Meta tags:
author= Chris Adams;
Headings (most frequently used words):
the, search, chris, text, is, what, background, you, adams, problem, for, generating, directions, future, ocr, challenges, indexing, notes, highlighting, results, business, eme, anyway, 2013, 10, 24, update, disclaimer, how, engines, process, poses, us, dear, programmer, cyclist, photographer, so, think, know, number, content, on, budget, eff, to, please, don, pick, wrong, fight, nsa, recklessness, risk, work,
Text of the page (most frequently used words):
the (285), and (114), for (69), #this (46), that (42), #which (41), with (36), #text (26), are (24), drm (23), #from (22), #search (22), web (21), #content (19), can (19), all (18), #image (17), but (17), #they (16), have (16), page (16), has (16), ocr (15), like (15), not (15), you (14), video (14), your (14), using (14), #work (14), full (13), will (13), word (13), more (12), browser (12), results (12), #what (12), coordinates (11), item (11), into (11), because (11), w3c (11), been (10), where (10), than (10), about (10), use (10), list (9), long (9), users (9), our (9), want (9), process (9), source (8), both (8), there (8), one (8), eff (8), future (8), was (8), only (8), most (8), url (8), each (8), security (8), way (8), solr (7), people (7), mozilla (7), eme (7), make (7), actually (7), metadata (7), every (7), items (7), software (7), user (7), some (7), display (7), first (7), would (7), approach (7), time (7), used (7), rather (6), new (6), its (6), since (6), see (6), when (6), open (6), made (6), javascript (6), algorithm (6), does (6), flash (6), their (6), much (6), could (6), characters (6), nsa (6), how (6), uses (6), large (5), able (5), pages (5), number (5), today (5), any (5), look (5), proprietary (5), going (5), even (5), require (5), rsa (5), library (5), http (5), away (5), documents (5), anything (5), those (5), support (5), document (5), next (5), words (5), silverlight (5), over (5), group (5), very (5), project (5), while (4), back (4), either (4), completely (4), returned (4), urls (4), arabic (4), owners (4), code (4), less (4), available (4), highlighted (4), background (4), need (4), form (4), also (4), brendan (4), blog (4), who (4), digit (4), interesting (4), plugins (4), decryption (4), just (4), org (4), wdl (4), earlier (4), task (4), well (4), www (4), data (4), building (4), adobe (4), out (4), should (4), raw (4), different (4), these (4), images (4), had (4), modern (4), engine (4), unfortunately (4), ability (4), relevant (4), microsoft (4), problem (4), browsers (4), likely (4), html (4), similar (4), through (4), html5 (4), based (4), particularly (4), site (4), quickly (4), wrong (4), means (4), file (4), case (4), high (4), beyond (4), now (4), scores (3), term (3), book (3), reducing (3), technologies (3), two (3), single (3), standards (3), embedded (3), amazon (3), dependency (3), allows (3), entire (3), works (3), least (3), field (3), extensions (3), match (3), collapsing (3), link (3), lowering (3), brien (3), watermarking (3), combined (3), engines (3), effort (3), plugin (3), tools (3), specific (3), google (3), system (3), css (3), standard (3), done (3), history (3), working (3), cory (3), download (3), something (3), claims (3), before (3), commercial (3), making (3), improves (3), doing (3), needs (3), feature (3), towards (3), play (3), don (3), directly (3), runs (3), simply (3), created (3), etc (3), thus (3), improvements (3), unencrypted (3), addition (3), hasn (3), always (3), avoiding (3), significant (3), place (3), third (3), providers (3), vendors (3), millions (3), test (3), presumably (3), python (3), server (3), simple (3), good (3), return (3), load (3), perhaps (3), never (3), digits (3), cdm (3), may (3), multiple (3), business (3), decimal (3), collections (3), developer (3), github (3), seems (3), easier (3), eich (3), must (3), whether (3), given (3), position (3), risk (3), major (3), update (3), cached (3), them (3), hocr (3), script (3), adams (3), chris (3), personal (3), released (3), makes (3), know (3), market (3), great (3), tesseract (3), example (3), common (3), whales (2), tim (2), berners (2), say (2), provide (2), tool (2), chrome (2), mine (2), emphasis (2), doctorow (2), installed (2), firefox (2), workflow (2), played (2), closed (2), actions (2), receive (2), customized (2), exact (2), avoid (2), else (2), adding (2), apps (2), looking (2), backdoor (2), then (2), aren (2), advisory (2), native (2), quality (2), ins (2), plug (2), strongly (2), developers (2), set (2), everything (2), many (2), anyone (2), contain (2), index (2), already (2), important (2), itunes (2), change (2), save (2), customers (2), prevented (2), stop (2), cases (2), benefit (2), linked (2), limited (2), requests (2), concern (2), point (2), easy (2), fast (2), lee (2), certain (2), elasticsearch (2), party (2), final (2), same (2), find (2), until (2), breaks (2), considerably (2), updates (2), however (2), selected (2), terms (2), control (2), input (2), idea (2), scanned (2), question (2), experience (2), public (2), produced (2), under (2), prevent (2), looks (2), goals (2), top (2), determine (2), individual (2), device (2), around (2), somewhere (2), frequently (2), access (2), without (2), adds (2), damaging (2), grouped (2), mechanism (2), request (2), slice (2), other (2), books (2), scoring (2), actual (2), replace (2), vendor (2), won (2), persian (2), far (2), definition (2), news (2), free (2), solution (2), end (2), against (2), danny (2), scan (2), fully (2), here (2), almost (2), normal (2), little (2), extract (2), details (2), were (2), happens (2), government (2), someone (2), innovative (2), effectively (2), longer (2), django (2), showing (2), score (2), places (2), generate (2), supported (2), cache (2), automated (2), risks (2), retrieved (2), copyright (2), such (2), player (2), information (2), numeric (2), correction (2), significantly (2), why (2), scanning (2), level (2), encrypted (2), media (2), hollywood (2), fight (2), quick (2), highlighting (2), digital (2), platform (2), indic (2), generally (2), budget (2), effective (2), drmed (2), start (2), organization (2), community (2), favor (2), cto (2), systems (2), pure (2), expensive (2), taking (2), complex (2), valid (2), issues (2), often (2), finally (2), scans (2), pixel (2), int (2), ensure (2), create (2), ۲۶۷۹ (2), converted (2), database (2), 2679 (2), being (2), shows (2), servers (2), wait (2), highly (2), isdecimal (2), immediately (2), challenges (2), searching (2), build (2), reuse (2), including (2), possibly (2), thing (2), otoy (2), simplest (2), requiring (2), automatic (2), doesn (2), reliable (2), general (2), variant (2), computing (2), willing (2), companies (2), slices (2), drop (2), eastern (2), spec (2), coordinate (2), continue (2), highlight (2), opaque (2), course (2), pressure (2), app (2), cost (2), matches (2), ids (2), netflix (2), might (2), popular (2), viewing (2), huge (2), backfire (2), isn (2), problems (2), hope (2), queue (2), transferred, rounding, segment, width, empty, ready, areas, coalesced, section, within, vary, borders, viewer, path, early, trimmed, reused, readable, reading, starting, fetch, seeing, haven, hit, accuracy, revised, again, cdn, easily, improved, edge, production, locally, preprocessing, remove, round, ratio, result, 600, caches, imagine, select, minor, clobbering, edges, smarter, developing, leaves, filters, processing, questions, softer, integrated, presets, directions, hand, corrected, wondering, overlay, supposed, scantailor, positioning, trip, displayed, consistently, tightly, humans, turned, characteristics, transparent, partially, cropped, letters, notes, coming, accidentally, noise, having, white, black, resolution, loaded, side, higher, color, familiar, convert, visible, takes, percentages, friendlier, calculated, sophisticated, wouldn, dpi, trick, com, highlighter, reprocess, handy, box, shadow, tricks, extremely, perspective, performed, separate, reviewed, client, recording, operates, recommend, ocred, once, smaller, calling, try, interactively, repl, x3g, happening, fairly, specification, quite, few, flagged, treat, purposes, regular, ၁၄၂၈၅, expressions, validate, sees, unidata, classification, unicode, include, radix, numbers, 0660, zero, documentation, action, ramayana, gives, projects, wasn, considered, noteworthy, cataloged, digitizing, printed, material, become, industrial, equipment, concerted, engineering, options, continual, burmese, progress, batch, fit, gado, fragile, rare, bulk, mar, left, exercise, reader, translate, mention, clue, budgets, hindi, worked, canonical, author, pasted, helpfully, latin, known, closer, instead, ٠١٢٣٤٥٦٧٨٩, numerals, failure, 0123456789, webmaster, think, aug, linkedin, flickr, pinboard, bitbucket, mastodon, feed, programmer, cyclist, unlinked, truncated, added, parser, aware, carefully, validated, values, part, besides, assigned, record, dispatching, mystery, solved, right, googlebot, soon, pointed, finds, character, iso, 8859, causing, break, services, facebook, recently, surprised, 404, noticed, philosophical, labor, desired, matching, determined, independently, unless, mentioned, worse, queries, mixed, throughout, technique, calls, team, enabled, specified, storing, fieldand, compute, groups, links, essentially, entry, retrieves, sorts, calculating, relevancy, returns, idf, separately, haystack, guide, later, retrieve, xhr, microdata, volume, ugly, providing, context, step, formats, indicating, type, contained, enhanced, matched, backend, response, highlights, ranked, relevance, performs, selects, language, analysis, original, segments, searches, guineé, experimental, lucene, inverted, required, preserving, master, shell, command, celery, generating, world, spirit, finding, experimenting, perform, offering, seen, efficiencies, painfully, accept, obvious, degree, digitization, capacity, outstrips, big, ways, involves, tiff, invisible, classic, challenge, cataloging, tasks, workers, expect, transformed, days, descriptive, contents, combining, indexed, unsuitable, offer, telling, visually, 700, incompatible, query, indexing, solve, priority, output, converts, json, apache, criteria, journals, newspapers, after, 1800, below, automatically, placed, low, programs, handles, usage, win, investing, year, codec, binary, class, designed, reduces, footprint, cdms, includes, aggressive, sandboxing, alone, directed, turn, collective, bad, memory, additionally, separating, module, performance, robust, networking, enter, increases, portability, thinking, strategy, reputation, enables, impossible, concerned, leverage, apple, space, smart, players, hardware, sell, said, saw, gpu, cloud, individually, best, intra, frame, according, supporters, ari, emanuel, enough, eliminate, trends, education, consumers, pay, restricted, adapted, basic, happen, kindle, keep, mandating, forever, slowly, surely, get, hacker, shouting, loudly, real, arguing, happened, lack, preventing, excellent, blocking, saving, snippets, rented, movie, requested, ensured, meaningful, distinction, between, proposes, decrypted, included, happy, decided, factual, war, exaggerating, badly, correctly, situation, terrible, month, absurdly, despite, discount, credibility, continues, sound, alarm, purpose, math, indication, whatsoever, studios, requirements, scuttled, lot, stores, asa, dotzler, summed, perfectly, businesses, overnight, obsession, operating, additional, countries, face, things, weaker, others, exploit, odds, china, russia, racing, exposed, unnecessary, weakening, theory, kim, zetter, wired, tells, amidst, confusion, encryption, noting, default, toolkits, advises, rogue, endangering, allied, disprove, several, industry, licensed, creative, commons, attribution, sharealike, unported, license, purely, reflect, views, employer, ship, convincingly, relatively, trust, thornier, customer, especially, foreign, ones, asking, innocently, dupe, actively, collaborating, probably, years, products, shift, definitely, opinion, written, own, disclaimer, strong, rights, donation, posted, his, issue, pushing, interoperability, call, feels, message, characteristic, balancing, goal, protecting, interests, realistic, constraints, current, succeed, bridge, khazad, 2013, obviously, solely, footer, widely, competition, cryptography, defaulted, dual, drbg, random, generator, spy, bright, spots, economy, cautious, helping, recklessness, wide, poses, sep, congress, office, connection, official, policy, job, involve, intellectual, property, reiterate, host, bundled, ligatures, during, playing, loose, facts, attempt, fail, ensuring, supporter, donor, 2001, although, believe, started, opposed, dangerous, fan, reporting, similarly, 90s, crypto, wars, electronic, frontier, foundation, worth, discarding, integrity, misrepresenting, trend, marks, hold, programming, contributing, leading, sources, exploits, average, wishes, publish, subject, development, roadmap, run, inside, webpage, environments, feel, range, capabilities, playback, overlap, features, provided, distributed, hefty, applications, reconsidering, recent, post, useful, dear, motivated, figures, reinventing, parts, particular, summers, activeocr, national, australia, wonderful, trove, combine, displays, trying, captions, service, mechanical, curator, figure, extraction, older, complicated, condition, materials, primitive, printing, technology, stylistic, choices, everyone, valuable, please, mith, pick, oct, considerable, room, integrating, crowd, sourcing, approaches, direct, epitomized, promising, concept, umd, area, lines, research, digitized, supporting, eye, idly, discussed, generic, application, corresponding, stored, tracking, review, along, kind, appear, agreement, paste, duties, visit, scratch, couldn, negotiate, sign, compliance, agreements, raft, compliant, interoperable, cannot, cut, approving, nest, ceded, agent, parlance, distributor, unspoken, assurance, indeed, ultimate, hard, line, focused, slippery, slope, enforced, navigate, amount, legally, guardian, wish, understood, conditions, robustness, modification, blanket, ban, modified, webkit, safari, gecko, related, prohibited, agents, implementing, whatever, emerges, past, allowed, saved, files, monitored, sealed, tombs, maybe, view, sites, failed, seem, nor, mess, limit, small, chunk, dramatically, anywhere, near, frequent, largely, duplicative, attack, surface, anyway, offers, massive, key, external, allowing, hoping, wants, consistent, learn, licensing, rules, option, annoying, write, off, neither, stack, specify, care, north, billions, dollars, sales, serve, encumbered, mostly, clearly, convenience, availability, total, internet, traffic, america, bit, ended, locking, down, owner, running, efforts, laudable, attempting, educate, lawmakers, analog, hole, reduce, photographer,
Text of the page (random words):
ricks com if you re not familiar with this form of css positioning since the ocred word coordinates aren t consistently tightly cropped around the letters in the word a minor css box shadow is used to make the edges softer and more like a highlighter notes from a workflow perspective i highly recommend recording the source of your ocr text and whether it s been reviewed since this is a fully automated process it is extremely handy to be able to reprocess items in the future if your software improves without accidentally clobbering any items which have been hand corrected by humans the word coordinates are pixel level coordinates based on the input file but our requests are made using calculated percentages since it s often the case that the scans are much higher resolution than we would want to display in a web browser and our users wouldn t want to wait for a 600 dpi image to download in any case you might be wondering why all of this work is performed on the client side rather than having the server return highlighted images in addition to reducing server load this approach is friendlier for caches because a given image segment can be reused for multiple words rounding the coordinates improves the cache hit ratio significantly and both the image and word coordinates can thus be cached by cdn edge servers rather than requiring a full round trip back to the server one common example of the cache ability benefit is when you open a result and start reading it in the viewer we display full page images rather than the trimmed slices so we must fetch new images but those are likely to be cached because they haven t been customized with the search text and we can reuse the locally cached word coordinates to immediately display the highlighting if you change your search text within an item we can again immediately update the display while the revised page list is retrieved challenges future directions this was supposed to be the simplest thing which could possibly work and it turned out not to be that simple as you might imagine this leaves a number of open questions for where to go next ocr results vary considerably based on the quality of the input image accuracy can be improved considerably by preprocessing the image to remove borders noise or use a more sophisticated algorithm to convert a full color scan into the black and white image which tesseract operates on the trick is either coming up with good presets for your data perhaps integrated with an image processing tool like scantailor or developing smarter code which can select filters based on the characteristics of the image for older items the ocr process is complicated by the condition of the materials more primitive printing technology and stylistic choices like the long s ſ or ligatures which are no longer in common usage and thus not well supported by common ocr programs one of my future goals is looking into the tools produced by the early modern ocr project and seeing whether there s a production ready path for this it would be interesting combine the results of ocr with my earlier figure extraction project for innovative displays like the mechanical curator or with more work trying to extract full figures with captions finally there s considerable room for integrating crowd sourcing approaches like the direct text correction as epitomized by the national library of australia s wonderful trove project and promising improvements on that concept like the umd mith s activeocr project this seems like an area for research which any organization with large digitized collections should be supporting particularly with an eye towards easier reuse ed summers and i have idly discussed the idea for a generic web application which would display hocr with the corresponding images for correction with all of the data stored somewhere like github for full change tracking and review it seems like something along those lines would be particularly valuable as a public service to avoid the expensive of everyone reinventing large parts of this process customized for their particular workflow oct 24 dear eff please don t pick the wrong fight the fight against drm is not worth discarding your integrity misrepresenting the w3c s encrypted media extensions will not do anything useful but it will hold the web back and make the eff less effective first some background i ve been a supporter and donor to the electronic frontier foundation for a long time at least 2001 although i believe i started earlier during the 90s crypto wars and opposed to to drm for at least as long i ve also been a fan of danny o brien s reporting and personal blog for a similarly long time unfortunately today had me reconsidering that support because of o brien s recent blog post lowering your standards drm and the future of the w3c i feel this marks a dangerous trend of playing very loose with the facts in an attempt to pressure the w3c to drop the encrypted media extensions eme spec and that this is not only like to fail but actually backfire in ensuring that millions of people continue to access content through proprietary closed systems background a little background information most video played on the web and particularly commercial content uses adobe s flash or microsoft s silverlight plugins to run a video player inside a webpage both flash and silverlight are full programming environments with a significant range of capabilities beyond video playback and have significant overlap with the features provided by your browser they re distributed as browser plugins which require a hefty download to be installed before viewing anything and both generally require proprietary tools for developers to create applications they re annoying for developers because they require using a completely different set of technologies than you use for everything else on the web but many places will write that off as a cost of doing business what s more of a concern is that both plugins have a history of security problems and neither microsoft nor adobe appear to be particularly motivated to build the kind of fast reliable automatic update system which the modern browsers have so in addition to requiring your users to download something before viewing content you re contributing to one of the leading sources of security exploits for the average user it also means that anyone who wishes to publish video on the web is generally subject to the development roadmap for one of two companies html5 offers a way out of this mess browsers could play back video directly avoiding the massive external dependency and allowing them to make improvements for video as quickly as they do anything else rather than hoping a third party developer wants to make improvements html5 video is very easy to use fast and has a consistent high quality user experience unfortunately anyone looking to use it for commercial content will learn that the licensing rules from all of the major content owners require the use of drm and thus html5 video is not an option what is eme anyway the w3c s eme group is working on way to reduce this dependency by adding a general mechanism which allows the use of html5 video with a little bit of javascript to specify a cdm and a decryption key for the file this allows content providers to use the entire modern web stack and limit the drm dependency to a small chunk of code which handles only the actual decryption dramatically lowering the attack surface and avoiding the need for anywhere near as frequent updates as the actual decryption mechanism is far less complex than the entire largely duplicative platform which flash or silverlight provide the problem drm does not work and all drmed content has ended up being available in unencrypted form very quickly because the only way to make drm work is by completely locking down a device to prevent its owner from running code which can access the unencrypted data and of course there s always the analog hole the eff has a long laudable history attempting to educate the public and lawmakers about these issues and i completely support those efforts unfortunately this effort has failed no significant amount of commercial video on the web is available without drm and users don t seem to care as the billions of dollars of sales through itunes amazon google play etc and netflix is using somewhere around 30 of the total internet traffic in north america to serve drm encumbered video mostly using silverlight clearly convenience and availability are more important to people the eff has been taking a hard line position on eme focused on slippery slope claims by approving this idea the w3c has ceded control of the user agent the term for a web browser in w3c parlance to a third party the content distributor that breaks a perhaps until now unspoken assurance about who has the final say in your web experience and indeed who has ultimate control over your computing device a web where you cannot cut and paste text where your browser can t save as an image where the allowed uses of saved files are monitored beyond the browser where javascript is sealed away in opaque tombs and maybe even where we can no longer effectively view source on some sites is a very different web from the one we have today it s a web where user agents browsers must navigate a nest of enforced duties every time they visit a page it s a place where the next tim berners lee or mozilla if they were building a new browser from scratch couldn t just look up the details of all the web technologies they d have to negotiate and sign compliance agreements with a raft of drm providers just to be fully standards compliant and interoperable lowering your standards drm and the future of the w3c danny o brien eff emphasis mine this is similar to some of the past claims made by cory doctorow the first of these conditions robustness against end user modification is a blanket ban on all free open source software free open source software by definition can be modified by its users that means that the two most popular browser technologies on the web webkit used in chrome and safari and gecko used in firefox and related browsers would be legally prohibited from implementing whatever standard the w3c emerges what i wish tim berners lee understood about drm cory doctorow the guardian emphasis mine both of these are simply wrong there is no meaningful distinction between what eme proposes and what is already the case with a browser plugin if firefox can play flash or silverlight content it can decrypted video using a cdm which is either included in the host operating system bundled under an agreement similar to chrome s flash plugin or installed by the user the real problem is that they re arguing the wrong point those requests have always been made and in most cases have already happened the lack of a w3c standard hasn t prevented the amazon kindle app from preventing your ability to save unencrypted text itunes from blocking saving snippets of a rented movie etc and it hasn t prevented either adobe or microsoft from adding every drm feature requested by the content owners what this has done is ensured that the web community hasn t had much say in the process because all of the content is created and played using proprietary closed software the eff is shouting loudly but only adobe and microsoft will benefit there s no indication whatsoever that the studios are going to drop their drm requirements if this w3c spec is scuttled we ll just continue to see a lot of opaque plugin content and of course more pressure away from the web towards proprietary app stores mozilla s asa dotzler summed this up perfectly earlier today on hacker news t he businesses hollywood with the content that web users want have done that math and decided that drm through plug ins and native apps is an excellent system and they re happy to keep mandating it forever if plug ins go away as they re slowly but surely doing then native apps will be the only place to get this content this approach also runs the risk of damaging the reputation of the eff and making it less effective beyond basic factual problems exaggerating the risks will backfire badly if people look and correctly see that the situation isn t so terrible netflix at 10 month is absurdly popular despite the drm and discount future claims made by the eff they ll need that credibility as the war on general purpose computing continues and cory is not wrong to sound the alarm over that what the open web community should be doing now is working to ensure that eme is designed in a way which improves security and reduces the proprietary footprint if the standard for cdms includes aggressive sandboxing it s a huge win for security alone even if all it does is turn flash into a collective bad memory for web users additionally separating the task of building a decryption module from building a high performance video player with robust networking makes it significantly easier for new vendors to enter the market and increases portability because so much less code needs to be adapted to a new platform there are some interesting long term trends as well more education about the risks of drmed content is good and reducing what consumers are willing to pay for restricted content may be the best long term strategy some of that effort needs to be directed towards content owners and providers who are thinking about investing in complex expensive systems which don t actually work a very interesting approach was highlighted by mozilla s brendan eich earlier this year in the form of otoy s pure javascript video codec which in addition to avoiding all of the issues with binary plugins has first class support for watermarking watermarking not drm this could be huge otoy s gpu cloud approach enables individually watermarking every intra frame and according to some of its hollywood supporters including ari emanuel this may be enough to eliminate the need for drm today i saw the future brendan eich mozilla cto obviously a shift away from the drm obsession won t happen overnight but it s not impossible either as content owners are concerned about the market leverage which the major drm vendors like apple and amazon have there s space for smart players willing to back away from drm in favor of an approach which works at least as well and doesn t require hardware vendors to sell out their users as brendan said there is hope 2013 10 24 update brendan eich mozilla s cto posted his position on the eme issue the bridge of khazad drm pushing the w3c for cdm level interoperability is a good call and definitely feels characteristic of mozilla by balancing the goal of protecting users interests with the realistic constraints of the current browser market i strongly hope they succeed since mozilla seems to be the only browser vendor taking a strong position in favor of user rights now is a great time to support their work with a donation disclaimer wh...
|