[sdiy] From Bernie of Electronotes

john slee indigoid at oldcorollas.org
Mon Jun 26 00:28:59 CEST 2017


On 25 June 2017 at 23:39, Tom Wiltshire <tom at electricdruid.net> wrote:

> It certainly would. A lot. Scanning them all would be a big job, but
> OCR'ing them all too would be a major task.
>

It sure would. Some years ago I worked on a project that involved
digitizing roughly 619000
sheets (!) of microfiche, each of which depicted 14x8 (or something like
that) page images.
The subject matter was the entire repository of Australian patent documents
from 1905ish (I
think) until 2005 or so. Most of it was on microfiche or 35mm microfilm
(for which specialised
scanners are available, albeit at non-trivial cost) and a whole lot of
paper.

We ran full-page OCR for every page. The OCR alone is a lot of work even if
you already
have the specialised workflow software to coordinate it (which my employer
had developed
in-house over many years) and the right people to run it.

The end result was pretty good. The output documents were PDFs with an
invisible text
layer beneath the image, laid out so that if you searched the text, the PDF
viewer would
appear to highlight the corresponding part of the image

The actual scanning is the easy and relatively quick task IME. Load the
paper bits into a
scanner with a proper document feeder. If you found someone with a Kodak
i840 or some
other scanner with Kodak's Perfect Page tech, you'd be done in less than an
hour and with
pretty good quality. What takes (far) more time is classifying pages and
splitting things into
their separate documents... and doing QA.

I would contribute to a crowdfunding effort to outsource scanning/OCR if
that was a
possibility. I don't have access to the cool toys (like the i840, or the
workflow software!) to
do it myself anymore :-(

John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170626/cf8582ef/attachment.htm>


More information about the Synth-diy mailing list