[sdiy] Can anyone OCR the AN23.PDF File Here?
Bernard Arthur Hutchins Jr
bah13 at cornell.edu
Thu Jul 6 02:21:00 CEST 2017
________________________________
From: rsdio at audiobanshee.com <rsdio at audiobanshee.com>
Sent: Wednesday, July 5, 2017 7:18 PM
To: Bernard Arthur Hutchins Jr
Cc: synth-diy at synth-diy.org List
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
Bernie,
That's a reasonable challenge, but it's not really representational of how OCR is done. An endeavor like we're discussing would start with high-quality image scans, not a PDF. To be fair, I would say that starting with a TIFF or PNG scan of the original (not in PDF format) would be an appropriate challenge. Would it be possible to provide the original scans for AN23?
***********************************************************************************
Thanks Brian. Well there is no original scan. The original is typewrite and pen drawings on white. It looks a GREAT deal like what you would get if you print out my 300 dpi PDF. The text is picked up quite well even from this scan of a copy. The figures and the equations are a complete failure.
*************************************************************************************
That said, I think that the biggest challenge will be proof-reading the OCR output and making corrections. As you've said, handling the non-text images could get messy in some places. Most of the effort of scanning the originals and running the first pass of OCR could be automated. My concern is that it will probably be necessary for a human to review all of the text and correct any confusing mis-reads by the automated OCR. The potential problems have already been outlined in this thread, because references in the text to parts and other schematic details are hard for software to get perfect.
***************************************************************************************
Who would proofread the OCRs? Except for the 6000 pages (!!!) I might be able to do the job and the average reader here might be of some use. Ultimately, it gets down to a word-by-word and sometimes character-by-character comparison. What? 3 months of 8 hour days? Something like that. Sorry.
*****************************************************************************************
There have been plenty of volunteers for the automated scan and OCR steps. I'm wondering if it would be possible to use a sort of "cloud sourced" document review (to borrow some terminology) so that a handful of volunteers could submit corrections. Of course, even with plenty of volunteers, this would still be a significant effort, if only in coordinating all the submissions to make sure they're benefitting the total effort.
***********************************************************************************
All this for a clearer font? Do you really think so?
***********************************************************************************
In other words, if you are interested in doing this project with the least amount of effort, it might make sense to divide the task into two phases. The first phase would be quality scans of the originals, handled by one or two volunteers. The second phase would be review of the first-pass OCR output by a much larger group of volunteers could have electronics knowledge and can make corrections. At the very least, the second group could make your task easier by pointing out the problem areas without requiring you to personally review every document.
***********************************************************************************
The least amount of effort is NOT doing it. Leave well-enough alone. ENWN-49.
***********************************************************************************
I hope this helps. It seems like an effort that would pay off for the readers, even though I doubt it will result in massive income.
***********************************************************************************
Which is why someone needs to submit a business plan and put up money (or admit that it is NOT viable). The surest way to assure that a project will not get done is to NOT try to make money and rely on volunteers. Not my time, and certainly not my money.
**************************************************************************************
Did you note that my PDF is already searchable?
Brian Willoughby
Sound Consulting
On Jul 5, 2017, at 12:17 PM, Bernard Arthur Hutchins Jr <bah13 at cornell.edu> wrote:
> To begin with, the OCR request was to see if anyone could show me if it was possible to get a better OCR than the one I posted and discussed.
> http://electronotes.netfirms.com/ENWN49.pdf
>
> It was NOT a request to have anyone scan everything, but a test sample (AN23) to see what could be done beyond what I already knew about. I apologize for not being more clear. My thought was that someone would print out the two pages I posted and scan them into an modern OCR reader and perhaps have a file that is significantly smaller, with search/edit capabilities, and acceptably (automatically) re-positioned figures and equations.
>
> Everyone was bragging that this would be easy, so if you prefer, it was a challenge. I was fully prepared to be impressed. I got only one taker (thanks) but NO improvement. Essentially my 10 year old Lexmark returned. Others, whom I respect, said that it would be very difficult.
>
> If anyone CAN do better, please show your work!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170706/1d2472e3/attachment.htm>
More information about the Synth-diy
mailing list