[sdiy] Can anyone OCR the AN23.PDF File Here?

rsdio at audiobanshee.com rsdio at audiobanshee.com
Thu Jul 6 01:18:51 CEST 2017


Bernie,

That's a reasonable challenge, but it's not really representational of how OCR is done. An endeavor like we're discussing would start with high-quality image scans, not a PDF. To be fair, I would say that starting with a TIFF or PNG scan of the original (not in PDF format) would be an appropriate challenge. Would it be possible to provide the original scans for AN23?

That said, I think that the biggest challenge will be proof-reading the OCR output and making corrections. As you've said, handling the non-text images could get messy in some places. Most of the effort of scanning the originals and running the first pass of OCR could be automated. My concern is that it will probably be necessary for a human to review all of the text and correct any confusing mis-reads by the automated OCR. The potential problems have already been outlined in this thread, because references in the text to parts and other schematic details are hard for software to get perfect.

There have been plenty of volunteers for the automated scan and OCR steps. I'm wondering if it would be possible to use a sort of "cloud sourced" document review (to borrow some terminology) so that a handful of volunteers could submit corrections. Of course, even with plenty of volunteers, this would still be a significant effort, if only in coordinating all the submissions to make sure they're benefitting the total effort.

In other words, if you are interested in doing this project with the least amount of effort, it might make sense to divide the task into two phases. The first phase would be quality scans of the originals, handled by one or two volunteers. The second phase would be review of the first-pass OCR output by a much larger group of volunteers could have electronics knowledge and can make corrections. At the very least, the second group could make your task easier by pointing out the problem areas without requiring you to personally review every document.

I hope this helps. It seems like an effort that would pay off for the readers, even though I doubt it will result in massive income.

Brian Willoughby
Sound Consulting


On Jul 5, 2017, at 12:17 PM, Bernard Arthur Hutchins Jr <bah13 at cornell.edu> wrote:
> To begin with, the OCR request was to see if anyone could show me if it was possible to get a better OCR than the one I posted and discussed.
> http://electronotes.netfirms.com/ENWN49.pdf
> 
> It was NOT a request to have anyone scan everything, but a test sample (AN23) to see what could be done beyond what I already knew about.  I apologize for not being more clear. My thought was that someone would print out the two pages I posted and scan them into an modern OCR reader and perhaps have a file that is significantly smaller, with search/edit capabilities, and acceptably (automatically) re-positioned figures and equations. 
> 
> Everyone was bragging that this would be easy, so if you prefer, it was a challenge.  I was fully prepared to be impressed.   I got only one taker (thanks) but NO improvement. Essentially my 10 year old Lexmark returned.  Others, whom I respect, said that it would be very difficult.  
> 
> If anyone CAN do better, please show your work!
> 




More information about the Synth-diy mailing list