[sdiy] Can anyone OCR the AN23.PDF File Here?
Bernard Arthur Hutchins Jr
bah13 at cornell.edu
Thu Jul 6 19:30:38 CEST 2017
Thanks Rob -
True - the equations are now usable, but slightly more blurred than my original PDF. Likewise, the figures are now OK but of slightly lower quality, which does NOT matter much for hand drawings.
I did note a lot of OCR misreads in the text. A careful proofing of the text took me 18 minutes and there are 25 errors, some not at all obscure, and about 13 of which I had to look at the original to see what they were supposed to be. (One was hard to detect since it substituted an Rf for an Ri, a disaster). A full proofread/correction would take at least 30 minutes (188 eight-hour days for 6000 pages). And I wrote this! Almost certainly a volunteer would have more trouble and miss errors.
In the spirit of no good deed going unpunished, Rob, let me put you on the spot. Take your scan, find and fix the 25 errors. Let us know how easy/hard this was and the time it took, and show your results.
I will post the "solution" to the "find the errors" this evening if I get the chance.
Since there is no improvement in the figures/equations, and the text is a serious downgrade, tell me again (anyone) why an OCR/ebook is a good idea here.
Bernie
________________________________
From: Rob Kam <robkam at ymail.com>
Sent: Thursday, July 6, 2017 7:24 AM
To: Bernard Arthur Hutchins Jr
Cc: synth-diy at synth-diy.org
Subject: RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
There’s a second attempt at http://www.sdiy.info/AN23b.rtf converting the equations to images instead, (and still manually tweaking the OCR). It took six minutes to do from the scan/PDF and the text still needs comparing and correcting against the original.
There are already experts at this sort of project, at Archive.org who have been doing this for a number of years https://archive.org/details/texts&tab=about
Free Books : Download & Streaming : eBooks and Texts ...<https://archive.org/details/texts&tab=about>
archive.org
The Internet Archive offers over 12,000,000 freely downloadable books and texts. There is also a collection of 550,000 modern eBooks that may be borrowed by anyone ...
To put my two cents in, the synth DIY community should see whether they are able to raise the funds to compensate (against unsold hardcopy, ebooks etc.) for releasing Electronotes under a non-commercial Creative Commons licence https://creativecommons.org/licenses/by-nc/2.0/uk/
Rob
From: Bernard Arthur Hutchins Jr [mailto:bah13 at cornell.edu]
Sent: 06 July 2017 01:42
To: Rob Kam <robkam at ymail.com>; mskala at ansuz.sooke.bc.ca
Cc: synth-diy at synth-diy.org
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
Tkanks Rob -
But a manual identifications and 5 minutes/page is no good for the small improvement. Still months of 8-hour days to do 6000 pages. My PDF is still much better already. The equations are still unusable. It makes the same text errors, apparently. Why not just say it just can't do this? Wasn't intended to.
Thanks for trying - useful data point!
Bernie
________________________________
From: Rob Kam <robkam at ymail.com<mailto:robkam at ymail.com>>
Sent: Wednesday, July 5, 2017 6:47 PM
To: Bernard Arthur Hutchins Jr; mskala at ansuz.sooke.bc.ca<mailto:mskala at ansuz.sooke.bc.ca>
Cc: synth-diy at synth-diy.org<mailto:synth-diy at synth-diy.org>
Subject: RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
Hi Bernie,
At http://www.sdiy.info/AN23.rtf this took 10 minutes to OCR with ABBYY FineReader 12<https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiZhc6ZmPPUAhVG6RQKHRHpA1UQFggoMAA&url=http%3A%2F%2Fwww.abbyy.com%2Fen-gb%2Fsupport%2Ffinereader-12%2F&usg=AFQjCNHLOjsz219pjjTDqDytG2Cpm9N90w>, first manually identifying areas of text vs. images. Obviously it still needs further corrections.
Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170706/903405fe/attachment.htm>
More information about the Synth-diy
mailing list