[sdiy] Can anyone OCR the AN23.PDF File Here?

Mike HEQX mike at heqx.com
Fri Jul 7 00:07:22 CEST 2017


You're right about that Dave. You don't need to be able to search every 
single word per page. That is why a good taxonomical index is the way to go.

Mike


On 7/6/2017 4:32 PM, Dave Magnuson wrote:
>
> Is it possible to leave the scans as images and just have someone add 
> some sort of metadata to each page instead?  Then you’d be searching 
> through perhaps a dozen or two “keywords” per page instead of the 
> actual document text.
>
> Just a thought…  I’ve had terrible luck with OCR myself, except on the 
> most simple of scans.
>
> Dave
>
> *From:*Synth-diy [mailto:synth-diy-bounces at synth-diy.org] *On Behalf 
> Of *Bruno Afonso
> *Sent:* Thursday, July 6, 2017 3:42 PM
> *To:* Bernard Arthur Hutchins Jr; Rob Kam
> *Cc:* synth-diy at synth-diy.org
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> Bernie,
>
> I'd be happy to have a go using open source software such as 
> tesseract. I feel you cannot tackle this problem without further 
> teaching the classifier for the nuances of this text and constraining 
> of what the possibilities are. I'd like you to rescan a representative 
> item into tiffs at 300 or 600dpi. You stated that the tiffs would look 
> the same but that is not true. The tiff exports out of AN23.pdf do not 
> have the same quality of the original scanned image (likely tiffs), 
> and this does not help.
>
> Is the end goal to replace the original image text with just text 
> using a similar font? For me the most useful would be to have OCR'ed 
> most of it so it's searchable. But again, you have not set what you 
> find acceptable or what your goal is. It's ok if you don't know. You 
> either propose this as a challenge that will never be possible to 
> accomplish (some teachers never want to give students top score) or 
> you compromise and propose a solution that enhances your original pdfs 
> with value worth money for most people. I find most people would be 
> happy keeping the original scanned text and simply having it OCR'ed to 
> the best possible for their cursory searches. But I may see things 
> different than most people :) In the academic lingo you should provide 
> some ground-truth examples of what you imagine is a perfect conversion 
> of the AN23.pdf, acceptable or not worth the time.
>
> You can never rely on ONE volunteer, but you can certainly get many 
> excited so over time as a group something is accomplished.
>
> Cheers
>
> b
>
> On Thu, Jul 6, 2017 at 3:11 PM Bernard Arthur Hutchins Jr 
> <bah13 at cornell.edu <mailto:bah13 at cornell.edu>> wrote:
>
>     Thanks Rob -
>
>     Really makes my point, and I guess I should not rely on
>     volunteers!  I don't blame you one bit - just does not work.
>
>     I expect no one else want to try either.  If anyone does, don't
>     look at the crib below until after you try.   Errors located and
>     circled in red.
>
>     http://electronotes.netfirms.com/AN23Rob.PDF
>
>     Please all, let's agree that the OCR issue is bogus as applied here.
>
>     Bernie
>
>     ------------------------------------------------------------------------
>
>     *From:*Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
>     *Sent:* Thursday, July 6, 2017 1:51 PM
>
>
>     *To:* Bernard Arthur Hutchins Jr
>     *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
>     *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>     Thanks for the challenge Bernie but no thanks. I don't have the
>     patience to correct the OCR.
>
>     Rob
>
>     ------------------------------------------------------------------------
>
>     *From:*Bernard Arthur Hutchins Jr <bah13 at cornell.edu
>     <mailto:bah13 at cornell.edu>>
>     *To:* Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
>     *Cc:* "synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>"
>     <synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>>
>     *Sent:* Thursday, 6 July 2017, 18:30
>
>
>     *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>     Thanks Rob -
>
>     True - the equations are now usable, but slightly more blurred
>     than my original PDF. Likewise, the figures are now OK but of
>     slightly lower quality, which does NOT matter much for hand drawings.
>
>     I did note a lot of OCR misreads in the text. A careful proofing
>     of the text took me 18 minutes and there are 25 errors, some not
>     at all obscure, and about 13 of which I had to look at the
>     original to see what they were supposed to be.  (One was hard to
>     detect since it substituted an Rf for an Ri, a disaster).  A full
>     proofread/correction would take at least 30 minutes (188
>     eight-hour days for 6000 pages).  And I wrote this!  Almost
>     certainly a volunteer would have more trouble and miss errors.
>
>     In the spirit of no good deed going unpunished, Rob, let me put
>     you on the spot. Take your scan, find and fix the 25 errors.  Let
>     us know how easy/hard this was and the time it took, and show your
>     results.
>
>     I will post the "solution" to the "find the errors" this evening
>     if I get the chance.
>
>     Since there is no improvement in the figures/equations, and the
>     text is a serious downgrade, tell me again (anyone) why an
>     OCR/ebook is a good idea here.
>
>     Bernie
>
>     ------------------------------------------------------------------------
>
>     *From:*Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
>     *Sent:* Thursday, July 6, 2017 7:24 AM
>     *To:* Bernard Arthur Hutchins Jr
>     *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
>     *Subject:* RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>     There’s a second attempt at http://www.sdiy.info/AN23b.rtf
>     converting the equations to images instead, (and still manually
>     tweaking the OCR). It took six minutes to do from the scan/PDF and
>     the text still needs comparing and correcting against the original.
>
>     There are already experts at this sort of project, at Archive.org
>     who have been doing this for a number of years
>     https://archive.org/details/texts&tab=about
>
>     Free Books : Download & Streaming : eBooks and Texts ...
>     <https://archive.org/details/texts&tab=about>
>
>     archive.org <http://archive.org>
>
>     The Internet Archive offers over 12,000,000 freely downloadable
>     books and texts. There is also a collection of 550,000 modern
>     eBooks that may be borrowed by anyone ...
>
>     Free Books : Download & Streaming : eBooks and Texts ...
>     <https://archive.org/details/texts&tab=about>
>
>     archive.org <http://archive.org>
>
>     The Internet Archive offers over 12,000,000 freely downloadable
>     books and texts. There is also a collection of 550,000 modern
>     eBooks that may be borrowed by anyone ...
>
>
>
>
>     To put my two cents in, the synth DIY community should see whether
>     they are able to raise the funds to compensate (against unsold
>     hardcopy, ebooks etc.) for releasing Electronotes under a
>     non-commercial Creative Commons licence
>     https://creativecommons.org/licenses/by-nc/2.0/uk/
>
>     Rob
>
>     *From:*Bernard Arthur Hutchins Jr [mailto:bah13 at cornell.edu
>     <mailto:bah13 at cornell.edu>]
>     *Sent:* 06 July 2017 01:42
>     *To:* Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>;
>     mskala at ansuz.sooke.bc.ca <mailto:mskala at ansuz.sooke.bc.ca>
>     *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
>     *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>     Tkanks Rob -
>
>     But a manual identifications and 5 minutes/page is no good for the
>     small improvement. Still months of 8-hour days to do 6000 pages.
>     My PDF is still much better already.  The equations are still
>     unusable. It makes the same text errors, apparently.  Why not just
>     say it just can't do this? Wasn't intended to.
>
>     Thanks for trying - useful data point!
>
>     Bernie
>
>     ------------------------------------------------------------------------
>
>     *From:*Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
>     *Sent:* Wednesday, July 5, 2017 6:47 PM
>     *To:* Bernard Arthur Hutchins Jr; mskala at ansuz.sooke.bc.ca
>     <mailto:mskala at ansuz.sooke.bc.ca>
>     *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
>     *Subject:* RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>     Hi Bernie,
>
>
>     At http://www.sdiy.info/AN23.rtfthis took 10 minutes to OCR with
>     ABBYY FineReader 12
>     <https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiZhc6ZmPPUAhVG6RQKHRHpA1UQFggoMAA&url=http%3A%2F%2Fwww.abbyy.com%2Fen-gb%2Fsupport%2Ffinereader-12%2F&usg=AFQjCNHLOjsz219pjjTDqDytG2Cpm9N90w>,
>     first manually identifying areas of text vs. images. Obviously it
>     still needs further corrections.
>
>     Rob
>
>     _______________________________________________
>     Synth-diy mailing list
>     Synth-diy at synth-diy.org <mailto:Synth-diy at synth-diy.org>
>     http://synth-diy.org/mailman/listinfo/synth-diy
>
>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170706/fce6bf3a/attachment.htm>


More information about the Synth-diy mailing list