[sdiy] Can anyone OCR the AN23.PDF File Here?

Quincas Moreira quincas at gmail.com
Fri Jul 7 00:24:39 CEST 2017


I agree that searching title, Issue number and some key words on content
would already be super handy. We don't need all the files completely
digitized, just a high quality scan with some metadata, much easier!  And
I'm all in for crowdfunding this endeavor if Bernie sees fit.
I think we can raise a good amount, synth diy is rapidly growing as a
community, as can be attested by the proliferation of kits and online
stores!



On Thu, Jul 6, 2017 at 5:07 PM, Mike HEQX <mike at heqx.com> wrote:

> You're right about that Dave. You don't need to be able to search every
> single word per page. That is why a good taxonomical index is the way to go.
>
> Mike
>
> On 7/6/2017 4:32 PM, Dave Magnuson wrote:
>
> Is it possible to leave the scans as images and just have someone add some
> sort of metadata to each page instead?     Then you’d be searching through
> perhaps a dozen or two “keywords” per page instead of the actual document
> text.
>
>
>
> Just a thought…  I’ve had terrible luck with OCR myself, except on the
> most simple of scans.
>
>
>
> Dave
>
>
>
> *From:* Synth-diy [mailto:synth-diy-bounces at synth-diy.org
> <synth-diy-bounces at synth-diy.org>] *On Behalf Of *Bruno Afonso
> *Sent:* Thursday, July 6, 2017 3:42 PM
> *To:* Bernard Arthur Hutchins Jr; Rob Kam
> *Cc:* synth-diy at synth-diy.org
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>
>
> Bernie,
>
>
>
> I'd be happy to have a go using open source software such as tesseract. I
> feel you cannot tackle this problem without further teaching the classifier
> for the nuances of this text and constraining of what the possibilities
> are. I'd like you to rescan a representative item into tiffs at 300 or
> 600dpi. You stated that the tiffs would look the same but that is not true.
> The tiff exports out of AN23.pdf do not have the same quality of the
> original scanned image (likely tiffs), and this does not help.
>
>
>
> Is the end goal to replace the original image text with just text using a
> similar font? For me the most useful would be to have OCR'ed most of it so
> it's searchable. But again, you have not set what you find acceptable or
> what your goal is. It's ok if you don't know. You either propose this as a
> challenge that will never be possible to accomplish (some teachers never
> want to give students top score) or you compromise and propose a solution
> that enhances your original pdfs with value worth money for most people. I
> find most people would be happy keeping the original scanned text and
> simply having it OCR'ed to the best possible for their cursory searches.
> But I may see things different than most people :) In the academic lingo
> you should provide some ground-truth examples of what you imagine is a
> perfect conversion of the AN23.pdf, acceptable or not worth the time.
>
>
>
> You can never rely on ONE volunteer, but you can certainly get many
> excited so over time as a group something is accomplished.
>
>
>
> Cheers
>
> b
>
>
>
> On Thu, Jul 6, 2017 at 3:11 PM Bernard Arthur Hutchins Jr <
> bah13 at cornell.edu> wrote:
>
> Thanks Rob -
>
>
>
> Really makes my point, and I guess I should not rely on volunteers!  I
> don't blame you one bit - just does not work.
>
>
>
> I expect no one else want to try either.  If anyone does, don't look at
> the crib below until after you try.   Errors located and circled in red.
>
>
>
> http://electronotes.netfirms.com/AN23Rob.PDF
>
>
>
> Please all, let's agree that the OCR issue is bogus as applied here.
>
>
>
> Bernie
>
>
> ------------------------------
>
> *From:* Rob Kam <robkam at ymail.com>
> *Sent:* Thursday, July 6, 2017 1:51 PM
>
>
> *To:* Bernard Arthur Hutchins Jr
> *Cc:* synth-diy at synth-diy.org
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>
>
> Thanks for the challenge Bernie but no thanks. I don't have the patience
> to correct the OCR.
>
> Rob
>
>
> ------------------------------
>
> *From:* Bernard Arthur Hutchins Jr <bah13 at cornell.edu>
> *To:* Rob Kam <robkam at ymail.com>
> *Cc:* "synth-diy at synth-diy.org" <synth-diy at synth-diy.org>
> *Sent:* Thursday, 6 July 2017, 18:30
>
>
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>
>
> Thanks Rob -
>
>
>
> True - the equations are now usable, but slightly more blurred than my
> original PDF. Likewise, the figures are now OK but of slightly lower
> quality, which does NOT matter much for hand drawings.
>
>
>
> I did note a lot of OCR misreads in the text.  A careful proofing of the
> text took me 18 minutes and there are 25 errors, some not at all obscure,
> and about 13 of which I had to look at the original to see what they were
> supposed to be.  (One was hard to detect since it substituted an Rf for an
> Ri, a disaster).  A full proofread/correction would take at least
> 30 minutes (188 eight-hour days for 6000 pages).    And I wrote this!
> Almost certainly a volunteer would have more trouble and miss errors.
>
>
>
> In the spirit of no good deed going unpunished, Rob, let me put you on the
> spot. Take your scan, find and fix the 25 errors.  Let us know how
> easy/hard this was and the time it took, and show your results.
>
>
>
> I will post the "solution" to the "find the errors" this evening if I get
> the chance.
>
>
>
> Since there is no improvement in the figures/equations, and the text is a
> serious downgrade, tell me again (anyone) why an OCR/ebook is a good idea
> here.
>
>
>
> Bernie
>
>
> ------------------------------
>
> *From:* Rob Kam <robkam at ymail.com>
> *Sent:* Thursday, July 6, 2017 7:24 AM
> *To:* Bernard Arthur Hutchins Jr
> *Cc:* synth-diy at synth-diy.org
> *Subject:* RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>
>
> There’s a second attempt at http://www.sdiy.info/AN23b.rtf converting the
> equations to images instead, (and still manually tweaking the OCR). It took
> six minutes to do from the scan/PDF and the text still needs comparing and
> correcting against the original.
>
>
>
> There are already experts at this sort of project, at Archive.org who have
> been doing this for a number of years https://archive.org/details/
> texts&tab=about
>
> Free Books : Download & Streaming : eBooks and Texts ...
> <https://archive.org/details/texts&tab=about>
>
> archive.org
>
> The Internet Archive offers over 12,000,000 freely downloadable books and
> texts. There is also a collection of 550,000 modern eBooks that may be
> borrowed by anyone ...
>
>
>
> Free Books : Download & Streaming : eBooks and Texts ...
> <https://archive.org/details/texts&tab=about>
>
> archive.org
>
> The Internet Archive offers over 12,000,000 freely downloadable books and
> texts. There is also a collection of 550,000 modern eBooks that may be
> borrowed by anyone ...
>
>
>
>
> To put my two cents in, the synth DIY community should see whether they
> are able to raise the funds to compensate (against unsold hardcopy, ebooks
> etc.) for releasing Electronotes under a non-commercial Creative Commons
> licence https://creativecommons.org/licenses/by-nc/2.0/uk/
>
>
>
> Rob
>
>
>
> *From:* Bernard Arthur Hutchins Jr [mailto:bah13 at cornell.edu]
> *Sent:* 06 July 2017 01:42
> *To:* Rob Kam <robkam at ymail.com>; mskala at ansuz.sooke.bc.ca
> *Cc:* synth-diy at synth-diy.org
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>
>
>
>
> Tkanks Rob -
>
>
>
> But a manual identifications and 5 minutes/page is no good for the small
> improvement. Still months of 8-hour days to do 6000 pages.  My PDF is still
> much better already.  The equations are still unusable.  It makes the same
> text errors, apparently.    Why not just say it just can't do this?
> Wasn't intended to.
>
>
>
> Thanks for trying - useful data point!
>
>
>
> Bernie
> ------------------------------
>
> *From:* Rob Kam <robkam at ymail.com>
> *Sent:* Wednesday, July 5, 2017 6:47 PM
> *To:* Bernard Arthur Hutchins Jr; mskala at ansuz.sooke.bc.ca
> *Cc:* synth-diy at synth-diy.org
> *Subject:* RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
>
>
> Hi Bernie,
>
>
> At http://www.sdiy.info/AN23.rtf this took 10 minutes to OCR with ABBYY
> FineReader 12
> <https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiZhc6ZmPPUAhVG6RQKHRHpA1UQFggoMAA&url=http%3A%2F%2Fwww.abbyy.com%2Fen-gb%2Fsupport%2Ffinereader-12%2F&usg=AFQjCNHLOjsz219pjjTDqDytG2Cpm9N90w>,
> first manually identifying areas of text vs. images. Obviously it still
> needs further corrections.
>
> Rob
>
>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
>
>
>
> _______________________________________________
> Synth-diy mailing listSynth-diy at synth-diy.orghttp://synth-diy.org/mailman/listinfo/synth-diy
>
>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
>
>


-- 
Quincas Moreira
Test Pilot at VBrazil Modular
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170706/242a393c/attachment.htm>


More information about the Synth-diy mailing list