[sdiy] Can anyone OCR the AN23.PDF File Here?
Mike HEQX
mike at heqx.com
Fri Jul 7 00:07:22 CEST 2017
You're right about that Dave. You don't need to be able to search every
single word per page. That is why a good taxonomical index is the way to go.
Mike
On 7/6/2017 4:32 PM, Dave Magnuson wrote:
>
> Is it possible to leave the scans as images and just have someone add
> some sort of metadata to each page instead? Then you’d be searching
> through perhaps a dozen or two “keywords” per page instead of the
> actual document text.
>
> Just a thought… I’ve had terrible luck with OCR myself, except on the
> most simple of scans.
>
> Dave
>
> *From:*Synth-diy [mailto:synth-diy-bounces at synth-diy.org] *On Behalf
> Of *Bruno Afonso
> *Sent:* Thursday, July 6, 2017 3:42 PM
> *To:* Bernard Arthur Hutchins Jr; Rob Kam
> *Cc:* synth-diy at synth-diy.org
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> Bernie,
>
> I'd be happy to have a go using open source software such as
> tesseract. I feel you cannot tackle this problem without further
> teaching the classifier for the nuances of this text and constraining
> of what the possibilities are. I'd like you to rescan a representative
> item into tiffs at 300 or 600dpi. You stated that the tiffs would look
> the same but that is not true. The tiff exports out of AN23.pdf do not
> have the same quality of the original scanned image (likely tiffs),
> and this does not help.
>
> Is the end goal to replace the original image text with just text
> using a similar font? For me the most useful would be to have OCR'ed
> most of it so it's searchable. But again, you have not set what you
> find acceptable or what your goal is. It's ok if you don't know. You
> either propose this as a challenge that will never be possible to
> accomplish (some teachers never want to give students top score) or
> you compromise and propose a solution that enhances your original pdfs
> with value worth money for most people. I find most people would be
> happy keeping the original scanned text and simply having it OCR'ed to
> the best possible for their cursory searches. But I may see things
> different than most people :) In the academic lingo you should provide
> some ground-truth examples of what you imagine is a perfect conversion
> of the AN23.pdf, acceptable or not worth the time.
>
> You can never rely on ONE volunteer, but you can certainly get many
> excited so over time as a group something is accomplished.
>
> Cheers
>
> b
>
> On Thu, Jul 6, 2017 at 3:11 PM Bernard Arthur Hutchins Jr
> <bah13 at cornell.edu <mailto:bah13 at cornell.edu>> wrote:
>
> Thanks Rob -
>
> Really makes my point, and I guess I should not rely on
> volunteers! I don't blame you one bit - just does not work.
>
> I expect no one else want to try either. If anyone does, don't
> look at the crib below until after you try. Errors located and
> circled in red.
>
> http://electronotes.netfirms.com/AN23Rob.PDF
>
> Please all, let's agree that the OCR issue is bogus as applied here.
>
> Bernie
>
> ------------------------------------------------------------------------
>
> *From:*Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
> *Sent:* Thursday, July 6, 2017 1:51 PM
>
>
> *To:* Bernard Arthur Hutchins Jr
> *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> Thanks for the challenge Bernie but no thanks. I don't have the
> patience to correct the OCR.
>
> Rob
>
> ------------------------------------------------------------------------
>
> *From:*Bernard Arthur Hutchins Jr <bah13 at cornell.edu
> <mailto:bah13 at cornell.edu>>
> *To:* Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
> *Cc:* "synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>"
> <synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>>
> *Sent:* Thursday, 6 July 2017, 18:30
>
>
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> Thanks Rob -
>
> True - the equations are now usable, but slightly more blurred
> than my original PDF. Likewise, the figures are now OK but of
> slightly lower quality, which does NOT matter much for hand drawings.
>
> I did note a lot of OCR misreads in the text. A careful proofing
> of the text took me 18 minutes and there are 25 errors, some not
> at all obscure, and about 13 of which I had to look at the
> original to see what they were supposed to be. (One was hard to
> detect since it substituted an Rf for an Ri, a disaster). A full
> proofread/correction would take at least 30 minutes (188
> eight-hour days for 6000 pages). And I wrote this! Almost
> certainly a volunteer would have more trouble and miss errors.
>
> In the spirit of no good deed going unpunished, Rob, let me put
> you on the spot. Take your scan, find and fix the 25 errors. Let
> us know how easy/hard this was and the time it took, and show your
> results.
>
> I will post the "solution" to the "find the errors" this evening
> if I get the chance.
>
> Since there is no improvement in the figures/equations, and the
> text is a serious downgrade, tell me again (anyone) why an
> OCR/ebook is a good idea here.
>
> Bernie
>
> ------------------------------------------------------------------------
>
> *From:*Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
> *Sent:* Thursday, July 6, 2017 7:24 AM
> *To:* Bernard Arthur Hutchins Jr
> *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
> *Subject:* RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> There’s a second attempt at http://www.sdiy.info/AN23b.rtf
> converting the equations to images instead, (and still manually
> tweaking the OCR). It took six minutes to do from the scan/PDF and
> the text still needs comparing and correcting against the original.
>
> There are already experts at this sort of project, at Archive.org
> who have been doing this for a number of years
> https://archive.org/details/texts&tab=about
>
> Free Books : Download & Streaming : eBooks and Texts ...
> <https://archive.org/details/texts&tab=about>
>
> archive.org <http://archive.org>
>
> The Internet Archive offers over 12,000,000 freely downloadable
> books and texts. There is also a collection of 550,000 modern
> eBooks that may be borrowed by anyone ...
>
> Free Books : Download & Streaming : eBooks and Texts ...
> <https://archive.org/details/texts&tab=about>
>
> archive.org <http://archive.org>
>
> The Internet Archive offers over 12,000,000 freely downloadable
> books and texts. There is also a collection of 550,000 modern
> eBooks that may be borrowed by anyone ...
>
>
>
>
> To put my two cents in, the synth DIY community should see whether
> they are able to raise the funds to compensate (against unsold
> hardcopy, ebooks etc.) for releasing Electronotes under a
> non-commercial Creative Commons licence
> https://creativecommons.org/licenses/by-nc/2.0/uk/
>
> Rob
>
> *From:*Bernard Arthur Hutchins Jr [mailto:bah13 at cornell.edu
> <mailto:bah13 at cornell.edu>]
> *Sent:* 06 July 2017 01:42
> *To:* Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>;
> mskala at ansuz.sooke.bc.ca <mailto:mskala at ansuz.sooke.bc.ca>
> *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
> *Subject:* Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> Tkanks Rob -
>
> But a manual identifications and 5 minutes/page is no good for the
> small improvement. Still months of 8-hour days to do 6000 pages.
> My PDF is still much better already. The equations are still
> unusable. It makes the same text errors, apparently. Why not just
> say it just can't do this? Wasn't intended to.
>
> Thanks for trying - useful data point!
>
> Bernie
>
> ------------------------------------------------------------------------
>
> *From:*Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com>>
> *Sent:* Wednesday, July 5, 2017 6:47 PM
> *To:* Bernard Arthur Hutchins Jr; mskala at ansuz.sooke.bc.ca
> <mailto:mskala at ansuz.sooke.bc.ca>
> *Cc:* synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
> *Subject:* RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
>
> Hi Bernie,
>
>
> At http://www.sdiy.info/AN23.rtfthis took 10 minutes to OCR with
> ABBYY FineReader 12
> <https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiZhc6ZmPPUAhVG6RQKHRHpA1UQFggoMAA&url=http%3A%2F%2Fwww.abbyy.com%2Fen-gb%2Fsupport%2Ffinereader-12%2F&usg=AFQjCNHLOjsz219pjjTDqDytG2Cpm9N90w>,
> first manually identifying areas of text vs. images. Obviously it
> still needs further corrections.
>
> Rob
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org <mailto:Synth-diy at synth-diy.org>
> http://synth-diy.org/mailman/listinfo/synth-diy
>
>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170706/fce6bf3a/attachment.htm>
More information about the Synth-diy
mailing list