[sdiy] Can anyone OCR the AN23.PDF File Here?
Rob Kam
robkam at ymail.com
Fri Jul 7 12:54:18 CEST 2017
Wiki = a collaboratively edited website which allows users to add, edit or change most content very quickly and easily.
With the mention of converting Electronotes to wiki format I’ve taken the time to wikify the sample PDF at http://www.sdiy.info/w/Electronotes/AN-23_-_The_CA3080_as_a_Voltage-Controlled_Resistor to show what can be done. This was about two hours’ work. The images can improved later/by someone else.
However, going by experience with the SDIY wiki thus far - don’t expect anyone to collaborate on a project like this. It seems mostly people are happy to write quick answers on email lists and forums but cannot take the time and effort to create content.
Because I enjoy working on the SDIY wiki I’d be happy to add the Musical Engineers Handbook (362 pages) and Builder's Guide and Preferred Circuits Collection (270 pages) to the SDIY wiki. Although I guess working alone this might take a year. However wikifying 6000 pages is too daunting, unless I was being paid to do it.
Rob
From: Bernard Arthur Hutchins Jr [mailto:bah13 at cornell.edu]
Sent: 07 July 2017 00:53
To: Ove Ridé <nitro2k01 at gmail.com>
Cc: Rob Kam <robkam at ymail.com>; synth-diy at synth-diy.org
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
_____
From: Ove Ridé <nitro2k01 at gmail.com <mailto:nitro2k01 at gmail.com> >
Sent: Thursday, July 6, 2017 4:13 PM
To: Bernard Arthur Hutchins Jr
Cc: Rob Kam; synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
OCR is not a panacea, but it's a useful tool since it can often do the bulk of the transcription, heavily saving on manual labour (typing). And in my humble opinion this is the way to go in the long run. Both for preservation of a legacy in a format that's less likely to be infested by "bit rot" but also for practical reasons of searchability. To give a practical example of searchability, although not necessarily identical to this use case, I have many times lost the link to some obscure web page. I've later been able to find it again, in part, by searching for some particular word or phrase used on that page, in addition to other keywords.
What I would suggest as an end goal, and we might disagree philosophically, is to convert the Electronotes to a "natively digital" format, such as HTML or wiki formatting, where the text, instead of being confined to lines of a certain length like on paper, it would instead be formatted into paragraph that would flow freely to the size of the user's screen. Especially wiki formatting would, with the use of an extension, provide formatting for formulae, in the form of a subset of LaTeX format. For example, the definition of the Laplace transform can be written as:
<math>F(s) =\int_0^\infty f(t)e^{-st} \, dt</math>
********************************************************
I don't have any idea how one would convert to HTML - let alone how one would do 6000 pages of paper. Now idea what wiki is. TeX I tried and gave up.
********************************************************************
In either case, the question is obviously, who's going to do the work? I could personally sink a few, or even many hours into a project like that, but obviously nowhere near enough time to finish all 6000 pages. But through collaboration it might be possible. For example, I might contact Jason Scott and the good volunteers at Archive Team. With enough luck, there might be people, among whom the job of scanning the papers could be divided, and people (not necessarily the same people as the first group) who could proofread and convert the scanned material.
*************************************************************************
I'm not going to do this work myself, obviously. Turning it over to a group of volunteers sound like an ideal way to stop everything.
*****************************************************************
In the end, it's probably a pipe dream, though.
***************************************************************
Congratulations. Trust you are enjoying your tenure in the real world !
*******************************************************************
On 6 July 2017 at 20:59, Bernard Arthur Hutchins Jr <bah13 at cornell.edu <mailto:bah13 at cornell.edu> > wrote:
Thanks Rob -
Really makes my point, and I guess I should not rely on volunteers! I don't blame you one bit - just does not work.
I expect no one else want to try either. If anyone does, don't look at the crib below until after you try. Errors located and circled in red.
http://electronotes.netfirms.com/AN23Rob.PDF
Please all, let's agree that the OCR issue is bogus as applied here.
Bernie
_____
From: Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com> >
Sent: Thursday, July 6, 2017 1:51 PM
To: Bernard Arthur Hutchins Jr
Cc: synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
Thanks for the challenge Bernie but no thanks. I don't have the patience to correct the OCR.
Rob
_____
From: Bernard Arthur Hutchins Jr <bah13 at cornell.edu <mailto:bah13 at cornell.edu> >
To: Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com> >
Cc: "synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org> " <synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org> >
Sent: Thursday, 6 July 2017, 18:30
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
Thanks Rob -
True - the equations are now usable, but slightly more blurred than my original PDF. Likewise, the figures are now OK but of slightly lower quality, which does NOT matter much for hand drawings.
I did note a lot of OCR misreads in the text. A careful proofing of the text took me 18 minutes and there are 25 errors, some not at all obscure, and about 13 of which I had to look at the original to see what they were supposed to be. (One was hard to detect since it substituted an Rf for an Ri, a disaster). A full proofread/correction would take at least 30 minutes (188 eight-hour days for 6000 pages). And I wrote this! Almost certainly a volunteer would have more trouble and miss errors.
In the spirit of no good deed going unpunished, Rob, let me put you on the spot. Take your scan, find and fix the 25 errors. Let us know how easy/hard this was and the time it took, and show your results.
I will post the "solution" to the "find the errors" this evening if I get the chance.
Since there is no improvement in the figures/equations, and the text is a serious downgrade, tell me again (anyone) why an OCR/ebook is a good idea here.
Bernie
_____
From: Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com> >
Sent: Thursday, July 6, 2017 7:24 AM
To: Bernard Arthur Hutchins Jr
Cc: synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
Subject: RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
There’s a second attempt at <http://www.sdiy.info/AN23b.rtf> http://www.sdiy.info/AN23b.rtf converting the equations to images instead, (and still manually tweaking the OCR). It took six minutes to do from the scan/PDF and the text still needs comparing and correcting against the original.
There are already experts at this sort of project, at Archive.org who have been doing this for a number of years <https://archive.org/details/texts&tab=about> https://archive.org/details/texts&tab=about
<https://archive.org/details/texts&tab=about> Free Books : Download & Streaming : eBooks and Texts ...
archive.org <http://archive.org>
The Internet Archive offers over 12,000,000 freely downloadable books and texts. There is also a collection of 550,000 modern eBooks that may be borrowed by anyone ...
<https://archive.org/details/texts&tab=about> Free Books : Download & Streaming : eBooks and Texts ...
archive.org <http://archive.org>
The Internet Archive offers over 12,000,000 freely downloadable books and texts. There is also a collection of 550,000 modern eBooks that may be borrowed by anyone ...
To put my two cents in, the synth DIY community should see whether they are able to raise the funds to compensate (against unsold hardcopy, ebooks etc.) for releasing Electronotes under a non-commercial Creative Commons licence https://creativecommons.org/licenses/by-nc/2.0/uk/
Rob
From: Bernard Arthur Hutchins Jr [mailto:bah13 at cornell.edu <mailto:bah13 at cornell.edu> ]
Sent: 06 July 2017 01:42
To: Rob Kam <robkam at ymail.com <mailto:robkam at ymail.com> >; mskala at ansuz.sooke.bc.ca <mailto:mskala at ansuz.sooke.bc.ca>
Cc: synth-diy at synth-diy.org <mailto:synth-diy at synth-diy.org>
Subject: Re: [sdiy] Can anyone OCR the AN23.PDF File Here?
Tkanks Rob -
But a manual identifications and 5 minutes/page is no good for the small improvement. Still months of 8-hour days to do 6000 pages. My PDF is still much better already. The equations are still unusable. It makes the same text errors, apparently. Why not just say it just can't do this? Wasn't intended to.
Thanks for trying - useful data point!
Bernie
_____
From: Rob Kam < <mailto:robkam at ymail.com> robkam at ymail.com>
Sent: Wednesday, July 5, 2017 6:47 PM
To: Bernard Arthur Hutchins Jr; <mailto:mskala at ansuz.sooke.bc.ca> mskala at ansuz.sooke.bc.ca
Cc: <mailto:synth-diy at synth-diy.org> synth-diy at synth-diy.org
Subject: RE: [sdiy] Can anyone OCR the AN23.PDF File Here?
Hi Bernie,
At <http://www.sdiy.info/AN23.rtf> http://www.sdiy.info/AN23.rtf this took 10 minutes to OCR with <https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwiZhc6ZmPPUAhVG6RQKHRHpA1UQFggoMAA&url=http%3A%2F%2Fwww.abbyy.com%2Fen-gb%2Fsupport%2Ffinereader-12%2F&usg=AFQjCNHLOjsz219pjjTDqDytG2Cpm9N90w> ABBYY FineReader 12, first manually identifying areas of text vs. images. Obviously it still needs further corrections.
Rob
_______________________________________________
Synth-diy mailing list
Synth-diy at synth-diy.org <mailto:Synth-diy at synth-diy.org>
http://synth-diy.org/mailman/listinfo/synth-diy
--
/Ove
Blog: <http://blog.gg8.se/>
"Here is Evergreen City. Evergreen is the color of green forever."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://synth-diy.org/pipermail/synth-diy/attachments/20170707/87390a47/attachment.htm>
More information about the Synth-diy
mailing list