[sdiy] Can anyone OCR the AN23.PDF File Here?
rsdio at audiobanshee.com
rsdio at audiobanshee.com
Thu Jul 6 22:15:18 CEST 2017
On Jul 5, 2017, at 5:21 PM, Bernard Arthur Hutchins Jr <bah13 at cornell.edu> wrote:
> Thanks Brian. Well there is no original scan. The original is typewrite and pen drawings on white. It looks a GREAT deal like what you would get if you print out my 300 dpi PDF. The text is picked up quite well even from this scan of a copy. The figures and the equations are a complete failure.
I'm thinking like an engineer - sorry. At some point in the process of creating your 300 dpi PDF, there must have existed a digital bitmap of the original page. The fact that it was not saved as a separate file, distinct from the PDF, is a consequence of the new, simplified world we live in now where everything is made accessible to non-engineers, and quality suffers because the details are hidden from us or lost.
If you really want to make a fair challenge that a professional could live up to, then make a quality scan of those two original pages and submit it to the audience. Not all OCR software takes a PDF as an input, and as you complained in your response to Rob's efforts, the quality of the drawings is degraded by printing the PDF and then scanning it a second time. That's called generational loss.
> Who would proofread the OCRs? Except for the 6000 pages (!!!) I might be able to do the job and the average reader here might be of some use. Ultimately, it gets down to a word-by-word and sometimes character-by-character comparison. What? 3 months of 8 hour days? Something like that. Sorry.
I realize that this part of your response came before my original solution. Who would proofread the OCRs? A crowd-sourced group of people could both submit corrections and vet them. Admittedly, that would probably require coordinating software like a wiki that we don't have now, and the group would probably be much smaller than most crowd-sourced efforts.
> There have been plenty of volunteers for the automated scan and OCR steps. I'm wondering if it would be possible to use a sort of "cloud sourced" document review (to borrow some terminology) so that a handful of volunteers could submit corrections. Of course, even with plenty of volunteers, this would still be a significant effort, if only in coordinating all the submissions to make sure they're benefitting the total effort.
> ***********************************************************************************
> All this for a clearer font? Do you really think so?
> ***********************************************************************************
The goal is not a clearer font, but more accurate text. The easy part is scanning the pages into digital format because it can be automated. The hard part is training the OCR and correcting any remaining mistakes. The latter part could be distributed so that it's not all on one person.
> The least amount of effort is NOT doing it. Leave well-enough alone. ENWN-49.
Fair enough. I think that what we're dealing with here is a group - myself included - who are used to doing what it takes to maintain vintage analog electronics for the sake of posterity, and who don't want to lose valuable historical information. Doing nothing is certainly an option, but it's still disappointing.
> Which is why someone needs to submit a business plan and put up money (or admit that it is NOT viable). The surest way to assure that a project will not get done is to NOT try to make money and rely on volunteers. Not my time, and certainly not my money.
Also fair, but there are exceptions. Wikipedia is an example of what a huge number of volunteers can accomplish. Despite the vandalism, which wouldn't be possible if volunteers weren't part of the system, it still works because other volunteers clean up after the vandals. What I'm suggesting would be not as open as Wikipedia, but could still be successful.
> Did you note that my PDF is already searchable?
Yes!
There were only two or three mistakes in the OCR. One missed a space between two words, which created a fictitious word, "jumpingoff" (not likely to be a problem). Another OCR error turned "transconductance and can be seen" into "transconductand can be ance seen" - looks like the OCR was confused by the hyphenated word and also somehow lost track of which line it was on in the middle of a sentence. If you double-click on "transconduct- in the PDF, Acrobat will highlight "and" on the second line instead of "ance" - this is the kind of cleanup that is important, but would take a lot of time.
Brian
More information about the Synth-diy
mailing list