[sdiy] Can anyone OCR the AN23.PDF File Here?
Andrew Simper
andy at cytomic.com
Sun Jul 16 07:17:38 CEST 2017
Brian,
I am guessing you have not understood what Joel as suggested. He is
saying to keep the articles as full images. No mistakes, unless the
original has a mistake, it's just and image.
You can then use any old OCR for a "rough" searchable text - it
doesn't have to be 100% accurate, just 50% would be totally fine
since, all you need to do is get a hit on the right document with one
or more words, and I doubt people will be searching for R_i or other
mathematical equations!!
Cheers,
Andy
On 7 July 2017 at 03:47, <rsdio at audiobanshee.com> wrote:
> Thanks for the thought, Joel, but a highly-flawed OCR is actually worse than none at all. Bernie has already given a specific example of the disastrous results. I'm not saying that all of the text in the diagrams has to be converted, but that the places in the main body of the text that refer to the schematics must be accurate, or else the circuits won't make any sense at all. I'm also not saying that the OCR has to be 100% perfect - nothing we humans create is ever totally perfect - but it absolutely has to be a lot better than "come-what-may" in quality.
>
> By the way, I misused the term in my previous reply. It's not called "cloud sourcing" but is supposed to be "crowd sourcing." In other words, getting the OCR right takes a lot of work, but that work could be spread out over several people instead of requiring one person to do thousands of pages. Think about the way a wiki works or any other distributed system. Granted, it would be difficult to do this given the smaller number of electronics savvy volunteers and the commercial nature of the project, but I present it as an idea of how to think outside of the box for solutions to a difficult problem.
>
> Brian
>
>
> On Jul 5, 2017, at 8:41 PM, Joel B wrote:
>> Why not just scan, and do a come-what-may OCR just for full text indexing - if it picks up a diagram and thinks that is a word, who cares, maybe someone will find something cool via typo they didn't expect. No human intervention, just index words to the page. Highly Imperfect but still super useful.
>>
>
> _______________________________________________
> Synth-diy mailing list
> Synth-diy at synth-diy.org
> http://synth-diy.org/mailman/listinfo/synth-diy
More information about the Synth-diy
mailing list