[sdiy] re-publishing typewritten material

Brian Willoughby brianw at audiobanshee.com
Thu Nov 12 02:34:43 CET 2020


Hi Barry,

Before I get into any digital technology, I want to say that it's really uncool that someone uploaded your copyrighted book to the Internet Archive.


As for the tech, I worked on some OCR software and have spent a lot of time scanning and compressing images of text and schematics. Results vary drastically depending upon the techniques used.


The very important aspects of scanning are the resolution, bit depth, lighting, and especially the file format and compression type.

Lighting: You don't want a white background on the scan bed because that makes the text on the back side of each page higher contrast, allowing it to bleed through. I found it better to place a black object (like a binder) behind each page. This reduced the contrast ratio, since the brightest white was no longer very bright, but the text from the reverse of the page was rendered extremely low contrast. After scanning, the contrast ratio can be boosted so that the whites are full brightness, and the back side of the page no longer bleeds through. In other words, you want all of the light reflected by the front of the paper, not the backing.

Bit Depth: Pixels can be anywhere from 1-bit to 8-bit or even 10-bit to 16-bit. Text and schematics only need 1-bit. The reason you can get a whole book in 6.2 MB is using 1-bit per pixel instead of 8-bit. I disagree with Matthew about the "rendered" text - I'm quite certain that 6.2 MB includes an actual scan of the original pages, but with proper bit depth (and compression). This is not rendered from the OCR, but an actual scan of the original. You'd be surprised how clear a tiny file can be when done correctly. I tend to scan at 8-bit (or the maximum my scanner can handle - it's now more than it was in the nineties) and then pick a threshold to drop that to 1-bit without losing character legibility.

Resolution: In general, more is better. Terry gave good examples. This is especially important when using 1-bit pixels. Low resolution requires gray scale to smooth out the edges, but 1-bit pixels just mean you need more, smaller pixels to get smooth edges.

File format and compression type: Folks are really headed in the wrong direction with 8-bit gray scale and JPEG compression. That's completely inappropriate for text and schematics. It looks horrible. As Terry pointed out, CCITT is a good choice here. That's an almost magic compression algorithm that has nothing to do with JPEG. It's not even lossy. It only works with 1-bit per pixel images, though. CCITT Group 4 compression, used for facsimile machines, can really make an image small without any artifacts - nothing at all like JPEG. Even when you consider the increased resolution required to make 1-bit look smooth, CCITT Group 4 can still make a high-resolution image look better - losslessly - than a grayscale JPEG of any quality level.

I've had great luck scanning at 600 dpi or 1200 dpi at 8-bit, then hand-picking a threshold that preserves the content. The resulting 1-bit file is this CCITT Group 4 compressed.

Unfortunately, I haven't done much of this in recent years. Now, copy machines will pick horrible settings automatically and email the terrible results to you. It's very difficult to get results now that are as good as a $1,200 glass flat bed scanner decades ago, when things were manual. However, if you know the magic formats and resolution to override the "for dummies" settings, then you can get good results.


OCR software can separate the images from the text. Part of OCR is finding how many columns of text are on a page, and often that includes finding diagrams. In the nineties, I worked on NeXTstep software that used a third-party software library to scan text. It had two modes. The more extensive mode would automatically find regions of text for you, but it often made mistakes. The other mode would let you manually specify rectangles within each page, specifying where the text is. The problem I saw is that this third-party developer didn't provide a third mode where you could have the software make its best guess about the text areas so that a human could tweak the outline. Their second mode required you to start from scratch, which was tedious considering that the automated mode was better than 90% correct. Anyway, I'm sure that OCR is far better given the three decades that have transpired since.

Brian


On Nov 9, 2020, at 07:21, Barry Klein <barryklein at cox.net> wrote:
> Given this, I give up. What’s the point? They did not have my permission to publish it and I told them so. They did a better job than I have been able to do. Missing corrections but not the point. I wonder how they did it.
> I don’t enjoy writing books, I enjoy electronics.
> 
> Barry
> 
> On Nov 9, 2020, at 6:13 AM, Neil Johnson <neil.johnson71 at gmail.com> wrote:
>> Hi,
>> 
>> mskala at ansuz.sooke.bc.ca wrote:
>>> On Sun, 8 Nov 2020, Barry Klein wrote:
>>>> about to redraw the schematics.  As it stands, I have the book in PDF form
>>>> with a file size of about 40MB (334 pages).  Not something you can email.  I
>>>> am sure there are those out there that would take on the job of doing all this
>>>> for hundreds of dollars, but I don't believe the cost would be justified.
>>> 
>>> I'm surprised that it's only 40M for 334 pages of un-OCRed scanned
>>> material.
>> 
>> Or less (6.2MB):
>> https://archive.org/details/electronic-music-circuits
>> 
>> Neil
> 





More information about the Synth-diy mailing list