The best place to *find* answers to programming/development questions, imo, however it's the *worst* place to *ask* questions (if your first question/comment doesn't get any up-rating/response, then u can't ask anymore questions--ridiculously unrealistic), but again, a great reference for *finding* answers.

My Music (Nickleus)

20110505

ocr - tesseract vs oscroscript (ocropus) - how to convert image to text

both tesseract and ocropus are developed by google, but for some reason ocropus misses the first line of the paragraph. see for yourself:

tesseract output:
The book is the synthesis of, on one hand, the no-nonsense
mathematical trader (sdf-styled “practmcnct of uncettaintfl who
spent his life trying to resist being fooled by randomness and mult the
emotions associated with uncertainty and, on the other, the aesthetically
obsessed, litcr:itun:·l0ving human being willing to be fooled by any form
of noruense that is polished, refined, original, and tasteful. I um not
capable of avoiding being the fool of randomness; what l can do is
confine it to where it brings some aesthetic gratification.
ocropus output:
mathematical trader (sdf-styled “practmcnct of uncettaintfl who spent his life trying to resist being fooled by randomness and mult the emotions associated with uncertainty and, on the other, the aesthetically obsessed, litcr:itun:·l0ving human being willing to be fooled by any form of noruense that is polished, refined, original, and tasteful. I um not capable of avoiding being the fool of randomness; what l can do is confine it to where it brings some aesthetic gratification.
compare the original (orig) image with the monochrome image. i had to convert it to monochrome to get a decent output.

this is how i did it in gimp:
1. first i enlarged the image so i could see more clearly the individual letters
2. colors > levels > edit settings as curves
i dragged the diagonal line down to the right just a little bit to make the letters a little more full looking (decrease the cloudy anti-aliasing effect)
3. image > mode > indexed > use black and white (1 bit) palette
turns it monochrome

now you have a png you can use to get a decent result with using ocroscript:
ocroscript recognize surface02.png > surface02.html

tesseract needs a tif file so you have 1 more step to get the better result with tesseract (the first line, which ocroscript doesnt manage to get):

4. layer > transparency > remove alpha channel
now save as *.tif and run this:
tesseract surface02.tif surface0.txt


No comments:

Post a Comment