
We selected several documents-two easy to read reports, a receipt, an historical document, a legal filing with a lot of redaction, a filled in disclosure form, and a water damaged page-to run through the OCR engines we are most interested in. Some are quite expensive, some are free and open source. Some are easy to use, some require a bit of programming to make them work, some require a lot of programming. There are a lot of OCR options available. We couldn’t find single side by side comparison of the most accessible OCR options, so we ran a handful of documents through seven different tools, and compared the results. We have been testing the components that already exist so we can prioritize our own efforts.

One of our projects at Factful is to build tools that make state of the art machine learning and artificial intelligence accessible to investigative reporters. OCR, or optical character recognition, allows us to transform a scan or photograph of a letter or court filing into searchable, sortable text that we can analyze.

Do you need to pay a lot of money to get reliable OCR results? Is Google Cloud Vision actually better than Tesseract? Are any cutting edge neural network-based OCR engines worth the time investment of getting them set up?
