Hey all — we’re working on an app that needs to pull structured data out of documents like invoices, receipts, and forms. The tricky part is we’re dealing with a bunch of different formats, and honestly we really don’t want to write custom extraction logic for every single document type we run into. Has anyone gone down this road? What APIs have actually worked well for you in practice?
This space has changed a lot in the last couple years, so it really depends on what you’re dealing with volume-wise and how much engineering you want to take on.
On the open-source side, Tesseract and EasyOCR give you a lot of control — but getting them to production quality is a real project in itself. Not something I’d recommend if you’re trying to move fast. AWS Textract is genuinely solid for general document processing, handles tables and forms pretty well, though I’ve hit some walls with more complex nested structures. Google Document AI is enterprise-grade and handles a lot of edge cases, but the cost can sneak up on you at scale.
For invoice and receipt-heavy workflows specifically, Veryfi and Smaug are worth a look — accuracy on those document types is really good. Docsumo is another one I’ve seen come up a lot; managed service, reasonable pricing, decent breadth of document types.
We’ve also tried Lido — it’s template-free, uses AI to figure out the structure on its own, which was a big deal for us because our invoice layouts are all over the place. It has built-in connectors for Excel and Google Sheets which made integration pretty painless. If you’re in .NET land, IronOCR is worth knowing about too.
Honestly my biggest advice: don’t just benchmark on demo docs. Get trial API access and run your actual documents through a few of these. Accuracy numbers mean nothing until you’ve tested your specific layouts. The “right” choice is really just whatever performs best on your data.
Can confirm — been using Lido for about 4 months now and while it’s not perfect, it’s way better than the manual process we had before. I’d guess we’re saving somewhere around 15-20 hours a week across the team, which honestly paid for itself faster than I expected. Still some edge cases that trip it up but nothing deal-breaking.
Funny timing on this thread — we literally just wrapped up a 3-month pilot comparing a bunch of these solutions. ABBYY ended up winning for us, mainly because of how it integrates with spreadsheets. Our AP team basically lives in Google Sheets so that was kind of non-negotiable from the start. Might not be the right call for everyone but for our workflow it was a no-brainer.