"I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains."
Some combination of this is what we're building at Tensorlake (full disclosure I work there). Where you can "see" the PDF like a human and "understand" the contents, not JUST "read" the text. Because the contents of PDFs are usually in tables, images, text, formulas, hand-writing.
Being able to then "understand what a PDF file contains" is important (I think) for that understand part though. And so then we parse the PDF and run multiple models to extract markdown chunks/JSON so that you can ingest the actual data into other applications (AI agents, LLMs, or frankly whatever you want).
Some combination of this is what we're building at Tensorlake (full disclosure I work there). Where you can "see" the PDF like a human and "understand" the contents, not JUST "read" the text. Because the contents of PDFs are usually in tables, images, text, formulas, hand-writing.
Being able to then "understand what a PDF file contains" is important (I think) for that understand part though. And so then we parse the PDF and run multiple models to extract markdown chunks/JSON so that you can ingest the actual data into other applications (AI agents, LLMs, or frankly whatever you want).
https://tensorlake.ai