PDF to text conversion

cbreemer · March 13, 2024

Hey folks,
Just wondering if anyone here has experience with extracting text from PDF files in PHP, and is willing to exchange some experiences ?

I tried a couple of different options but they all have their own quirks, and no two seem to produce the same output.

Cheers,

Chris

ginerjm · March 13, 2024

What tools have you tried? I did a quick google and see several packages available but have no interest in exploring them for you. FYI - when someone sends me something that I wish to process using php I ask them to provide me the same data in a proper data format rather than a pdf.

cbreemer · March 13, 2024

2 hours ago, ginerjm said:

What tools have you tried? I did a quick google and see several packages available but have no interest in exploring them for you. FYI - when someone sends me something that I wish to process using php I ask them to provide me the same data in a proper data format rather than a pdf.

Thanks for your reply. I use PHP as the back-end for a web application I wrote to organize all my documents (which are almost exclusively pdf). In order to add search functionality I was looking into ways to convert pdf to text. So far I tried these PHP solutions:

pdfcrowd API ( https://pdfcrowd.com/api/pdf-to-text-php/ )

Looks reliable but is very slow (cloud based) and either severely request-limited or quite expensive

pdftotext.phpclass ( https://github.com/christian-vigh-phpclasses/PdfToText )

Looks ok but I saw one text output where each character was on a line by its own.

Smalot Pdf Parser ( https://github.com/smalot/pdfparser )

Looks ok-ish but buggy. Lots of exceptions and errors, and 178 open issues, some recignized bugs since years

pdftotext executable ( https://www.xpdfreader.com/download.html ) via shell_exec()

Clunky and slow of course, and not always output. I could improve by not using shell_exec() but might still be too slow to search a large number of pdf's.

I noticed inconsistent output between the four. Apparently there are different ways to parse a pdf into text. I'm starting to get the feeling that whatever I pick will come with its own set of errors and quirks.

Just curious - you don't consider PDF a proper data format ?

ginerjm · March 13, 2024

pdf is a presentation tool to provide a view of things to people using a common format. Why would you think of it as a db format?

cbreemer · March 14, 2024

16 hours ago, ginerjm said:

pdf is a presentation tool to provide a view of things to people using a common format. Why would you think of it as a db format?

I don't of course. I was just wondering about you suggesting that pdf is "not a proper data format". A large part of the data I download (manuals, invoices, digital letters, even sheet music) is in pdf format.

ginerjm · March 14, 2024

But none of those 'things' are true 'data providers'. They are documents. Data must be stored in a retrievable way. That means it can be easily read and analyzed and presented as the user wishes. A manual is meant to be visually read. Same for an invoice, a letter in digital format (or in printed format) or sheet music. A pdf is also meant for that purpose. If someone is trying to provide 'data' they should not be creating a pdf out of it. It's really pretty simple.

If you have no option other than to try and decipher these pdfs that you already have then you probably need to keep trying to find the proper tool that you can handle.

cbreemer · March 14, 2024

11 minutes ago, ginerjm said:

But none of those 'things' are true 'data providers'. They are documents. Data must be stored in a retrievable way. That means it can be easily read and analyzed and presented as the user wishes. A manual is meant to be visually read. Same for an invoice, a letter in digital format (or in printed format) or sheet music. A pdf is also meant for that purpose. If someone is trying to provide 'data' they should not be creating a pdf out of it. It's really pretty simple.

If you have no option other than to try and decipher these pdfs that you already have then you probably need to keep trying to find the proper tool that you can handle.

Yes, I guess I'll need to keep looking. Thanks anyway.

ginerjm · March 14, 2024

And try to get new sets of 'data' in a better format. A PDF is something that someone created to deliver data/info to you. Why can't they deliver it to you in a truly readable format? I mean, have you ever viewed the contents of a pdf document? It is unreadable by the human eye. It must be read by something that knows how to interpret the format that Adobe created years ago to provide info in a format that anyone (not a data processor) could read with Adobe's product line as well as (now) any one of many tools that have been written to interpret them as well.

Sign In

PDF to text conversion

Recommended Posts

cbreemer

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

cbreemer

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

cbreemer

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

cbreemer

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information