Jump to content

PDF to text conversion


cbreemer

Recommended Posts

Hey folks,
Just wondering if anyone here has experience with extracting text from PDF files in PHP, and is willing to exchange some experiences ?

I tried a couple of different options but they all have their own quirks, and no two seem to produce the same output.

Cheers,

Chris

Link to comment
Share on other sites

What tools have you tried?  I did a quick google and see several packages available but have no interest in exploring them for you.   FYI - when someone sends me something that I wish to process using php I ask them to provide me the same data in a proper data format rather than a pdf.

Link to comment
Share on other sites

2 hours ago, ginerjm said:

What tools have you tried?  I did a quick google and see several packages available but have no interest in exploring them for you.   FYI - when someone sends me something that I wish to process using php I ask them to provide me the same data in a proper data format rather than a pdf.

Thanks for your reply. I use PHP as the back-end for a web application I wrote to organize all my documents (which are almost exclusively pdf). In order to add search functionality I was looking into ways to convert pdf to text. So far I tried these PHP solutions:

Looks reliable but is very slow (cloud based) and either severely request-limited or quite expensive

Looks ok but I saw one text output where each character was on a line by its own.

Looks ok-ish but buggy. Lots of exceptions and errors, and 178 open issues, some recignized bugs since years

Clunky and slow of course, and not always output. I could improve by not using shell_exec() but might still be too slow to search a large number of pdf's.

I noticed inconsistent output between the four. Apparently there are different ways to parse a pdf into text. I'm starting to get the feeling that whatever I pick will come with its own set of errors and quirks.

Just curious - you don't consider PDF a proper data format ?

 

 

 

Link to comment
Share on other sites

16 hours ago, ginerjm said:

pdf is a presentation tool to provide a view of things to people using a common format.  Why would you think of it as a db format?

I don't of course. I was just wondering about you suggesting that pdf is "not a proper data format". A large part of the data I download (manuals, invoices, digital letters, even sheet music) is in pdf format.

Link to comment
Share on other sites

But none of those 'things' are true 'data providers'.  They are documents.  Data must be stored in a retrievable way.  That means it can be easily read and analyzed and presented as the user wishes.  A manual is meant to be visually read.  Same for an invoice, a letter in digital format (or in printed format) or sheet music.  A pdf is also meant for that purpose.  If someone is trying to provide 'data' they should not be creating a pdf out of it.  It's really pretty simple.

If you have no option other than to try and decipher these pdfs that you already have then you probably need to keep trying to find the proper tool that you can handle.

Link to comment
Share on other sites

11 minutes ago, ginerjm said:

But none of those 'things' are true 'data providers'.  They are documents.  Data must be stored in a retrievable way.  That means it can be easily read and analyzed and presented as the user wishes.  A manual is meant to be visually read.  Same for an invoice, a letter in digital format (or in printed format) or sheet music.  A pdf is also meant for that purpose.  If someone is trying to provide 'data' they should not be creating a pdf out of it.  It's really pretty simple.

If you have no option other than to try and decipher these pdfs that you already have then you probably need to keep trying to find the proper tool that you can handle.

Yes, I guess I'll need to keep looking. Thanks anyway.

Link to comment
Share on other sites

And try to get new sets of 'data' in a better format.  A PDF is something that someone created to deliver data/info to you.  Why can't they deliver it to you in a truly readable format?  I mean, have you ever viewed the contents of a pdf document?  It is unreadable by the human eye.  It must be read by something that knows how to interpret the format that Adobe created years ago to provide info in a format that anyone (not a data processor) could read with Adobe's product line as well as (now) any one of many tools that have been written to interpret them as well.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.