cbreemer Posted March 13 Share Posted March 13 Hey folks, Just wondering if anyone here has experience with extracting text from PDF files in PHP, and is willing to exchange some experiences ? I tried a couple of different options but they all have their own quirks, and no two seem to produce the same output. Cheers, Chris Quote Link to comment Share on other sites More sharing options...
ginerjm Posted March 13 Share Posted March 13 What tools have you tried? I did a quick google and see several packages available but have no interest in exploring them for you. FYI - when someone sends me something that I wish to process using php I ask them to provide me the same data in a proper data format rather than a pdf. Quote Link to comment Share on other sites More sharing options...
cbreemer Posted March 13 Author Share Posted March 13 2 hours ago, ginerjm said: What tools have you tried? I did a quick google and see several packages available but have no interest in exploring them for you. FYI - when someone sends me something that I wish to process using php I ask them to provide me the same data in a proper data format rather than a pdf. Thanks for your reply. I use PHP as the back-end for a web application I wrote to organize all my documents (which are almost exclusively pdf). In order to add search functionality I was looking into ways to convert pdf to text. So far I tried these PHP solutions: pdfcrowd API ( https://pdfcrowd.com/api/pdf-to-text-php/ ) Looks reliable but is very slow (cloud based) and either severely request-limited or quite expensive pdftotext.phpclass ( https://github.com/christian-vigh-phpclasses/PdfToText ) Looks ok but I saw one text output where each character was on a line by its own. Smalot Pdf Parser ( https://github.com/smalot/pdfparser ) Looks ok-ish but buggy. Lots of exceptions and errors, and 178 open issues, some recignized bugs since years pdftotext executable ( https://www.xpdfreader.com/download.html ) via shell_exec() Clunky and slow of course, and not always output. I could improve by not using shell_exec() but might still be too slow to search a large number of pdf's. I noticed inconsistent output between the four. Apparently there are different ways to parse a pdf into text. I'm starting to get the feeling that whatever I pick will come with its own set of errors and quirks. Just curious - you don't consider PDF a proper data format ? Quote Link to comment Share on other sites More sharing options...
ginerjm Posted March 13 Share Posted March 13 pdf is a presentation tool to provide a view of things to people using a common format. Why would you think of it as a db format? Quote Link to comment Share on other sites More sharing options...
cbreemer Posted March 14 Author Share Posted March 14 16 hours ago, ginerjm said: pdf is a presentation tool to provide a view of things to people using a common format. Why would you think of it as a db format? I don't of course. I was just wondering about you suggesting that pdf is "not a proper data format". A large part of the data I download (manuals, invoices, digital letters, even sheet music) is in pdf format. Quote Link to comment Share on other sites More sharing options...
ginerjm Posted March 14 Share Posted March 14 But none of those 'things' are true 'data providers'. They are documents. Data must be stored in a retrievable way. That means it can be easily read and analyzed and presented as the user wishes. A manual is meant to be visually read. Same for an invoice, a letter in digital format (or in printed format) or sheet music. A pdf is also meant for that purpose. If someone is trying to provide 'data' they should not be creating a pdf out of it. It's really pretty simple. If you have no option other than to try and decipher these pdfs that you already have then you probably need to keep trying to find the proper tool that you can handle. Quote Link to comment Share on other sites More sharing options...
cbreemer Posted March 14 Author Share Posted March 14 11 minutes ago, ginerjm said: But none of those 'things' are true 'data providers'. They are documents. Data must be stored in a retrievable way. That means it can be easily read and analyzed and presented as the user wishes. A manual is meant to be visually read. Same for an invoice, a letter in digital format (or in printed format) or sheet music. A pdf is also meant for that purpose. If someone is trying to provide 'data' they should not be creating a pdf out of it. It's really pretty simple. If you have no option other than to try and decipher these pdfs that you already have then you probably need to keep trying to find the proper tool that you can handle. Yes, I guess I'll need to keep looking. Thanks anyway. Quote Link to comment Share on other sites More sharing options...
ginerjm Posted March 14 Share Posted March 14 And try to get new sets of 'data' in a better format. A PDF is something that someone created to deliver data/info to you. Why can't they deliver it to you in a truly readable format? I mean, have you ever viewed the contents of a pdf document? It is unreadable by the human eye. It must be read by something that knows how to interpret the format that Adobe created years ago to provide info in a format that anyone (not a data processor) could read with Adobe's product line as well as (now) any one of many tools that have been written to interpret them as well. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.