grissom Posted June 5, 2013 Share Posted June 5, 2013 Hi all Basically, I'm looking for a way in PHP to read plain ascii text from out of any .pdf file I've Googled around a bit, but so far everything I've found suffers from either one of two drawbacks : 1. It requires me to install something on the server, which I'd rather not do, I'm looking for some PHP classes ideally 2. The recommended code, despite its promises, just doesn't work in practice! So basically, I'm asking does anybody know of anything to read text from a PDF file which they have used for themselves and can recommend. Many thanks. Quote Link to comment Share on other sites More sharing options...
ginerjm Posted June 5, 2013 Share Posted June 5, 2013 How much "text" is there in a PDF file? The ones I've managed to look at are pretty much gibberish. Quote Link to comment Share on other sites More sharing options...
grissom Posted June 5, 2013 Author Share Posted June 5, 2013 (edited) Sorry if I didn't make myself crystal clear. Of course, if you open a pdf file in an ascii text editor, it will look gibberish. What I am after is something to read the text that you would see if you opened it in a pdf reader. Edited June 5, 2013 by grissom Quote Link to comment Share on other sites More sharing options...
Psycho Posted June 5, 2013 Share Posted June 5, 2013 Sorry if I didn't make myself crystal clear. Of course, if you open a pdf file in an ascii text editor, it will look gibberish. What I am after is something to read the text that you would see if you opened it in a pdf reader. And that is why your question is not accurate. Although the text is "displayed" as basic characters that is not how it is stored within the PDF file. In fact, the characters you see within a PDF may not even be stored as "text" within the PDF. It all depends on how the PDF was created. Without knowing that, there is no way to give you an appropriate answer. If we translate this into an HTML equivalent, the problem is easier to understand. Here are three examples: Example 1: <p>Can't get text</p> Example 2 (Some/all characters escaped) <p>Can't get text</p> Example 3 (Text stored as image - or line art in PDFs) <img src="cantget.jpg" /> Getting the text from the first example would be relatively easy. Getting the text from teh second example would be more difficult and would require a process/function to "know" that the text is transformed and a way to de-transform it. The third example will be very difficult and would require an ORC reader. But, PDFs also have ways to create text on a page using graphical representations of characters. So, "how" the text in the PDF is created is paramount to determining what you need to get the text. If you are dealing with one of the more complicated scenarios you will likely not find a free solution since it would take quite a bit of time and effort to build such a process using PHP only. Quote Link to comment Share on other sites More sharing options...
ginerjm Posted June 5, 2013 Share Posted June 5, 2013 Yes - you did say something else. You want to read a screen that is displaying the contents of a pdf. It took me all of 5 minutes to find out how to do it on google. I'll let you do the same. Quote Link to comment Share on other sites More sharing options...
ginerjm Posted June 5, 2013 Share Posted June 5, 2013 Egg on my face.... Curious, I downloaded the item that I believed would do what you wanted. Sadly, it failed. So did the next two methods I located. So it seems that the problem is still in need of a solution. Quote Link to comment Share on other sites More sharing options...
grissom Posted June 6, 2013 Author Share Posted June 6, 2013 Yes, which is why on my original post I clearly stated that I had already Googled around for an answer and had found that none of the methods I came across worked. Maybe there is no cast iron method of doing this ... which would be a shame. Quote Link to comment Share on other sites More sharing options...
ginerjm Posted June 6, 2013 Share Posted June 6, 2013 I actually downloaded a script generally named PDF2Text.php and attempted to work thru its processing. Seems to me that the script is putting out the formatting chars from the PDF and not the ascii text, or something very close to that. My debugging followed the process of identifying and separating out the pdf objects but my lack of knowledge of regex statements failed me in the last function named getTextUsingTransformations. Perhaps if you know more about regex you can decipher what is supposed to be happening there. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.