read ascii text from pdf file

grissom · June 5, 2013

Hi all

Basically, I'm looking for a way in PHP to read plain ascii text from out of any .pdf file

I've Googled around a bit, but so far everything I've found suffers from either one of two drawbacks :

1. It requires me to install something on the server, which I'd rather not do, I'm looking for some PHP classes ideally

2. The recommended code, despite its promises, just doesn't work in practice!

So basically, I'm asking does anybody know of anything to read text from a PDF file which they have used for themselves and can recommend. Many thanks.

ginerjm · June 5, 2013

How much "text" is there in a PDF file? The ones I've managed to look at are pretty much gibberish.

grissom · June 5, 2013

Sorry if I didn't make myself crystal clear. Of course, if you open a pdf file in an ascii text editor, it will look gibberish. What I am after is something to read the text that you would see if you opened it in a pdf reader.

Psycho · June 5, 2013

Sorry if I didn't make myself crystal clear. Of course, if you open a pdf file in an ascii text editor, it will look gibberish. What I am after is something to read the text that you would see if you opened it in a pdf reader.

And that is why your question is not accurate. Although the text is "displayed" as basic characters that is not how it is stored within the PDF file. In fact, the characters you see within a PDF may not even be stored as "text" within the PDF. It all depends on how the PDF was created. Without knowing that, there is no way to give you an appropriate answer. If we translate this into an HTML equivalent, the problem is easier to understand. Here are three examples:

Example 1:

<p>Can't get text</p>

Example 2 (Some/all characters escaped)

<p>Can't get text</p>

Example 3 (Text stored as image - or line art in PDFs)

<img src="cantget.jpg" />

Getting the text from the first example would be relatively easy. Getting the text from teh second example would be more difficult and would require a process/function to "know" that the text is transformed and a way to de-transform it. The third example will be very difficult and would require an ORC reader. But, PDFs also have ways to create text on a page using graphical representations of characters.

So, "how" the text in the PDF is created is paramount to determining what you need to get the text. If you are dealing with one of the more complicated scenarios you will likely not find a free solution since it would take quite a bit of time and effort to build such a process using PHP only.

ginerjm · June 5, 2013

Yes - you did say something else.

You want to read a screen that is displaying the contents of a pdf. It took me all of 5 minutes to find out how to do it on google. I'll let you do the same.

ginerjm · June 5, 2013

Egg on my face....

Curious, I downloaded the item that I believed would do what you wanted. Sadly, it failed. So did the next two methods I located. So it seems that the problem is still in need of a solution.

grissom · June 6, 2013

Yes, which is why on my original post I clearly stated that I had already Googled around for an answer and had found that none of the methods I came across worked.

Maybe there is no cast iron method of doing this ... which would be a shame.

ginerjm · June 6, 2013

I actually downloaded a script generally named PDF2Text.php and attempted to work thru its processing. Seems to me that the script is putting out the formatting chars from the PDF and not the ascii text, or something very close to that. My debugging followed the process of identifying and separating out the pdf objects but my lack of knowledge of regex statements failed me in the last function named getTextUsingTransformations. Perhaps if you know more about regex you can decipher what is supposed to be happening there.

Sign In

read ascii text from pdf file

Recommended Posts

grissom

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

grissom

Link to comment

Share on other sites

Psycho

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

grissom

Link to comment

Share on other sites

ginerjm

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information