Jump to content

read ascii text from pdf file


grissom

Recommended Posts

Hi all

 

Basically, I'm looking for a way in PHP to read plain ascii text from out of any .pdf file

 

I've Googled around a bit, but so far everything I've found suffers from either one of two drawbacks :

 

1. It requires me to install something on the server, which I'd rather not do, I'm looking for some PHP classes ideally

 

2. The recommended code, despite its promises, just doesn't work in practice!

 

So basically, I'm asking does anybody know of anything to read text from a PDF file which they have used for themselves and can recommend. Many thanks.

Link to comment
Share on other sites

Sorry if I didn't make myself crystal clear. Of course, if you open a pdf file in an ascii text editor, it will look gibberish. What I am after is something to read the text that you would see if you opened it in a pdf reader.

Edited by grissom
Link to comment
Share on other sites

Sorry if I didn't make myself crystal clear. Of course, if you open a pdf file in an ascii text editor, it will look gibberish. What I am after is something to read the text that you would see if you opened it in a pdf reader.

 

And that is why your question is not accurate. Although the text is "displayed" as basic characters that is not how it is stored within the PDF file. In fact, the characters you see within a PDF may not even be stored as "text" within the PDF. It all depends on how the PDF was created. Without knowing that, there is no way to give you an appropriate answer. If we translate this into an HTML equivalent, the problem is easier to understand. Here are three examples:

 

Example 1:

 

<p>Can't get text</p>

Example 2 (Some/all characters escaped)

 

<p>Can't get text</p>

Example 3 (Text stored as image - or line art in PDFs)

 

<img src="cantget.jpg" />

 

Getting the text from the first example would be relatively easy. Getting the text from teh second example would be more difficult and would require a process/function to "know" that the text is transformed and a way to de-transform it. The third example will be very difficult and would require an ORC reader. But, PDFs also have ways to create text on a page using graphical representations of characters.

 

So, "how" the text in the PDF is created is paramount to determining what you need to get the text. If you are dealing with one of the more complicated scenarios you will likely not find a free solution since it would take quite a bit of time and effort to build such a process using PHP only.

Link to comment
Share on other sites

Yes, which is why on my original post I clearly stated that I had already Googled around for an answer and had found that none of the methods I came across worked.

 

Maybe there is no cast iron method of doing this ... which would be a shame.

Link to comment
Share on other sites

I actually downloaded a script generally named PDF2Text.php and attempted to work thru its processing.  Seems to me that the script is putting out the formatting chars from the PDF and not the ascii text, or something very close to that.  My debugging followed the process of identifying and separating out the pdf objects but my lack of knowledge of regex statements failed me in the last function named getTextUsingTransformations.  Perhaps if you know more about regex you can decipher what is supposed to be happening there.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.