samona Posted January 22, 2009 Share Posted January 22, 2009 Is there a command to save a pdf as text file so that i can extract data from it? Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/ Share on other sites More sharing options...
premiso Posted January 22, 2009 Share Posted January 22, 2009 If it is an image you would need to use OCR to transfer it, and that is shaky at best. So there probably is, but as to whether it will work is a toss up. Someone else may have done this and know, but as far as I know it is not possible. Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743350 Share on other sites More sharing options...
samona Posted January 22, 2009 Author Share Posted January 22, 2009 It's just a report. I can open it in adobe and Save As... a text file. I was just wondering if i can open it in php and save it as a text file. Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743380 Share on other sites More sharing options...
Mchl Posted January 22, 2009 Share Posted January 22, 2009 If it's not encrypted or compressed, it's a text file anyway. You just need to drop PS control characters. Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743384 Share on other sites More sharing options...
samona Posted January 22, 2009 Author Share Posted January 22, 2009 How would I do that. I dont understand what PS controls are. Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743398 Share on other sites More sharing options...
Mchl Posted January 22, 2009 Share Posted January 22, 2009 See the attached file (hello.pdf) It is an example of uncompressed pdf file. If you open it in acrobat, it displays 'Hello World' If you open it in Notepad (or other text editor) you'll see lots of control characters, and in the line 35 the words "Hello World". If you pdf files look like this, you can actually process them in PHP to remove all those control characters and leave just the content. If however your pdf files, when open in Notepad, look nothing like this, they're probably compressed, in which case I'm afraid you can't extract text from them with PHP alone. [attachment deleted by admin] Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743507 Share on other sites More sharing options...
samona Posted January 22, 2009 Author Share Posted January 22, 2009 Yes, it does look like that. Even when I save it as text file its all just characters. Theres no images in the file. Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743561 Share on other sites More sharing options...
Mchl Posted January 22, 2009 Share Posted January 22, 2009 If so, extracting the text should be possible. You would need to take a look at PDF Reference Book and see what control characters are enclosing text objects. Or look for someone who can do this script for you. (I'm not going into it, too busy) Quote Link to comment https://forums.phpfreaks.com/topic/141966-is-there-a-command-to-save-a-pdf-file-as-text-in-php/#findComment-743593 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.