mikebyrne Posted July 8, 2009 Share Posted July 8, 2009 Im using a piece of code I found to read the contents of a pdf file and put the output into a text file but I cant get the contents to pass?? My code is: <?php // Function : pdf2txt() // Arguments : $filename - Filename of the PDF you want to extract // Description : Reads a pdf file, extracts data streams, and manages // their translation to plain text - returning the plain // text at the end // Authors : Jonathan Beckett, 2005-05-02 // : Sven Schuberth, 2007-03-29 function pdf2txt($filename){ $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'); $data = getFileData($filename); $s=strpos($data,"%")+1; $version=substr($data,$s,strpos($data,"%",$s)-1); if(substr_count($version,"PDF-1.2")==0) return handleV3($data); else return handleV2($data); } // handles the verson 1.2 function handleV2($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); foreach($a_obj as $obj){ $a_filter = getDataArray($obj,"<<",">>"); if (is_array($a_filter)){ $j++; $a_chunks[$j]["filter"] = $a_filter[0]; $a_data = getDataArray($obj,"stream\r\n","endstream"); if (is_array($a_data)){ $a_chunks[$j]["data"] = substr($a_data[0], strlen("stream\r\n"), strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream")); } } } // decode the chunks foreach($a_chunks as $chunk){ // look at each chunk and decide how to decode it - by looking at the contents of the filter $a_filter = split("/",$chunk["filter"]); if ($chunk["data"]!=""){ // look at the filter to find out which encoding has been used if (substr($chunk["filter"],"FlateDecode")!==false){ $data =@ gzuncompress($chunk["data"]); if (trim($data)!=""){ $result_data .= ps2txt($data); } else { //$result_data .= "x"; } } } } return $result_data; } //handles versions >1.2 function handleV3($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); $result_data=""; foreach($a_obj as $obj){ //check if it a string if(substr_count($obj,"/GS1")>0){ //the strings are between ( and ) preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER); if(is_array($field)) foreach($field as $data) $result_data.=$data[1]; } } return $result_data; file_put_contents('C:\Users\Mike\Desktop\file.txt'); } function ps2txt($ps_data){ $result = ""; $a_data = getDataArray($ps_data,"[","]"); if (is_array($a_data)){ foreach ($a_data as $ps_text){ $a_text = getDataArray($ps_text,"(",")"); if (is_array($a_text)){ foreach ($a_text as $text){ $result .= substr($text,1,strlen($text)-2); } } } } else { // the data may just be in raw format (outside of [] tags) $a_text = getDataArray($ps_data,"(",")"); if (is_array($a_text)){ foreach ($a_text as $text){ $result .= substr($text,1,strlen($text)-2); } } } return $result; file_put_contents('C:\Users\Mike\Desktop\file.txt'); } function getFileData($filename){ $handle = fopen($filename,"rb"); $data = fread($handle, filesize($filename)); fclose($handle); return $data; } function getDataArray($data,$start_word,$end_word){ $start = 0; $end = 0; unset($a_result); while ($start!==false && $end!==false){ $start = strpos($data,$start_word,$end); if ($start!==false){ $end = strpos($data,$end_word,$start); if ($end!==false){ // data is between start and end $a_result[] = substr($data,$start,$end-$start+strlen($end_word)); } } } return $a_result; file_put_contents('C:\Users\Mike\Desktop\file.txt'); } ?> Quote Link to comment Share on other sites More sharing options...
ignace Posted July 8, 2009 Share Posted July 8, 2009 file_put_contents('/path/to/txtfile.txt', pdf2txt('/path/to/pdfile.pdf')); This does not belong in this function: $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'); Quote Link to comment Share on other sites More sharing options...
mikebyrne Posted July 9, 2009 Author Share Posted July 9, 2009 So I just remove $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'); and replace it with file_put_contents('/path/to/txtfile.txt', pdf2txt('/path/to/pdfile.pdf')); ?? Quote Link to comment Share on other sites More sharing options...
ignace Posted July 9, 2009 Share Posted July 9, 2009 I see you made several changes to the body of certain functions remove all these lines they don't belong there.. 1) name or rename the file where the declaration of the pdf functions reside and name it: pdf2txt.php 2) create a new file pdf2txt_test.php make sure these both files are in the same directory under your server root 3) add this content to your pdf2txt_test.php file: <?php error_reporting(E_ALL); ini_set('display_errors', TRUE); // modify the paths file_put_contents('/path/to/txtfile.txt', pdf2txt('/path/to/pdfile.pdf')); ?> 4) start your web server and point it to the directory where you added both pdf files (pdf2txt.php and pdf2txt_test.php) 5) you now get a listing click pdf2txt_test.php a blank page should show up and without any errors on screen. 6) go to the location of the text file '/path/to/txtfile.txt' and open it. It should now contain the text from the pdf file. Quote Link to comment Share on other sites More sharing options...
mikebyrne Posted July 9, 2009 Author Share Posted July 9, 2009 I run the test file and I get the blank screen with no errors but the textfile is not created on the desktop?? My code looks like this: <?php // Function : pdf2txt() // Arguments : $filename - Filename of the PDF you want to extract // Description : Reads a pdf file, extracts data streams, and manages // their translation to plain text - returning the plain // text at the end // Authors : Jonathan Beckett, 2005-05-02 // : Sven Schuberth, 2007-03-29 function pdf2txt($filename){ $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'); $data = getFileData($filename); $s=strpos($data,"%")+1; $version=substr($data,$s,strpos($data,"%",$s)-1); if(substr_count($version,"PDF-1.2")==0) return handleV3($data); else return handleV2($data); } // handles the verson 1.2 function handleV2($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); foreach($a_obj as $obj){ $a_filter = getDataArray($obj,"<<",">>"); if (is_array($a_filter)){ $j++; $a_chunks[$j]["filter"] = $a_filter[0]; $a_data = getDataArray($obj,"stream\r\n","endstream"); if (is_array($a_data)){ $a_chunks[$j]["data"] = substr($a_data[0], strlen("stream\r\n"), strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream")); } } } // decode the chunks foreach($a_chunks as $chunk){ // look at each chunk and decide how to decode it - by looking at the contents of the filter $a_filter = split("/",$chunk["filter"]); if ($chunk["data"]!=""){ // look at the filter to find out which encoding has been used if (substr($chunk["filter"],"FlateDecode")!==false){ $data =@ gzuncompress($chunk["data"]); if (trim($data)!=""){ $result_data .= ps2txt($data); } else { //$result_data .= "x"; } } } } return $result_data; } //handles versions >1.2 function handleV3($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); $result_data=""; foreach($a_obj as $obj){ //check if it a string if(substr_count($obj,"/GS1")>0){ //the strings are between ( and ) preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER); if(is_array($field)) foreach($field as $data) $result_data.=$data[1]; } } return $result_data; } function ps2txt($ps_data){ $result = ""; $a_data = getDataArray($ps_data,"[","]"); if (is_array($a_data)){ foreach ($a_data as $ps_text){ $a_text = getDataArray($ps_text,"(",")"); if (is_array($a_text)){ foreach ($a_text as $text){ $result .= substr($text,1,strlen($text)-2); } } } } else { // the data may just be in raw format (outside of [] tags) $a_text = getDataArray($ps_data,"(",")"); if (is_array($a_text)){ foreach ($a_text as $text){ $result .= substr($text,1,strlen($text)-2); } } } return $result; } function getFileData($filename){ $handle = fopen($filename,"rb"); $data = fread($handle, filesize($filename)); fclose($handle); return $data; } function getDataArray($data,$start_word,$end_word){ $start = 0; $end = 0; unset($a_result); while ($start!==false && $end!==false){ $start = strpos($data,$start_word,$end); if ($start!==false){ $end = strpos($data,$end_word,$start); if ($end!==false){ // data is between start and end $a_result[] = substr($data,$start,$end-$start+strlen($end_word)); } } } return $a_result; } error_reporting(E_ALL); ini_set('display_errors', TRUE); // modify the paths file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf')); ?> Quote Link to comment Share on other sites More sharing options...
mikebyrne Posted July 10, 2009 Author Share Posted July 10, 2009 Does anyone have any idea why this code doesnt produce an error report or pass the pdf conents to the textfile?? Quote Link to comment Share on other sites More sharing options...
stavroch Posted February 27, 2013 Share Posted February 27, 2013 I am using the https://code.google.com/p/lucene-silverstripe-plugin/source/browse/trunk/thirdparty/?r=19 but I get error charset with Greek characters. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.