mikebyrne Posted August 10, 2009 Share Posted August 10, 2009 I'm trying to modify a piece of code to allow me to take the contents of a pdf document and send the output to a textfile My code is <?php // Function : pdf2txt() // Arguments : $filename - Filename of the PDF you want to extract // Description : Reads a pdf file, extracts data streams, and manages // their translation to plain text - returning the plain // text at the end // Authors : Jonathan Beckett, 2005-05-02 // : Sven Schuberth, 2007-03-29 function pdf2txt($filename){ $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'); $data = getFileData($filename); $s=strpos($data,"%")+1; $version=substr($data,$s,strpos($data,"%",$s)-1); if(substr_count($version,"PDF-1.2")==0) return handleV3($data); else return handleV2($data); } // handles the verson 1.2 function handleV2($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); foreach($a_obj as $obj){ $a_filter = getDataArray($obj,"<<",">>"); if (is_array($a_filter)){ $j++; $a_chunks[$j]["filter"] = $a_filter[0]; $a_data = getDataArray($obj,"stream\r\n","endstream"); if (is_array($a_data)){ $a_chunks[$j]["data"] = substr($a_data[0], strlen("stream\r\n"), strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream")); } } } // decode the chunks foreach($a_chunks as $chunk){ // look at each chunk and decide how to decode it - by looking at the contents of the filter $a_filter = split("/",$chunk["filter"]); if ($chunk["data"]!=""){ // look at the filter to find out which encoding has been used if (substr($chunk["filter"],"FlateDecode")!==false){ $data =@ gzuncompress($chunk["data"]); if (trim($data)!=""){ $result_data .= ps2txt($data); } else { //$result_data .= "x"; } } } } return $result_data; } //handles versions >1.2 function handleV3($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); $result_data=""; foreach($a_obj as $obj){ //check if it a string if(substr_count($obj,"/GS1")>0){ //the strings are between ( and ) preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER); if(is_array($field)) foreach($field as $data) $result_data.=$data[1]; } } return $result_data; } function ps2txt($ps_data){ $result = ""; $a_data = getDataArray($ps_data,"[","]"); if (is_array($a_data)){ foreach ($a_data as $ps_text){ $a_text = getDataArray($ps_text,"(",")"); if (is_array($a_text)){ foreach ($a_text as $text){ $result .= substr($text,1,strlen($text)-2); } } } } else { // the data may just be in raw format (outside of [] tags) $a_text = getDataArray($ps_data,"(",")"); if (is_array($a_text)){ foreach ($a_text as $text){ $result .= substr($text,1,strlen($text)-2); } } } return $result; } function getFileData($filename){ $handle = fopen($filename,"rb"); $data = fread($handle, filesize($filename)); fclose($handle); return $data; } function getDataArray($data,$start_word,$end_word){ $start = 0; $end = 0; unset($a_result); while ($start!==false && $end!==false){ $start = strpos($data,$start_word,$end); if ($start!==false){ $end = strpos($data,$end_word,$start); if ($end!==false){ // data is between start and end $a_result[] = substr($data,$start,$end-$start+strlen($end_word)); } } } return $a_result; } error_reporting(E_ALL); ini_set('display_errors', TRUE); file_put_contents('txtfile.txt', pdf2txt('Athy Register.pdf')); ?> When I run the code it jst seems to freeze, any idea why?? Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/ Share on other sites More sharing options...
GingerRobot Posted August 10, 2009 Share Posted August 10, 2009 When I run the code it jst seems to freeze, any idea why? Yes. You're calling the function (pdf2txt()) from within itself, with no condition for breaking out. So the function is being recursively called indefinitely. Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894650 Share on other sites More sharing options...
mikebyrne Posted August 10, 2009 Author Share Posted August 10, 2009 I'm a function noobie, how should it be written? Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894655 Share on other sites More sharing options...
GingerRobot Posted August 10, 2009 Share Posted August 10, 2009 Well I assume you should move this line: $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'); To outside of all the functions. Not having written or looked at the code, I can't guarantee that will work for you, however. Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894656 Share on other sites More sharing options...
mikebyrne Posted August 10, 2009 Author Share Posted August 10, 2009 That seems to have fixed the problem with the code not loading bt im getting the error Notice: Undefined variable: j in C:\xampp\htdocs\pdf2txt_test.php on line 38 $j++; Any idea how I can fix this? Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894767 Share on other sites More sharing options...
Mark Baker Posted August 10, 2009 Share Posted August 10, 2009 That seems to have fixed the problem with the code not loading bt im getting the error Notice: Undefined variable: j in C:\xampp\htdocs\pdf2txt_test.php on line 38 $j++; Any idea how I can fix this? Initialize $j before you use it $j = -1; foreach($a_obj as $obj){ $a_filter = getDataArray($obj,"<<",">>"); if (is_array($a_filter)){ $j++; Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894786 Share on other sites More sharing options...
mikebyrne Posted August 10, 2009 Author Share Posted August 10, 2009 Thanks that seemed to fix it Im now getting the error Notice: Undefined variable: a_result in C:\xampp\htdocs\pdf2txt_test.php on line 140 line 140 is in the function function getDataArray($data,$start_word,$end_word){ $start = 0; $end = 0; unset($a_result); while ($start!==false && $end!==false){ $start = strpos($data,$start_word,$end); if ($start!==false){ $end = strpos($data,$end_word,$start); if ($end!==false){ // data is between start and end $a_result[] = substr($data,$start,$end-$start+strlen($end_word)); } } } return $a_result; Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894796 Share on other sites More sharing options...
Mark Baker Posted August 10, 2009 Share Posted August 10, 2009 Im now getting the error Notice: Undefined variable: a_result in C:\xampp\htdocs\pdf2txt_test.php on line 140 Quite logical The line unset($a_result); basically says delete the variable $a_result, but $a_result doesn't actually exist at that point, so it can't be deleted. To fix it, simply remove that line. However, to pre-empt your next question after you've done that, the code then tries to store values in the $a_result array, which doesn't yet exist.... the one that unset($a_result); tried to destroy but couldn't because it didn't exist. Instead of simply removing the unset() line, replace it with $a_result = array(); Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894805 Share on other sites More sharing options...
mikebyrne Posted August 10, 2009 Author Share Posted August 10, 2009 I'm noe getting the error Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 40 Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 44 Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 46 The lines are in the function function handleV2($data){ // grab objects and then grab their contents (chunks) $a_obj = getDataArray($data,"obj","endobj"); $j = -1; foreach($a_obj as $obj){ $a_filter = getDataArray($obj,"<<",">>"); if (is_array($a_filter)){ $j++; $a_chunks[$j]["filter"] = $a_filter[0]; $a_data = getDataArray($obj,"stream\r\n","endstream"); if (is_array($a_data)){ $a_chunks[$j]["data"] = substr($a_data[0], strlen("stream\r\n"), strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream")); } Link to comment https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/#findComment-894896 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.