Jump to content

Help with pdf to text code


mikebyrne

Recommended Posts

I'm trying to modify a piece of code to allow me to take the contents of a pdf document and send the output to a textfile

 

My code is

 

<?php
// Function    : pdf2txt()
// Arguments   : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
//               their translation to plain text - returning the plain
//               text at the end
// Authors      : Jonathan Beckett, 2005-05-02
//                            : Sven Schuberth, 2007-03-29

function pdf2txt($filename){    

    $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf');

    $data = getFileData($filename);
   
    $s=strpos($data,"%")+1;
   
    $version=substr($data,$s,strpos($data,"%",$s)-1);
    if(substr_count($version,"PDF-1.2")==0)
        return handleV3($data);
    else
        return handleV2($data);

   
}
// handles the verson 1.2
function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);
       
        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used           
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {
               
                    //$result_data .= "x";
                }
            }
        }
    }
   
    return $result_data;
}

//handles versions >1.2
function handleV3($data){
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    $result_data="";
    foreach($a_obj as $obj){
        //check if it a string
        if(substr_count($obj,"/GS1")>0){
            //the strings are between ( and )
            preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
            if(is_array($field))
                foreach($field as $data)
                    $result_data.=$data[1];
        }
    }
    return $result_data;
    
}

function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}

function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
error_reporting(E_ALL);
ini_set('display_errors', TRUE);

file_put_contents('txtfile.txt', pdf2txt('Athy Register.pdf'));
?>

 

When I run the code it jst seems to freeze, any idea why??

Link to comment
https://forums.phpfreaks.com/topic/169567-help-with-pdf-to-text-code/
Share on other sites

Well I assume you should move this line:

 

$pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf');

 

To outside of all the functions. Not having written or looked at the code, I can't guarantee that will work for you, however.

That seems to have fixed the problem with the code not loading bt im getting the error

 

Notice: Undefined variable: j in C:\xampp\htdocs\pdf2txt_test.php on line 38

 

$j++;

 

Any idea how I can fix this?

Initialize $j before you use it

    $j = -1;
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;

Thanks that seemed to fix it

 

Im now getting the error

 

Notice: Undefined variable: a_result in C:\xampp\htdocs\pdf2txt_test.php on line 140

 

line 140 is in the function

 

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;

 

 

 

Im now getting the error

 

Notice: Undefined variable: a_result in C:\xampp\htdocs\pdf2txt_test.php on line 140

 

Quite logical

The line

unset($a_result);

basically says delete the variable $a_result, but $a_result doesn't actually exist at that point, so it can't be deleted. To fix it, simply remove that line.

 

However, to pre-empt your next question after you've done that, the code then tries to store values in the $a_result array, which doesn't yet exist.... the one that unset($a_result); tried to destroy but couldn't because it didn't exist.

Instead of simply removing the unset() line, replace it with

$a_result = array();

 

I'm noe getting the error

 

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 40

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 44

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 46

 

The lines are in the function

 

function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
     $j = -1;
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.