Jump to content

Recommended Posts

Im working on a pdf to textfile program and I'm getting the following errora

 

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 40

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 44

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 46

 

The lines are in the function

 

function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
     $j = -1;
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }

 

My complete code is:

 

<?php
// Function    : pdf2txt()
// Arguments   : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
//               their translation to plain text - returning the plain
//               text at the end
// Authors      : Jonathan Beckett, 2005-05-02
//                            : Sven Schuberth, 2007-03-29

$pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Register.pdf');


function pdf2txt($filename){    

    $data = getFileData($filename);
   
    $s=strpos($data,"%")+1;
   
    $version=substr($data,$s,strpos($data,"%",$s)-1);
    if(substr_count($version,"PDF-1.2")==0)
        return handleV3($data);
    else
        return handleV2($data);

   
}
// handles the verson 1.2
function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
     $j = -1;
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);
       
        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used           
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {
               
                    //$result_data .= "x";
                }
            }
        }
    }
   
    return $result_data;
}

//handles versions >1.2
function handleV3($data){
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    $result_data="";
    foreach($a_obj as $obj){
        //check if it a string
        if(substr_count($obj,"/GS1")>0){
            //the strings are between ( and )
            preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
            if(is_array($field))
                foreach($field as $data)
                    $result_data.=$data[1];
        }
    }
    return $result_data;
    
}

function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}

function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    $a_result = array();
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
error_reporting(E_ALL);
ini_set('display_errors', TRUE);

file_put_contents('txtfile.txt', pdf2txt('Athy Register.pdf'));
?>

 

Any help on this would be great

Link to comment
https://forums.phpfreaks.com/topic/169734-undefined-offset-error-help/
Share on other sites

I've replaced that line but im still getting two errors 

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 44

 

Notice: Undefined offset: 0 in C:\xampp\htdocs\pdf2txt_test.php on line 46

 

Its got rid of one error

 

Line 44 is

 

$a_chunks[$j]["data"] = substr($a_data[0],

 

Line 46 is

strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));

That's why I've been saying to dump the data. Obviously you've improved the code, but from the sounds of it it's failing virtually every conditional and so the script isn't actually doing a great deal. Need to debug the code and find out exactly where it's going wrong.

 

 

No just use print_r or var_dump within the script to show the contents of the array/variable. Testing different arrays in different areas of the script will build up a picture of what's happening behind the scenes. For example after you call the getDataArray() function, print out the data returned. Obviously you'll see then if it's returned the data as expected or returned null / false, which tells you where the problem is. To be honest I think it's that function causing the problems anyway.

 

If you still feel a little lost, have a read through this tutorial and see if it helps - particularly the 'logical errors' section:

 

http://www.phpfreaks.com/tutorial/debugging-a-beginners-guide

I've put the print_r after each function but im still getting a blank screen

 

<?php
// Function    : pdf2txt()
// Arguments   : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
//               their translation to plain text - returning the plain
//               text at the end
// Authors      : Jonathan Beckett, 2005-05-02
//                            : Sven Schuberth, 2007-03-29

$pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Register.pdf');


function pdf2txt($filename){    

    $data = getFileData($filename);
   
    $s=strpos($data,"%")+1;
   
    $version=substr($data,$s,strpos($data,"%",$s)-1);
    if(substr_count($version,"PDF-1.2")==0)
        return handleV3($data);
    else
        return handleV2($data);

   
}
// handles the verson 1.2
function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
     $j = -1;
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if ((is_array($a_filter)) && (count($a_filter) > 0)) {
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if ((is_array($a_data)) && (count($a_data) > 0)) {
                $a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);
       
        if (isset($chunk['data'])) {
            // look at the filter to find out which encoding has been used           
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {
               
                    //$result_data .= "x";
                }
            }
        }
    }
   
    if (isset($result_data)) {
    return $result_data;
} else {
    return false;
}
}

//handles versions >1.2
function handleV3($data){
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    $result_data="";
    foreach($a_obj as $obj){
        //check if it a string
        if(substr_count($obj,"/GS1")>0){
            //the strings are between ( and )
            preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
            if(is_array($field))
                foreach($field as $data)
                    $result_data.=$data[1];
        }
    }
    return $result_data;
    print_r ($a_result);
    
}

function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
    print_r ($result);
}

function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
    print_r ($data);
}

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    $a_result = array();
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
    print_r ($a_result);
}
error_reporting(E_ALL);
ini_set('display_errors', TRUE);

file_put_contents('txtfile.txt', pdf2txt('Athy Register.pdf'));
?>

Mike,

 

You have your print_r after a return statement in the function. That will never happen.

 

   return $data;
   print_r ($data);

 

Should be:

  print_r ($data);
return $data;

 

That will show data and allow you to check to make sure that everything is going through properly.

 

 

Thanks

 

I'm just getting a mountain of nonsense ie

 

%PDF-1.2 %âãÏÓ 1 0 obj << /Type /Catalog /Pages 2 0 R /PageMode /UseNone /ViewerPreferences << /FitWindow true /PageLayout /SinglePage /NonFullScreenPageMode /UseNone >> >> endobj 2 0 obj << /Type /Pages /Kids [ 8 0 R 30 0 R 40 0 R 44 0 R 48 0 R 52 0 R 56 0 R 60 0 R 64 0 R 68 0 R 72 0 R 76 0 R 80 0 R 84 0 R 88 0 R 92 0 R 96 0 R 100 0 R 104 0 R 108 0 R 112 0 R 116 0 R 120 0 R 124 0 R 128 0 R 132 0 R 136 0 R 140 0 R 144 0 R 148 0 R 152 0 R 156 0 R 160 0 R 164 0 R 168 0 R 172 0 R 176 0 R 180 0 R 184 0 R 188 0 R 192 0 R 196 0 R 200 0 R 204 0 R 208 0 R 212 0 R 216 0 R 220 0 R 224 0 R 228 0 R 232 0 R 236 0 R 240 0 R 244 0 R 248 0 R 252 0 R 256 0 R 260 0 R 264 0 R 268 0 R 272 0 R 276 0 R 280 0 R 284 0 R 288 0 R 292 0 R 296 0 R 300 0 R 304 0 R 308 0 R 312 0 R 316 0 R 320 0 R 324 0 R 328 0 R 332 0 R 336 0 R 340 0 R 344 0 R 348 0 R 352 0 R 368 0 R 372 0 R 376 0 R 380 0 R ] /Count 85 /MediaBox 3 0 R /CropBox 4 0 R >> endobj 3 0 obj [ 0 0 595 841 ] endobj 4 0 obj [ 0 0 595 841 ] endobj 5 0 obj << /Length 1701 /Filter [ /FlateDecode ] >> stream xœÅ™ÛrÛ6†ïù¸«3ã*�€@{åSÓC’¦–ÓÜø¦)‹ .I9£Çë›'@‰±"³S{ÆÚßZ°X`I¤�êßÖˆóàõeò”ÆÉõ›s×Cu±0ø5øÊš‹º¿Ë‹�Cð% ,L„äBaȬȂeðxþJÂPd¯„ÌŠþJ#€¥X9®[Fî*ã—ëàSP<-ˆðP7°²œê;G+F"‹‡iäTÖ*ÊÕe™stÂv¯.Ý4+‚¸Yí©ãàü&@&‚™þ·þQÁq+nòàä:yHë&©@¹WY7eUƒWàæ¯àêFÃy) ¼Fõ¼Px­ób5ð†!Ä=/¢ï/X•Uœ€UUæêëf ~J¶�C(@SªÙöÿ‹à΀Bª9C!5 ÊY+²VPÁ”pnV91ì†2Άù”i.Ê¢nÒf“ñö‡]TAõrÂ"ìPèPÕÜÖ͉cQ íQ „|ˆz‘6[ðZÁnŠf&ƒz8L§SŽÍ:¾T·1 ú[šÝË*™ t1Gó¨z«l ㋢îÁ²Ü4뉀cÎúµiDpLqp'Ž x¢ŽRå±Gù¡Ì²´x�—jƒ¨Ò¸Ù¹Jv3Q|EZtK‡BÏ®u³âHRŒ9íçôjS•‰,ÀYe©Ì“¢y&©Ôeð€ãÁs».Œ›Ç’‡‘èçX„l2©ÀÛ2–Y» +ë¬JäT¦9ö6·,¼SØ:¾ŒÞ[Çcú³f½Â³‹¢Å³«¢ƒ5+¡ÇK–…Ÿfãe!ëfÏe—B°K¯VÙ£ß9Z1ËìòKÏø¼9ó-&ÅB÷¨¨ÄÔ·ØÖmTžP¢Ê„× Ö‰¬Ìçf…?J)!¥n£%ŠÜ·‹,Ù¤eq{Rß¾Ú Ê€BsÑ1µ*¸Çµb”£p‰k‚r^nkð^6°Œ×e™©(]—ªÜ€hŠ\å#ÖV!Hzr+f!UF_'ÇSä(±:9b¹sCÎÈáäá¹ÊG¬­¢˜÷äVÌB¾/…'ÉɹÊG¬­Â¤OK'f!ÇœNN§È1d¨Ã³JÚ%¨³gáV?‡s³ý܆'÷A`¬KNgÏ@­ŠÊoHÍhZãä>§aÔ女ç€&ø²’O@kœÜçtB•Ø´µç€F~E²”©ªýþL‹XÕ€ßÕàç²~L™‚½µJO“û˜Vp»dtöÌ\ÝÍ{Ìq©nß­Óx­>ßËr%Ÿ;pLîS:AQ—†Îž™2x(òÄIc`rŸÒ Œ»töÈ8‡"O1&÷)­ˆT!Ü";{äHpr(òÄÙb`rŸÒ FºüsöÈÌ/¹¿†L÷Õ¹7õk¨—³±!E†SUó#a¼FU

Ok, lets re-arrange your code and put the call statements up top and remove the print_r

 

<?php
// Function    : pdf2txt()
// Arguments   : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
//               their translation to plain text - returning the plain
//               text at the end
// Authors      : Jonathan Beckett, 2005-05-02
//                            : Sven Schuberth, 2007-03-29


error_reporting(E_ALL);
ini_set('display_errors', TRUE);

file_put_contents('txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Register.pdf'));

if (file_exists('txtfile.txt')) {
    echo "File txtfile.txt has been created succesfully and contains <br><br><pre>";
    echo file_get_contents('txtfile.txt');
}else {
    echo "File txtfile.txt was not created due to an issue.";
}

function pdf2txt($filename){    

    $data = getFileData($filename);
   
    $s=strpos($data,"%")+1;
   
    $version=substr($data,$s,strpos($data,"%",$s)-1);
    if(substr_count($version,"PDF-1.2")==0)
        return handleV3($data);
    else
        return handleV2($data);

   
}
// handles the verson 1.2
function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
     $j = -1;
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if ((is_array($a_filter)) && (count($a_filter) > 0)) {
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if ((is_array($a_data)) && (count($a_data) > 0)) {
                $a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream\r\n"),
strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);
       
        if (isset($chunk['data'])) {
            // look at the filter to find out which encoding has been used           
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {
               
                    //$result_data .= "x";
                }
            }
        }
    }
   
    if (isset($result_data)) {
    return $result_data;
} else {
    return false;
}
}

//handles versions >1.2
function handleV3($data){
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    $result_data="";
    foreach($a_obj as $obj){
        //check if it a string
        if(substr_count($obj,"/GS1")>0){
            //the strings are between ( and )
            preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
            if(is_array($field))
                foreach($field as $data)
                    $result_data.=$data[1];
        }
    }
    return $result_data;

}

function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}

function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    $a_result = array();
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}
?>

 

Give that a try and see what comes of it.

The text file will be created wherever that script was ran. You can define the path to place it on your desktop if that is easier for you like so:

 

rror_reporting(E_ALL);
ini_set('display_errors', TRUE);
$path = 'C:\Users\Mike\Desktop\';
file_put_contents($path . 'txtfile.txt', pdf2txt($path . 'Athy Register.pdf'));

if (file_exists($path . 'txtfile.txt')) {
    echo "File txtfile.txt has been created succesfully and contains <br><br><pre>";
    echo file_get_contents($path . 'txtfile.txt');
}else {
    echo "File txtfile.txt was not created due to an issue.";
}

 

Which will create it on the desktop for you. But the file exists, or else php would not say it did :)

Ok let's try this one more time.

 

error_reporting(E_ALL);
ini_set('display_errors', TRUE);
$path = 'C:\\Users\\Mike\\Desktop\\';
file_put_contents($path . 'txtfile.txt', pdf2txt($path . 'Athy Register.pdf'));

if (file_exists($path . 'txtfile.txt')) {
    echo "File txtfile.txt has been created succesfully and contains <br><br><pre>";
    echo file_get_contents($path . 'txtfile.txt');
}else {
    echo "File txtfile.txt was not created due to an issue.";
}

 

Replace that portion (the if statement was missing a ') and see if that corrects the issue.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.