Jump to content

mikebyrne

Members
  • Posts

    780
  • Joined

  • Last visited

Posts posted by mikebyrne

  1. I've tried the following code but still not getting any results

     

    <?php
    // Function    : pdf2txt()
    // Arguments   : $filename - Filename of the PDF you want to extract
    // Description : Reads a pdf file, extracts data streams, and manages
    //               their translation to plain text - returning the plain
    //               text at the end
    // Authors      : Jonathan Beckett, 2005-05-02
    //                            : Sven Schuberth, 2007-03-29
    
    function pdf2txt($filename){    
    
        $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf');
    
        $data = getFileData($filename);
       
        $s=strpos($data,"%")+1;
       
        $version=substr($data,$s,strpos($data,"%",$s)-1);
        if(substr_count($version,"PDF-1.2")==0)
            return handleV3($data);
        else
            return handleV2($data);
    
       
    }
    // handles the verson 1.2
    function handleV2($data){
           
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
       
        foreach($a_obj as $obj){
           
            $a_filter = getDataArray($obj,"<<",">>");
       
            if (is_array($a_filter)){
                $j++;
                $a_chunks[$j]["filter"] = $a_filter[0];
    
                $a_data = getDataArray($obj,"stream\r\n","endstream");
                if (is_array($a_data)){
                    $a_chunks[$j]["data"] = substr($a_data[0],
    strlen("stream\r\n"),
    strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
                }
            }
        }
    
        // decode the chunks
        foreach($a_chunks as $chunk){
    
            // look at each chunk and decide how to decode it - by looking at the contents of the filter
            $a_filter = split("/",$chunk["filter"]);
           
            if ($chunk["data"]!=""){
                // look at the filter to find out which encoding has been used           
                if (substr($chunk["filter"],"FlateDecode")!==false){
                    $data =@ gzuncompress($chunk["data"]);
                    if (trim($data)!=""){
                        $result_data .= ps2txt($data);
                    } else {
                   
                        //$result_data .= "x";
                    }
                }
            }
        }
       
        return $result_data;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));
    }
    
    //handles versions >1.2
    function handleV3($data){
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
        $result_data="";
        foreach($a_obj as $obj){
            //check if it a string
            if(substr_count($obj,"/GS1")>0){
                //the strings are between ( and )
                preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
                if(is_array($field))
                    foreach($field as $data)
                        $result_data.=$data[1];
            }
        }
        return $result_data;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));
    }
    
    function ps2txt($ps_data){
        $result = "";
        $a_data = getDataArray($ps_data,"[","]");
        if (is_array($a_data)){
            foreach ($a_data as $ps_text){
                $a_text = getDataArray($ps_text,"(",")");
                if (is_array($a_text)){
                    foreach ($a_text as $text){
                        $result .= substr($text,1,strlen($text)-2);
                    }
                }
            }
        } else {
            // the data may just be in raw format (outside of [] tags)
            $a_text = getDataArray($ps_data,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
        return $result;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));	
    }
    
    function getFileData($filename){
        $handle = fopen($filename,"rb");
        $data = fread($handle, filesize($filename));
        fclose($handle);
        return $data;
    }
    
    function getDataArray($data,$start_word,$end_word){
    
        $start = 0;
        $end = 0;
        unset($a_result);
       
        while ($start!==false && $end!==false){
            $start = strpos($data,$start_word,$end);
            if ($start!==false){
                $end = strpos($data,$end_word,$start);
                if ($end!==false){
                    // data is between start and end
                    $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
                }
            }
        }
        return $a_result;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));
    }
    ?>
    

     

  2. I'm using a piece of code to pass text from a pdf to a textfile but when it runs the text file isnt created and I'm not getting any error reports??

     

    Any ideas why?

     

     

    <?php
    // Function    : pdf2txt()
    // Arguments   : $filename - Filename of the PDF you want to extract
    // Description : Reads a pdf file, extracts data streams, and manages
    //               their translation to plain text - returning the plain
    //               text at the end
    // Authors      : Jonathan Beckett, 2005-05-02
    //                            : Sven Schuberth, 2007-03-29
    
    function pdf2txt($filename){
    
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));    
    
        $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf');
    
        $data = getFileData($filename);
       
        $s=strpos($data,"%")+1;
       
        $version=substr($data,$s,strpos($data,"%",$s)-1);
        if(substr_count($version,"PDF-1.2")==0)
            return handleV3($data);
        else
            return handleV2($data);
    
       
    }
    // handles the verson 1.2
    function handleV2($data){
           
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
       
        foreach($a_obj as $obj){
           
            $a_filter = getDataArray($obj,"<<",">>");
       
            if (is_array($a_filter)){
                $j++;
                $a_chunks[$j]["filter"] = $a_filter[0];
    
                $a_data = getDataArray($obj,"stream\r\n","endstream");
                if (is_array($a_data)){
                    $a_chunks[$j]["data"] = substr($a_data[0],
    strlen("stream\r\n"),
    strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
                }
            }
        }
    
        // decode the chunks
        foreach($a_chunks as $chunk){
    
            // look at each chunk and decide how to decode it - by looking at the contents of the filter
            $a_filter = split("/",$chunk["filter"]);
           
            if ($chunk["data"]!=""){
                // look at the filter to find out which encoding has been used           
                if (substr($chunk["filter"],"FlateDecode")!==false){
                    $data =@ gzuncompress($chunk["data"]);
                    if (trim($data)!=""){
                        $result_data .= ps2txt($data);
                    } else {
                   
                        //$result_data .= "x";
                    }
                }
            }
        }
       
        return $result_data;
    }
    
    //handles versions >1.2
    function handleV3($data){
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
        $result_data="";
        foreach($a_obj as $obj){
            //check if it a string
            if(substr_count($obj,"/GS1")>0){
                //the strings are between ( and )
                preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
                if(is_array($field))
                    foreach($field as $data)
                        $result_data.=$data[1];
            }
        }
        return $result_data;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));
    }
    
    function ps2txt($ps_data){
        $result = "";
        $a_data = getDataArray($ps_data,"[","]");
        if (is_array($a_data)){
            foreach ($a_data as $ps_text){
                $a_text = getDataArray($ps_text,"(",")");
                if (is_array($a_text)){
                    foreach ($a_text as $text){
                        $result .= substr($text,1,strlen($text)-2);
                    }
                }
            }
        } else {
            // the data may just be in raw format (outside of [] tags)
            $a_text = getDataArray($ps_data,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
        return $result;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));	
    }
    
    function getFileData($filename){
        $handle = fopen($filename,"rb");
        $data = fread($handle, filesize($filename));
        fclose($handle);
        return $data;
    }
    
    function getDataArray($data,$start_word,$end_word){
    
        $start = 0;
        $end = 0;
        unset($a_result);
       
        while ($start!==false && $end!==false){
            $start = strpos($data,$start_word,$end);
            if ($start!==false){
                $end = strpos($data,$end_word,$start);
                if ($end!==false){
                    // data is between start and end
                    $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
                }
            }
        }
        return $a_result;
        file_put_contents('C:\Users\Mike\Desktop\txtfile.txt', pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf'));
    }
    ?>
    

  3. Im using a piece of code I found to read the contents of a pdf file and put the output into a text file but I cant get the contents to pass??

     

    My code is:

     

    <?php
    // Function    : pdf2txt()
    // Arguments   : $filename - Filename of the PDF you want to extract
    // Description : Reads a pdf file, extracts data streams, and manages
    //               their translation to plain text - returning the plain
    //               text at the end
    // Authors      : Jonathan Beckett, 2005-05-02
    //                            : Sven Schuberth, 2007-03-29
    
    function pdf2txt($filename){
    
        $pdftext = pdf2txt('C:\Users\Mike\Desktop\Athy Database\Athy Register.pdf');
    
        $data = getFileData($filename);
       
        $s=strpos($data,"%")+1;
       
        $version=substr($data,$s,strpos($data,"%",$s)-1);
        if(substr_count($version,"PDF-1.2")==0)
            return handleV3($data);
        else
            return handleV2($data);
    
       
    }
    // handles the verson 1.2
    function handleV2($data){
           
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
       
        foreach($a_obj as $obj){
           
            $a_filter = getDataArray($obj,"<<",">>");
       
            if (is_array($a_filter)){
                $j++;
                $a_chunks[$j]["filter"] = $a_filter[0];
    
                $a_data = getDataArray($obj,"stream\r\n","endstream");
                if (is_array($a_data)){
                    $a_chunks[$j]["data"] = substr($a_data[0],
    strlen("stream\r\n"),
    strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
                }
            }
        }
    
        // decode the chunks
        foreach($a_chunks as $chunk){
    
            // look at each chunk and decide how to decode it - by looking at the contents of the filter
            $a_filter = split("/",$chunk["filter"]);
           
            if ($chunk["data"]!=""){
                // look at the filter to find out which encoding has been used           
                if (substr($chunk["filter"],"FlateDecode")!==false){
                    $data =@ gzuncompress($chunk["data"]);
                    if (trim($data)!=""){
                        $result_data .= ps2txt($data);
                    } else {
                   
                        //$result_data .= "x";
                    }
                }
            }
        }
       
        return $result_data;
    }
    
    //handles versions >1.2
    function handleV3($data){
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
        $result_data="";
        foreach($a_obj as $obj){
            //check if it a string
            if(substr_count($obj,"/GS1")>0){
                //the strings are between ( and )
                preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
                if(is_array($field))
                    foreach($field as $data)
                        $result_data.=$data[1];
            }
        }
        return $result_data;
        file_put_contents('C:\Users\Mike\Desktop\file.txt');
    }
    
    function ps2txt($ps_data){
        $result = "";
        $a_data = getDataArray($ps_data,"[","]");
        if (is_array($a_data)){
            foreach ($a_data as $ps_text){
                $a_text = getDataArray($ps_text,"(",")");
                if (is_array($a_text)){
                    foreach ($a_text as $text){
                        $result .= substr($text,1,strlen($text)-2);
                    }
                }
            }
        } else {
            // the data may just be in raw format (outside of [] tags)
            $a_text = getDataArray($ps_data,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
        return $result;
        file_put_contents('C:\Users\Mike\Desktop\file.txt');	
    }
    
    function getFileData($filename){
        $handle = fopen($filename,"rb");
        $data = fread($handle, filesize($filename));
        fclose($handle);
        return $data;
    }
    
    function getDataArray($data,$start_word,$end_word){
    
        $start = 0;
        $end = 0;
        unset($a_result);
       
        while ($start!==false && $end!==false){
            $start = strpos($data,$start_word,$end);
            if ($start!==false){
                $end = strpos($data,$end_word,$start);
                if ($end!==false){
                    // data is between start and end
                    $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
                }
            }
        }
        return $a_result;
        file_put_contents('C:\Users\Mike\Desktop\file.txt');
    }
    ?>
    

  4. Probably a stupid question but I found the below code to play with but how to I point it to my pdf file?

     

    <?php
    // Function    : pdf2txt()
    // Arguments   : $filename - Filename of the PDF you want to extract
    // Description : Reads a pdf file, extracts data streams, and manages
    //               their translation to plain text - returning the plain
    //               text at the end
    // Authors      : Jonathan Beckett, 2005-05-02
    //                            : Sven Schuberth, 2007-03-29
    
    function pdf2txt($filename){
    
        $data = getFileData($filename);
       
        $s=strpos($data,"%")+1;
       
        $version=substr($data,$s,strpos($data,"%",$s)-1);
        if(substr_count($version,"PDF-1.2")==0)
            return handleV3($data);
        else
            return handleV2($data);
    
       
    }
    // handles the verson 1.2
    function handleV2($data){
           
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
       
        foreach($a_obj as $obj){
           
            $a_filter = getDataArray($obj,"<<",">>");
       
            if (is_array($a_filter)){
                $j++;
                $a_chunks[$j]["filter"] = $a_filter[0];
    
                $a_data = getDataArray($obj,"stream\r\n","endstream");
                if (is_array($a_data)){
                    $a_chunks[$j]["data"] = substr($a_data[0],
    strlen("stream\r\n"),
    strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
                }
            }
        }
    
        // decode the chunks
        foreach($a_chunks as $chunk){
    
            // look at each chunk and decide how to decode it - by looking at the contents of the filter
            $a_filter = split("/",$chunk["filter"]);
           
            if ($chunk["data"]!=""){
                // look at the filter to find out which encoding has been used           
                if (substr($chunk["filter"],"FlateDecode")!==false){
                    $data =@ gzuncompress($chunk["data"]);
                    if (trim($data)!=""){
                        $result_data .= ps2txt($data);
                    } else {
                   
                        //$result_data .= "x";
                    }
                }
            }
        }
       
        return $result_data;
    }
    
    //handles versions >1.2
    function handleV3($data){
        // grab objects and then grab their contents (chunks)
        $a_obj = getDataArray($data,"obj","endobj");
        $result_data="";
        foreach($a_obj as $obj){
            //check if it a string
            if(substr_count($obj,"/GS1")>0){
                //the strings are between ( and )
                preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
                if(is_array($field))
                    foreach($field as $data)
                        $result_data.=$data[1];
            }
        }
        return $result_data;
    }
    
    function ps2txt($ps_data){
        $result = "";
        $a_data = getDataArray($ps_data,"[","]");
        if (is_array($a_data)){
            foreach ($a_data as $ps_text){
                $a_text = getDataArray($ps_text,"(",")");
                if (is_array($a_text)){
                    foreach ($a_text as $text){
                        $result .= substr($text,1,strlen($text)-2);
                    }
                }
            }
        } else {
            // the data may just be in raw format (outside of [] tags)
            $a_text = getDataArray($ps_data,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
        return $result;
    }
    
    function getFileData($filename){
        $handle = fopen($filename,"rb");
        $data = fread($handle, filesize($filename));
        fclose($handle);
        return $data;
    }
    
    function getDataArray($data,$start_word,$end_word){
    
        $start = 0;
        $end = 0;
        unset($a_result);
       
        while ($start!==false && $end!==false){
            $start = strpos($data,$start_word,$end);
            if ($start!==false){
                $end = strpos($data,$end_word,$start);
                if ($end!==false){
                    // data is between start and end
                    $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
                }
            }
        }
        return $a_result;
    }
    ?>
    

     

  5. Each page is in a fairly standard format for example

     

    Polling Station: Athy Boys Nat. School -

    Room 01

    Ardreigh (ED Athy Rural) Athy

    1 Callan, Cthy Wxxxide

    2 Callan, Lam Waysxxxe

    3 Callan, Marget Wayde

    4 Callen, Cathne Bray Road

    5 Callen, Tithy Bray Road

    6 Carbery, Emma Farill

    7 Carbery, Jeriah Farmhill

    8 Carbery, Mry Farml

    9 Carbery, Sarh Fahill

    10 Cuy, Brian Bchlawn

    11 Cully-Wall, Brda Beechlawn

     

    There's 2 columns on each page

     

  6. I have a comma separated file with all rows ending with the word Co.Kildare but on a large number of theses rows the word appears with a comma at the end ie Co.Kildare,

     

    How could I code the replacement of Co.Kildare, with Co.Kildare?

     

    <?php
    $test = file_get_contents("C:\Users\Mike\Desktop\AthyDB.txt");
    
    Replacement code??
    
    file_put_contents('C:\Users\Mike\Desktop\test.txt', $test);
    ?>
    

  7. <?php
    $rewrote = "";
    $handle = fopen("C:\Users\Mike\Desktop\AthyDB.txt", "r+"); // Open file to read it read.
    
    if ($handle) {
    while (!feof($handle)) // Loop til end of file.
    {
    $currentline = fgets($handle, 4096); // Read a line.
    $currentline=preg_replace('/^D$/','Dail European Parliament and Local Elections only', $currentline);
    $currentline=preg_replace('/^S$/','Post or special arrangement only', $currentline);
    $currentline=preg_replace('/^L$/','Local Elections only', $currentline);
    $currentline=preg_replace('/^E$/','European Parliament and Local Elections only', $currentline);
    $rewrote .= $currentline;
    $rewrote .= "\n";
    }
    file_put_contents('C:\Users\Mike\Desktop\test.txt', $rewrote);
    fclose($handle);
    }
    ?>
    

     

    That seems to fix the repeating problem but the code seems to stop replacing the letters after 2,000+

     

    Very strange. Any idea why it stops replacing?

  8. I have a file that looks something like the following:

     

    D,2796,Son,Oler,13 Dun Bnn,Bch Road,Ahy,Co.Kire

    S,2797,Gerty,Laurce,15  Dun Brn,Bleach ad,Ahy,Co.Kilde

    L,2801,Mazse,Saras,17  Dn Brn,Blch Rod,Aty,Co.Kilre

    E,2808,Esjo,Leel,21  Dun Bnn,Blach Road,Ay,Co.Kilre

     

    What I am trying to do is replace the single character letters "D","S","L" & "E" with the following sentences

     

    Dail European Parliament and Local Elections only  (instead of D)

     

    European Parliament and Local Elections only (instead of E)

     

    Local Elections only (instead of L)

     

    Post or special arrangement only (instead of S)

     

    with lots of help we compiled the following

     

    <?php
    $rewrote = "";
    $handle = fopen("C:\Users\Mike\Desktop\AthyDB.txt", "r+"); // Open file to read it read.
    
    if ($handle) {
    while (!feof($handle)) // Loop til end of file.
    {
    $currentline = fgets($handle, 4096); // Read a line.
    $currentline=preg_replace('/^D/','Dail European Parliament and Local Elections only', $currentline);
    $currentline=preg_replace('/^S/','Post or special arrangement only', $currentline);
    $currentline=preg_replace('/^L/','Local Elections only', $currentline);
    $currentline=preg_replace('/^E/','European Parliament and Local Elections only', $currentline);
    $rewrote .= $currentline;
    $rewrote .= "\n";
    }
    file_put_contents('C:\Users\Mike\Desktop\test.txt', $rewrote);
    fclose($handle);
    }
    ?>
    

     

    It seems that the replacing of "D" produces "Dail European Parliament and Local Elections onlyail European Parliament and Local Elections only"

     

    Any idea what's gone wrong?

     

    It also seems to happen for "L" a few times but not everytime?? (Strange)[/code]

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.