Jump to content

Parsing word docs without COM object?


envytomdead

Recommended Posts

Under some legal thing, MS had to release the specifications of the format of all of their formats (Word 03 is what I mean...  Word 07 is a known format), so you could make something if you're desperate and have a ton of time.  Other than that, I guess just try googling.

 

 

 

By the way, a "LAMP" language doesn't make sense since LAMP is generally Linux Apache MySQL PHP.

Sometimes the P in LAMP can stand Perl or Python. That is what I meant by another LAMP language. So I will rephrase that to ask would I be better off trying to do this with a perl or python script.

 

You're right, but since you posted in the PHP Help section it's assumed you're using PHP.

Ok I found this bit of code that works for the most part.  However there are still some issues with the first line getting lost.

 

The code checks to see if the line contains a NULL character or if the string length is 0 if it doesn't meet either of those conditions it cleans up the line and adds it to the output.

The problem is there is occasionally good data in a line that contains 0x00. I found a way rid of the garbage characters. See the commented out $mytemp lines. However there are a lot of random strings that would look like good data (z @YYYY0@UnknownGTimes New Roman5Symbol3Arial qh2fSF Nkg4XNx) any ideas?

 

Thanks

 

function parseWord($userDoc)

{

    $fileHandle =fopen($userDoc,"r");

    $line = @fread($fileHandle,filesize($userDoc)); 

    $lines = explode(chr(0x0D),$line);

    //var_dump($lines);

    $outtext = "";

    $count=0;

    foreach($lines as $thisline)

    {

      $pos =strpos($thisline,chr(0x00));

        if(($pos !== FALSE) || (strlen($thisline) == 0))

          {

          // $mytemp=strtr($thisline,'','');

          //$mytemp=preg_replace('/[^\b\s]/','',$mytemp);

          // $mytemp=preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$thisline);

         

        //  if($count==0 ){

          // echo "The position in line ".$count." is ".$pos."<br />";

          //echo $mytemp."<br />";

          // }

           

          }else{

            $outtext .= $thisline . " ";

          }

      $count++;

    }

    $outtext =preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);

    return $outtext;

}

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.