Jump to content

Recommended Posts

Under some legal thing, MS had to release the specifications of the format of all of their formats (Word 03 is what I mean...  Word 07 is a known format), so you could make something if you're desperate and have a ton of time.  Other than that, I guess just try googling.

 

 

 

By the way, a "LAMP" language doesn't make sense since LAMP is generally Linux Apache MySQL PHP.

Sometimes the P in LAMP can stand Perl or Python. That is what I meant by another LAMP language. So I will rephrase that to ask would I be better off trying to do this with a perl or python script.

 

You're right, but since you posted in the PHP Help section it's assumed you're using PHP.

Ok I found this bit of code that works for the most part.  However there are still some issues with the first line getting lost.

 

The code checks to see if the line contains a NULL character or if the string length is 0 if it doesn't meet either of those conditions it cleans up the line and adds it to the output.

The problem is there is occasionally good data in a line that contains 0x00. I found a way rid of the garbage characters. See the commented out $mytemp lines. However there are a lot of random strings that would look like good data (z @YYYY0@UnknownGTimes New Roman5Symbol3Arial qh2fSF Nkg4XNx) any ideas?

 

Thanks

 

function parseWord($userDoc)

{

    $fileHandle =fopen($userDoc,"r");

    $line = @fread($fileHandle,filesize($userDoc)); 

    $lines = explode(chr(0x0D),$line);

    //var_dump($lines);

    $outtext = "";

    $count=0;

    foreach($lines as $thisline)

    {

      $pos =strpos($thisline,chr(0x00));

        if(($pos !== FALSE) || (strlen($thisline) == 0))

          {

          // $mytemp=strtr($thisline,'','');

          //$mytemp=preg_replace('/[^\b\s]/','',$mytemp);

          // $mytemp=preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$thisline);

         

        //  if($count==0 ){

          // echo "The position in line ".$count." is ".$pos."<br />";

          //echo $mytemp."<br />";

          // }

           

          }else{

            $outtext .= $thisline . " ";

          }

      $count++;

    }

    $outtext =preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);

    return $outtext;

}

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.