envytomdead Posted April 9, 2009 Share Posted April 9, 2009 Has anyone done this or have any ideas? Would another LAMP language be better to attempt this? -Thanks Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/ Share on other sites More sharing options...
corbin Posted April 9, 2009 Share Posted April 9, 2009 Under some legal thing, MS had to release the specifications of the format of all of their formats (Word 03 is what I mean... Word 07 is a known format), so you could make something if you're desperate and have a ton of time. Other than that, I guess just try googling. By the way, a "LAMP" language doesn't make sense since LAMP is generally Linux Apache MySQL PHP. Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/#findComment-805889 Share on other sites More sharing options...
envytomdead Posted April 9, 2009 Author Share Posted April 9, 2009 Sometimes the P in LAMP can stand Perl or Python. That is what I meant by another LAMP language. So I will rephrase that to ask would I be better off trying to do this with a perl or python script. Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/#findComment-805901 Share on other sites More sharing options...
corbin Posted April 9, 2009 Share Posted April 9, 2009 If I had to guess, I would say you're best off with PHP, then Python then Perl. I can't imagine a Word class in Perl or Python, although I think finding one in Python would be more likely than Perl. Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/#findComment-805907 Share on other sites More sharing options...
Maq Posted April 10, 2009 Share Posted April 10, 2009 Sometimes the P in LAMP can stand Perl or Python. That is what I meant by another LAMP language. So I will rephrase that to ask would I be better off trying to do this with a perl or python script. You're right, but since you posted in the PHP Help section it's assumed you're using PHP. Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/#findComment-806040 Share on other sites More sharing options...
Mark Baker Posted April 10, 2009 Share Posted April 10, 2009 I've seen at least one library on phpClasses for reading word documents in pure PHP. Can't comment on whether it's any good, and you'd have to search for it, but it might provide a starting point for you Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/#findComment-806207 Share on other sites More sharing options...
envytomdead Posted April 14, 2009 Author Share Posted April 14, 2009 Ok I found this bit of code that works for the most part. However there are still some issues with the first line getting lost. The code checks to see if the line contains a NULL character or if the string length is 0 if it doesn't meet either of those conditions it cleans up the line and adds it to the output. The problem is there is occasionally good data in a line that contains 0x00. I found a way rid of the garbage characters. See the commented out $mytemp lines. However there are a lot of random strings that would look like good data (z @YYYY0@UnknownGTimes New Roman5Symbol3Arial qh2fSF Nkg4XNx) any ideas? Thanks function parseWord($userDoc) { $fileHandle =fopen($userDoc,"r"); $line = @fread($fileHandle,filesize($userDoc)); $lines = explode(chr(0x0D),$line); //var_dump($lines); $outtext = ""; $count=0; foreach($lines as $thisline) { $pos =strpos($thisline,chr(0x00)); if(($pos !== FALSE) || (strlen($thisline) == 0)) { // $mytemp=strtr($thisline,'',''); //$mytemp=preg_replace('/[^\b\s]/','',$mytemp); // $mytemp=preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$thisline); // if($count==0 ){ // echo "The position in line ".$count." is ".$pos."<br />"; //echo $mytemp."<br />"; // } }else{ $outtext .= $thisline . " "; } $count++; } $outtext =preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext); return $outtext; } Quote Link to comment https://forums.phpfreaks.com/topic/153391-parsing-word-docs-without-com-object/#findComment-810024 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.