Jocka Posted December 8, 2006 Share Posted December 8, 2006 This code came from a script redbullmarky has. The objective of it is to strip text out of a doc file but it isn't working for me. I've looked through everything and found where it all goes wrong but then i'm stuck.Here's the code used to "decode":[code] function stripSpecial($input) { $search = array(chr(145), chr(146), chr(96), chr(132), chr(147), chr(148), chr(133), chr(150)); $replace = array( "'", "'", "'", '"', '"', '"', '...', '-'); $output = addslashes(str_replace($search, $replace, $input)); // now strip out all the junk/control chars, etc $output = stripslashes(substr($output, 0, strpos($output, '\0\0'))); // get rid of any remaining control chars $output = preg_replace('/'.chr(19).'(.*?)'.chr(20).'/', '', $output); $output = str_replace(chr(21), '', $output); return $output; }[/code]it goes wrong here:[code]$output = stripslashes(substr($output, 0, strpos($output, '\0\0')));[/code]After it goes through the hassle of stripping out most of the code and looking for the text, it comes to this point. This (i'm guessing) is supposed to find a significant point in the doc file where the text starts but this isn't the case for MY doc files. From the looks of it, it's trying to find where this point starts and read after it but when it does this, all it returns is: [b]x[/b]I've tried everything I could think of but since the top section randomly changes length, i can't use length and since '\0\0' doesn't come up on my doc files, it doesn't find a place to stop.Any ideas? Quote Link to comment Share on other sites More sharing options...
Ninjakreborn Posted December 8, 2006 Share Posted December 8, 2006 I am even impressed red was able to create that, that is amazing.I have spend my whole time thinking it was impossible.So I don't know how to help, I didn't even know that could be done. Quote Link to comment Share on other sites More sharing options...
Jocka Posted December 8, 2006 Author Share Posted December 8, 2006 yea it CAN be done. With a 3rd party module I can do it but my server can't take anything (it's crap..) Quote Link to comment Share on other sites More sharing options...
roopurt18 Posted December 8, 2006 Share Posted December 8, 2006 By doc file I'm assuming you mean MS Word. There could be a version problem with the code you have. This might help you out some:http://www.wotsit.org/search.asp?s=text(EDIT) Looking again at the code you have again, it looks like that particular function is looking for a double null char as the sentinal where the text ends, not begins. Quote Link to comment Share on other sites More sharing options...
roopurt18 Posted December 8, 2006 Share Posted December 8, 2006 I wrote up a little script to test this out myself.Here is the beginning of the raw contents of my doc file:[code]ÐÏࡱá>þÿ þÿÿÿ‹Œÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á!` ð¿X8bjbj\\ .^>Ç>ÇX0ÿÿÿÿÿÿ¤¼:¼:¼:¼:,è:dhX;X;X;X;X;X;X;X;ƒƒƒƒƒƒ$j‘hÒ“|§ÓǼ:%@HÒ0ቄN”A N”eN”eX;Z²;@ÉAò;4&<X;X;X;§§³AX;X;X;Ó<Ó<Ó<Ó<äø"Äø"ÿÿÿÿPharaoh Information SystemsTo doLast Update: Thu. Nov 9, 2006[/code]If I add the following to the function:[code] $output = addslashes(str_replace($search, $replace, $input)); echo "<pre>" . print_r($output, true) . "</pre>";[/code]then the beginning of $output looks like:[code]ÐÏࡱá\0\0[/code]So in my case the substr portion returns only:[code]ÐÏࡱá[/code] as my word document.However, there are many more sequences of \0\0 before the actual document begins. Chances are red means to capture the last of those sequences and return that with the substr function. Quote Link to comment Share on other sites More sharing options...
Jocka Posted December 8, 2006 Author Share Posted December 8, 2006 the problem here is, there's really no telling where the document actually BEGINS (as far as I can see anyway). There has to be a way to strip all that out though. Red' said it works perfectly for him. .. i don't know, i'm lost. I'm doing the 3rd party searches again to see if I can find one compatible. Quote Link to comment Share on other sites More sharing options...
roopurt18 Posted December 8, 2006 Share Posted December 8, 2006 If you look at the wotsit link I provided, you will get detailed information on out to strip the relevant information out of a .doc file. However, if you're looking for quick and dirty, you could create a regexp that contains all the characters you want to keep, negate it, and replace everything in the negated class with an empty string. Not as reliable but probably quicker than disecting the format. Quote Link to comment Share on other sites More sharing options...
Jocka Posted December 8, 2006 Author Share Posted December 8, 2006 Yea I tried the regex thing.. then I got stuck with a bunch of useless letters setting in the middle.. I did look at that link btw but I'm for that "Quick" way. Quote Link to comment Share on other sites More sharing options...
redbullmarky Posted December 8, 2006 Share Posted December 8, 2006 Jocka, the rest of the code I sent you strips the header off the doc file before the stripping of stuff even begins.in the loadFile method, you'll notice this:[code]<?phpfunction loadFile($filename = '', $header_len = 2560){// ... check stuff $this->header = fread($fo, $header_len);//...read rest of file}?>[/code]which whips off the first 2560 bytes from the DOC - so it's actually this bit that gets to the start of the 'real' content. i came up with this figure after testing out several doc files from Word 97/Office XP, etc. then comes the StripSpecial method that cleans things up.part of the StripSpecial method was gotten from Bartek's post in the PHP manual [url=http://uk2.php.net/htmlentities]here[/url], which I added bits to to clean up things even further. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.