i need to borrow a genious

Jocka · December 8, 2006

This code came from a script redbullmarky has. The objective of it is to strip text out of a doc file but it isn't working for me. I've looked through everything and found where it all goes wrong but then i'm stuck.

Here's the code used to "decode":

[code]
function stripSpecial($input)
{
$search = array(chr(145),
chr(146),
chr(96),
chr(132),
chr(147),
chr(148),
chr(133),
chr(150));

$replace = array( "'",
"'",
"'",
'"',
'"',
'"',
'...',
'-');
$output = addslashes(str_replace($search, $replace, $input));

// now strip out all the junk/control chars, etc
$output = stripslashes(substr($output, 0, strpos($output, '\0\0')));

// get rid of any remaining control chars
$output = preg_replace('/'.chr(19).'(.*?)'.chr(20).'/', '', $output);
$output = str_replace(chr(21), '', $output);

return $output;
}
[/code]

it goes wrong here:
[code]
$output = stripslashes(substr($output, 0, strpos($output, '\0\0')));
[/code]
After it goes through the hassle of stripping out most of the code and looking for the text, it comes to this point. This (i'm guessing) is supposed to find a significant point in the doc file where the text starts but this isn't the case for MY doc files. From the looks of it, it's trying to find where this point starts and read after it but when it does this, all it returns is: [b]x[/b]
I've tried everything I could think of but since the top section randomly changes length, i can't use length and since '\0\0' doesn't come up on my doc files, it doesn't find a place to stop.

Any ideas?

Ninjakreborn · December 8, 2006

I am even impressed red was able to create that, that is amazing.
I have spend my whole time thinking it was impossible.
So I don't know how to help, I didn't even know that could be done.

Jocka · December 8, 2006

yea it CAN be done. With a 3rd party module I can do it but my server can't take anything (it's crap..)

roopurt18 · December 8, 2006

By doc file I'm assuming you mean MS Word. There could be a version problem with the code you have. This might help you out some:

http://www.wotsit.org/search.asp?s=text

(EDIT) Looking again at the code you have again, it looks like that particular function is looking for a double null char as the sentinal where the text ends, not begins.

roopurt18 · December 8, 2006

I wrote up a little script to test this out myself.

Here is the beginning of the raw contents of my doc file:
[code]ÐÏà¡±á>þÿ þÿÿÿ‹Œÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á!` ð¿X8bjbj\\ .^>Ç>ÇX0ÿÿÿÿÿÿ¤¼:¼:¼:¼:,è:dhX;X;X;X;X;X;X;X;ƒƒƒƒƒƒ$j‘hÒ“|§ÓÇ¼:%@î™‰HÒ0á‰„N”A N”eN”eX;Z²;@ÉAò;4&<X;X;X;§§³AX;X;X;Ó<Ó<Ó<Ó<äø"Äø"ÿÿÿÿPharaoh Information Systems
To do
Last Update: Thu. Nov 9, 2006
[/code]

If I add the following to the function:
[code] $output = addslashes(str_replace($search, $replace, $input));
echo "<pre>" . print_r($output, true) . "</pre>";
[/code]
then the beginning of $output looks like:
[code]
ÐÏà¡±á\0\0
[/code]

So in my case the substr portion returns only:
[code]ÐÏà¡±á[/code] as my word document.

However, there are many more sequences of \0\0 before the actual document begins. Chances are red means to capture the last of those sequences and return that with the substr function.

Jocka · December 8, 2006

the problem here is, there's really no telling where the document actually BEGINS (as far as I can see anyway). There has to be a way to strip all that out though. Red' said it works perfectly for him. .. i don't know, i'm lost. I'm doing the 3rd party searches again to see if I can find one compatible.

roopurt18 · December 8, 2006

If you look at the wotsit link I provided, you will get detailed information on out to strip the relevant information out of a .doc file. However, if you're looking for quick and dirty, you could create a regexp that contains all the characters you want to keep, negate it, and replace everything in the negated class with an empty string. Not as reliable but probably quicker than disecting the format.

Jocka · December 8, 2006

Yea I tried the regex thing.. then I got stuck with a bunch of useless letters setting in the middle.. I did look at that link btw but I'm for that "Quick" way.

redbullmarky · December 8, 2006

Jocka, the rest of the code I sent you strips the header off the doc file before the stripping of stuff even begins.
in the loadFile method, you'll notice this:

[code]
<?php
function loadFile($filename = '', $header_len = 2560)
{
// ... check stuff

$this->header = fread($fo, $header_len);

//...read rest of file
}
?>
[/code]

which whips off the first 2560 bytes from the DOC - so it's actually this bit that gets to the start of the 'real' content. i came up with this figure after testing out several doc files from Word 97/Office XP, etc. then comes the StripSpecial method that cleans things up.

part of the StripSpecial method was gotten from Bartek's post in the PHP manual [url=http://uk2.php.net/htmlentities]here[/url], which I added bits to to clean up things even further.

Sign In

i need to borrow a genious

Recommended Posts

Jocka

Link to comment

Share on other sites

Ninjakreborn

Link to comment

Share on other sites

Jocka

Link to comment

Share on other sites

roopurt18

Link to comment

Share on other sites

roopurt18

Link to comment

Share on other sites

Jocka

Link to comment

Share on other sites

roopurt18

Link to comment

Share on other sites

Jocka

Link to comment

Share on other sites

redbullmarky

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information