Jump to content

Recommended Posts

This code came from a script redbullmarky has. The objective of it is to strip text out of a doc file but it isn't working for me. I've looked through everything and found where it all goes wrong but then i'm stuck.

Here's the code used to "decode":

[code]
function stripSpecial($input)
{
$search = array(chr(145),
chr(146),
chr(96),
chr(132),
chr(147),
chr(148),
chr(133),
chr(150));

$replace = array(    "'",
"'",
"'",
'"',
'"',
'"',
'...',
'-');
$output = addslashes(str_replace($search, $replace, $input));

    // now strip out all the junk/control chars, etc
    $output = stripslashes(substr($output, 0, strpos($output, '\0\0')));
   
    // get rid of any remaining control chars
    $output = preg_replace('/'.chr(19).'(.*?)'.chr(20).'/', '', $output);
    $output = str_replace(chr(21), '', $output);

return $output;
}
[/code]

it goes wrong here:
[code]
$output = stripslashes(substr($output, 0, strpos($output, '\0\0')));
[/code]
After it goes through the hassle of stripping out most of the code and looking for the text, it comes to this point. This (i'm guessing) is supposed to find a significant point in the doc file where the text starts but this isn't the case for MY doc files. From the looks of it, it's trying to find where this point starts and read after it but when it does this, all it returns is: [b]x[/b]
I've tried everything I could think of but since the top section randomly changes length, i can't use length and since '\0\0' doesn't come up on my doc files, it doesn't find a place to stop.

Any ideas?
Link to comment
https://forums.phpfreaks.com/topic/29955-i-need-to-borrow-a-genious/
Share on other sites

By doc file I'm assuming you mean MS Word.  There could be a version problem with the code you have.  This might help you out some:

http://www.wotsit.org/search.asp?s=text

(EDIT) Looking again at the code you have again, it looks like that particular function is looking for a double null char as the sentinal where the text ends, not begins.
I wrote up a little script to test this out myself.

Here is the beginning of the raw contents of my doc file:
[code]ÐÏࡱá>þÿ þÿÿÿ‹Œÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á!` ð¿X8bjbj\­\­ .^>Ç>ÇX0ÿÿÿÿÿÿ¤¼:¼:¼:¼:,è:dhX;X;X;X;X;X;X;X;ƒƒƒƒƒƒ$j‘hÒ“|§ÓǼ:%@HÒ0á‰„N”A N”eN”eX;Z²;@ÉAò;4&<­X;X;X;§§³AX;X;X;Ó<Ó<Ó<Ó<äø"Äø"ÿÿÿÿ Pharaoh Information Systems
To do
Last Update: Thu. Nov 9, 2006
[/code]

If I add the following to the function:
[code] $output = addslashes(str_replace($search, $replace, $input));
    echo "<pre>" . print_r($output, true) . "</pre>";
[/code]
then the beginning of $output looks like:
[code]
ÐÏࡱá\0\0
[/code]

So in my case the substr portion returns only:
[code]ÐÏࡱá[/code] as my word document.

However, there are many more sequences of \0\0 before the actual document begins.  Chances are red means to capture the last of those sequences and return that with the substr function.
the problem here is, there's really no telling where the document actually BEGINS (as far as I can see anyway). There has to be a way to strip all that out though. Red' said it works perfectly for him. .. i don't know, i'm lost. I'm doing the 3rd party searches again to see if I can find one compatible.
If you look at the wotsit link I provided, you will get detailed information on out to strip the relevant information out of a .doc file.  However, if you're looking for quick and dirty, you could create a regexp that contains all the characters you want to keep, negate it, and replace everything in the negated class with an empty string.  Not as reliable but probably quicker than disecting the format.
Jocka, the rest of the code I sent you strips the header off the doc file before the stripping of stuff even begins.
in the loadFile method, you'll notice this:

[code]
<?php
function loadFile($filename = '', $header_len = 2560)
{
// ... check stuff

  $this->header = fread($fo, $header_len);

//...read rest of file
}
?>
[/code]

which whips off the first 2560 bytes from the DOC - so it's actually this bit that gets to the start of the 'real' content. i came up with this figure after testing out several doc files from Word 97/Office XP, etc. then comes the StripSpecial method that cleans things up.

part of the StripSpecial method was gotten from Bartek's post in the PHP manual [url=http://uk2.php.net/htmlentities]here[/url], which I added bits to to clean up things even further.
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.