Jump to content

Clean Text before posting to Db


StefanRSA

Recommended Posts

I am parsing an XML file and importing the data to Mysql Db.

One of the nodes is text but contains random URL Links, sometimes css and html and sometimes multiple <br>, <lu>, <li> x2, x3 x4

 

I want to clean the text to be imported to be without URL links, css & html, but want to keep it tidy to be displayed as regular text on a webpage with regular spacing... I also need to get rid of untidy extra spaces and so more...

 

The code I am using are not so fresh and give me broken text.

What will be a more acceptable way than my following code?

$fullt=str_replace("\n","",$texta);

$fullt = preg_replace("/[\r\n]+/", "\n", $fullt);
$fullt=trim($fullt);
$fullt = preg_replace('|https?://www\.[a-z\.0-9]+|i', '', $fullt);
$fullt = preg_replace('|www\.[a-z\.0-9]+|i', '', $fullt);
$fullt=str_replace('">',"",$fullt);
$fullt=str_replace('<"',"",$fullt);
$fullt=preg_replace('/\s*$^\s*/m', "\n", $fullt);
$fullt=str_replace('<BR/>',"<br>",$fullt);
$fullt=str_replace('<br><br><br>',"<br>",$fullt);
$fullt=str_replace('<br> <br>',"<br>",$fullt);
$fullt=str_replace('&nbsp',"",$fullt);
$fullt=str_replace('<br><br>',"<br>",$fullt);
$fullt=str_replace('a href',"",$fullt);
$fullt=str_replace('"',"",$fullt);
$fullt=str_replace(';',"",$fullt);
$fullt=str_replace(';',"",$fullt);
$fullt=str_replace('^',"",$fullt);
$qs='?';
$fullt = preg_replace( '@^(<br\\b[^>]*/'.$qs.'>)+@i', '', $fullt);               //// Text ready
$fullt=mysql_real_escape_string($fullt);
Link to comment
https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/
Share on other sites

For years i've used the W3 CSS validator: http://validator.w3.org/

and virtually always been told to use

<br />
but it looks like its open to debate: http://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

Oh, only just noticed that the code in my earlier post isn't there, I wonder why that keeps on happening here???

 

I was going to suggest using htmlentities() and then you could use the corresponding decode function when displaying... but I then thought that you probably actually want to just strip them, for which strip_tags() should be the one...

You could reduce the amount of str_replace() calls by using regular expressions. For example, to replace all <br> and <br/> tags, no matter how many there are, you could do something like

<?php
$fullt = 'This is my<br><br>break text<br><br><BR/><br><BR><br/><br>hi<br><BR>kjsdf<br/><br/><br/>fsdffsd<br/><br/>df';
print htmlentities($fullt) . '<br>';
 
$fullt = preg_replace("~(<br>|<br/>)+~i", "<br>", $fullt);
print htmlentities($fullt);
?>
Note that the "i" at the end of the regular expression makes it case insensitive. That way it also replaces <BR> and <BR/>.

Cool,

 

I tried to use this formula given on php.net:

function strip_word_html($text, $allowed_tags = '')
    {
        mb_regex_encoding('UTF-8');
        //replace MS special characters first
        $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
        $replace = array('\'', '\'', '"', '"', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        }
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        $num_matches = preg_match_all("/\<!--/u", $text, $matches);
        if($num_matches){
              $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
        }
        return $text;
    } 

I have two issues...
1. How light is this formula if I wanna parse an XML file of 2Gb (estimate 400 000 entries)... Will this have a bad impact on server load?

2. This also did not solve my problem with the "£;" issue that is still left by doing the folowing....:

$fullt=strip_word_html($texta);/// 
$fullt=str_replace('£;',"£",$fullt);

Will this then be safe to directly insert into a Db field?

 

Thanks for all your help!!!

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.