Clean Text before posting to Db

StefanRSA · October 28, 2013

I am parsing an XML file and importing the data to Mysql Db.

One of the nodes is text but contains random URL Links, sometimes css and html and sometimes multiple , <lu>, <li> x2, x3 x4

I want to clean the text to be imported to be without URL links, css & html, but want to keep it tidy to be displayed as regular text on a webpage with regular spacing... I also need to get rid of untidy extra spaces and so more...

The code I am using are not so fresh and give me broken text.

What will be a more acceptable way than my following code?

$fullt=str_replace("\n","",$texta);

$fullt = preg_replace("/[\r\n]+/", "\n", $fullt);
$fullt=trim($fullt);
$fullt = preg_replace('|https?://www\.[a-z\.0-9]+|i', '', $fullt);
$fullt = preg_replace('|www\.[a-z\.0-9]+|i', '', $fullt);
$fullt=str_replace('">',"",$fullt);
$fullt=str_replace('<"',"",$fullt);
$fullt=preg_replace('/\s*$^\s*/m', "\n", $fullt);
$fullt=str_replace('<BR/>',"<br>",$fullt);
$fullt=str_replace('<br><br><br>',"<br>",$fullt);
$fullt=str_replace('<br> <br>',"<br>",$fullt);
$fullt=str_replace('&nbsp',"",$fullt);
$fullt=str_replace('<br><br>',"<br>",$fullt);
$fullt=str_replace('a href',"",$fullt);
$fullt=str_replace('"',"",$fullt);
$fullt=str_replace(';',"",$fullt);
$fullt=str_replace(';',"",$fullt);
$fullt=str_replace('^',"",$fullt);
$qs='?';
$fullt = preg_replace( '@^(<br\\b[^>]*/'.$qs.'>)+@i', '', $fullt);               //// Text ready
$fullt=mysql_real_escape_string($fullt);

Ch0cu3r · October 28, 2013

Maybe use strip_tags instead

mentalist · October 28, 2013

Also you should be replacing the line breaks with proper validated line breaks, e.g.

StefanRSA · October 28, 2013

Thanks for the reply.

How would i replace line breaks with proper validated line breaks?

When using pure strip_tags it still leaves me with non text caracters like:

’ for '

– for -

£; - for £

mentalist · October 28, 2013

For years i've used the W3 CSS validator: http://validator.w3.org/

and virtually always been told to use

<br />

but it looks like its open to debate: http://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

Oh, only just noticed that the code in my earlier post isn't there, I wonder why that keeps on happening here???

I was going to suggest using htmlentities() and then you could use the corresponding decode function when displaying... but I then thought that you probably actually want to just strip them, for which strip_tags() should be the one...

mentalist · October 28, 2013

I can't edit that other post now! But it appears that my code quoting issue is due to me using noscript... very odd!

cyberRobot · October 28, 2013

You could reduce the amount of str_replace() calls by using regular expressions. For example, to replace all and tags, no matter how many there are, you could do something like

<?php
$fullt = 'This is my<br><br>break text<br><br><BR/><br><BR><br/><br>hi<br><BR>kjsdf<br/><br/><br/>fsdffsd<br/><br/>df';
print htmlentities($fullt) . '<br>';
 
$fullt = preg_replace("~(<br>|<br/>)+~i", "<br>", $fullt);
print htmlentities($fullt);
?>

Note that the "i" at the end of the regular expression makes it case insensitive. That way it also replaces and .

StefanRSA · October 28, 2013

Cool,

I tried to use this formula given on php.net:

function strip_word_html($text, $allowed_tags = '')
    {
        mb_regex_encoding('UTF-8');
        //replace MS special characters first
        $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
        $replace = array('\'', '\'', '"', '"', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        }
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        $num_matches = preg_match_all("/\<!--/u", $text, $matches);
        if($num_matches){
              $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
        }
        return $text;
    }

I have two issues...
1. How light is this formula if I wanna parse an XML file of 2Gb (estimate 400 000 entries)... Will this have a bad impact on server load?

2. This also did not solve my problem with the "£;" issue that is still left by doing the folowing....:

$fullt=strip_word_html($texta);/// 
$fullt=str_replace('£;',"£",$fullt);

Will this then be safe to directly insert into a Db field?

Thanks for all your help!!!

Sign In

Clean Text before posting to Db

Recommended Posts

StefanRSA

Link to comment

Share on other sites

Ch0cu3r

Link to comment

Share on other sites

mentalist

Link to comment

Share on other sites

StefanRSA

Link to comment

Share on other sites

mentalist

Link to comment

Share on other sites

mentalist

Link to comment

Share on other sites

cyberRobot

Link to comment

Share on other sites

StefanRSA

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information