Jump to content

Clean Text before posting to Db


Recommended Posts

I am parsing an XML file and importing the data to Mysql Db.

One of the nodes is text but contains random URL Links, sometimes css and html and sometimes multiple <br>, <lu>, <li> x2, x3 x4


I want to clean the text to be imported to be without URL links, css & html, but want to keep it tidy to be displayed as regular text on a webpage with regular spacing... I also need to get rid of untidy extra spaces and so more...


The code I am using are not so fresh and give me broken text.

What will be a more acceptable way than my following code?


$fullt = preg_replace("/[\r\n]+/", "\n", $fullt);
$fullt = preg_replace('|https?://www\.[a-z\.0-9]+|i', '', $fullt);
$fullt = preg_replace('|www\.[a-z\.0-9]+|i', '', $fullt);
$fullt=preg_replace('/\s*$^\s*/m', "\n", $fullt);
$fullt=str_replace('<br> <br>',"<br>",$fullt);
$fullt=str_replace('a href',"",$fullt);
$fullt = preg_replace( '@^(<br\\b[^>]*/'.$qs.'>)+@i', '', $fullt);               //// Text ready
Link to comment
Share on other sites

For years i've used the W3 CSS validator: http://validator.w3.org/

and virtually always been told to use

<br />
but it looks like its open to debate: http://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

Oh, only just noticed that the code in my earlier post isn't there, I wonder why that keeps on happening here???


I was going to suggest using htmlentities() and then you could use the corresponding decode function when displaying... but I then thought that you probably actually want to just strip them, for which strip_tags() should be the one...

Edited by mentalist
Link to comment
Share on other sites

You could reduce the amount of str_replace() calls by using regular expressions. For example, to replace all <br> and <br/> tags, no matter how many there are, you could do something like

$fullt = 'This is my<br><br>break text<br><br><BR/><br><BR><br/><br>hi<br><BR>kjsdf<br/><br/><br/>fsdffsd<br/><br/>df';
print htmlentities($fullt) . '<br>';
$fullt = preg_replace("~(<br>|<br/>)+~i", "<br>", $fullt);
print htmlentities($fullt);
Note that the "i" at the end of the regular expression makes it case insensitive. That way it also replaces <BR> and <BR/>.
Link to comment
Share on other sites



I tried to use this formula given on php.net:

function strip_word_html($text, $allowed_tags = '')
        //replace MS special characters first
        $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u');
        $replace = array('\'', '\'', '"', '"', '-');
        $text = preg_replace($search, $replace, $text);
        //make sure _all_ html entities are converted to the plain ascii equivalents - it appears
        //in some MS headers, some html entities are encoded and some aren't
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        //try to strip out any C style comments first, since these, embedded in html comments, seem to
        //prevent strip_tags from removing html comments (MS Word introduced combination)
        if(mb_stripos($text, '/*') !== FALSE){
            $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
        //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
        //'<1' becomes '< 1'(note: somewhat application specific)
        $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
        $text = strip_tags($text, $allowed_tags);
        //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
        $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
        //strip out inline css and simplify style tags
        $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
        $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
        $text = preg_replace($search, $replace, $text);
        //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
        //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
        //some MS Style Definitions - this last bit gets rid of any leftover comments */
        $num_matches = preg_match_all("/\<!--/u", $text, $matches);
              $text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
        return $text;

I have two issues...
1. How light is this formula if I wanna parse an XML file of 2Gb (estimate 400 000 entries)... Will this have a bad impact on server load?

2. This also did not solve my problem with the "£;" issue that is still left by doing the folowing....:


Will this then be safe to directly insert into a Db field?


Thanks for all your help!!!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.