StefanRSA Posted October 28, 2013 Share Posted October 28, 2013 I am parsing an XML file and importing the data to Mysql Db. One of the nodes is text but contains random URL Links, sometimes css and html and sometimes multiple <br>, <lu>, <li> x2, x3 x4 I want to clean the text to be imported to be without URL links, css & html, but want to keep it tidy to be displayed as regular text on a webpage with regular spacing... I also need to get rid of untidy extra spaces and so more... The code I am using are not so fresh and give me broken text. What will be a more acceptable way than my following code? $fullt=str_replace("\n","",$texta); $fullt = preg_replace("/[\r\n]+/", "\n", $fullt); $fullt=trim($fullt); $fullt = preg_replace('|https?://www\.[a-z\.0-9]+|i', '', $fullt); $fullt = preg_replace('|www\.[a-z\.0-9]+|i', '', $fullt); $fullt=str_replace('">',"",$fullt); $fullt=str_replace('<"',"",$fullt); $fullt=preg_replace('/\s*$^\s*/m', "\n", $fullt); $fullt=str_replace('<BR/>',"<br>",$fullt); $fullt=str_replace('<br><br><br>',"<br>",$fullt); $fullt=str_replace('<br> <br>',"<br>",$fullt); $fullt=str_replace(' ',"",$fullt); $fullt=str_replace('<br><br>',"<br>",$fullt); $fullt=str_replace('a href',"",$fullt); $fullt=str_replace('"',"",$fullt); $fullt=str_replace(';',"",$fullt); $fullt=str_replace(';',"",$fullt); $fullt=str_replace('^',"",$fullt); $qs='?'; $fullt = preg_replace( '@^(<br\\b[^>]*/'.$qs.'>)+@i', '', $fullt); //// Text ready $fullt=mysql_real_escape_string($fullt); Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/ Share on other sites More sharing options...
Ch0cu3r Posted October 28, 2013 Share Posted October 28, 2013 Maybe use strip_tags instead Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455835 Share on other sites More sharing options...
mentalist Posted October 28, 2013 Share Posted October 28, 2013 Also you should be replacing the line breaks with proper validated line breaks, e.g. Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455837 Share on other sites More sharing options...
StefanRSA Posted October 28, 2013 Author Share Posted October 28, 2013 Thanks for the reply. How would i replace line breaks with proper validated line breaks? When using pure strip_tags it still leaves me with non text caracters like: ’ for ' – for - £; - for £ Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455843 Share on other sites More sharing options...
mentalist Posted October 28, 2013 Share Posted October 28, 2013 (edited) For years i've used the W3 CSS validator: http://validator.w3.org/ and virtually always been told to use <br />but it looks like its open to debate: http://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br Oh, only just noticed that the code in my earlier post isn't there, I wonder why that keeps on happening here??? I was going to suggest using htmlentities() and then you could use the corresponding decode function when displaying... but I then thought that you probably actually want to just strip them, for which strip_tags() should be the one... Edited October 28, 2013 by mentalist Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455845 Share on other sites More sharing options...
mentalist Posted October 28, 2013 Share Posted October 28, 2013 I can't edit that other post now! But it appears that my code quoting issue is due to me using noscript... very odd! Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455846 Share on other sites More sharing options...
cyberRobot Posted October 28, 2013 Share Posted October 28, 2013 You could reduce the amount of str_replace() calls by using regular expressions. For example, to replace all <br> and <br/> tags, no matter how many there are, you could do something like <?php $fullt = 'This is my<br><br>break text<br><br><BR/><br><BR><br/><br>hi<br><BR>kjsdf<br/><br/><br/>fsdffsd<br/><br/>df'; print htmlentities($fullt) . '<br>'; $fullt = preg_replace("~(<br>|<br/>)+~i", "<br>", $fullt); print htmlentities($fullt); ?> Note that the "i" at the end of the regular expression makes it case insensitive. That way it also replaces <BR> and <BR/>. Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455847 Share on other sites More sharing options...
StefanRSA Posted October 28, 2013 Author Share Posted October 28, 2013 Cool, I tried to use this formula given on php.net: function strip_word_html($text, $allowed_tags = '') { mb_regex_encoding('UTF-8'); //replace MS special characters first $search = array('/‘/u', '/’/u', '/“/u', '/”/u', '/—/u'); $replace = array('\'', '\'', '"', '"', '-'); $text = preg_replace($search, $replace, $text); //make sure _all_ html entities are converted to the plain ascii equivalents - it appears //in some MS headers, some html entities are encoded and some aren't $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8'); //try to strip out any C style comments first, since these, embedded in html comments, seem to //prevent strip_tags from removing html comments (MS Word introduced combination) if(mb_stripos($text, '/*') !== FALSE){ $text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm'); } //introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be //'<1' becomes '< 1'(note: somewhat application specific) $text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text); $text = strip_tags($text, $allowed_tags); //eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one $text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text); //strip out inline css and simplify style tags $search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu'); $replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>'); $text = preg_replace($search, $replace, $text); //on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears //that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains //some MS Style Definitions - this last bit gets rid of any leftover comments */ $num_matches = preg_match_all("/\<!--/u", $text, $matches); if($num_matches){ $text = preg_replace('/\<!--(.)*--\>/isu', '', $text); } return $text; } I have two issues...1. How light is this formula if I wanna parse an XML file of 2Gb (estimate 400 000 entries)... Will this have a bad impact on server load? 2. This also did not solve my problem with the "£;" issue that is still left by doing the folowing....: $fullt=strip_word_html($texta);/// $fullt=str_replace('£;',"£",$fullt); Will this then be safe to directly insert into a Db field? Thanks for all your help!!! Quote Link to comment https://forums.phpfreaks.com/topic/283369-clean-text-before-posting-to-db/#findComment-1455848 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.