dnguyen Posted July 31, 2008 Share Posted July 31, 2008 So I'm scraping a webpage for values within <td> tags...and some of these values come out in the form of: 09311264 Of course, on the actual webpage, through the browser, it appears as "09311264 " When I move this value into my UTF-8 encoded table...I get something like: 09311264� (not only is there a gibberish character for the nbsp, but there is a trailing space afterwards) I'm sure there's an easy solution to this...but I've tried a lot of different things...iconv, trim, str_replace...and nothing seems to work. All I want is the numerical value. Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/ Share on other sites More sharing options...
btherl Posted July 31, 2008 Share Posted July 31, 2008 How about this: $str = str_replace(' ', '', $str); $str = trim($str); Then you've removed all nbsp, and also any spaces from the start and end. Another approach is $str = preg_replace('|[^0-9]|', '', $str); That removes anything that is not a digit. Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-604223 Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 Yeah I tried that...but the still appears...this is what I tried: $txt = str_replace(' ', '', $txt); The string is an output from SimpleXML...the encoding shouldn't matter, should it? Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-604267 Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 and unfortunately, I can't just do a regexp for numbers...I'm also grabbing alphabetical strings from these fields Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-604268 Share on other sites More sharing options...
cooldude832 Posted July 31, 2008 Share Posted July 31, 2008 if your reading it using file_get_contents() then the are reserved as the string they are since php doesn't care t hat they are html special characters. Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-604269 Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 I'm reading it from a loadHTML call, which is then imported into simplexml....like so: $ch= curl_init(); curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer 6"); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); ob_start(); curl_exec($ch); $html = curl_exec($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); //lets get all names and urls $eval="/html/body//tr"; $hrefs = $xpath->evaluate($eval); $eval="/html/body//tr"; $hrefs = $xpath->evaluate($eval); for ($i = 0; $i < $hrefs->length; $i++) { $tr = $hrefs->item($i); $trs = simplexml_import_dom($tr); if(preg_match("/registry/i", $trs->td[0]->strong)){ $td2 = $trs->td[2]; trace("HEY 1: ". $td2); $txt = $td2; $txt = str_replace(' ', '', $txt); ... Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-604271 Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 This seems to work for my needs: $txt = preg_replace('/[^0-9A-Za-z\-\,\'():#$\/_" ]/', "", $txt); But seriously, what the hell do I need to do to get an html entity to be treated as just a part of a string that can be removed in a less complicated way? Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-604296 Share on other sites More sharing options...
btherl Posted August 1, 2008 Share Posted August 1, 2008 I rather suspect that when you try to remove the it has already been converted to another encoding. Otherwise the str_replace() would pick it up. That fits in with seeing a funny character in the database. In which case you either need to know what that funny character is so you can remove it (try printing urlencode($txt) to see the value in hex) or you need to remove it while it's still encoded as the string " ". Or alternatively, you can tell whichever process is converting the entities that you don't want it to convert them. I would guess it's SimpleXML converting them, as XML parsing typically involves entity conversion. I don't know how (or if you can) to do that with SimpleXML. Link to comment https://forums.phpfreaks.com/topic/117451-non-breaking-space-translates-into-gibberishhow-to-fix-it/#findComment-605161 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.