dnguyen Posted July 31, 2008 Share Posted July 31, 2008 So I'm scraping a webpage for values within <td> tags...and some of these values come out in the form of: 09311264 Of course, on the actual webpage, through the browser, it appears as "09311264 " When I move this value into my UTF-8 encoded table...I get something like: 09311264� (not only is there a gibberish character for the nbsp, but there is a trailing space afterwards) I'm sure there's an easy solution to this...but I've tried a lot of different things...iconv, trim, str_replace...and nothing seems to work. All I want is the numerical value. Quote Link to comment Share on other sites More sharing options...
btherl Posted July 31, 2008 Share Posted July 31, 2008 How about this: $str = str_replace(' ', '', $str); $str = trim($str); Then you've removed all nbsp, and also any spaces from the start and end. Another approach is $str = preg_replace('|[^0-9]|', '', $str); That removes anything that is not a digit. Quote Link to comment Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 Yeah I tried that...but the still appears...this is what I tried: $txt = str_replace(' ', '', $txt); The string is an output from SimpleXML...the encoding shouldn't matter, should it? Quote Link to comment Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 and unfortunately, I can't just do a regexp for numbers...I'm also grabbing alphabetical strings from these fields Quote Link to comment Share on other sites More sharing options...
cooldude832 Posted July 31, 2008 Share Posted July 31, 2008 if your reading it using file_get_contents() then the are reserved as the string they are since php doesn't care t hat they are html special characters. Quote Link to comment Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 I'm reading it from a loadHTML call, which is then imported into simplexml....like so: $ch= curl_init(); curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer 6"); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); ob_start(); curl_exec($ch); $html = curl_exec($ch); $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom); //lets get all names and urls $eval="/html/body//tr"; $hrefs = $xpath->evaluate($eval); $eval="/html/body//tr"; $hrefs = $xpath->evaluate($eval); for ($i = 0; $i < $hrefs->length; $i++) { $tr = $hrefs->item($i); $trs = simplexml_import_dom($tr); if(preg_match("/registry/i", $trs->td[0]->strong)){ $td2 = $trs->td[2]; trace("HEY 1: ". $td2); $txt = $td2; $txt = str_replace(' ', '', $txt); ... Quote Link to comment Share on other sites More sharing options...
dnguyen Posted July 31, 2008 Author Share Posted July 31, 2008 This seems to work for my needs: $txt = preg_replace('/[^0-9A-Za-z\-\,\'():#$\/_" ]/', "", $txt); But seriously, what the hell do I need to do to get an html entity to be treated as just a part of a string that can be removed in a less complicated way? Quote Link to comment Share on other sites More sharing options...
btherl Posted August 1, 2008 Share Posted August 1, 2008 I rather suspect that when you try to remove the it has already been converted to another encoding. Otherwise the str_replace() would pick it up. That fits in with seeing a funny character in the database. In which case you either need to know what that funny character is so you can remove it (try printing urlencode($txt) to see the value in hex) or you need to remove it while it's still encoded as the string " ". Or alternatively, you can tell whichever process is converting the entities that you don't want it to convert them. I would guess it's SimpleXML converting them, as XML parsing typically involves entity conversion. I don't know how (or if you can) to do that with SimpleXML. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.