Jump to content

Non-breaking space translates into gibberish...how to fix it?


dnguyen

Recommended Posts

So I'm scraping a webpage for values within <td> tags...and some of these values come out in the form of:

 

09311264 

 

Of course, on the actual webpage, through the browser, it appears as "09311264 "

 

When I move this value into my UTF-8 encoded table...I get something like: 09311264�

(not only is there a gibberish character for the nbsp, but there is a trailing space afterwards)

 

I'm sure there's an easy solution to this...but I've tried a lot of different things...iconv, trim, str_replace...and nothing seems to work. All I want is the numerical value.

 

How about this:

 

$str = str_replace(' ', '', $str);
$str = trim($str);

 

Then you've removed all nbsp, and also any spaces from the start and end.

 

Another approach is

 

$str = preg_replace('|[^0-9]|', '', $str);

 

That removes anything that is not a digit.

I'm reading it from a loadHTML call, which is then imported into simplexml....like so:

 

$ch= curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer 6");
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

ob_start();

curl_exec($ch);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

//lets get all names and urls

$eval="/html/body//tr";
$hrefs = $xpath->evaluate($eval);


$eval="/html/body//tr";
$hrefs = $xpath->evaluate($eval);


	for ($i = 0; $i < $hrefs->length; $i++) {
		$tr = $hrefs->item($i);		
		$trs = simplexml_import_dom($tr);

		if(preg_match("/registry/i", $trs->td[0]->strong)){
			$td2 = $trs->td[2];		
			trace("HEY 1: ". $td2);
			$txt = $td2;

			$txt = str_replace(' ', '', $txt);

...

This seems to work for my needs:

$txt = preg_replace('/[^0-9A-Za-z\-\,\'():#$\/_" ]/', "", $txt);

 

But seriously, what the hell do I need to do to get an html entity to be treated as just a part of a string that can be removed in a less complicated way?

I rather suspect that when you try to remove the   it has already been converted to another encoding.  Otherwise the str_replace() would pick it up.  That fits in with seeing a funny character in the database.

 

In which case you either need to know what that funny character is so you can remove it (try printing urlencode($txt) to see the value in hex) or you need to remove it while it's still encoded as the string " ".

 

Or alternatively, you can tell whichever process is converting the entities that you don't want it to convert them.  I would guess it's SimpleXML converting them, as XML parsing typically involves entity conversion.  I don't know how (or if you can) to do that with SimpleXML.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.