Jump to content

Non-breaking space translates into gibberish...how to fix it?


dnguyen

Recommended Posts

So I'm scraping a webpage for values within <td> tags...and some of these values come out in the form of:

 

09311264 

 

Of course, on the actual webpage, through the browser, it appears as "09311264 "

 

When I move this value into my UTF-8 encoded table...I get something like: 09311264�

(not only is there a gibberish character for the nbsp, but there is a trailing space afterwards)

 

I'm sure there's an easy solution to this...but I've tried a lot of different things...iconv, trim, str_replace...and nothing seems to work. All I want is the numerical value.

 

Link to comment
Share on other sites

How about this:

 

$str = str_replace(' ', '', $str);
$str = trim($str);

 

Then you've removed all nbsp, and also any spaces from the start and end.

 

Another approach is

 

$str = preg_replace('|[^0-9]|', '', $str);

 

That removes anything that is not a digit.

Link to comment
Share on other sites

I'm reading it from a loadHTML call, which is then imported into simplexml....like so:

 

$ch= curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer 6");
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

ob_start();

curl_exec($ch);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

//lets get all names and urls

$eval="/html/body//tr";
$hrefs = $xpath->evaluate($eval);


$eval="/html/body//tr";
$hrefs = $xpath->evaluate($eval);


	for ($i = 0; $i < $hrefs->length; $i++) {
		$tr = $hrefs->item($i);		
		$trs = simplexml_import_dom($tr);

		if(preg_match("/registry/i", $trs->td[0]->strong)){
			$td2 = $trs->td[2];		
			trace("HEY 1: ". $td2);
			$txt = $td2;

			$txt = str_replace(' ', '', $txt);

...

Link to comment
Share on other sites

This seems to work for my needs:

$txt = preg_replace('/[^0-9A-Za-z\-\,\'():#$\/_" ]/', "", $txt);

 

But seriously, what the hell do I need to do to get an html entity to be treated as just a part of a string that can be removed in a less complicated way?

Link to comment
Share on other sites

I rather suspect that when you try to remove the   it has already been converted to another encoding.  Otherwise the str_replace() would pick it up.  That fits in with seeing a funny character in the database.

 

In which case you either need to know what that funny character is so you can remove it (try printing urlencode($txt) to see the value in hex) or you need to remove it while it's still encoded as the string " ".

 

Or alternatively, you can tell whichever process is converting the entities that you don't want it to convert them.  I would guess it's SimpleXML converting them, as XML parsing typically involves entity conversion.  I don't know how (or if you can) to do that with SimpleXML.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.