Non-breaking space translates into gibberish...how to fix it?

dnguyen · July 31, 2008

So I'm scraping a webpage for values within <td> tags...and some of these values come out in the form of:

09311264

Of course, on the actual webpage, through the browser, it appears as "09311264 "

When I move this value into my UTF-8 encoded table...I get something like: 09311264�

(not only is there a gibberish character for the nbsp, but there is a trailing space afterwards)

I'm sure there's an easy solution to this...but I've tried a lot of different things...iconv, trim, str_replace...and nothing seems to work. All I want is the numerical value.

btherl · July 31, 2008

How about this:

$str = str_replace(' ', '', $str);
$str = trim($str);

Then you've removed all nbsp, and also any spaces from the start and end.

Another approach is

$str = preg_replace('|[^0-9]|', '', $str);

That removes anything that is not a digit.

dnguyen · July 31, 2008

Yeah I tried that...but the still appears...this is what I tried:

$txt = str_replace(' ', '', $txt);

The string is an output from SimpleXML...the encoding shouldn't matter, should it?

dnguyen · July 31, 2008

and unfortunately, I can't just do a regexp for numbers...I'm also grabbing alphabetical strings from these fields

cooldude832 · July 31, 2008

if your reading it using file_get_contents() then the are reserved as the string they are since php doesn't care t hat they are html special characters.

dnguyen · July 31, 2008

I'm reading it from a loadHTML call, which is then imported into simplexml....like so:

$ch= curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer 6");
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);

ob_start();

curl_exec($ch);
$html = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

//lets get all names and urls

$eval="/html/body//tr";
$hrefs = $xpath->evaluate($eval);


$eval="/html/body//tr";
$hrefs = $xpath->evaluate($eval);


	for ($i = 0; $i < $hrefs->length; $i++) {
		$tr = $hrefs->item($i);		
		$trs = simplexml_import_dom($tr);

		if(preg_match("/registry/i", $trs->td[0]->strong)){
			$td2 = $trs->td[2];		
			trace("HEY 1: ". $td2);
			$txt = $td2;

			$txt = str_replace(' ', '', $txt);

...

dnguyen · July 31, 2008

This seems to work for my needs:

$txt = preg_replace('/[^0-9A-Za-z\-\,\'():#$\/_" ]/', "", $txt);

But seriously, what the hell do I need to do to get an html entity to be treated as just a part of a string that can be removed in a less complicated way?

btherl · August 1, 2008

I rather suspect that when you try to remove the it has already been converted to another encoding. Otherwise the str_replace() would pick it up. That fits in with seeing a funny character in the database.

In which case you either need to know what that funny character is so you can remove it (try printing urlencode($txt) to see the value in hex) or you need to remove it while it's still encoded as the string " ".

Or alternatively, you can tell whichever process is converting the entities that you don't want it to convert them. I would guess it's SimpleXML converting them, as XML parsing typically involves entity conversion. I don't know how (or if you can) to do that with SimpleXML.

Sign In

Non-breaking space translates into gibberish...how to fix it?

Recommended Posts

dnguyen

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

dnguyen

Link to comment

Share on other sites

dnguyen

Link to comment

Share on other sites

cooldude832

Link to comment

Share on other sites

dnguyen

Link to comment

Share on other sites

dnguyen

Link to comment

Share on other sites

btherl

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information