Jump to content

Special Characters While Parsing HTML


jRiest

Recommended Posts

Hello, I am relatively new to php, and while I have found my experience thus far to be enjoyable, I seem to have hit an wall and I need help.

 

I am making a small web application for a personal site, and in it I am trying to parse through a website to extract some data that will be stored into a mysql database. To do this, I am using

this basic setup:

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";

$content = file_get_contents($site);

$content = str_replace("½",".5", $content, $count);

echo 'Replacements: ' . $count . '<br />';

$doc = DOMDocument::loadHTML($content);

 

However, the website is using the ½ (the 1/2) character. I am trying to replace all instances of that character with ".5" so that I can store it as a decimal in the database. However, str_replace() doesn't seem to be working. I'm pretty sure it has to do with encoding because when I print out the textContent of the DOMNode that contains that character, it prints out as ½. However, if I change my browser text encoding to UTF-8, it prints out okay.

 

So, any suggestions on how I can replace all instances of  the ½ characet with .5?

 

Thanks in advance!

 

Link to comment
https://forums.phpfreaks.com/topic/188993-special-characters-while-parsing-html/
Share on other sites

Try using utf8_encode with that function.

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";
$content = utf8_encode(file_get_contents($site));
$replacement_char = utf8_encode('½');
$content = str_replace($replacement_char, ".5", $content, $count);
echo 'Replacements: ' . $count . '<br />';
$doc = DOMDocument::loadHTML($content);

 

This works fine for me, and removed the 'Â.5' .. problem.

That didn't seem to work for me either. To check whether it was working, I am using this

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";
$content = utf8_decode(file_get_contents($site));
$replacement_char = utf8_encode('½');
$content = str_replace($replacement_char, ".5", $content, $count);
echo 'Replacements: ' . $count . '<br />';

 

I also tried this and it doesn't work either:

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";
$content = utf8_decode(file_get_contents($site));
$content = str_replace('½', ".5", $content, $count);
echo 'Replacements: ' . $count . '<br />';

 

 

It never makes any replacements

 

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.