Jump to content

Special Characters While Parsing HTML


jRiest

Recommended Posts

Hello, I am relatively new to php, and while I have found my experience thus far to be enjoyable, I seem to have hit an wall and I need help.

 

I am making a small web application for a personal site, and in it I am trying to parse through a website to extract some data that will be stored into a mysql database. To do this, I am using

this basic setup:

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";

$content = file_get_contents($site);

$content = str_replace("½",".5", $content, $count);

echo 'Replacements: ' . $count . '<br />';

$doc = DOMDocument::loadHTML($content);

 

However, the website is using the ½ (the 1/2) character. I am trying to replace all instances of that character with ".5" so that I can store it as a decimal in the database. However, str_replace() doesn't seem to be working. I'm pretty sure it has to do with encoding because when I print out the textContent of the DOMNode that contains that character, it prints out as ½. However, if I change my browser text encoding to UTF-8, it prints out okay.

 

So, any suggestions on how I can replace all instances of  the ½ characet with .5?

 

Thanks in advance!

 

Link to comment
Share on other sites

Try using utf8_encode with that function.

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";
$content = utf8_encode(file_get_contents($site));
$replacement_char = utf8_encode('½');
$content = str_replace($replacement_char, ".5", $content, $count);
echo 'Replacements: ' . $count . '<br />';
$doc = DOMDocument::loadHTML($content);

 

This works fine for me, and removed the 'Â.5' .. problem.

Link to comment
Share on other sites

That didn't seem to work for me either. To check whether it was working, I am using this

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";
$content = utf8_decode(file_get_contents($site));
$replacement_char = utf8_encode('½');
$content = str_replace($replacement_char, ".5", $content, $count);
echo 'Replacements: ' . $count . '<br />';

 

I also tried this and it doesn't work either:

 

$site = "http://www.usatoday.com/sports/gaming/sheridan.htm";
$content = utf8_decode(file_get_contents($site));
$content = str_replace('½', ".5", $content, $count);
echo 'Replacements: ' . $count . '<br />';

 

 

It never makes any replacements

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.