flounder Posted November 26, 2015 Share Posted November 26, 2015 Hello all,With permission, I scraped a website using curl and simple_html_dom to retrieve 6342 links from 112 pages.While scraping, I converted the links to options for a select element. Most of the options display properly.Here's the problem: there are some ISO 8859-1 hexadecimal encoded characters in the HTML source files, which display as string literals inside options. $input = "<option>Cr\E8me</option>" $input = str_replace("\E8", "è", $input) does not work. How do I turn "<option>Cr\E8me</option>" into "<option>Crème</option>" Any suggestions? TIA. Quote Link to comment Share on other sites More sharing options...
requinix Posted November 26, 2015 Share Posted November 26, 2015 But wouldn't that mean the source you scraped from had "\E8" too? That would be odd. If not then there was a problem with your scraper... Quote Link to comment Share on other sites More sharing options...
teynon Posted November 27, 2015 Share Posted November 27, 2015 I would mess with some of the string encodings. I've done this in C#, but not in PHP. Perhaps http://php.net/manual/en/function.utf8-encode.php. Quote Link to comment Share on other sites More sharing options...
flounder Posted December 3, 2015 Author Share Posted December 3, 2015 Thanks for your response. The pages I scrape don't have a DTD or a declared character encoding. FireFox displays the pages OK in Quirks mode. My Code Editor identifies the encoding as windows-1252. So I created a page with a few of the problem characters in it: è, ö, ü and ý, saving it in windows-1252 encoding, attached.This works on my terminal: iconv -f WINDOWS-1252 -t UTF-8 input.html outputtingèüýöto my screen, but server-side: $file = fopen("input.html","r"); while(! feof($file)) {echo fgets($file);} $file = file('input.html'); foreach ($file as $line_num => $line) {echo $line;} echo file_get_contents('input.html'); All return � � � �As far as I can tell, all PHP file operations retrieve the contents of the file in ASCII, therefore $utf8 = iconv('windows-1252', 'utf-8', $input); fails.I don't think it can be done programatically server-side. Can anyone confirm this? input.html Quote Link to comment Share on other sites More sharing options...
Solution Jacques1 Posted December 3, 2015 Solution Share Posted December 3, 2015 (edited) So the strange hex sequences you talked about in your first post have somehow disappeared, and now all you want to do is convert the input document from ISO 8859-1 (or Windows-1252) to UTF-8? How exactly does iconv() “fail”? No, PHP's file functions are not limited to ASCII. They simply read bytes, so they work for any encoding. This works just fine: <?php header('Content-Type: text/html; charset=utf-8'); $input = file_get_contents('input.html'); $output = utf8_encode($input); echo $output; Edited December 3, 2015 by Jacques1 Quote Link to comment Share on other sites More sharing options...
flounder Posted December 3, 2015 Author Share Posted December 3, 2015 Hi Jacques1, thanks for your help. The strange hex sequences were written to my output file after processing my curl input with simple_html_dom. Replacing $html = new simple_html_dom(); $html->load($result); with $html = new simple_html_dom(); header('Content-Type: text/html; charset=utf-8'); $html->load(utf8_encode($result)); solved my problem. All options now have the right text. Thank you VERY much! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.