Jump to content

converting escaped characters

Recommended Posts

Hello all,

With permission, I scraped a website using curl and simple_html_dom to retrieve 6342 links from 112 pages.

While scraping, I converted the links to options for a select element. Most of the options display properly.

Here's the problem: there are some ISO 8859-1 hexadecimal encoded characters in the HTML source files, which display as string literals inside options.

$input = "<option>Cr\E8me</option>"
$input = str_replace("\E8", "è", $input)

 does not work.


How do I turn "<option>Cr\E8me</option>" into "<option>Crème</option>"


Any suggestions?



Share this post

Link to post
Share on other sites

But wouldn't that mean the source you scraped from had "\E8" too? That would be odd. If not then there was a problem with your scraper...

Share this post

Link to post
Share on other sites

Thanks for your response.


The pages I scrape don't have a DTD or a declared character encoding.

FireFox displays the pages OK in Quirks mode.

My Code Editor identifies the encoding as windows-1252.

So I created a page with a few of the problem characters in it: è, ö, ü and ý, saving it in windows-1252 encoding, attached.

This works on my terminal:

iconv -f WINDOWS-1252 -t UTF-8 input.html



to my screen, but server-side:

$file = fopen("input.html","r");
while(! feof($file)) {echo fgets($file);}
$file = file('input.html');
foreach ($file as $line_num => $line) {echo $line;}
echo file_get_contents('input.html');

All return � � � �

As far as I can tell, all PHP file operations retrieve the contents of the file in ASCII, therefore

$utf8 = iconv('windows-1252', 'utf-8', $input);


I don't think it can be done programatically server-side.


Can anyone confirm this?


Share this post

Link to post
Share on other sites

So the strange hex sequences you talked about in your first post have somehow disappeared, and now all you want to do is convert the input document from ISO 8859-1 (or Windows-1252) to UTF-8?


How exactly does iconv() “fail”? No, PHP's file functions are not limited to ASCII. They simply read bytes, so they work for any encoding. This works just fine:


header('Content-Type: text/html; charset=utf-8');

$input = file_get_contents('input.html');
$output = utf8_encode($input);

echo $output;
Edited by Jacques1

Share this post

Link to post
Share on other sites

Hi Jacques1,


thanks for your help.


The strange hex sequences were written to my output file after processing my curl input with simple_html_dom.



$html = new simple_html_dom();


$html = new simple_html_dom();
header('Content-Type: text/html; charset=utf-8');

solved my problem. All options now have the right text.


Thank you VERY much!


Share this post

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.