Jump to content
flounder

converting escaped characters

Recommended Posts

Hello all,

With permission, I scraped a website using curl and simple_html_dom to retrieve 6342 links from 112 pages.

While scraping, I converted the links to options for a select element. Most of the options display properly.

Here's the problem: there are some ISO 8859-1 hexadecimal encoded characters in the HTML source files, which display as string literals inside options.

$input = "<option>Cr\E8me</option>"
$input = str_replace("\E8", "è", $input)

 does not work.

 

How do I turn "<option>Cr\E8me</option>" into "<option>Crème</option>"

 

Any suggestions?

 

TIA.

Share this post


Link to post
Share on other sites

But wouldn't that mean the source you scraped from had "\E8" too? That would be odd. If not then there was a problem with your scraper...

Share this post


Link to post
Share on other sites

Thanks for your response.

 

The pages I scrape don't have a DTD or a declared character encoding.

FireFox displays the pages OK in Quirks mode.

My Code Editor identifies the encoding as windows-1252.

So I created a page with a few of the problem characters in it: è, ö, ü and ý, saving it in windows-1252 encoding, attached.

This works on my terminal:

iconv -f WINDOWS-1252 -t UTF-8 input.html

outputting

è
ü
ý
ö

to my screen, but server-side:

$file = fopen("input.html","r");
while(! feof($file)) {echo fgets($file);}
$file = file('input.html');
foreach ($file as $line_num => $line) {echo $line;}
echo file_get_contents('input.html');

All return � � � �

As far as I can tell, all PHP file operations retrieve the contents of the file in ASCII, therefore

$utf8 = iconv('windows-1252', 'utf-8', $input);

fails.

I don't think it can be done programatically server-side.

 

Can anyone confirm this?

input.html

Share this post


Link to post
Share on other sites

So the strange hex sequences you talked about in your first post have somehow disappeared, and now all you want to do is convert the input document from ISO 8859-1 (or Windows-1252) to UTF-8?

 

How exactly does iconv() “fail”? No, PHP's file functions are not limited to ASCII. They simply read bytes, so they work for any encoding. This works just fine:

<?php

header('Content-Type: text/html; charset=utf-8');

$input = file_get_contents('input.html');
$output = utf8_encode($input);

echo $output;
Edited by Jacques1

Share this post


Link to post
Share on other sites

Hi Jacques1,

 

thanks for your help.

 

The strange hex sequences were written to my output file after processing my curl input with simple_html_dom.

 

Replacing

$html = new simple_html_dom();
$html->load($result);

with

$html = new simple_html_dom();
header('Content-Type: text/html; charset=utf-8');
$html->load(utf8_encode($result));

solved my problem. All options now have the right text.

 

Thank you VERY much!

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.