Jump to content

converting escaped characters


flounder
Go to solution Solved by Jacques1,

Recommended Posts

Hello all,

With permission, I scraped a website using curl and simple_html_dom to retrieve 6342 links from 112 pages.

While scraping, I converted the links to options for a select element. Most of the options display properly.

Here's the problem: there are some ISO 8859-1 hexadecimal encoded characters in the HTML source files, which display as string literals inside options.

$input = "<option>Cr\E8me</option>"
$input = str_replace("\E8", "è", $input)

 does not work.

 

How do I turn "<option>Cr\E8me</option>" into "<option>Crème</option>"

 

Any suggestions?

 

TIA.

Link to comment
Share on other sites

Thanks for your response.

 

The pages I scrape don't have a DTD or a declared character encoding.

FireFox displays the pages OK in Quirks mode.

My Code Editor identifies the encoding as windows-1252.

So I created a page with a few of the problem characters in it: è, ö, ü and ý, saving it in windows-1252 encoding, attached.

This works on my terminal:

iconv -f WINDOWS-1252 -t UTF-8 input.html

outputting

è
ü
ý
ö

to my screen, but server-side:

$file = fopen("input.html","r");
while(! feof($file)) {echo fgets($file);}
$file = file('input.html');
foreach ($file as $line_num => $line) {echo $line;}
echo file_get_contents('input.html');

All return � � � �

As far as I can tell, all PHP file operations retrieve the contents of the file in ASCII, therefore

$utf8 = iconv('windows-1252', 'utf-8', $input);

fails.

I don't think it can be done programatically server-side.

 

Can anyone confirm this?

input.html

Link to comment
Share on other sites

  • Solution

So the strange hex sequences you talked about in your first post have somehow disappeared, and now all you want to do is convert the input document from ISO 8859-1 (or Windows-1252) to UTF-8?

 

How exactly does iconv() “fail”? No, PHP's file functions are not limited to ASCII. They simply read bytes, so they work for any encoding. This works just fine:

<?php

header('Content-Type: text/html; charset=utf-8');

$input = file_get_contents('input.html');
$output = utf8_encode($input);

echo $output;
Edited by Jacques1
Link to comment
Share on other sites

Hi Jacques1,

 

thanks for your help.

 

The strange hex sequences were written to my output file after processing my curl input with simple_html_dom.

 

Replacing

$html = new simple_html_dom();
$html->load($result);

with

$html = new simple_html_dom();
header('Content-Type: text/html; charset=utf-8');
$html->load(utf8_encode($result));

solved my problem. All options now have the right text.

 

Thank you VERY much!

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.