Thanks for your response.
The pages I scrape don't have a DTD or a declared character encoding.
FireFox displays the pages OK in Quirks mode.
My Code Editor identifies the encoding as windows-1252.
So I created a page with a few of the problem characters in it: è, ö, ü and ý, saving it in windows-1252 encoding, attached. This works on my terminal:
iconv -f WINDOWS-1252 -t UTF-8 input.html
outputting è ü ý ö to my screen, but server-side:
$file = fopen("input.html","r");
while(! feof($file)) {echo fgets($file);}
$file = file('input.html');
foreach ($file as $line_num => $line) {echo $line;}
echo file_get_contents('input.html');
All return � � � � As far as I can tell, all PHP file operations retrieve the contents of the file in ASCII, therefore
$utf8 = iconv('windows-1252', 'utf-8', $input);
fails. I don't think it can be done programatically server-side.
Can anyone confirm this?
input.html