Jump to content

Problem with PHP, XML and UTF-8


gladtobegrey

Recommended Posts

All the HTML, PHP and XML files on my website are encoded as 'UTF-8 without BOM' using Notepad++

 

All the HTML and PHP pages contain '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />'

 

All the XML files contain '<?xml version="1.0" encoding="utf-8"?>'

 

One of the webpages ('offers.php') contains a mix of HTML code plus some PHP code to read an XML file ('offers.xml') and generate a list of special offers with prices.

 

The first element of 'offers.xml' contains

<offers>

<offer>
    <title>'Special Lunchtime Menu' Offer[^]Only £4.99</title>
<image>board.png</image>
    <text>
    [p]Choose any of the following:[/p]
    [p][b]PIZZA and SALAD[/b][^]
    - Margherita[^]
    - Ham and Mushroom[^]
    - Pepperoni[^]
    - Vegetarian[/p]
    [p][i][b]OR[/b][/i][/p]
    [p][b]PASTA and GARLIC BREAD[/b][^]
    - Spaghetti Bolognaise[^]
    - Penne Arrabiata[^]
    - Spaghetti Carbonara[^]
    - Risotto[/p]
    [p][i][b]OR[/b][/i][/p]
    [p][b]GRILLED CHICKEN SALAD[/b][/p]
    </text>
</offer>

 

The PHP code parses the file and generates HTML output (the  characters between square braces are converted to HTML tags by a preg_replace() regex as part of the process - e.g. '[^] becomes '<br />' ... don't get hung up on this, I have my reasons)

 

The relevant chunk of the parser code is here:

 

function tag_contents($parser, $data) {
global $source, $current_tag;
$patterns = array ("/\[\^\]/u","/\[\~\]/u","/\[/u","/\]/u","/\t/u");
$replaces = array ("<br />"," ","<",">","");
$result = htmlentities($data, ENT_COMPAT, 'UTF-8');
$newres = preg_replace($patterns, $replaces, $result);
//echo '$data="'.$data.'"('.strlen($data).'), $result="'.$result.'"('.strlen($result).'"),     $newres="'.$newres.'"('.strlen($newres).')'."\n\r";
switch ($current_tag) {
	case "IMAGE":	echo '<div class="offerimg"><img src="images/'.$newres.'" alt="" /></div>'; 
					break;
	case "TITLE":	echo '<div class="offertitle">'.$newres.'</div>'."\n\r";
					break;
	case "TEXT":	echo '<div class="offertext">'.$newres.'</div>'."\n\r";
					break;
}
}

 

My problem is that the output HTML always contains a spurious line-break immediately before the '£' character.

Previously it was outputting an A-umlaut before the '£', which has gone away since the addition of the htmlentities() code as added, but I cannot work out how to get rid of the unwanted line break.

 

So, where I'd expect to see:

 

    'Special Lunchtime Menu Offer'

              Only £4.99

 

... I'm getting

 

    'Special Lunchtime Menu Offer'

                Only

                £4.99

 

I've done quite a bit of browsing and trying various solutions, but am now beginning to tear my hair out.  I'm probably missing something stupidly obvious, but clearly cannot see the problem.

 

I'm testing on XAMPP under WinXP SP3, with PHP 5.3.1 and Apache 2.2.14.  Unfortunately I am constrained to running live under PHP 4.4.2.  However, I'm seeing the same issue in that environment as in my test environment.

 

Any help would be very gratefully received.

 

 

Link to comment
https://forums.phpfreaks.com/topic/193898-problem-with-php-xml-and-utf-8/
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.