Parsing XML Errors with Special Characters

dbol · June 2, 2010

Anytime I try to extract the title of an article that has special characters, such as & or ', it's only returning the letters after the special characters.

For instance. If I have title:

<title>MarketTelegraph.com: PVSP & NRDS</title>

Then my result is:

NRDS rather than MarketTelegraph.com: PVSP & NRDS

I need the entire title however.

My code is below:

<?php 

$xml_file = "http://www.meltwaternews.com/magenta/xml/html/37/10/150061.html.XML"; 

$xml_title_key = "*FEEDS*FEED*DOCUMENTS*DOCUMENT*TITLE";
$xml_url_key = "*FEEDS*FEED*DOCUMENTS*DOCUMENT*URL";
$xml_ingress_key = "*FEEDS*FEED*DOCUMENTS*DOCUMENT*INGRESS";
$xml_date_key = "*FEEDS*FEED*DOCUMENTS*DOCUMENT*CREATEDATE5";
$xml_source_key = "*FEEDS*FEED*DOCUMENTS*DOCUMENT*SOURCENAME";

$document_array = array(); 

$counter = 0; 
class xml_document{ 
    var $title, $url, $ingress, $date, $source, $dateMonth, $dateDay, $dateYear;
} 

function startTag($parser, $data){ 
    global $current_tag; 
    $current_tag .= "*$data"; 
} 

function endTag($parser, $data){ 
    global $current_tag; 
    $tag_key = strrpos($current_tag, '*'); 
    $current_tag = substr($current_tag, 0, $tag_key); 
} 

function contents($parser, $data){ 
    global $current_tag, $xml_title_key, $xml_url_key, $xml_ingress_key, $xml_date_key, $xml_source_key, $counter, $document_array; 
    switch($current_tag){ 
        case $xml_title_key: 
            $document_array[$counter] = new xml_document(); 
            $document_array[$counter]->title = $data; 
            break; 
        case $xml_url_key:  
            $document_array[$counter]->url = $data; 
            break; 
        case $xml_ingress_key: 
            $document_array[$counter]->ingress = $data; 
            break; 
        case $xml_date_key: 
            $document_array[$counter]->date = $data; 
            break; 	
	case $xml_source_key:
		$document_array[$counter]->source = $data;
		$counter++;
		break;
    } 
}

$xml_parser = xml_parser_create(); 

xml_set_element_handler($xml_parser, "startTag", "endTag"); 

xml_set_character_data_handler($xml_parser, "contents"); 

$fp = fopen($xml_file, "r") or die("Could not open file");

function remotefsize($furl) {
        $sch = parse_url($furl, PHP_URL_SCHEME);
        if (($sch != "http") && ($sch != "https") && ($sch != "ftp") && ($sch != "ftps")) {
            return false;
        }
        if (($sch == "http") || ($sch == "https")) {
            $headers = get_headers($furl, 1);
            if ((!array_key_exists("Content-Length", $headers))) { return false; }
            return $headers["Content-Length"];
        }
    }

$data = fread($fp, remotefsize("http://www.meltwaternews.com/magenta/xml/html/37/10/150061.html.XML")) or die("Could not read file"); 

if(!(xml_parse($xml_parser, $data, feof($fp)))){ 
    die("Error on line " . xml_get_current_line_number($xml_parser)); 
} 

xml_parser_free($xml_parser); 

fclose($fp); 

?> 

<html> 
<head> 
<title>Project: Parse XML 4</title> 
</head>
<body bgcolor="#FFFFFF">
<center><h1>Title</h1></center>
<br/>
<table align="center">
<tr>
<td width="600">
<?php
for($x=0;$x<count($document_array);$x++){ 
    echo "<b>\t" . $document_array[$x]->title . "</b>\n<br/>";
$newdate = date('m/j/Y',strtotime($document_array[$x]->date));
echo "\t" . $newdate . " | " . $document_array[$x]->source . "\n<br/>";
echo "\t" . $document_array[$x]->ingress . "\n<br/>";
echo "\t<a href='" . $document_array[$x]->url . "'>"  . $document_array[$x]->url . "</a>\n<br/><br/>";
} 
?>
</td>
</tr>
</table>

</body> 
</html>

Ideas?

pornophobic · June 2, 2010

htmlentities()

htmlspecialchars()

dbol · June 2, 2010

Looks promising, but perhaps I'm implementing incorrectly.

<?php
for($x=0;$x<count($document_array);$x++){
        $newtitle = htmlspecialchars_decode($document_array[$x]->title, ENT_QUOTES);
        echo "\t" . $newtitle . "\n<br/>";
        echo "<b>\t" . $document_array[$x]->title . "</b>\n<br/>";
$newdate = date('m/j/Y',strtotime($document_array[$x]->date));
echo "\t" . $newdate . " | " . $document_array[$x]->source . "\n<br/>";
echo "\t" . $document_array[$x]->ingress . "\n<br/><br/>";
} 
?>

Output of lines 4 and 5 are identical. I've tried htmlspecialchars and htmlspecialchars_decode. Both produce the same results, which is exactly what the title is showing in the XML element:

amp;T June 7th (after htmlspecialchars)

amp;T June 7th (before htmlspecialchars)

<title>Metered Data Plans, Tethering Coming To AT&T June 7th</title> (the xml element)

Metered Data Plans, Tethering Coming To AT&T June 7th (what I'm seeking to output)

Additional info:

single quotes are displaying as ' (in xml)

double quotes are displaying as " (in xml)

ampersands are displaying as & (in xml)

These 3 special characters are causing the title to output incorrectly, starting after the last instance of any of those special characters in the title line.

So if I have: This & is "the title" now, only now will output as the title.

codebyren · June 2, 2010

Have you tried something simple to make sure things are working as expected? Like this:

$test = "AT&T caps data plan, introduces iPhone tethering";
$decoded_string = htmlspecialchars_decode($test);
echo $decoded_string; // outputs "AT&T caps data plan, introduces iPhone tethering"

Sorry for going a bit off track here but do you not have access to PHP5? It would make thins a LOT easier on you. For example:

// Get the XML source
$xml = file_get_contents("http://www.meltwaternews.com/magenta/xml/html/37/10/150061.html.XML");

// Make it PHP5 friendly
$simplexml = simplexml_load_string($xml);

// Get all news documents as a SimpleXML object that you can loop through
$documents = $simplexml->feed->documents->document; // the arrows (->) pretty much follow the XML nesting
?>

Then in the HTML:

<html>
<head>
<title>Project: Parse XML 4</title>
</head>
<body bgcolor="#FFFFFF">
<center><h1>Title</h1></center>
<br/>
<table align="center">
<tr>
<th>Title</th>
<th>Date</th>
<th>Source</th>
<th>Ingress</th>
<th>link</th>
</tr>
<?php foreach ($documents as $document) : ?>
<tr>
<td><?php echo $document->title; ?></td>
<td><?php echo date('m/j/Y', strtotime($document->createDate)); ?></td>
<td><?php echo $document->sourcename; ?></td>
<td><?php echo $document->ingress; ?></td>
<td><a href="<?php echo $document->url; ?>">read</a></td>
</tr>
<?php endforeach;?>
</table>
</body>
</html>

I hope this helps...

dbol · June 2, 2010

This helps immensely.

Thanks so much.

dbol · June 2, 2010

I modified the script to parse an RSS2 feed, because our RSS filters duplicate articles, where the XML does not.

Only problem now is the single quotes in the RSS are coming out very weird in the PHP; it doesn't happen with normal single quotes; it occurs when the single quotes are slanted, like ‘ and ’, not '. How can I adjust to account for those quotes?

Edit: The RSS is encoded in UTF-8.

Sign In

Parsing XML Errors with Special Characters

Recommended Posts

dbol

Link to comment

Share on other sites

pornophobic

Link to comment

Share on other sites

dbol

Link to comment

Share on other sites

codebyren

Link to comment

Share on other sites

dbol

Link to comment

Share on other sites

dbol

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information