Jump to content

Recommended Posts

Hi everyone, I'm currently taking XML files that I download and I have to fix them before processing them, otherwise I get lots of errors from XMLReader.

 

One of the fixes is scanning for a string that may look something like "&" or """ or something like that and replacing it with what it should be i.e. "&" or """.  Now I think this feature is working fine and is not the problem (the function is ascii_fix_feed).

 

The next step is then converting these to the numeric equivalent using the function convert_ascii_to_numeric_entity so "&" becomes "&" again I think this feature works fine, as I've tested (but only output the display in a web browser).

 

One of the lines in the XML is

<prod id="41462197"><pId>602-0015-01</pId><text><name>Pony Themed 3D Bedroom Wallpaper</name><desc>- Wow, have a complete pony themed bedroom    Transform your room into your very own horse and pony stables. With this must have magical mural, with beautiful and wonderful varieties of horses and the cutest prancing ponies and much, much more.  This Walltastic has been specially designed with a horse in a picture frame within the mural to help you learn all about the external anatomy of a horse.  Every child dreams of having their own stables, and now their dreams can come true with the perfect gift for any horse and pony mad child Walltastic&#39;s Horse and Pony Stables.  This wallpaper is a wipeable wall covering that covers any wall area up to 10ft x 8ft. Each product is in 12 pieces which means it is easily applicable and flexible according to how much space needs covering. It comes rolled up similar to wallpaper in a postal tube, includeing fitting instructions and is a great addition to any child&#8217;s bedroom, nursery, playroom or games room.</desc></text><price><buynow>31.25</buynow><delivery>4.95</delivery></price><cat><awCatId>631</awCatId><awCat>Novelty Gifts</awCat><mCat>Pony Bedroom Wallpaper for Girls who love ponies and who are horse and pony mad</mCat></cat><brand/></prod>

 

Now the little bit somewhere in the middle is "&#8217;".  When I run my tests with these functions in another code file I made and output the display to the browser it works fine and I get the output in the browser of what it is meant to be (some kind of single quote).  However when I run the code that is fixing the feed line by line and manually re-writing the whole file, I run into problems because the output is not the expected "&#8217;" but instead I get "&#226;&#128;&#153;"

 

Functions that I am using

function ascii_fix_feed_return($str)
{
  $string = "&";
  if(is_numeric($str[4]))
  {
    $string .= "#";
  }
  $string .= $str[4] . ";";
  return $string;
}

function ascii_fix_feed($str)
{
  preg_match_all('/&(#)?([\w]+);(#)?([\w]+);/i', $str, $count);
  if(isset($count[0][0]))
  {
    $count = $count[0][0];
  }
  else
  {
    unset($count);
  }
  
  while(!empty($count))
  {
    $str = preg_replace_callback('/&(#)?([\w]+);(#)?([\w]+);/i','ascii_fix_feed_return', $str);
    preg_match_all('/&(#)?([\w]+);(#)?([\w]+);/i', $str, $count);
    if(isset($count[0][0]))
    {
      $count = $count[0][0];
    }
    else
    {
      unset($count);
    }
  }
  
  return $str;  
}

function convertAlphaEntitysToNumericEntity($entity)
{
  return '&#'.ord(html_entity_decode($entity[0])).';';
}

function convertAsciiOver127toNumericEntity($entity)
{
  if(($asciiCode = ord($entity[0])) > 127)
  {
    return '&#'.$asciiCode.';';
  }
  else
  {
    return $entity[0];
  }
}

function convert_ascii_to_numeric_entity($str)
{
  $str = preg_replace_callback('/&([\w]+);/i','convertAlphaEntitysToNumericEntity', $str);
  $str = preg_replace_callback('/[^\w]/i','convertAsciiOver127toNumericEntity', $str);
  
  return $str;
} 

 

Code that actually uses the functions

 

function xml_clean_up($file, $xml_full_file_name, $message, $fail_message, $display_safe = 0)
{
  global $log_file;

  $handle = @fopen($xml_full_file_name, "r");
  $handle2 = @fopen($xml_full_file_name.".tmp", "w");
  while (!feof($handle))
  {
    // Read the file line by line 
    $line = stream_get_line($handle, 10000, "\n");
    // Convert to UTF-8
    $line = iconv("UTF-8", "UTF-8//IGNORE", $line);    
    // Fix the XML e.g. replace "&rsquo;" with "’" 
    $line = ascii_fix_feed($line);
    // Convert all ASCII characters to numeric entity equivalents
    $line = convert_ascii_to_numeric_entity($line);
    // If we want to convert the characters back then use this function as well
    if($display_safe)
    {
      // $line = advert_display_safe($line);
    }
    // End the line
    $line .= "\n"; 
    fwrite($handle2, $line, 10000);
  } 
  fclose($handle);
  fclose($handle2);

  // Rename the XML
  if(rename($xml_full_file_name . ".tmp", $xml_full_file_name))
  {      
    flog($log_file, $file . " " . $message . "\r\n");
    echo $file . " " . $message . ".\r\n";    
  }
  else
  {
    flog($log_file, $file . " " . $fail_message . "\r\n\r\n");
    exit($file . " " . $fail_message . ".\r\n\r\n");      
  }  
} 

 

So if anyone knows what is going on I would be very grateful and if you need any more information just let me know.

Thanks in advance.

Link to comment
https://forums.phpfreaks.com/topic/243585-ascii-utf-8-xml-problems/
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.