Jump to content

XML Parsing Problem, Data Cleaning


l008com

Recommended Posts

Hello

I'm making a php XML parser that loads data into mysql database. I'm using just the basic XML functions. The files I'm parsing are uber huge, so I can't use one of the easier XML parsing methods that load the whole file into memory. So anyway, I read a file line by line, and pass each line into the parse. I keep getting invalid character errors. I use the utf8_encode() function and it got a little further. I added a function to replace all &'s with &amps; (which apparently is a bug in the XML file, that this isn't already done) and it got a little further. But its still not making it nearly all the way through the file. This particular is 256 MB that I'm trying to parse, though once this is squared away, I'm going to have an even bigger one I want to build a parser for to run once a week.

 

So what I really need, is a way to clean the xml data before giving it to the parser. With MySQL, i use the myql_real_escape_string() function before giving it data, and it works nicely. Is there any function or technique I can use to do the same kind of cleaning to lines of XML files? Even if it just deletes "illegal" characters?

 

 

Here's my script, FYI.

The XML file is a very simple format, its just very very long.

 

<?

set_time_limit(0); 

$insiderest = false; 

$tag = ""; 

$title = ""; 
$country = ""; 
$address = ""; 
$city = ""; 
$state = ""; 
$zip = ""; 
$phone = ""; 
$url = ""; 

// Create an XML parser 
$xml_parser = xml_parser_create(); 

// Set the functions to handle opening and closing tags 
xml_set_element_handler($xml_parser, "startElement", "endElement"); 

// Set the function to handle blocks of character data 
xml_set_character_data_handler($xml_parser, "characterData"); 

// Open the XML file for reading 
$fp = fopen("/path/to/zmlfile.txt","r") 
or die("Error opening xml file."); 

// Read the XML file one line at a time 
while ($data = fgets($fp,4096))
{
//Clean String for XML parser
$data = utf8_encode($data);
$data = str_replace("&","&",$data);

// Parse each line with the XML parser created above 
xml_parse($xml_parser, $data, feof($fp))
// Handle errors in parsing 
or die("XML error: ".xml_error_string(xml_get_error_code($xml_parser))."(".xml_get_error_code($xml_parser).") at line ".xml_get_current_line_number($xml_parser)."++$data++");
}

// Close the XML file 
fclose($fp); 

// Free up memory used by the XML parser 
xml_parser_free($xml_parser);



function startElement($parser, $tagName, $attrs)
{
global $insiderest, $tag;

if ($insiderest)
{
$tag = $tagName;
}
elseif ($tagName == "RESTAURANT")
{
$insiderest = true;
}
}

function characterData ($parser, $data)
{
global $insiderest, $tag, $title, $country, $address, $city, $state, $zip, $phone, $url;

if ($insiderest)
{

switch ($tag)
{ 
case "D:TITLE": 
$title .= mysql_real_escape_string(trim($data)); 
break; 
case "COUNTRY": 
$country .= mysql_real_escape_string(trim($data)); 
break; 
case "ADDRESS": 
$address .= mysql_real_escape_string(trim($data)); 
break; 
case "CITY": 
$city .= mysql_real_escape_string(trim($data)); 
break; 
case "STATE": 
$state .= mysql_real_escape_string(trim($data)); 
break; 
case "ZIP": 
$zip .= mysql_real_escape_string(trim($data)); 
break; 
case "PHONE": 
$phone .= mysql_real_escape_string(trim($data)); 
break; 
case "URL": 
$url .= mysql_real_escape_string(trim($data)); 
break; 
} 
}
}

function endElement($parser, $tagName)
{
global $insiderest, $tag, $title, $country, $address, $city, $state, $zip, $phone, $url,$xml_parser;

if ($tagName == "RESTAURANT")
{

$query = "INSERT INTO `chefmoz_list` (name,country,address,city,state,zip,phone,url) VALUES('$title','$country','$address','$city','$state','$zip','$phone','$url')";
mysql_query($query);
echo "Insert restuarant `$title` into database [".mysql_insert_id()."]\n";

$insiderest = false;
$title = ""; 
$country = ""; 
$address = ""; 
$city = ""; 
$state = ""; 
$zip = ""; 
$phone = ""; 
$url = ""; 
}
}
?>

Link to comment
Share on other sites

  • 3 months later...

Hi there,

 

I am having the exact same issue. I am getting the "Invalid character" error.

I just want to ignore those invalid characters and insert the rest of the text to the database.

Even I do not know what are the invalid characters. If I knew I could replace them to empty string and pass that to the XML parser. How can I determine what are the invalid characters?

 

Hi l008com,

Did you have any success with this issue or did you find any other solutions?

 

Please help!! I badly need to solve my issue ASAP.

 

Thanks !!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.