l008com Posted June 20, 2007 Share Posted June 20, 2007 Hello I'm making a php XML parser that loads data into mysql database. I'm using just the basic XML functions. The files I'm parsing are uber huge, so I can't use one of the easier XML parsing methods that load the whole file into memory. So anyway, I read a file line by line, and pass each line into the parse. I keep getting invalid character errors. I use the utf8_encode() function and it got a little further. I added a function to replace all &'s with &s; (which apparently is a bug in the XML file, that this isn't already done) and it got a little further. But its still not making it nearly all the way through the file. This particular is 256 MB that I'm trying to parse, though once this is squared away, I'm going to have an even bigger one I want to build a parser for to run once a week. So what I really need, is a way to clean the xml data before giving it to the parser. With MySQL, i use the myql_real_escape_string() function before giving it data, and it works nicely. Is there any function or technique I can use to do the same kind of cleaning to lines of XML files? Even if it just deletes "illegal" characters? Here's my script, FYI. The XML file is a very simple format, its just very very long. <? set_time_limit(0); $insiderest = false; $tag = ""; $title = ""; $country = ""; $address = ""; $city = ""; $state = ""; $zip = ""; $phone = ""; $url = ""; // Create an XML parser $xml_parser = xml_parser_create(); // Set the functions to handle opening and closing tags xml_set_element_handler($xml_parser, "startElement", "endElement"); // Set the function to handle blocks of character data xml_set_character_data_handler($xml_parser, "characterData"); // Open the XML file for reading $fp = fopen("/path/to/zmlfile.txt","r") or die("Error opening xml file."); // Read the XML file one line at a time while ($data = fgets($fp,4096)) { //Clean String for XML parser $data = utf8_encode($data); $data = str_replace("&","&",$data); // Parse each line with the XML parser created above xml_parse($xml_parser, $data, feof($fp)) // Handle errors in parsing or die("XML error: ".xml_error_string(xml_get_error_code($xml_parser))."(".xml_get_error_code($xml_parser).") at line ".xml_get_current_line_number($xml_parser)."++$data++"); } // Close the XML file fclose($fp); // Free up memory used by the XML parser xml_parser_free($xml_parser); function startElement($parser, $tagName, $attrs) { global $insiderest, $tag; if ($insiderest) { $tag = $tagName; } elseif ($tagName == "RESTAURANT") { $insiderest = true; } } function characterData ($parser, $data) { global $insiderest, $tag, $title, $country, $address, $city, $state, $zip, $phone, $url; if ($insiderest) { switch ($tag) { case "D:TITLE": $title .= mysql_real_escape_string(trim($data)); break; case "COUNTRY": $country .= mysql_real_escape_string(trim($data)); break; case "ADDRESS": $address .= mysql_real_escape_string(trim($data)); break; case "CITY": $city .= mysql_real_escape_string(trim($data)); break; case "STATE": $state .= mysql_real_escape_string(trim($data)); break; case "ZIP": $zip .= mysql_real_escape_string(trim($data)); break; case "PHONE": $phone .= mysql_real_escape_string(trim($data)); break; case "URL": $url .= mysql_real_escape_string(trim($data)); break; } } } function endElement($parser, $tagName) { global $insiderest, $tag, $title, $country, $address, $city, $state, $zip, $phone, $url,$xml_parser; if ($tagName == "RESTAURANT") { $query = "INSERT INTO `chefmoz_list` (name,country,address,city,state,zip,phone,url) VALUES('$title','$country','$address','$city','$state','$zip','$phone','$url')"; mysql_query($query); echo "Insert restuarant `$title` into database [".mysql_insert_id()."]\n"; $insiderest = false; $title = ""; $country = ""; $address = ""; $city = ""; $state = ""; $zip = ""; $phone = ""; $url = ""; } } ?> Link to comment https://forums.phpfreaks.com/topic/56326-xml-parsing-problem-data-cleaning/ Share on other sites More sharing options...
faheemhameed Posted October 3, 2007 Share Posted October 3, 2007 Hi there, I am having the exact same issue. I am getting the "Invalid character" error. I just want to ignore those invalid characters and insert the rest of the text to the database. Even I do not know what are the invalid characters. If I knew I could replace them to empty string and pass that to the XML parser. How can I determine what are the invalid characters? Hi l008com, Did you have any success with this issue or did you find any other solutions? Please help!! I badly need to solve my issue ASAP. Thanks !! Link to comment https://forums.phpfreaks.com/topic/56326-xml-parsing-problem-data-cleaning/#findComment-360706 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.