Jump to content

Editing PHP Script to Enable Parsing of Large XML File


maghnom

Recommended Posts

I'm trying to utilize a PHP script to parse a large XML file (around 450 MB) to MYSQL database into certain structure and definitions of included XML elements. The problem is that the original script uses file_get_contents and SimpleXMLElement to get it done, but the corn job executed by the server halts due to the volume of the XML file. I'm no PHP expert, so I bought this XMLSplit software and divided the XML into 17 separated XML files each at size of 30 MB, parsed them one by one using the same script. However, the output database was missing a lot of input, and I have serious doubts whether this would be the same output of the original file if left not divided automatically and parsed one by one.

So, I've decided to use XMLReader with this exact PHP script to parse this big XML file, but so far I couldn't manage to simply replace the parsing code and keep other functionality intact.

I'm including the script below, I'd really appreciate if someone helps me to do so.

<?php
set_time_limit(0);
ini_set('memory_limit', '1024M');

include_once('../db.php');
include_once(DOC_ROOT.'/include/func.php');

mysql_query("TRUNCATE screenshots_list");
mysql_query("TRUNCATE pages");
mysql_query("TRUNCATE page_screenshots");

// This is the part I need help with to change into XMLReader instead of utilized function, to enable parsing of the large XML file correctly (while keeping rest of the script code as is if possible):

$xmlstr = file_get_contents('t_info.xml');
$xml = new SimpleXMLElement($xmlstr);
foreach ($xml->template as $item)

{
//print_r($item);
$sql = sprintf("REPLACE INTO templates SET id = %d, state = %d, price = %d, exc_price = %d, inserted_date = '%s', update_date = '%s', downloads = %d, type_id = %d, type_name = '%s', is_flash = %d, is_adult = %d, width = '%s', author_id = %d, author_nick = '%s', package_id = %d, is_full_site = %d, is_real_size = %d, keywords = '%s', sources = '%s', description = '%s', software_required = '%s'", $item->id, $item->state, $item->price, $item->exc_price, $item->inserted_date, $item->update_date, $item->downloads, $item->template_type->type_id, $item->template_type->type_name, $item->is_flash, $item->is_adult, $item->width, $item->author->author_id, $item->author->author_nick, $item->package->package_id, $item->is_full_site, $item->is_real_size, $item->keywords, $item->sources, $item->description, $item->software_required);
//echo '<br>'.$sql;
mysql_query($sql);
//print_r($item->screenshots_list->screenshot);
foreach ($item->screenshots_list->screenshot as $scr) {
$main = (!empty($scr->main_preview)) ? 1 : 0;
$small = (!empty($scr->small_preview)) ? 1 : 0;
insert_data($item->id, 'screenshots_list', 0, $scr->uri, $scr->filemtime, $main, $small); 
}
foreach ($item->styles->style as $st) {
insert_data($item->id, 'styles', $st->style_id, $st->style_name); 
}
foreach ($item->categories->category as $cat) {
insert_data($item->id, 'categories', $cat->category_id, $cat->category_name); 
}
foreach ($item->sources_available_list->source as $so) {
insert_data($item->id, 'sources_available_list', $so->source_id, ''); 
}
foreach ($item->software_required_list->software as $soft) {
insert_data($item->id, 'software_required_list', $soft->software_id, ''); 
}
//print_r($item->pages->page);
if (!empty($item->pages->page)) {
foreach ($item->pages->page as $p) {
mysql_query(sprintf("REPLACE INTO pages SET tpl_id = %d, name = '%s', id = NULL ", $item->id, $p->name));
$page_id = mysql_insert_id();
if (!empty($p->screenshots->scr)) {
foreach ($p->screenshots->scr as $psc) {
$href = (!empty($psc->href)) ? (string)$psc->href : '';
mysql_query(sprintf("REPLACE INTO page_screenshots SET page_id = %d, description = '%s', uri = '%s', scr_type_id = %d, width = %d, height = %d, href = '%s'", $page_id, $psc->description, $psc->uri, $psc->scr_type_id, $psc->width, $psc->height, $href));
}
}
}
}}?>

I'd appreciate your help with that...

Link to comment
Share on other sites

I've not used XMLReader, however one of the user comments points to a SimpleXMLReader class that provides a simple call-back style interface for parsing a large XML file. You could give that a try and see if it works out for you.

 

As for your current file-splitting solution, so long as the file splits correctly it shouldn't cause any issues as you don't seem to have an inter-dependencies during the parsing.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.