Jump to content
Klyx99

Help Faster way to create a dynamic website with HUGE XML files in PHP

Recommended Posts

Simply put, I every night I have to pull a HUGE xml file (5mb 4500+ records).
with this data ( small sales site), I create a webpage using the data provided. Pretty simple.

However, I first used XML and converted it to arrays. This was problematic for functions and manipulating etc.. or picking out discounts and  specific items etc.

So Now I use the xml file, and convert it to sql - this as you know is very time consuming (longest time is about 10-12min using INSERT UPDATE ON DUPLICATE). So I have to use a cronjob to perform this. I would rather have the page dynamicly loaded when user is visiting.

Now SQL works nicely as all the manipulation features are fast. However, I still think it feels sluggish. And it feels sloppy.
I do not want to learn XSLT (LAZY?), I am comfortable with PHP. I can already parse the xml file fast. Just need a way to manipulate the data. mainly sorting, and picking out specific items in the xml - then sort those results.

One method I did use was to cache what I manipulated and then overnight I pull the new feed, I delete the cache.
Hopefully this is not answered already, gonna feel like an idiot. What would you pro's recommend? Leave it at the SQL? Or keep pursuing the XML?

Share this post


Link to post
Share on other sites

Cron and a database. Cron to deal with the updated XML and add it to the database, database for the massively better performance and much better searching capabilities compared to reading an XML file.

Yeah, okay, it might take a few minutes overnight to deal with the file (something I'm 95% sure can be sped up) but you'll more than make up for the time during the rest of the day when people are hitting the database.

Share this post


Link to post
Share on other sites
1 minute ago, requinix said:

Cron and a database. Cron to deal with the updated XML and add it to the database, database for the massively better performance and much better searching capabilities compared to reading an XML file.

Yeah, okay, it might take a few minutes overnight to deal with the file (something I'm 95% sure can be sped up) but you'll more than make up for the time during the rest of the day when people are hitting the database.

thaaaannnkkk you. I built this thing from scratch, and out of the two methods the sql was incredibly faster! But it does not update instantly when a user visits. But I agree with you, the 10- 15min delay is ok for the rest of the day of the user hits.

I guess I just need to hear from a pro that I was on the right track. And when I find it, there is a massively faster method to import xml to sql - I forgot to bookmark it. but apparently the script imports a 20mb xml file in a few seconds - wow, when I find it, I'll post the link here - might help others.

Love to hear anyone else's thoughts on this. Or if you use a similar method.

Share this post


Link to post
Share on other sites

What is there to update? Is this XML coming from a user upload and not from some automatic nightly thing?

Share this post


Link to post
Share on other sites
9 hours ago, requinix said:

What is there to update? Is this XML coming from a user upload and not from some automatic nightly thing?

Parent company. New Items(usually 2 or 3 a night, mini 1 a night), update popularity of other products etc..

Share this post


Link to post
Share on other sites
6 hours ago, Barand said:

Fastest way is with mysql's LOAD XML command

hmm I might write a quick script to test that - the xml file is up to 5k records. The bottleneck comes in when migrating the data to the sql. That's why I just do the following to import it. Fairly fast, but not as fast as if I was reading it from xml straight to the page. I post that yesterday for someone looking for a way to just import XML to SQL.

What I might tinker with is a way to pull a little and cache it, but that could be problematic as well with a high level of users. what do you think?

Share this post


Link to post
Share on other sites

the code you linked to in the post above at SOF is the slowest method possible, because it's repeatedly producing, preparing, and executing the same single-row sql statement inside of a loop. at a minimum, it should only produce and prepare the query once, before the start of the loop, then only get each row of data and execute the query inside of the loop. you also need to insure that you are using a true prepared query (PDO emulates prepared queries by default and therefor doesn't actual save the query planning done with a true prepared query when executing the same sql statement more than once.)

the fastest method, as Barand posted, is to use a LOAD ... command.

the next fastest is to use a multi-value(multi-row) query that INSERTs ... UPDATEs or REPLACEs  (depending on how many fields you are dealing with, a REPLACE query may be better than in INSERT ... ON DUPLICATE UPDATE... query) as many rows as possible at one time (a few thousand is typical, based on the maximum packet size that the database server is set up to use - 1Mbytes is the default.)

the next fastest would be a singe-row prepared query, done correctly, by only producing and preparing the query once before the start of any loop.

the next fastest would be to run a non-prepared or an emulated prepared query inside the loop.

the slowest would be a true prepared query, being prepared and executed inside of the loop, because both the prepare() and execute() cause a communication between php and the database server.

Edited by mac_gyver

Share this post


Link to post
Share on other sites
29 minutes ago, mac_gyver said:

the code you linked to in the post above at SOF is the slowest method possible, because it's repeatedly producing, preparing, and executing the same single-row sql statement inside of a loop. at a minimum, it should only produce and prepare the query once, before the start of the loop, then only get each row of data and execute the query inside of the loop. you also need to insure that you are using a true prepared query (PDO emulates prepared queries by default and therefor doesn't actual save the query planning done with a true prepared query when executing the same sql statement more than once.)

 

Im still in the process of learning PDO,  so my question then would be:

The reason I have the repeated preparing, is because the XML is dynamic. will preparing ONCE account for the changes? Can you give an example of the repeated code and how you would improve it? That code posted is very old, and since it works, I have never changed it.

I use a combination of 2 XML functions to read that file since it is fairly large

$xml  = new XMLReader();            
 $xml->open($xml_file); // file is your xml file you want to parse
 while($xml->read() && $xml->name != 'game') { ; } // get past the header to your first record (game in my case)

 

Unless you are referring to this section: I just loop through the fields and create the prepare statement. If this is the section you are referring to, if I put this outside of the loop, will the execute command function as needed? Maybe this is my bottle neck. I am still new to the prepare and how they respond to an execute command. But if this is the bottle nicek, and the execute will work fine, I could grab the field names and create this string OUTSIDE of the while loop and then run it, might speed it up?

 /* this looks confusing, but it is not. There are over 20 fields, and instead of typing them all out, I just made a string. */
                $sql = "INSERT INTO $table (";
                foreach($gameRec as $field=>$game){
                $sql .= " $field,";
                }
                $sql = rtrim($sql, ",");
                $sql .=") values (";

                foreach($gameRec as $field=>$game) {
                    $sql .= " :$field,";               
                }
                $sql = rtrim($sql,",");
                $sql .= ") ON DUPLICATE KEY UPDATE "; // online game doesn't have a gamerank - not my choice LOL, so I adjust that for here

                switch ($os) {
                    case 'pc' : $sql .= "gamerank = ".$gameRec['gamerank']        ; break;
                    case 'mac': $sql .= "gamerank = ".$gameRec['gamerank']        ; break;
                    case 'pl' : $sql .= "playercount = ".$gameRec['playercount']  ; break;
                    case 'og' :
                        $playercount = $this->getPlayerCount($gameRec['gameid']);
                        $sql .= "playercount = ".$playercount['playercount']  ;
                        break;

                }

 

Also, MY POST is under JAKE STONE. I am not the OP there. Just in case you read a different post. thanks for checking that out.

Edited by Klyx99

Share this post


Link to post
Share on other sites
4 minutes ago, Barand said:

Canyou provide a sample of the XML you are trying to load?

sure! Here you go, there are about 5,000 others of them but felt one was enough LOL

<gamexml>
    <title>Big Fish Games - Game Archive</title>
    <description>A New Game Every Day</description>
    <copyright>© 2018, Big Fish Games, Inc.</copyright>
    <pubdate>Fri, 16 Nov 2018 05:15:32 -0800</pubdate>
    <game>
      <gameid>17915</gameid>
      <gamename>1 Moment of Time: Silentville</gamename>
      <family>1 Moment of Time: Silentville</family>
      <familyid>7237</familyid>
      <productid>13231</productid>
      <genreid name="Puzzle">4</genreid>
      <allgenreid>4,15,21,27,119,123,124,125,126,143,147</allgenreid>
      <shortdesc>Uncover a mystery and save a town from a curse.</shortdesc>
      <meddesc>Once upon a time in the small, undistinguished town of Silentville people started disappearing. Help uncover the mystery behind these disappearances and save the town from a curse!</meddesc>
      <longdesc>Uncover a mystery and save a town from a curse in 1 Moment Of Time: Silentville! Once upon a time in the small, undistinguished town of Silentville people started disappearing - one by one. Even the visitors suddenly disappeared without a trace. Airplanes and birds, trains and pedestrians, drivers, tourists and even bicyclers - all of them have vanished! Will you be able to solve the mystery and release Silentville from its cursed non-existence?</longdesc>
      <bullet1>Beautiful graphics</bullet1>
      <bullet2>Original mini-games</bullet2>
      <bullet3>Intriguing story</bullet3>
      <bullet4>Check out our Blog Walkthrough</bullet4>
      <foldername>en_1-moment-of-time-silentville</foldername>
      <price>$ 6.99 USD</price>
      <hasdownload>yes</hasdownload>
      <macgameid>27210</macgameid>
      <hasvideo>yes</hasvideo>
      <hasflash>no</hasflash>
      <hasdwfeature>no</hasdwfeature>
      <gamerank>2401</gamerank>
      <releasedate>2012-06-17 00:00:00</releasedate>
      <gamesize>528691151</gamesize>
      <systemreq>
        <pc>
          <sysreqos>Windows XP/Windows Vista/Windows 7/8</sysreqos>
          <sysreqmhz>1.8 GHz</sysreqmhz>
          <sysreqmem>512</sysreqmem>
          <sysreqdx>8.1</sysreqdx>
          <sysreqhd>518</sysreqhd>
        </pc>
      </systemreq>
    </game>

 

Share this post


Link to post
Share on other sites

Been an affiliate for 9yrs, last few years been tinkering with creating my own site vs the ugly stock one they have.. Great learning process.

Share this post


Link to post
Share on other sites

WELL I just tried the LOAD comand - and BAM 9 seconds on my OLD THINK PAD!!!!

Think that might be the way to go!

EDIT:

forgot to mention i did that at the cmd prompt, but that would be an easy batch file to make and run at midnight

 

Edited by Klyx99

Share this post


Link to post
Share on other sites

I was experimenting too. This worked (at least the one record you provided):

CREATE TABLE `gameinput` (
  `gameid` int(11) NOT NULL,
  `gamename` varchar(50) DEFAULT NULL,
  `family` varchar(50) DEFAULT NULL,
  `familyid` int(11) DEFAULT NULL,
  `productid` int(11) DEFAULT NULL,
  `genreid` int(11) DEFAULT NULL,
  `allgenreid` varchar(50) DEFAULT NULL,
  `shortdesc` varchar(50) DEFAULT NULL,
  `meddesc` varchar(250) DEFAULT NULL,
  `longdesc` text,
  `bullet1` varchar(50) DEFAULT NULL,
  `bullet2` varchar(50) DEFAULT NULL,
  `bullet3` varchar(50) DEFAULT NULL,
  `bullet4` varchar(50) DEFAULT NULL,
  `foldername` varchar(50) DEFAULT NULL,
  `price` varchar(50) DEFAULT NULL,
  `hasdownload` varchar(50) DEFAULT NULL,
  `macgameid` int(11) DEFAULT NULL,
  `hasvideo` varchar(50) DEFAULT NULL,
  `hasflash` varchar(50) DEFAULT NULL,
  `hasdwfeature` varchar(50) DEFAULT NULL,
  `gamerank` int(11) DEFAULT NULL,
  `releasedate` datetime DEFAULT NULL,
  `gamesize` int(11) DEFAULT NULL,
  `sysreqos` varchar(50) DEFAULT NULL,
  `sysreqmhz` varchar(50) DEFAULT NULL,
  `sysreqmem` int(11) DEFAULT NULL,
  `sysreqdx` varchar(10) DEFAULT NULL,
  `sysreqhd` int(11) DEFAULT NULL,
  PRIMARY KEY (`gameid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


LOAD XML LOCAL INFILE 'C:/inetpub/wwwroot/test/game.xml'
INTO TABLE gameinput
ROWS IDENTIFIED BY '<game>';

You have some work to normalising the data (eg the comma delimited allgenreid field) and the bulletN fields.

Share this post


Link to post
Share on other sites

yes that is exactly what I did. Very surprised at how fast that processed it. Im at work, so using the old tried and true Thinkpad w530 (20gbRAM) lol  and both times took roughly 9 - 15seconds.

 

So thinking outside the box, Instead of using cron.php to run a long and grueling import, I'll write a batch file (will post it here for others) to process all the data.

 

Thanks for that tip!! Was extremely helpful!!! Not sure if there is a SOLVED button, but if not, i'll edit the title

Share this post


Link to post
Share on other sites
54 minutes ago, Barand said:

I was experimenting too. This worked (at least the one record you provided):


CREATE TABLE `gameinput` (
  `gameid` int(11) NOT NULL,
  `gamename` varchar(50) DEFAULT NULL,
  `family` varchar(50) DEFAULT NULL,
  `familyid` int(11) DEFAULT NULL,
  `productid` int(11) DEFAULT NULL,
  `genreid` int(11) DEFAULT NULL,
  `allgenreid` varchar(50) DEFAULT NULL,
  `shortdesc` varchar(50) DEFAULT NULL,
  `meddesc` varchar(250) DEFAULT NULL,
  `longdesc` text,
  `bullet1` varchar(50) DEFAULT NULL,
  `bullet2` varchar(50) DEFAULT NULL,
  `bullet3` varchar(50) DEFAULT NULL,
  `bullet4` varchar(50) DEFAULT NULL,
  `foldername` varchar(50) DEFAULT NULL,
  `price` varchar(50) DEFAULT NULL,
  `hasdownload` varchar(50) DEFAULT NULL,
  `macgameid` int(11) DEFAULT NULL,
  `hasvideo` varchar(50) DEFAULT NULL,
  `hasflash` varchar(50) DEFAULT NULL,
  `hasdwfeature` varchar(50) DEFAULT NULL,
  `gamerank` int(11) DEFAULT NULL,
  `releasedate` datetime DEFAULT NULL,
  `gamesize` int(11) DEFAULT NULL,
  `sysreqos` varchar(50) DEFAULT NULL,
  `sysreqmhz` varchar(50) DEFAULT NULL,
  `sysreqmem` int(11) DEFAULT NULL,
  `sysreqdx` varchar(10) DEFAULT NULL,
  `sysreqhd` int(11) DEFAULT NULL,
  PRIMARY KEY (`gameid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


LOAD XML LOCAL INFILE 'C:/inetpub/wwwroot/test/game.xml'
INTO TABLE gameinput
ROWS IDENTIFIED BY '<game>';

You have some work to normalising the data (eg the comma delimited allgenreid field) and the bulletN fields.

use this... sorry LOL

Quote

view-source:http://cdn-rss.bigfishgames.com/src.php?feedtype=game&username=KlyxGaming&content=glrelease&gametype=pc&type=6&locale=en

 

Share this post


Link to post
Share on other sites

I just ran it with your data

mysql> LOAD XML LOCAL INFILE 'C:/inetpub/wwwroot/test/game.xml'
    -> INTO TABLE gameinput
    -> ROWS IDENTIFIED BY '<game>';
Query OK, 4975 rows affected (3.54 sec)

 

Share this post


Link to post
Share on other sites

Ya I think that is the way to go. with that speed I can now do all 9+ rss feeds that pertain to the overall collection. My test showed about 5min 8 tables, where before it was around 15min  top.. HUGE diff, as when I put it on my server (little more beefier) I am expecting all the tables to process around 1-3min tops.

That was a very nice tip! Currently making a batchfile to run it all (cept having issues with wget grabbing the xml file) but that is for a different thread. I'll figure it out LOL

Share this post


Link to post
Share on other sites

If it helps, this puts it all together in a PHP script

<?php
include('db_inc.php');
$db = pdoConnect("test");   // connect to "test" database

function get_data($path){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL,$path);
    curl_setopt($ch, CURLOPT_FAILONERROR,1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 15);
    $retValue = curl_exec($ch);
    curl_close($ch);
    return $retValue;
}

$url = "http://cdn-rss.bigfishgames.com/src.php?feedtype=game&username=KlyxGaming&content=glrelease&gametype=pc&type=6&locale=en";

$xml = get_data($url);
file_put_contents('game1.xml', $xml);

$db->exec("DROP TABLE gameinput");

$db->exec("CREATE TABLE `gameinput` (
              `gameid` int(11) NOT NULL,
              `gamename` varchar(250) DEFAULT NULL,
              `family` varchar(250) DEFAULT NULL,
              `familyid` int(11) DEFAULT NULL,
              `productid` int(11) DEFAULT NULL,
              `genreid` int(11) DEFAULT NULL,
              `allgenreid` varchar(250) DEFAULT NULL,
              `shortdesc` varchar(250) DEFAULT NULL,
              `meddesc` text,
              `longdesc` text,
              `bullet1` varchar(250) DEFAULT NULL,
              `bullet2` varchar(250) DEFAULT NULL,
              `bullet3` varchar(250) DEFAULT NULL,
              `bullet4` varchar(250) DEFAULT NULL,
              `foldername` varchar(250) DEFAULT NULL,
              `price` varchar(50) DEFAULT NULL,
              `hasdownload` varchar(50) DEFAULT NULL,
              `macgameid` int DEFAULT NULL,
              `hasvideo` varchar(50) DEFAULT NULL,
              `hasflash` varchar(50) DEFAULT NULL,
              `hasdwfeature` varchar(50) DEFAULT NULL,
              `gamerank` int(11) DEFAULT NULL,
              `releasedate` datetime DEFAULT NULL,
              `gamesize` long DEFAULT NULL,
              `sysreqos` varchar(250) DEFAULT NULL,
              `sysreqmhz` varchar(50) DEFAULT NULL,
              `sysreqmem` int(11) DEFAULT NULL,
              `sysreqdx` varchar(10) DEFAULT NULL,
              `sysreqhd` int(11) DEFAULT NULL,
              PRIMARY KEY (`gameid`)
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8"
         );

$db->exec("LOAD XML LOCAL INFILE 'C:/inetpub/wwwroot/test/game1.xml'
            INTO TABLE gameinput
            ROWS IDENTIFIED BY '<game>'"
         );
         
// CHECK DOWNLOAD/LOAD TABLE

$res = $db->query("SELECT COUNT(*) FROM gameinput");
printf("Game records loaded: %d<br>", $res->fetchColumn());
?>

 

Share this post


Link to post
Share on other sites
6 minutes ago, Barand said:

$db->exec("LOAD XML LOCAL INFILE 'C:/inetpub/wwwroot/test/game1.xml' INTO TABLE gameinput ROWS IDENTIFIED BY '<game>'" );

omg I am so dumb I forgot I could do all this in php der!!

LOL

I found a similar one from a few of my old sites (2010)using curl - prety fast on the download, but bottle necked with the old way of processing.

Now I'll put it todgether with the LOAD command. minor change in my maint job

Share this post


Link to post
Share on other sites

ok, CRON done.7 feeds into 7 tables 43 seconds (missing one table I will add later for Mac - can't leave them out - they're worse than linux users when it comes to games LOL)

Incredible  - and here I thought I got it down to 9 min and I was rockin!!! LOL
 

Quote


Processing Comprehensive... Done!

Processing Online Games... Done!

Processing Daily Deal... Done!

Processing Weekly Special... Done!

Processing Features... Done!

Processing Player count for online Games... Done!

Maint Completed Successfully!

Completion Time: 43 seconds


 

 

Share this post


Link to post
Share on other sites

The question now is "What are you going to do with that 8m 17secs that you just saved? 😄

Share this post


Link to post
Share on other sites
8 minutes ago, Barand said:

The question now is "What are you going to do with that 8m 17secs that you just saved? 😄

chill with a red bull?

Share this post


Link to post
Share on other sites

Well this is goofy. I realized I need to add a player count (which is from another feed) to the online games. I currently store the gameid and player count in another table. the player count from that table should be added to the onlinetable.playercount

this script taks over 3 - 5min vs the 30seconds for the other 7 tables. Any ideas what I could do to speed this up. Normally this is not an issue, but it is during a search as I want to use a player count to sort it by (note player count is NOT in the online games feed, it is in a separate feed for some weird reason, I say weird because only the online games use the player count)

 

 $sql = "SELECT * FROM playercount";
        $stmt = $db->connect()->query($sql);
        $stmt->execute();
        $onlineCount = $stmt->fetchAll(PDO::FETCH_ASSOC);
        
        foreach($onlineCount as $onlinegame) {
            $online_update = "UPDATE onlinegames SET playercount = :playercount WHERE gameid = :gameid";
            $check = $db->connect()->prepare($online_update);
            $check->execute($onlinegame);
           
        }

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.