Jump to content

Recommended Posts

I am parsing through a .csv file to translate it into a cleaner looking .csv file.  I am first taking the input .csv file and putting it into a database.  I just testing an input file that was about 50,000 lines and that took about 14 minutes to complete..  This doesn't sound good especially considering the input file that users will be submitting are about 1,600,000 lines.  Here's my code.. does anyone see how my algorithm could be improved to increase the speed?  Or should I be looking at doing this in another language?  If so, what language do you propose?

 

if($flag == true)
{
$con = mysql_connect('localhost', 'root', '') or die('cannot connect');
$db = mysql_select_db('translate') or die('cannot select DB');
$table = 'nursinghomes';


if ($file=fopen($fileloc,"r+"))
{
while(!feof($file))
{
	$line = fgets($file);
	$linearray = explode(',', $line);
	$id = $linearray[0];
	$formcolumn = $linearray[1] . "_" . $linearray[2] . "_" . $linearray[3];
	$content = $linearray[4];


	if(checkColumn($formcolumn, $table) == 0)
	{
		$query = ("ALTER TABLE $table ADD COLUMN $formcolumn CHAR(100);");
		$result = mysql_query($query);

	}
	if(checkID($id, $table) == 1)
	{
		$query = "UPDATE $table SET $formcolumn='$content' WHERE id=$id";
		$result = mysql_query($query);
	}
	else
	{
		$query = "INSERT INTO $table (id, $formcolumn) VALUES('$id', '$content')";
		mysql_query($query);
	}

}
fclose($file);
echo "Your file has been translated.  Click <a href=\"?p=another.php\">here</a> to export the file.";
}

Link to comment
https://forums.phpfreaks.com/topic/238337-file-processing-time/
Share on other sites

Never EVER run queries in loops. Instead, in your processing code determine all the DB queries you will need to make and then combine them (as applicable). For example, you could run one query to add multiple columns and you can run one query to add multiple records (both of which will be much more efficient). You can also run one query to update multiple records with different values for each - but it is not strait-forward and I'm not sure how efficient it will be.

 

I'll review the code and give you something to work with.

Why are you adding columns to a table?  There has to be a structure that doesn't involve that.

 

I would be willing to bet that most of the time taken in this is in the mysql aspects of the code.

 

You can greatly improve that by batching up all your inserts.  You do not need to seperate the insert from the update if you use the INSERT ... ON DUPLICATE KEY ... syntax.

 

You might start with batches of 500 rows at a time.  If you're not familiar, you simply repeat the part after VALUES

 

INSERT INTO table (id, ) VALUES ( ), ( ), ( ) .....
ON DUPLICATE KEY ...

 

an example line of input would be like:

 

111111,Z012345,00100,0200,Arizona

 

11111 would be the client ID

Z012345,00100,0200 would be combined to form the column name, so it would produce Z012345_00100_0200

Arizona would be the content of the field

 

The input file has multiple lines like this.. there could be 10 lines with the same ID number, but different columns and content.  There is no way of knowing the structure of the database before hand, so I have to build the structure as I go as well as filling in the content..

an example line of input would be like:

 

111111,Z012345,00100,0200,Arizona

 

11111 would be the client ID

Z012345,00100,0200 would be combined to form the column name, so it would produce Z012345_00100_0200

Arizona would be the content of the field

 

The input file has multiple lines like this.. there could be 10 lines with the same ID number, but different columns and content.  There is no way of knowing the structure of the database before hand, so I have to build the structure as I go as well as filling in the content..

 

Concatenating 3 values together and turning that into a column is a bizarre design that is going to perform poorly.

 

Other than loading in this data into mysql we don't really have an idea of what you're trying to accomplish, but I can say with some certainty that you are on the wrong track here. 

 

If we understood the goal, we might be able to advise you on a good way to approach your problem.

 

OK, following your original process I rewrote this to do the exact same thing, but more efficiently. Although it is a lot more code it should run faster due to reducing the queries being done. Right now your script runs an individual query for each and every field of each record. The below script does the following:

 

1. Gets a list of all the current fields in the table, then while processing the data generates a list of new fieldds to be added to the table. Then ONE query is run to add the fields.

2. Creates an array of all complete records to be added/updated

3. Splits the array into two arrays: records to be added and records to be updated

4. Processes the records to be added array into ONE single query to add all the records

5. Processes the records to be updated into one complete query per record (not per field).

 

With that there are a total of 4 main queries plus one query each for each record update. That is probably less than 10% of the queries you have now. Also, #5 can be further updated into a single query to update all the records in one query. It takes a MySQL function similar to a switch statement in PHP. But, I forget the structure and I'm too lazy to look it up. That would mean only 5 total queries for the entire file.

 

I wrote all this code on-the-fly, so I'm positive there are some syntax errors and I probably have a logic problem or two. I would suggest commenting out the actual mysql_query() calls, add some debugging code to display the queries and other pertinent data, then run it against a small file to validate the results.

 

<?php

if($flag == true)
{
    $con = mysql_connect('localhost', 'root', '') or die('cannot connect');
    $db = mysql_select_db('translate') or die('cannot select DB');
    $table = 'nursinghomes';

    //Get list of existing columns from table
    $query = "SHOW COLUMNS FROM {$table}";
    $result = mysql_query($query) or die(mysql_error());
    $existingColumns = array();
    while($row = mysql_fetch_assoc($result))
    {
        $existingColumns[] = $row['Field'];
    }

    //Variables to hold processed data
    $newColumnsSQL = array();
    $data = array();

    //Process the file into an array
    if ($file=fopen($fileloc,"r+"))
    {
        while($linearray = fgetcsv($file, 1000, ",") !== FALSE)
        {
            $id = $linearray[0];
            $formcolumn = $linearray[1] . "_" . $linearray[2] . "_" . $linearray[3];
            $content = $linearray[4];

            //Detemine if field exists in table (or has already been added)
            if(!in_array($formcolumn, $existingColumns))
            {
                $existingColumns[] = $formcolumn;
                $newColumnsSQL[] = "ADD COLUMN {$formcolumn} CHAR(100)";
            }
            
            //Add record to data variable
            $data[(int)$id][$formcolumn] = mysql_real_escape_string($content);
        }
        fclose($file);

    }
    //Finish processing file, now run queries

    //Run ONE query to add all new columns
    if(count($newColumnsSQL) > 0)
    {
        $query = "ALTER TABLE $table " . implode(', ', $newColumnsSQL);
        $result = mysql_query($query) or die(mysql_error());
    }

    //Run query to find all existing records from the processed data
    $query = "SELECT id FROM {$table} WHERE id IN (" . implode(',', array_keys($data)) . ")";
    $result = mysql_query($query) or die(mysql_error());
    //Dump results into array
    $existingIDs = array();
    while($row = mysql_fetch_assoc($result))
    {
        $existingIDs[$row['id']] = false;
    }

    //Split data array into two arrays (one for new records and one for existing records)
    $newData = array_diff_key($data, $existingIDs);
    $existingData = array_diff_key($data, $newData);

    //Create array of all fields (except ID field) with empty value
    $columnList = array_fill_keys($existingColumns, '');

    //Process insert data
    $valuesSQL = array();
    foreach($newData as $id => $recordData)
    {
        //Ensure data is mapped for all fields in correct order
        $insertData = array_merge($columnList, $recordData);
        foreach($recordData as $fieldName => $fieldValue)
        {
            $valuesSQL[] = "('" . implode("', '", $insertData) . "')";
        }
    }
    
    //Create ONE insert query for all new records
    $query = "INSERT INTO $table (`" . implode('`, `', array_keys($columnList))  . "`)
              VALUES " . implode(', ', $valuesSQL);
    $result = mysql_query($query) or die(mysql_error());

    //Process update data
    foreach($existingData as $id => $recordData)
    {
        $setValuesSQL = array();
        foreach($recordData as $fieldName => $fieldValue)
        {
            $setValuesSQL[] = "`{$fieldName}` = '{$fieldValue}'"
        }
        $query = "UPDATE {$table} SET " . implode(', ', $setValuesSQL);
        $result = mysql_query($query) or die(mysql_error());
    }

    echo "Your file has been translated.  Click <a href=\"?p=another.php\">here</a> to export the file.";
}

?>

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.