Large file to CSV

cyberbob1uk · January 24, 2012

Hi guys,

I am currently receiving a large text file ( > 500mb), once per week which I have been manually splitting then processing to obtain the required CSV files. However, this is taking in the region of 2 to 3 hours. Very soon, these files will be sent daily and I really dont have the time to split and process this everyday

I have been playing for a while to try and parse everything properly/automatically with fopen, feof and fgets ( and other 'f' options), but the script never seems to read the file all the way to the end - I assume this is due to memory usage.

The data received in the file follows a strict pattern throughout the file which is:

BSNY990141112271112270100000 POO2C35    122354000 DMUS   075                   O
BX         NTY                                                                  
LOLANCSTR 1132 11322         TB                                                 
LIMORCMSJ           1135 00000000                                               
LICRNFNJN           1140 00000000                       H                       
LICRNF    1141H1142H     11421142         T                                     
LISDAL    1147H1148H     11481148         T                                     
LIARNSIDE 1152H1153      11531153         T                                     
LIGOVS    1158 1159      11581159         T                                     
LIKTBK    1202 1202H     12021202         T                                     
LICARK    1206 1207      12061207         T                                     
LIULVRSTN 1214H1215H     12151215         T                                     
LIDALTON  1223 1223H     12231223         T                                     
LIDALTONJ           1225 00000000                                               
LIROOSE   1229 1229H     12291229         T             2                       
LTBAROW   1237 12391     TF

That is just one record of informaton (1 of around 140,000 records), each record has no fixed amount of lines but each line in each record is fixed to 80 characters and all lines in each record need to have the same unique 'id', at present, Im using an md5 hash of microtime.

The first line of every record starts with 'BS' and the last line of each record starts with 'LT' terminating with 'TF'. All the other stuff between also follows a certain pattern of which I can break down effectively.

The record above show one train service schedule, hence why each line in each record needs the same unique id.

Anyone got any ideas on how I could process such a file effectively??

Many thanks

Dave

trq · January 24, 2012

Are you trying to do this via a web interface? Do you need to?

You shouldn't have any problem processing this via the command line with any scripting language.

cyberbob1uk · January 24, 2012

This is all being done on my web server and does need to be as the daily files will be placed directy on the server ready for processing

trq · January 24, 2012

That doesn't really answer the question. Are you executing this script by making a request to the script via a browser?

cyberbob1uk · January 24, 2012

Actually, after replying, I think i caught onto your thought.

While testing, I have been executing the script via a browser (and not getting the expected results), however, I assume if the browser wasnt used to execute the script, all 'should' work as expected?

PFMaBiSmAd · January 24, 2012

Define: manually splitting then processing to obtain the required CSV files

Why and what are you splitting the data for? What processing are you doing? Why and what are the 'required' CSV files?

cyberbob1uk · January 24, 2012

Define: manually splitting then processing to obtain the required CSV files

The weekly files I have been doing on my local machine through the night at work.

The original file was split into managable chunks (usually 10 - 12 separate files).

Then each record in each smaller file was made into an array using an md5 hash of microtime to create a unique array key.

In the next script I manually changed the name of the smaller file (now a php file containing arrays), and included it in the script which put the info into CSV files

Why and what are you splitting the data for? What processing are you doing? Why and what are the 'required' CSV files?

Each record in the file contains the stopping pattern for one train service operating between date_a and date_b on day(s)_x (this info is held within the first line of each record) - one train can have more than 1 pattern depending on the date and day.

Because of the amount of services and service variations, each record needs a unique id so different schedules for the same train on different days dont get mixed up.

From this, main CSV files service details, and stopping patterns are created. Service details containing the necessary parts for that service, stopping patterns contains the timing points for each record in order of passing (theses are the lines starting 'LO','LI' and 'LT'), then each timing point has its own csv containing each service that passes/stops at it.

CSV's are then transfered to a database using LOAD DATA INFILE

SergeiSS · January 24, 2012

I think, that you may try to load it directly to Data Base. Every DB has a command that allow to do it. Your file is well-structured and it's read to be loaded.

You may to do the next actions while loading:

1. Assign automatic, auto-incremented ids.

2. Analyze information and mark some records as start/stop of a block - you may do it inside trigger. You may skip recording of some lines. You may save some additional info into some other tables, on the base of information from some lines... In other words you may do whatever you wish.

500 MB/80 chars =~ 6.25 mln records. It's not too much for a good server and good SQL-server

Now you need a lot of additional time to process via CSV.

cyberbob1uk · January 24, 2012

originally i did try to send directly to database, but this was taking an unbelievably long time. For this reason, I decided to create CSV files, load those to the database then destroy the files as they'd be no longer needed

SergeiSS · January 24, 2012

OK. Could you show a code that is used for file processing? Maybe you something is wrong.

You see... I also load a lot of information every day, it's statistics information. One part is loaded automatically in the night, I don't know it's volume. Another part is loaded under my control in the morning. The second part has a volume appr. 400-500 MB and it takes appr. 15-20 minutes to prepare it (C++ program is working) and then appr. 10-15 minutes to load it into DB, in some tables. DB is PostgreSQL. You see - the time is not so big. I'm talking about it in order to show you that I know something about it .

cyberbob1uk · January 24, 2012

First i see if the new file is available, if it is then :

while(!feof($handle)) {
  $line = stream_get_line($handle, 81);
  $lineID = "$line[0]$line[1]";
  
  if($lineID == 'TI'){
    ti($line);
  }
  if($lineID == 'BS'){
    $time = md5(microtime());
    $time = "a$time";
    
    $details = bs($line,$time);
  }
  if($lineID == 'BX'){
    
    bx($line,$details);
  }
  if($lineID == 'LO'){
    lo($line,$details);
  }
  if($lineID == 'LI'){
    li($line,$details);
  }
  if($lineID == 'LT'){
    lt($line,$details);
  }

}

each custom function opens the relevant csv file, appends the needed data, closes csv file.

$details returns the md5 hash of the microtime of when BS was detedted at the start of the line, therefore giving each line the same id until the next BS line-start is encountered

SergeiSS · January 24, 2012

each custom function opens the relevant csv file, appends the needed data, closes csv file.

I'm quite sure that it's your problem!!!

You have some millions of lines and you open-close files some millions of times.... That's very-very bad! Because opening-closing procedures takes a time. Many millions...

You have to open files in the beginning, before your loop is started and close them when loop is finished. Just try it - preparation time might be changed from hours to minutes or tens of minutes.

SergeiSS · January 26, 2012

Do you solve you problem or not?

cyberbob1uk · January 28, 2012

Sorry, been at work for the last 4 nights. Ive have been trying to get it to work but to no avail.

Im now starting to think of another database structure that could work in the hope it will copy all the data as needed.

PFMaBiSmAd · January 28, 2012

There are some things in the posted code and in the implied code that are not efficient for processing a large amount of data -

1) Reading the file one line at a time (you should read a large chunk of the file at one time, such as 100k - 200k characters),

2) As SergeiSS pointed out - opening and closing files repeatedly,

3) Once you do have the data preprocessed, you should either get the database engine to import the data (see the LOAD DATA LOCAL INFILE query for mysql) or if you must use php code to perform each insert/replace query, use the multi-value version of INSERT/REPLACE statements with a few thousand records in each query.

We can only directly help if we know what portion of that data you are actually keeping and what processing you are doing (i.e. what's your existing code need to reproduce the problem.) Also, if this data is coming from a standard/published source, if you mention the source, someone might know of an existing script that deals with it.

Just an FYI, I created a test data file with 140k copies of the sample data you posted and was able to read through the ~176Mbyte/~2M line file in blocks of 200k characters at a time, breaking the data into lines in an array, ready to be processed, in 4 seconds for the whole file, when using an explode statement to break up the lines (14 seconds when using a preg_split statement.)

silkfire · January 29, 2012

PFMa, you say it's better to read in at 200k chars at a time, but how do you deal with the situation that you're not getting the full last row? Do you somehow glue it together to the next row you read, or?

The lines are max 80 chars but they could be less unless padding is used to reach 80 chars/line?

What does 2MB refer to?

PFMaBiSmAd · January 29, 2012

Do you somehow glue it together to the next row you read

Yes. Any partial information left over from the end of one read is appended to by the following read.

laffin · January 29, 2012

But you don't really need it.

I think the manual splitting of the file is unnecessary. the processor is a conglomeration of if's, which should be optimized. Using a switch/case or just using continue, to avoid any more if comparisons.

You should be looking at your manual work and looking to automate that.

<?php
  $fh=fopen('sample.txt','rt');
  $details=array();
  $insection=FALSE;
  $ctr=0;
  while(!feof($fh))
  {
    $line=fgets($fh);
    $id=substr($line,0,2);
    if($id!=='BS' && !$insection)
      continue;
    switch($id)
    {
      case 'BS':
        // Initial Processing
        $id=md5(microtime());
        $insection=TRUE;
        break;
      case 'TI':
      case 'BX':
      case 'LO':
      case 'LI':
        // Each section should have it's own processor
        break;
      case  'LT':
        // Do Final Processing Here
        $ctr++;
        $insection=FALSE;
    }
  }
  fclose($fh);
  echo "Processed $ctr sections";
?>

This should handle each section of the BIG FILE, just add your processing code

cyberbob1uk · January 30, 2012

After typing and testing for a number of hours, the problem is lying with the amount of files that need creating.

Each line in the data that starts with either LO,LI or LT is a station or timing point, removing the LO,LI or LT gives the unique reference to that station.

LOLANCSTR becomes LANCSTR                                            
LIMORCMSJ becomes MORCMSJ
LTBAROW becomes BAROW

each of these stations timing points need their own table (containing the rest of the data on the same row), as users on the site will search by station/timing point intially, then extra info is collected from the other tables using the unique id.

I ran the file creation script so all LO,LI and LT were written to one file and this resulted in 2.2 million lines of data(in a matter of 5 seconds).

After eventually getting that file into a database (LOAD DATA LOCAL INFILE failed and the database table i was trying to insert on crashed, with no lines inserted), I performed a couple of test searches but was taking 10s to return just bare basic info, that was before i got the additional info from the other tables and echoed the results to the browser.

There are approximatey 5,500 different timing points for which tables are needed - there has to be a better way for me to create these?

PFMaBiSmAd · January 30, 2012

We cannot specifically help you unless you show us specifically what you are doing that you need help with.

However, it sounds like you are creating multiple tables to hold same meaning data. That is a bad database design and won't ever produce results efficiently.

laffin · January 30, 2012

Avoid the creation of the files, and let the script do the inserts.

Look at what you are manually doing, and see if it can be automated,

Sign In

Large file to CSV

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information