cyberbob1uk Posted January 24, 2012 Share Posted January 24, 2012 Hi guys, I am currently receiving a large text file ( > 500mb), once per week which I have been manually splitting then processing to obtain the required CSV files. However, this is taking in the region of 2 to 3 hours. Very soon, these files will be sent daily and I really dont have the time to split and process this everyday I have been playing for a while to try and parse everything properly/automatically with fopen, feof and fgets ( and other 'f' options), but the script never seems to read the file all the way to the end - I assume this is due to memory usage. The data received in the file follows a strict pattern throughout the file which is: BSNY990141112271112270100000 POO2C35 122354000 DMUS 075 O BX NTY LOLANCSTR 1132 11322 TB LIMORCMSJ 1135 00000000 LICRNFNJN 1140 00000000 H LICRNF 1141H1142H 11421142 T LISDAL 1147H1148H 11481148 T LIARNSIDE 1152H1153 11531153 T LIGOVS 1158 1159 11581159 T LIKTBK 1202 1202H 12021202 T LICARK 1206 1207 12061207 T LIULVRSTN 1214H1215H 12151215 T LIDALTON 1223 1223H 12231223 T LIDALTONJ 1225 00000000 LIROOSE 1229 1229H 12291229 T 2 LTBAROW 1237 12391 TF That is just one record of informaton (1 of around 140,000 records), each record has no fixed amount of lines but each line in each record is fixed to 80 characters and all lines in each record need to have the same unique 'id', at present, Im using an md5 hash of microtime. The first line of every record starts with 'BS' and the last line of each record starts with 'LT' terminating with 'TF'. All the other stuff between also follows a certain pattern of which I can break down effectively. The record above show one train service schedule, hence why each line in each record needs the same unique id. Anyone got any ideas on how I could process such a file effectively?? Many thanks Dave Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/ Share on other sites More sharing options...
trq Posted January 24, 2012 Share Posted January 24, 2012 Are you trying to do this via a web interface? Do you need to? You shouldn't have any problem processing this via the command line with any scripting language. Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310583 Share on other sites More sharing options...
cyberbob1uk Posted January 24, 2012 Author Share Posted January 24, 2012 This is all being done on my web server and does need to be as the daily files will be placed directy on the server ready for processing Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310586 Share on other sites More sharing options...
trq Posted January 24, 2012 Share Posted January 24, 2012 That doesn't really answer the question. Are you executing this script by making a request to the script via a browser? Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310587 Share on other sites More sharing options...
cyberbob1uk Posted January 24, 2012 Author Share Posted January 24, 2012 Actually, after replying, I think i caught onto your thought. While testing, I have been executing the script via a browser (and not getting the expected results), however, I assume if the browser wasnt used to execute the script, all 'should' work as expected? Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310589 Share on other sites More sharing options...
PFMaBiSmAd Posted January 24, 2012 Share Posted January 24, 2012 Define: manually splitting then processing to obtain the required CSV files Why and what are you splitting the data for? What processing are you doing? Why and what are the 'required' CSV files? Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310591 Share on other sites More sharing options...
cyberbob1uk Posted January 24, 2012 Author Share Posted January 24, 2012 Define: manually splitting then processing to obtain the required CSV files The weekly files I have been doing on my local machine through the night at work. The original file was split into managable chunks (usually 10 - 12 separate files). Then each record in each smaller file was made into an array using an md5 hash of microtime to create a unique array key. In the next script I manually changed the name of the smaller file (now a php file containing arrays), and included it in the script which put the info into CSV files Why and what are you splitting the data for? What processing are you doing? Why and what are the 'required' CSV files? Each record in the file contains the stopping pattern for one train service operating between date_a and date_b on day(s)_x (this info is held within the first line of each record) - one train can have more than 1 pattern depending on the date and day. Because of the amount of services and service variations, each record needs a unique id so different schedules for the same train on different days dont get mixed up. From this, main CSV files service details, and stopping patterns are created. Service details containing the necessary parts for that service, stopping patterns contains the timing points for each record in order of passing (theses are the lines starting 'LO','LI' and 'LT'), then each timing point has its own csv containing each service that passes/stops at it. CSV's are then transfered to a database using LOAD DATA INFILE Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310593 Share on other sites More sharing options...
SergeiSS Posted January 24, 2012 Share Posted January 24, 2012 I think, that you may try to load it directly to Data Base. Every DB has a command that allow to do it. Your file is well-structured and it's read to be loaded. You may to do the next actions while loading: 1. Assign automatic, auto-incremented ids. 2. Analyze information and mark some records as start/stop of a block - you may do it inside trigger. You may skip recording of some lines. You may save some additional info into some other tables, on the base of information from some lines... In other words you may do whatever you wish. 500 MB/80 chars =~ 6.25 mln records. It's not too much for a good server and good SQL-server Now you need a lot of additional time to process via CSV. Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310596 Share on other sites More sharing options...
cyberbob1uk Posted January 24, 2012 Author Share Posted January 24, 2012 originally i did try to send directly to database, but this was taking an unbelievably long time. For this reason, I decided to create CSV files, load those to the database then destroy the files as they'd be no longer needed Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310600 Share on other sites More sharing options...
SergeiSS Posted January 24, 2012 Share Posted January 24, 2012 OK. Could you show a code that is used for file processing? Maybe you something is wrong. You see... I also load a lot of information every day, it's statistics information. One part is loaded automatically in the night, I don't know it's volume. Another part is loaded under my control in the morning. The second part has a volume appr. 400-500 MB and it takes appr. 15-20 minutes to prepare it (C++ program is working) and then appr. 10-15 minutes to load it into DB, in some tables. DB is PostgreSQL. You see - the time is not so big. I'm talking about it in order to show you that I know something about it . Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310602 Share on other sites More sharing options...
cyberbob1uk Posted January 24, 2012 Author Share Posted January 24, 2012 First i see if the new file is available, if it is then : while(!feof($handle)) { $line = stream_get_line($handle, 81); $lineID = "$line[0]$line[1]"; if($lineID == 'TI'){ ti($line); } if($lineID == 'BS'){ $time = md5(microtime()); $time = "a$time"; $details = bs($line,$time); } if($lineID == 'BX'){ bx($line,$details); } if($lineID == 'LO'){ lo($line,$details); } if($lineID == 'LI'){ li($line,$details); } if($lineID == 'LT'){ lt($line,$details); } } each custom function opens the relevant csv file, appends the needed data, closes csv file. $details returns the md5 hash of the microtime of when BS was detedted at the start of the line, therefore giving each line the same id until the next BS line-start is encountered Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310606 Share on other sites More sharing options...
SergeiSS Posted January 24, 2012 Share Posted January 24, 2012 each custom function opens the relevant csv file, appends the needed data, closes csv file. I'm quite sure that it's your problem!!! You have some millions of lines and you open-close files some millions of times.... That's very-very bad! Because opening-closing procedures takes a time. Many millions... You have to open files in the beginning, before your loop is started and close them when loop is finished. Just try it - preparation time might be changed from hours to minutes or tens of minutes. Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1310607 Share on other sites More sharing options...
SergeiSS Posted January 26, 2012 Share Posted January 26, 2012 Do you solve you problem or not? Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1311333 Share on other sites More sharing options...
cyberbob1uk Posted January 28, 2012 Author Share Posted January 28, 2012 Sorry, been at work for the last 4 nights. Ive have been trying to get it to work but to no avail. Im now starting to think of another database structure that could work in the hope it will copy all the data as needed. Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1311890 Share on other sites More sharing options...
PFMaBiSmAd Posted January 28, 2012 Share Posted January 28, 2012 There are some things in the posted code and in the implied code that are not efficient for processing a large amount of data - 1) Reading the file one line at a time (you should read a large chunk of the file at one time, such as 100k - 200k characters), 2) As SergeiSS pointed out - opening and closing files repeatedly, 3) Once you do have the data preprocessed, you should either get the database engine to import the data (see the LOAD DATA LOCAL INFILE query for mysql) or if you must use php code to perform each insert/replace query, use the multi-value version of INSERT/REPLACE statements with a few thousand records in each query. We can only directly help if we know what portion of that data you are actually keeping and what processing you are doing (i.e. what's your existing code need to reproduce the problem.) Also, if this data is coming from a standard/published source, if you mention the source, someone might know of an existing script that deals with it. Just an FYI, I created a test data file with 140k copies of the sample data you posted and was able to read through the ~176Mbyte/~2M line file in blocks of 200k characters at a time, breaking the data into lines in an array, ready to be processed, in 4 seconds for the whole file, when using an explode statement to break up the lines (14 seconds when using a preg_split statement.) Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1311908 Share on other sites More sharing options...
silkfire Posted January 29, 2012 Share Posted January 29, 2012 PFMa, you say it's better to read in at 200k chars at a time, but how do you deal with the situation that you're not getting the full last row? Do you somehow glue it together to the next row you read, or? The lines are max 80 chars but they could be less unless padding is used to reach 80 chars/line? What does 2MB refer to? Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1312093 Share on other sites More sharing options...
PFMaBiSmAd Posted January 29, 2012 Share Posted January 29, 2012 Do you somehow glue it together to the next row you read Yes. Any partial information left over from the end of one read is appended to by the following read. Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1312095 Share on other sites More sharing options...
laffin Posted January 29, 2012 Share Posted January 29, 2012 But you don't really need it. I think the manual splitting of the file is unnecessary. the processor is a conglomeration of if's, which should be optimized. Using a switch/case or just using continue, to avoid any more if comparisons. You should be looking at your manual work and looking to automate that. <?php $fh=fopen('sample.txt','rt'); $details=array(); $insection=FALSE; $ctr=0; while(!feof($fh)) { $line=fgets($fh); $id=substr($line,0,2); if($id!=='BS' && !$insection) continue; switch($id) { case 'BS': // Initial Processing $id=md5(microtime()); $insection=TRUE; break; case 'TI': case 'BX': case 'LO': case 'LI': // Each section should have it's own processor break; case 'LT': // Do Final Processing Here $ctr++; $insection=FALSE; } } fclose($fh); echo "Processed $ctr sections"; ?> This should handle each section of the BIG FILE, just add your processing code Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1312266 Share on other sites More sharing options...
cyberbob1uk Posted January 30, 2012 Author Share Posted January 30, 2012 After typing and testing for a number of hours, the problem is lying with the amount of files that need creating. Each line in the data that starts with either LO,LI or LT is a station or timing point, removing the LO,LI or LT gives the unique reference to that station. LOLANCSTR becomes LANCSTR LIMORCMSJ becomes MORCMSJ LTBAROW becomes BAROW each of these stations timing points need their own table (containing the rest of the data on the same row), as users on the site will search by station/timing point intially, then extra info is collected from the other tables using the unique id. I ran the file creation script so all LO,LI and LT were written to one file and this resulted in 2.2 million lines of data(in a matter of 5 seconds). After eventually getting that file into a database (LOAD DATA LOCAL INFILE failed and the database table i was trying to insert on crashed, with no lines inserted), I performed a couple of test searches but was taking 10s to return just bare basic info, that was before i got the additional info from the other tables and echoed the results to the browser. There are approximatey 5,500 different timing points for which tables are needed - there has to be a better way for me to create these? Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1312423 Share on other sites More sharing options...
PFMaBiSmAd Posted January 30, 2012 Share Posted January 30, 2012 We cannot specifically help you unless you show us specifically what you are doing that you need help with. However, it sounds like you are creating multiple tables to hold same meaning data. That is a bad database design and won't ever produce results efficiently. Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1312588 Share on other sites More sharing options...
laffin Posted January 30, 2012 Share Posted January 30, 2012 Avoid the creation of the files, and let the script do the inserts. Look at what you are manually doing, and see if it can be automated, Quote Link to comment https://forums.phpfreaks.com/topic/255664-large-file-to-csv/#findComment-1312601 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.