Lectrician Posted August 25, 2014 Share Posted August 25, 2014 Hi. I am working with some rather large files. Sometimes I am just opening a file and trying to find an entry on a single line. Sometimes I am opening a file and working on each line. Whats the most efficient way to work with a large file? If I open a file, and place all lines into an array, I will use a lot of memory. If I open a file, and work on it line by line, I should use much less memory, but will the process take longer? Whats most efficient? I have underlined my main question above. Thanks :-) Quote Link to comment Share on other sites More sharing options...
mac_gyver Posted August 25, 2014 Share Posted August 25, 2014 Whats the most efficient way to work with a large file? store the data in a database. a) you can set up indexes that allow information to be found quickly, without reading through the whole set of data. b) the database engine is complied code that can find information at least 10x faster than php's interpreted code can. Quote Link to comment Share on other sites More sharing options...
mogosselin Posted August 25, 2014 Share Posted August 25, 2014 Yes, you can read it line by line without busting the memory: <?php$fh = fopen("inputfile.txt", "r");if ($fh) {while (($line = fgets($fh)) !== false) {echo $line;// do something with $line..}fclose($fh);}?> But yes, it will take time (depending on how big your file is). More info on fgets in the official PHP guide Depending on what are the information in your files (like mac_gyver said), it would probably be a good idea to use a database. You could create a script that reads your text file and put them in the database each hour or day etc. depending on what you need. That's a usual 'load' process. If you need something more robust, think about using a queue system (RabitMQ for example) to load your files in a database. Or, depending on what you need your data for, you could also load it in a search system (Solr, Elastic Search) so that you can issue super fast search queries. If you want more precise tips, tell us what you're using your data for and what's the average volume (number of lines, etc) and if there are relations between those files. Quote Link to comment Share on other sites More sharing options...
Lectrician Posted August 25, 2014 Author Share Posted August 25, 2014 Thanks. I was looking at two different scenarios. I have a log file which is appended and rotated daily. I can't alter how this written. Every night, my PHP script runs to extract data from each line. The files have become too large to open in one go into an array, so I altered the script to open the file and extract data line by line, using similar code to above(thanks). This now works fine. It got me to thinking. I have a log file which I append to, as users pass through a captive portal. It stores a users MAC, IP, HOSTNAME, HTTP HEADER, NAME, POSTCODE, EMAIL etc on a line. The file can be upto 400 lines long sometimes. When a user is presented with the captive portal, if they have passed through before, their data is auto filled. Their cookies are checked first, but if these have been cleared, up to 30 days of log files are opened, each line being checked for the users MAC. If a match is found, their details are auto filled. I can't decide if its best to open a whole log file into an array and work through each entry, or open the file line by line. This is where my speed vs memory query came from. I want it done quickly, but don't want to kill the process due to lack of memory (don't think it has ever died from lack of memory yet). Cheers. Quote Link to comment Share on other sites More sharing options...
mogosselin Posted August 25, 2014 Share Posted August 25, 2014 So, if I understand correctly, you store user information in a log file. One line equals one user? So you have a lot of data on the same line (mac address, ip, etc.) for each users. Then you need to open the file, parse the lines and find the user quickly to fill out its information. If it's what you're doing, this would normally be done with a Database. What's your key for your user? IP address? MAC address? Remember that the IP and MAC address could be the same for multiple users. For example, at my work, we all have the same IP and MAC address. Quote Link to comment Share on other sites More sharing options...
trq Posted August 25, 2014 Share Posted August 25, 2014 You probably want to look at generators for the most efficient way to loop through a files contents. <?php function getLinesFromFile($file) { $f = fopen($file, 'r'); if ($f) { while ($line = fgets($f)) { yield $line; } fclose($f); } } foreach (getLinesFromFile("yourlog.txt") as $line) { // do something with $line } This will only every load into memory a single line at a time. Quote Link to comment Share on other sites More sharing options...
Lectrician Posted August 25, 2014 Author Share Posted August 25, 2014 Thanks. Yes, the line for each user looks like this: timestamp | name | email | postcode | IP given | their MAC | hostname | HTTP header | a few other bits of info There can be 400 odd lines on busy days. I am running PFsense, and using the captive portal in it, altering the PHP to suit me. I check to make sure no cookie exists holding the name, email, postcode (these are auto-filled into the HTTP form fields when they go to connect). If no cookie exists, I open yesterdays logs, search each line for the users MAC address, and then auto fill. I think it is safe to assume the MACs wil be different, as this is how PFsense deals with identifying users. If yesterdays logs don't find a match, the script will go back 30 days. This whole searching the logs could be omitted, but it is preferred to have the form fields auto-filled, as customers are less annoyed! I am not sure I can setup a database on PFsense - I would need to have a look. Everything else on my version of PFsense runs as a flat file system. Newer versions do use an SQL database. Quote Link to comment Share on other sites More sharing options...
kicken Posted August 25, 2014 Share Posted August 25, 2014 Reading a file line-by-line may be slightly slower due to having to run more read operations. I don't think the difference in speed would be all that significant though. If line by line does result in a significant decrease in performance you can improve things by reading in chunks rather than line by line. For example read out 10MiB of data at a time, then split that into lines and process each line. If the last line is incomplete prepend it to the next chunk of data that gets read. By doing this you help reduce the amount of reading done by reading in larger chunks, but still keep memory usage at manageable levels by controlling how big of a chunk you read. Quote Link to comment Share on other sites More sharing options...
mogosselin Posted August 25, 2014 Share Posted August 25, 2014 I think it is safe to assume the MACs wil be different, as this is how PFsense deals with identifying users. I don't exactly know your use case and what PFsense do. And, I'm not a Hardware expert, but I'm pretty sure that if you pass trough a modem of some sort to go on the Internet, you'll have this modem's MAC and IP address. So, 2 persons on 2 computers going through the same modem would get the same MAC address. It means that I could get the information of my colleague filled up. You should check if it's problematic for your use case. Also, if you want to check more than just yesterday's log, you could just load those information into a database once a day. Add a cron job at midnight to load the last log file into a DB. Quote Link to comment Share on other sites More sharing options...
kicken Posted August 25, 2014 Share Posted August 25, 2014 MAC addresses are not maintained across the network path like IP addresses are. A computer only ever sees the MAC of whatever the previous computer was in the path. Typically this is whatever router the computer is connected to. I'm guessing in this case they are dealing with a router in which case the MACs of each computer connecting to the router will be available and will be unique. Quote Link to comment Share on other sites More sharing options...
Lectrician Posted August 25, 2014 Author Share Posted August 25, 2014 Hi, yes, PFsense is a firewall, and clients connect to it, receiving an IP by DHCP from it. PFsense is able to read the MAC and HOSTNAME, and use these to identify a user uniquely. Obviously they could be spoofed. I have not run into issues opening the files to search yet. They are not huge. No errors are reported. I was just wondering on the best method. Opening the file in chunks may be the best option. Thanks. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.