Need advice: Access times

Koobi · July 25, 2006

This isn't really PHP specific so i thought i'd post here.

I'm reading a CSV file into my script (and this CSV may well contain upto 3 million records with upto 53 fields per record) and I do a number of things to the data like mathematical calculations and insertions into a database.

The database i'm using is Postgres.
I'm wondering if it would be better for me to:
A. read the entire CSV into an array (via fgetcsv) and INSERT it into a temporary table in the database and then select based on a LIMIT clause and process around 500 records at a time
B. read each CSV record, line by line (via fgets or fscanf) and process each line as i come accross it.

there are other options such as running a system command. what i'm looking for is the fastest method for process these records.

thanks for your time.

ober · July 25, 2006

I have no data to back this up, but I'd imagine that option A would be faster.

effigy · July 25, 2006

I think B would be faster and more efficient because you're going line by line, whereas loading the whole thing into an array immediately sucks up memory. Do you have the option to use Perl?

ober · July 25, 2006

Ahh... but if he's dumping it into a table within the database, eating up memory isn't an issue. Plus, you have all those reads from the file (slow), instead of queries on the table (faster). I think you're putting too much weight on the fact that the array only exists for a short amount of time before the data is moved to the database for processing.

Koobi · July 25, 2006

the problem with method A is that it immediately takes up memory like effigy mentioned.
i would have to change PHP's default memory allocation. Currenty it's 8 and 2MB. would 20 and 12MB be sufficient?

the problem with method B is that i will be accessing the disk over 3 million times. thats a lot :/

Koobi · July 25, 2006

[quote author=effigy link=topic=101767.msg403174#msg403174 date=1153836381]
Do you have the option to use Perl?
[/quote]

Not at the moment...but it's running on a dedicated server which i have full access to so i could install what i want i suppose.I don't think it would be a problem...any ideas, effigy? unfortunately, i don't know Perl but i'm willing to learn what i need to (for now) to get this running efficiently.

I'm also considering using the shell because I'm guaranteed that this application will only be run on a Linux/UNIX, Apache, Postgres, PHP environment.

i could bypass PHP entirely and use awk and parse the CSV and write them to a text file as SQL inserts and use CLI for Postgres to refer to the text file for the queries.
from thereon, i could access a few hundred records at a time from the database and process them.

that would be more efficient, i imagine, since it's the OS's shell that will be handling everything.

i just thought about this...sounds like a good idea right? except that i know very little awk but it's syntax seems relatively simple but i have no idea how to work Postgres via CLI.

effigy · July 25, 2006

I know Perl has a reputation for being the champ at text processing. I asked some of the guys on www.perlmonks.net how they would accomplish this. One recommendation was to use [url=http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm]DBD::CSV[/url], while another mentioned that going line-by-line shouldn't be an issue due to input buffering by either readline or the OS itself; I know nothing about either.

I think the shell would be nasty for this, but this could depend on the content and how much processing is needed.

awk seems like a viable option; I know very little myself.

Is the data using quotes as delimiters also, or just commas?

You can always use Unix's [b]time[/b] command to see who gets the job done quicker.

Koobi · July 25, 2006

[quote author=effigy link=topic=101767.msg403374#msg403374 date=1153855496]
I know Perl has a reputation for being the champ at text processing. I asked some of the guys on www.perlmonks.net how they would accomplish this. One recommendation was to use [url=http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm]DBD::CSV[/url], while another mentioned that going line-by-line shouldn't be an issue due to input buffering by either readline or the OS itself; I know nothing about either.

I think the shell would be nasty for this, but this could depend on the content and how much processing is needed.

awk seems like a viable option; I know very little myself.

Is the data using quotes as delimiters also, or just commas?

You can always use Unix's [b]time[/b] command to see who gets the job done quicker.
[/quote]

thanks for asking at perlmonks.net :)
i will have a look at that option if i go for that idea, thanks.

it's a comma delimited CSV.

yeah, actually i should have used time to do my research instead of posting here but unfortunately, time is a big problem. we are almost at the deadline and there's still some work to be done and i didn't want to have to try a bunch of options so i thought i would ask here first.

thanks so far guys.

effigy · July 25, 2006

Here's a basic example to start with. Simply run it with the csv file as the argument, e.g. [tt]./csv.pl in.csv[/tt].

[code]
#!/usr/local2/bin/perl
use warnings;
use strict;
while (<>) {
chomp;
my @fields = split ',' => $_;
}
[/code]

If the commas are surrounded by whitespace, replace ',' with /\s*,\s*/.

Koobi · July 26, 2006

[quote author=effigy link=topic=101767.msg403385#msg403385 date=1153856324]
Here's a basic example to start with. Simply run it with the csv file as the argument, e.g. [tt]./csv.pl in.csv[/tt].

[code]
#!/usr/local2/bin/perl
use warnings;
use strict;
while (<>) {
chomp;
my @fields = split ',' => $_;
}
[/code]

If the commas are surrounded by whitespace, replace ',' with /\s*,\s*/.
[/quote]

thanks for that.
so on the first line you have the shebang line, followed by the inclusion of libraries? and what's the condition in that while loop? and what is chomp? i dont understand the next line either. what does $_ do?

can you direct me to the official perl manual?
is this it: http://www.cs.cmu.edu/cgi-bin/perl-man ?

wouldn't mind some beginner toutorials too, if you have any handy links

effigy · July 26, 2006

One of the beauties of Perl are shortcuts, which make it fun/painful to learn :)

[code]
#!/usr/local2/bin/perl
### The above line should point to your perl executable.
### The following 2 lines are called "pragmata"--they
### instruct the compiler what to do. In this case, they
### instruct it to (1) report warnings; and (2) use
### strict (good) programming practices. Always,
### always use these 2 lines.
use warnings;
use strict;
### The diamond operator (<>) has a lot of "magic" to
### it. By default it reads from the filehandle ARGV,
### which is either (1) a list of file names if they were
### supplied, or (2) STDIN. This allows programs to
### easily be included in pipelines.
while (<>) {
### chomp removes the trailing new line character,
### if any, from the string.
chomp;
### "my" is declaring a lexical (not global) variable;
### the "@" indicates that it is an array;
### and split is splitting the variable $_ by a comma.
my @fields = split ',' => $_;
}[/code]

[b]$_[/b] is Perl's default variable. In most cases it is used when a variable name is not specified. The [tt]while(<>)[/tt] construct is "magically" storing each line into $_, which chomp works off by default also.

There should be some tutorials at www.perlmonks.net. Documentation can be found via the perldoc command, or @ http://perldoc.perl.org/, http://www.perl.com/pub/q/documentation, and http://www.perl.org/docs.html for starters.

Let me know if you have any other questions.

Koobi · July 26, 2006

thanks for taking the time to explain :)

are warnings in Perl similar to the ones in PHP? i.e., do they have similar error levels?

so i could pass the csv file name to perl via terminal as an option and argv would grab it? or would i have to use ">" or "|" to direct the csv contents via cat?

also, how is Perl with db connectivity? Postgres specifically.

effigy · July 26, 2006

Sort of. In older versions they were either on or off, but in newer versions the pragma can now be scoped and turned on or off based on the type; have a look at [url=http://perldoc.perl.org/perllexwarn.html]this[/url].

Yes. Any arguments passed on the command line are in the @ARGV array which can be used like any other array. When the [tt]while (<>) { }[/tt] construct is used, it knows to automatically read each argument in @ARGV and treat it as a file. No redirection or cat-ing is needed.

I haven't used Perl with any databases yet, but I know it can. I found [url=http://www.postgresql.org/files/documentation/books/aw_pgsql/node156.html]this[/url] example.

Koobi · July 26, 2006

ok great. i understand it so far.

i have another questions with regards to argv.
suppose i want to pass 3 csv files, they would be in argv[1], argv[2] and argv[3] right, assuming i used it like i have posted below?
[code]
$ perlExecutable -file=1.csv 2.csv 3.csv myScript.pl
[/code]

do you know of any issues Apache has with perl? I'd have to use mod_perl, right?

effigy · July 26, 2006

I haven't used Perl with Apache yet either :) but yes, mod_perl is what I've heard of.

[tt]$ perlExecutable -file=1.csv 2.csv 3.csv myScript.pl[/tt] would fail, unless I'm missing something. You'd really want [tt]$ myScript.pl 1.csv 2.csv 3.csv[/tt], or even [tt]$ myScript.pl *.csv[/tt].

effigy · July 28, 2006

What's the status of this? I'm interested in your results.

Koobi · July 30, 2006

well i messed about with the basics of perl a bit. didn't really get much done because there are other aspects of the project that need to be completed as well.

i was wondering, since i'm in a hurry and this needs to be done properly and securely, do you think i'd be able to outsource this module to you?

i could give you the pseudo code as well as the PHP code. if you are interested, let me know, effigy.
my contact details can be found in the [b]Staff Contact Info.[/b] sticky thread in the Admin Cafe.

effigy · July 30, 2006

I have e-mailed you.

Koobi · July 30, 2006

thanks, i just replied :)

Koobi · August 17, 2006

any developments on this, effigy?

if you've replied to my mail, i haven't recieved it yet.

effigy · August 17, 2006

Hey,

I was waiting on the:

[quote]
I'll send you more details this week because I don't have all the
files with me at this moment. I could send you the PHP we have
completed for this process as well as the rough pseudo code plus a
sample CSV with about 10 records.
[/quote]

I should have sent you a follow-up e-mail, but I've been swamped.

Overall, based on your requirements, it [i]sounds[/i] simple; it's the details and testing that need to be fine tuned.

Koobi · August 17, 2006

ah ok, i was waiting for your reply.

i'll send you the necessary data this week.

thanks.

Sign In

Need advice: Access times

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information