Jump to content

Need advice: Access times


Koobi

Recommended Posts

This isn't really PHP specific so i thought i'd post here.

I'm reading a CSV file into my script (and this CSV may well contain upto 3 million records with upto 53 fields per record) and I do a number of things to the data like mathematical calculations and insertions into a database.

The database i'm using is Postgres.
I'm wondering if it would be better for me to:
A. read the entire CSV into an array (via fgetcsv) and INSERT it into a temporary table in the database and then select based on a LIMIT clause and process around 500 records at a time
B. read each CSV record, line by line (via fgets or fscanf) and process each line as i come accross it.


there are other options such as running a system command. what i'm looking for is the fastest method for process these records.

thanks for your time.
Link to comment
Share on other sites

Ahh... but if he's dumping it into a table within the database, eating up memory isn't an issue.  Plus, you have all those reads from the file (slow), instead of queries on the table (faster).  I think you're putting too much weight on the fact that the array only exists for a short amount of time before the data is moved to the database for processing.
Link to comment
Share on other sites

the problem with method A is that it immediately takes up memory like effigy mentioned.
i would have to change PHP's default memory allocation. Currenty it's 8 and 2MB. would 20 and 12MB be sufficient?

the problem with method B is that i will be accessing the disk over 3 million times. thats a lot :/
Link to comment
Share on other sites

[quote author=effigy link=topic=101767.msg403174#msg403174 date=1153836381]
Do you have the option to use Perl?
[/quote]

Not at the moment...but it's running on a dedicated server which i have full access to so i could install what i want i suppose.I don't think it would be a problem...any ideas, effigy? unfortunately, i don't know Perl but i'm willing to learn what i need to (for now) to get this running efficiently.

I'm also considering using the shell because I'm guaranteed that this application will only be run on a Linux/UNIX, Apache, Postgres, PHP environment.



i could bypass PHP entirely and use awk and parse the CSV and write them to a text file as SQL inserts and use CLI for Postgres to refer to the text file for the queries.
from thereon, i could access a few hundred records at a time from the database and process them.

that would be more efficient, i imagine, since it's the OS's shell that will be handling everything.

i just thought about this...sounds like a good idea right? except that i know very little awk but it's syntax seems relatively simple but i have no idea how to work Postgres via CLI.
Link to comment
Share on other sites

I know Perl has a reputation for being the champ at text processing. I asked some of the guys on www.perlmonks.net how they would accomplish this. One recommendation was to use [url=http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm]DBD::CSV[/url], while another mentioned that going line-by-line shouldn't be an issue due to input buffering by either readline or the OS itself; I know nothing about either.

I think the shell would be nasty for this, but this could depend on the content and how much processing is needed.

awk seems like a viable option; I know very little myself.

Is the data using quotes as delimiters also, or just commas?

You can always use Unix's [b]time[/b] command to see who gets the job done quicker.
Link to comment
Share on other sites

[quote author=effigy link=topic=101767.msg403374#msg403374 date=1153855496]
I know Perl has a reputation for being the champ at text processing. I asked some of the guys on www.perlmonks.net how they would accomplish this. One recommendation was to use [url=http://search.cpan.org/~jzucker/DBD-CSV-0.22/lib/DBD/CSV.pm]DBD::CSV[/url], while another mentioned that going line-by-line shouldn't be an issue due to input buffering by either readline or the OS itself; I know nothing about either.

I think the shell would be nasty for this, but this could depend on the content and how much processing is needed.

awk seems like a viable option; I know very little myself.

Is the data using quotes as delimiters also, or just commas?

You can always use Unix's [b]time[/b] command to see who gets the job done quicker.
[/quote]

thanks for asking at perlmonks.net :)
i will have a look at that option if i go for that idea, thanks.

it's a comma delimited CSV.

yeah, actually i should have used time to do my research instead of posting here but unfortunately, time is a big problem. we are almost at the deadline and there's still some work to be done and i didn't want to have to try a bunch of options so i thought i would ask here first.

thanks so far guys.
Link to comment
Share on other sites

Here's a basic example to start with. Simply run it with the csv file as the argument, e.g. [tt]./csv.pl in.csv[/tt].

[code]
#!/usr/local2/bin/perl
use warnings;
use strict;
while (<>) {
chomp;
my @fields = split ',' => $_;
}
[/code]

If the commas are surrounded by whitespace, replace ',' with /\s*,\s*/.
Link to comment
Share on other sites

[quote author=effigy link=topic=101767.msg403385#msg403385 date=1153856324]
Here's a basic example to start with. Simply run it with the csv file as the argument, e.g. [tt]./csv.pl in.csv[/tt].

[code]
#!/usr/local2/bin/perl
use warnings;
use strict;
while (<>) {
chomp;
my @fields = split ',' => $_;
}
[/code]

If the commas are surrounded by whitespace, replace ',' with /\s*,\s*/.
[/quote]

thanks for that.
so on the first line you have the shebang line, followed by the inclusion of libraries? and what's the condition in that while loop? and what is chomp? i dont understand the next line either. what does $_ do?

can you direct me to the official perl manual?
is this it: http://www.cs.cmu.edu/cgi-bin/perl-man ?

wouldn't mind some beginner toutorials too, if you have any handy links
Link to comment
Share on other sites

One of the beauties of Perl are shortcuts, which make it fun/painful to learn :)

[code]
#!/usr/local2/bin/perl
### The above line should point to your perl executable.
### The following 2 lines are called "pragmata"--they
### instruct the compiler what to do. In this case, they
### instruct it to (1) report warnings; and (2) use
### strict (good) programming practices. Always,
### always use these 2 lines.
use warnings;
use strict;
### The diamond operator (<>) has a lot of "magic" to
### it. By default it reads from the filehandle ARGV,
### which is either (1) a list of file names if they were
### supplied, or (2) STDIN. This allows programs to
### easily be included in pipelines.
while (<>) {
### chomp removes the trailing new line character,
### if any, from the string.
chomp;
### "my" is declaring a lexical (not global) variable;
### the "@" indicates that it is an array;
### and split is splitting the variable $_ by a comma.
my @fields = split ',' => $_;
}[/code]

[b]$_[/b] is Perl's default variable. In most cases it is used when a variable name is not specified. The [tt]while(<>)[/tt] construct is "magically" storing each line into $_, which chomp works off by default also.

There should be some tutorials at www.perlmonks.net. Documentation can be found via the perldoc command, or @ http://perldoc.perl.org/, http://www.perl.com/pub/q/documentation, and http://www.perl.org/docs.html for starters.

Let me know if you have any other questions.
Link to comment
Share on other sites

thanks for taking the time to explain :)

are warnings in Perl similar to the ones in PHP? i.e., do they have similar error levels?

so i could pass the csv file name to perl via terminal as an option and argv would grab it? or would i have to use ">" or "|" to direct the csv contents via cat?



also, how is Perl with db connectivity? Postgres specifically.
Link to comment
Share on other sites

Sort of. In older versions they were either on or off, but in newer versions the pragma can now be scoped and turned on or off based on the type; have a look at [url=http://perldoc.perl.org/perllexwarn.html]this[/url].

Yes. Any arguments passed on the command line are in the @ARGV array which can be used like any other array. When the [tt]while (<>) { }[/tt] construct is used, it knows to automatically read each argument in @ARGV and treat it as a file. No redirection or cat-ing is needed.

I haven't used Perl with any databases yet, but I know it can. I found [url=http://www.postgresql.org/files/documentation/books/aw_pgsql/node156.html]this[/url] example.
Link to comment
Share on other sites

ok great. i understand it so far.

i have another questions with regards to argv.
suppose i want to pass 3 csv files, they would be in argv[1], argv[2] and argv[3] right, assuming i used it like i have posted below?
[code]
$ perlExecutable -file=1.csv 2.csv 3.csv myScript.pl
[/code]



do you know of any issues Apache has with perl? I'd have to use mod_perl, right?
Link to comment
Share on other sites

I haven't used Perl with Apache yet either :) but yes, mod_perl is what I've heard of.

[tt]$ perlExecutable -file=1.csv 2.csv 3.csv myScript.pl[/tt] would fail, unless I'm missing something. You'd really want [tt]$ myScript.pl 1.csv 2.csv 3.csv[/tt], or even [tt]$ myScript.pl *.csv[/tt].
Link to comment
Share on other sites

well i messed about with the basics of perl a bit. didn't really get much done because there are other aspects of the project that need to be completed as well.

i was wondering, since i'm in a hurry and this needs to be done properly and securely, do you think i'd be able to outsource this module to you?

i could give you the pseudo code as well as the PHP code. if you are interested, let me know, effigy.
my contact details can be found in the [b]Staff Contact Info.[/b] sticky thread in the Admin Cafe.
Link to comment
Share on other sites

  • 3 weeks later...
Hey,

I was waiting on the:

[quote]
I'll send you more details this week because I don't have all the
files with me at this moment. I could send you the PHP we have
completed for this process as well as the rough pseudo code plus a
sample CSV with about 10 records.
[/quote]

I should have sent you a follow-up e-mail, but I've been swamped.

Overall, based on your requirements, it [i]sounds[/i] simple; it's the details and testing that need to be fine tuned.
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.