Jump to content


Photo

Need advice: Access times


  • Please log in to reply
21 replies to this topic

#1 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 25 July 2006 - 07:38 AM

This isn't really PHP specific so i thought i'd post here.

I'm reading a CSV file into my script (and this CSV may well contain upto 3 million records with upto 53 fields per record) and I do a number of things to the data like mathematical calculations and insertions into a database.

The database i'm using is Postgres.
I'm wondering if it would be better for me to:
A. read the entire CSV into an array (via fgetcsv) and INSERT it into a temporary table in the database and then select based on a LIMIT clause and process around 500 records at a time
B. read each CSV record, line by line (via fgets or fscanf) and process each line as i come accross it.


there are other options such as running a system command. what i'm looking for is the fastest method for process these records.

thanks for your time.

#2 ober

ober
  • Staff Alumni
  • Advanced Member
  • 5,337 posts
  • LocationEast Coast, USA

Posted 25 July 2006 - 12:31 PM

I have no data to back this up, but I'd imagine that option A would be faster.

Info: PHP Manual


#3 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 25 July 2006 - 02:06 PM

I think B would be faster and more efficient because you're going line by line, whereas loading the whole thing into an array immediately sucks up memory. Do you have the option to use Perl?
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#4 ober

ober
  • Staff Alumni
  • Advanced Member
  • 5,337 posts
  • LocationEast Coast, USA

Posted 25 July 2006 - 02:23 PM

Ahh... but if he's dumping it into a table within the database, eating up memory isn't an issue.  Plus, you have all those reads from the file (slow), instead of queries on the table (faster).  I think you're putting too much weight on the fact that the array only exists for a short amount of time before the data is moved to the database for processing.

Info: PHP Manual


#5 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 25 July 2006 - 02:28 PM

the problem with method A is that it immediately takes up memory like effigy mentioned.
i would have to change PHP's default memory allocation. Currenty it's 8 and 2MB. would 20 and 12MB be sufficient?

the problem with method B is that i will be accessing the disk over 3 million times. thats a lot :/

#6 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 25 July 2006 - 07:14 PM

Do you have the option to use Perl?


Not at the moment...but it's running on a dedicated server which i have full access to so i could install what i want i suppose.I don't think it would be a problem...any ideas, effigy? unfortunately, i don't know Perl but i'm willing to learn what i need to (for now) to get this running efficiently.

I'm also considering using the shell because I'm guaranteed that this application will only be run on a Linux/UNIX, Apache, Postgres, PHP environment.



i could bypass PHP entirely and use awk and parse the CSV and write them to a text file as SQL inserts and use CLI for Postgres to refer to the text file for the queries.
from thereon, i could access a few hundred records at a time from the database and process them.

that would be more efficient, i imagine, since it's the OS's shell that will be handling everything.

i just thought about this...sounds like a good idea right? except that i know very little awk but it's syntax seems relatively simple but i have no idea how to work Postgres via CLI.

#7 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 25 July 2006 - 07:24 PM

I know Perl has a reputation for being the champ at text processing. I asked some of the guys on www.perlmonks.net how they would accomplish this. One recommendation was to use DBD::CSV, while another mentioned that going line-by-line shouldn't be an issue due to input buffering by either readline or the OS itself; I know nothing about either.

I think the shell would be nasty for this, but this could depend on the content and how much processing is needed.

awk seems like a viable option; I know very little myself.

Is the data using quotes as delimiters also, or just commas?

You can always use Unix's time command to see who gets the job done quicker.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#8 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 25 July 2006 - 07:33 PM

I know Perl has a reputation for being the champ at text processing. I asked some of the guys on www.perlmonks.net how they would accomplish this. One recommendation was to use DBD::CSV, while another mentioned that going line-by-line shouldn't be an issue due to input buffering by either readline or the OS itself; I know nothing about either.

I think the shell would be nasty for this, but this could depend on the content and how much processing is needed.

awk seems like a viable option; I know very little myself.

Is the data using quotes as delimiters also, or just commas?

You can always use Unix's time command to see who gets the job done quicker.


thanks for asking at perlmonks.net :)
i will have a look at that option if i go for that idea, thanks.

it's a comma delimited CSV.

yeah, actually i should have used time to do my research instead of posting here but unfortunately, time is a big problem. we are almost at the deadline and there's still some work to be done and i didn't want to have to try a bunch of options so i thought i would ask here first.

thanks so far guys.

#9 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 25 July 2006 - 07:38 PM

Here's a basic example to start with. Simply run it with the csv file as the argument, e.g. ./csv.pl in.csv.

#!/usr/local2/bin/perl
use warnings;
use strict;
while (<>) {
	chomp;
	my @fields = split ',' => $_;
}

If the commas are surrounded by whitespace, replace ',' with /\s*,\s*/.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#10 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 26 July 2006 - 04:47 AM

Here's a basic example to start with. Simply run it with the csv file as the argument, e.g. ./csv.pl in.csv.

#!/usr/local2/bin/perl
use warnings;
use strict;
while (<>) {
	chomp;
	my @fields = split ',' => $_;
}

If the commas are surrounded by whitespace, replace ',' with /\s*,\s*/.


thanks for that.
so on the first line you have the shebang line, followed by the inclusion of libraries? and what's the condition in that while loop? and what is chomp? i dont understand the next line either. what does $_ do?

can you direct me to the official perl manual?
is this it: http://www.cs.cmu.edu/cgi-bin/perl-man ?

wouldn't mind some beginner toutorials too, if you have any handy links

#11 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 26 July 2006 - 04:59 AM

One of the beauties of Perl are shortcuts, which make it fun/painful to learn :)

#!/usr/local2/bin/perl
### The above line should point to your perl executable.
### The following 2 lines are called "pragmata"--they
### instruct the compiler what to do. In this case, they
### instruct it to (1) report warnings; and (2) use
### strict (good) programming practices. Always,
### always use these 2 lines.
use warnings;
use strict;
### The diamond operator (<>) has a lot of "magic" to
### it. By default it reads from the filehandle ARGV,
### which is either (1) a list of file names if they were
### supplied, or (2) STDIN. This allows programs to
### easily be included in pipelines.
while (<>) {
	### chomp removes the trailing new line character,
	### if any, from the string.
	chomp;
	### "my" is declaring a lexical (not global) variable;
	### the "@" indicates that it is an array;
	### and split is splitting the variable $_ by a comma.
	my @fields = split ',' => $_;
}

$_ is Perl's default variable. In most cases it is used when a variable name is not specified. The while(<>) construct is "magically" storing each line into $_, which chomp works off by default also.

There should be some tutorials at www.perlmonks.net. Documentation can be found via the perldoc command, or @ http://perldoc.perl.org/, http://www.perl.com/...q/documentation, and http://www.perl.org/docs.html for starters.

Let me know if you have any other questions.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#12 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 26 July 2006 - 01:45 PM

thanks for taking the time to explain :)

are warnings in Perl similar to the ones in PHP? i.e., do they have similar error levels?

so i could pass the csv file name to perl via terminal as an option and argv would grab it? or would i have to use ">" or "|" to direct the csv contents via cat?



also, how is Perl with db connectivity? Postgres specifically.

#13 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 26 July 2006 - 01:57 PM

Sort of. In older versions they were either on or off, but in newer versions the pragma can now be scoped and turned on or off based on the type; have a look at this.

Yes. Any arguments passed on the command line are in the @ARGV array which can be used like any other array. When the while (<>) { } construct is used, it knows to automatically read each argument in @ARGV and treat it as a file. No redirection or cat-ing is needed.

I haven't used Perl with any databases yet, but I know it can. I found this example.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#14 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 26 July 2006 - 02:09 PM

ok great. i understand it so far.

i have another questions with regards to argv.
suppose i want to pass 3 csv files, they would be in argv[1], argv[2] and argv[3] right, assuming i used it like i have posted below?
$ perlExecutable -file=1.csv 2.csv 3.csv myScript.pl



do you know of any issues Apache has with perl? I'd have to use mod_perl, right?

#15 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 26 July 2006 - 02:56 PM

I haven't used Perl with Apache yet either :) but yes, mod_perl is what I've heard of.

$ perlExecutable -file=1.csv 2.csv 3.csv myScript.pl would fail, unless I'm missing something. You'd really want $ myScript.pl 1.csv 2.csv 3.csv, or even $ myScript.pl *.csv.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#16 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 28 July 2006 - 07:23 PM

What's the status of this? I'm interested in your results.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#17 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 30 July 2006 - 03:34 PM

well i messed about with the basics of perl a bit. didn't really get much done because there are other aspects of the project that need to be completed as well.

i was wondering, since i'm in a hurry and this needs to be done properly and securely, do you think i'd be able to outsource this module to you?

i could give you the pseudo code as well as the PHP code. if you are interested, let me know, effigy.
my contact details can be found in the Staff Contact Info. sticky thread in the Admin Cafe.

#18 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 30 July 2006 - 05:22 PM

I have e-mailed you.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#19 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 30 July 2006 - 06:57 PM

thanks, i just replied :)

#20 Koobi

Koobi
  • Staff Alumni
  • Advanced Member
  • 419 posts
  • LocationColombo, Sri Lanka | South Asia

Posted 17 August 2006 - 03:19 PM

any developments on this, effigy?

if you've replied to my mail, i haven't recieved it yet.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users