dilbertone

May 17, 2011

hello many thx for the answer !

i want to run this task with Curl! This is my approach for this task.

well - i will come with some code-lines next days. If any body can give a helping hand i would be more than happy.

May 17, 2011

Hello dear Community, hello dear Andy

I want to parse a site that is called the foundation-finder: My Perl knowledge is pretty small!

I have tried various tutorials (examples of Mecha - that i have found on the CPAN) not oll of them work - some of them are broken!

Now i try t o get some real-world-task!

the Foundation-Finder-task has several steps: Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland: http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=3221

which has a dataset of 2700 foundations. All the data are free to use - with no limitations copyrights on it.

i mused about a starting-point: ould i use a Perl-module from CPAN and do the job with Perl.I guess that Mechanize or LWP could do a great job. Or HTML::Parser well - i am just musing which is

the best way to do the job. Guess that i am in front of a nice learning curve. This task will give me some nice PHP or Perl lessions.

Or can we do this with Python either!? I guess so! So here i am!

So here is a sample-page for the real-world-task a governmental site in Switzerland: more than 2'700 foundations in

http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=3221

can i do this with mecha!?

love to get a hint

thx matz

May 14, 2011

hello dear community,

i am currently wroking on a approach to parse some sites that contain datas on Foundations in Switzerland

with some details like goals, contact-E-Mail and the like,,,

See http://www.foundationfinder.ch/ which has a dataset of 790 foundations. All the data are free to use - with no limitations copyrights on it.

I have tried it with PHP Simple HTML DOM Parser - but , i have seen that it is difficult to get all necessary data -that is needed to get it up and running.

Who is wanting to jump in and help in creating this scraper/parser. I love to hear from you.

Please help me - to get up to speed with this approach?

regards

Dilbertone

February 26, 2011

howdy myarro

interesting thing -

I am using file_get_contents to grab HTML pages. Seems to work fine as a part of a 1-shot function. But as soon as I include the function as part of a loop, it doesn't return anything...

for ($i = 0; $i < 10; $i++) {

...

getFile($urlArray[$i];

...

What's the deal?

well i eagerly want to know how the full code will look like. I currently work on the same thing...

Look forward to hear from you

cheers

db1

BTW - ever tried to do it with CURL ...

February 26, 2011

good evening dear community! Howdy,

at the moment i am debugging some lines of code...

purpose: i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records

Attempt: Here are the first 5 page URLs:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop:


#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  
  
my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  
  
my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   
   
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
$html =~ tr/r//d;     # strip the carriage returns  
$html =~ s/ / /g; # expand the spaces  
  
my $te = new HTML::TableExtract();  
$te->parse($html);  
  
my $csv = Text::CSV->new({ binary => 1 });  
  
foreach my $ts ($te->table_states) {  
	foreach my $row ($ts->rows) {  
			#trim leading/trailing whitespace from base fields  
		s/^s+//, s/\s+$// for @$row;  

		#load the fields into the hash using a "hash slice"  
		my %h;  
		@h{@cols} = @$row;  
  
		#derive some fields from base fields, again using a hash slice  
		@h{qw/name street postal town/} = split /n+/, $h{name};  
		@h{qw/phone fax/} = split /n+/, $h{phone};  
  
		#trim leading/trailing whitespace from derived fields  
		s/^s+//, s/\s+$// for @h{qw/name street postal town/};  
  
		$csv->combine(@h{@fields});  
		print $csv->string, "\n";  
	}  
} 
}

i tested the code and get the following results: .- see below - the error message shown in the command line...

btw: here the lines 57 and 58:

	#trim leading/trailing whitespace from derived fields  
		s/^s+//, s/\s+$// for @h{qw/name street postal town/};

what do you think?

Sta�e
    PLZ                                                                                                                                                                                                  
    Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
"lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
    Sta�e                                                                                                                                                                                                
    PLZ                                                                                                                                                                                                  
    Ot",,,Telefo,Fax,Schulat,Webseite                                                                                                                                                                    
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                                                     
"lfd. N.",Schul-numme,Schul,"ame                                                                                                                                                                         
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame

February 25, 2011

good evening - here i am back again!!

i run into troubles... i guess that i have made some mistakes while applying some code in the above mentioned script....

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
     my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
}

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/ / /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
}

there have been some issues - i have made a mistake

i guess that the error is here:

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
     my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
}

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/ / /g; # expand the spaces

i have written down some kind of double - code. I need to leave out one part ... this one here; What do you think about this!?

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/ / /g; # expand the spaces

I get these kind of errors - it looks very very nasty!

martin@suse-linux:~> cd perl
martin@suse-linux:~/perl> perl bavaria_all_.pl
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
syntax error at bavaria_all_.pl line 59, near "/,"
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Substitution replacement not terminated at bavaria_all_.pl line 63.
martin@suse-linux:~/perl>

what do you think!?

i look forward to hear from you!

February 25, 2011

good day, hello dear community!

i am currently ironing out a little parser-script. I have some bits

- but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

This page has got more than 6000 results!

Well how do i get all the results?

I tried out several things - but i dont helped. I allways got bad results.

See i have good csv-data - but unfortunatley no spider logic... I need some bits to get there! How to get there!?

I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records



#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/ / /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
}

Well - with this i have a good csv-output:- but unfortunatley no spider logic.

How to add the spider-logic here... !?

well i need some help

Love to hear from you

February 24, 2011

hi dear Abracadaver,

many many thanks - i am very very happy to hear from you.

Why not try a Perl board?

i am pretty sure that this can be done in php as well - and the usage of csv-formatted output is also known in php-fields.. But the best argument is - i am a big big fan of this site here.

And yes - you helped me years and years... your code is a live time saver..!!!

[ i know you from the AutoTheme and i am/was a user of your site from the early beginning in 2003....

So i would be glad if you can help me here...

February 24, 2011

hello good day dear community,

i like this place. It is a great place for idea and knowlege sharing! But by far the most impressive thing i learned is that this community here is so supportive. I am overwhelmed by this experience. This forum has so many many great folks.

i have a little parser that parses a site - with 6150 records. But i need to have this in a CSV-formate. First of all see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

i need all the data - with separation in the filed of

    number
    schoolnumber
    school-name
    Adress
    Street 
    Postal Code 
     phone
     fax 
    School-type
    website

BTW - see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 and compare!

Well - i have a script: i am very interested what you think about this .... not all the fields are gained yet - i need more of them!

    #!/usr/bin/perl
    use strict;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    
    my $total_records = 0;
    my $alpha = "x";
    my $results = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $percent = 0;
    
    workDir();
    chdir $processdir;
    processURL();
    print "\nPress <enter> to continue\n";
    <>;
    my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
    processData();
    close $outfile;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
    unlink 'processing.html';
    
    sub processURL() {
    print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
    getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';
    
       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }
    
    sub processData() {
       while ( $range <= $total_records) {
          my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
          getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          foreach my $ts ($te->table_states) {
             foreach my $row ($ts->rows) {
                cleanup(@$row);
    	    # Add a table column delimiter in this case ||
                print $outfile join("||", @$row)."\n";
                }
             }
          $| = 1;  
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
       }
    }
    
    sub cleanup() {
       for ( @_ ) {
          s/\s+/ /g;
       }
    }
    
    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }

output:


    1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule|| 
    2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)|| 
    3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de 
    4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

With the || being the delimiter.

My problem is: i need to have more fields - i need to have the following divided:

    name: Volksschule Abenberg (Grundschule)
    street: Güssübelstr. 2
    postal-code and town: 91183 Abenberg
    fax and telephone: 09178/215 09178/905060
    type of school: Volksschulen
    website: home.t-online.de/home/vs-abenberg

well - how to add more fields?

This obviously has to be done in this line here, doesn t it!?

my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

But how. I tried out several things - but i dont helped. I allways got bad results. Btw: i played around - and tried another solution - but here i have good csv-data - but unfortunatley no spider logic...

    #!/usr/bin/perl
    use warnings;
    use strict;
    use LWP::Simple;
    use HTML::TableExtract;
    use Text::CSV;
    
    my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
    $html =~ tr/r//d;     # strip the carriage returns
    $html =~ s/ / /g; # expand the spaces
    
    my $te = new HTML::TableExtract();
    $te->parse($html);
    
    my @cols = qw(
        rownum
        number
        name
        phone
        type
        website
    );
    
    my @fields = qw(
        rownum
        number
        name
        street
        postal
        town
        phone
        fax
        type
        website
    );
    
    my $csv = Text::CSV->new({ binary => 1 });
    
    foreach my $ts ($te->table_states) {
        foreach my $row ($ts->rows) {
    
            #  trim leading/trailing whitespace from base fields
            s/^s+//, s/\s+$// for @$row;
    
            # load the fields into the hash using a "hash slice"
            my %h;
            @h{@cols} = @$row;
    
            # derive some fields from base fields, again using a hash slice
            @h{qw/name street postal town/} = split /n+/, $h{name};
            @h{qw/phone fax/} = split /n+/, $h{phone};
    
            #  trim leading/trailing whitespace from derived fields
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};
    
            $csv->combine(@h{@fields});
            print $csv->string, "\n";
        }
    }

Well - with this i tried another solution - but here i have good csv-data - but unfortunatley no spider logic.

How to add the spider-logic here... !?

look forward to any and all help!

February 19, 2011

hi all - i need some ideas here. it is so frustrating to do the job without a script. I can do it manually - but this takes about 7 hours .....

February 19, 2011

hello dear all - hello all freaks of this great community,

one question regarding a parser... note - it is a perl-parser, but believe me: i need some help with that. And i guess that here many many experts know the perl-bits... so well that this is no problem here....

Here we go! is there any chance to catch some seperators within the that seperate the table... The paser script runs allready nicely. Note - i want to store the data into a MySQL database. So it would be great to have some seperators - (commas, tabs or somewhat else - a tab seperated values or comma seperated values

are handy formats to work with...

here the data out of the following site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20

lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de

2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg

6 3074 Private Berufsschule zur sonderpäd. Förderung, Förderschwerpunkt Lernen, Abensberg Regensburger Straße 60 93326 Abensberg 09443/709191

09443/709193 Berufsschulen zur sonderpädog. Förderung www.berufsschule-abensberg.de

Well i need to have those lines divided into at least three columns - take the first record.

name: Volksschule Abenberg (Grundschule)

street: Güssübelstr. 2

postal-code and town: 91183 Abenberg

fax and telephone: 09178/215 09178/905060

type of school: Volksschulen

website: home.t-online.de/home/vs-abenberg

Or even better - i have divided the postal-code and town into two seperate columns!?

Question: is this possible?

By the way: see the first record: (here i only show the names of the school)

1 0401 Mädchenrealschule Marienburg, Abenberg,

6 3074 Private Berufsschule zur sonderpäd. Förderung, Förderschwerpunkt Lernen, Abensberg

Note, those have some commas inside the name; does this make it difficult to create a parser that creates csv-fomate?

Any idea how to do this in Perl... If possible it would be just great!!

many many thx for a hint regarding this little issue - besides this all is great and fascinating!

dilbertone...

Here the code:

  #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    my $te = HTML::TableExtract->new;
    my $total_records = 0;
    my $suchbegriffe = "e";
    my $treffer = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $displaydate = "";
    my $percent = 0;

    &workDir();
    chdir $processdir;
    &processURL();
    print "\nPress <enter> to continue\n";
    <>;
    $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
    &processData();
    close OUTFILE;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
    unlink 'processing.html';
    die "\n";

    sub processURL() {
    print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
    getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';

       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }

    sub processData() {
       while ( $range <= $total_records) {
          getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          for my $row ( $table->rows ) {
             cleanup(@$row);
             print OUTFILE "@$row\n";
          }
          $| = 1; 
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
          $te = HTML::TableExtract->new;
       }
    }

    sub cleanup() {
       for ( @_ ) {
          s/s+/ /g;
       }
    }

    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }

February 13, 2011

Hello TLG

yes - i want to split that information to tree cells or columns (in MySQL)

BTW see the dataset: here you have an overview: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

Well - i have loaded the data of the online sheet to a calc-spreadsheet and from there i imported it to mysql.

In only one Column (the third one!) i have the full adress with

1. name of the school

2. name of the street

3. postal code and town

Well - i guess that your code hits the point. I take all the (almost 6000 ) records and apply your code below.

If I understand correctly, you would want something like this:

$opt = explode("\n", $record); // Don't forget to mysql_real_escape_string
mysql_query("INSERT INTO my_table (name, street, postal) VALUES ('{$opt[0]}','{$opt[1]}','{$opt[2]}')");

i will have a closer look what explode does exactly. But i am pretty sure that you have given the exact hint....

Best regards

db1

February 13, 2011

hello dear The Little Guy

many many thanks for the hints! GREAT!

$opt = explode("\n", $record);
var_dump($opt);

well did i get you right: i take the information of the third column in this huge spreadheet that can be found here http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

(or the information that is derived from this dataset to a calc-Spreadsheet) and apply your code on the third (!!!) Column of the Spreadsheet ?

Note - i need to have the information not only in one cell but in three or four!?

Question: should i take the information of the huge (note the table contains almost 6000 records)

The Little Guy - i love to hear from you again. ..

best regards

db1

February 13, 2011

good day dear community,

well i am in big big trouble - i need some regex to solve a problem! Can you help me a bit! That would be great! Well - i mused alot how to call the subject: Finally i came to: "Regex or explode to array: I need some help in a simple string!"

i have a spreadsheed in calc. with some records. There is a column that contains the following information

Ecole Saint-Exupery

Rue Saint-Malo 24

67544 Paris

Well i need to have those lines divided into at least three columns

name: Ecole Saint-Exupery

street: Rue Saint-Malo 24

postal code and town 67544 Paris

Or even better - i have divided the postal code and town into two seperate columns!? Question: is this possible? Can (or should) i do this in calc (open document-formate)? Do i need to have to use a regex and perl or am i able to solve this issues without an regex?

Note - finally i need to transfer the data into MySQL-database...

I look forward to a tipp...

greetings

BTW: you can see all the things in a real world-live-demo: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 - see the filed

Schulname

Straße

PLZ Ort

These field contains three things - the name, the street and the Postal Code and the town! Question: can this be divided into parts!? If you copy and paste the information - and drop it to calc then you get all the information in only one cell. How to divide and seperate all those information into three cells or even four?

BTW - i tried to translate the information to hex-code - see the follwoing...:

Staatl. Realschule Grafenau

Rachelweg 20

94481 Grafenau

00000000: 53 74 61 61 74 6C 2E 20  52 65 61 6C 73 63 68 75
00000010: 6C 65 20 47 72 61 66 65  6E 61 75 20 0A 52 61 63
00000020: 68 65 6C 77 65 67 20 32  30 0A 39 34 34 38 31 20
00000030: 20 47 72 61 66 65 6E 61  75 20 20

but i do not know if this helps here!??

Can you help me to solve the problem. Do i need to have a regex!?

Many thanks in advance for any and all help!

January 16, 2011

Hello Ignace

many many thanks for the idea - that sounds very very good.,

BTW;

As i already have the addresses in tab separated format

What aobut this: I can create 10 different tables (or less according to the different formats) and loaf the into the database using load

data infile command MySQL :: MySQL 5.1 Reference Manual :: 12.2.6 LOAD DATA INFILE Syntax .After this i can used the commands posted by you to create a new table with your new address book format.

what do you think about this!

look forward to hear from you

best

db1

see also: http://dev.mysql.com/doc/refman/5.1/en/load-data.html

January 15, 2011

Hi there - hello BlueSkyIS

you can add more fields to any table you like. perhaps i don't understand the question?

Well - what if i want to migrate 10 (Adressbook-)DBs into one.

They look a bit different:

Adressbook 1: 	name	adress 	eMail	tel		Telefax	       portrait	

Adressbook 2: 	name	Company aresss: postalcode 	Telefon: 	Fax: 	E-Mail:	Internet: 

Adressbook 3: 	name	address	tel	fax	email	homepage

all ten look like a bit different. How to treat this migration of ten tables into one big DB!?

Hope i was able to make clear what i want. If i have to be more precise - just lemme know

Many thanks in advance

regards

db1

January 15, 2011

Hi dear freaks

i want to create an adressbook with MySQL. At the moment i do not know how many fields i need.

i want to be flexible with that - at least in the next days... Untill i am sure how many fields i really would need.

i have found a Dump that allready is build for a Adressbook - i found this one in the internet.

http://www.apachefriends.org/f/viewtopic.php?f=14&t=26305&start=0&sid=633a3f317b08dc6d8e555a81ed10538f&view=print

# phpMyAdmin SQL Dump # version 2.5.7-pl1 # http://www.phpmyadmin.net # # Host: localhost # Erstellungszeit: 04. September 2007 um 16:37 # Server Version: 4.0.20 # PHP-Version: 4.3.7 # # Datenbank: `joels` # # -------------------------------------------------------- # # Tabellenstruktur f�r Tabelle `address_book` # CREATE TABLE `address_book` ( `address_book_id` int(11) NOT NULL auto_increment, `customers_id` int(11) NOT NULL default '0', `entry_gender` char(1) NOT NULL default '', `entry_company` varchar(32) default NULL, `entry_firstname` varchar(32) NOT NULL default '', `entry_lastname` varchar(32) NOT NULL default '', `entry_street_address` varchar(64) NOT NULL default '', `entry_suburb` varchar(32) default NULL, `entry_postcode` varchar(10) NOT NULL default '', `entry_city` varchar(32) NOT NULL default '', `entry_state` varchar(32) default NULL, `entry_country_id` int(11) NOT NULL default '0', `entry_zone_id` int(11) NOT NULL default '0', PRIMARY KEY (`address_book_id`), KEY `idx_address_book_customers_id` (`customers_id`) ) TYPE=MyISAM AUTO_INCREMENT=2 ;

can i use this - and can i easily add more fields... ?

lookforward to hear from you

Regards

db1

December 29, 2010

Hello BlueSkyIs,

many thanks for your answer!!

That function is so simple, i would just skip writing the function and put the function code within the loop.

Hmm - how do i apply it!`

Hmmm - as i am new to php i ask myself what the point of the function is. Does the $numbers loop go within the function definition.

i guess no - Hmm - guess the loop goes outside the function definition where the function is called multiple times.

function  () {
    /* Inside, define the function. */
}
multiload(); /* <-- Outside, call the function. */

Love to hear from you

best regards

db1

December 29, 2010

Hello dear community,

The following code is a solution that returns the labels and values in a formatted array ready for input to mysql. Very nice;-)

<?php

$dom = new DOMDocument();
@$dom->loadHTMLFile('http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1%5buid%5d=60119');
$divElement = $dom->getElementById('wfqbeResults');

$innerHTML= '';
$children = $divElement->childNodes;
foreach ($children as $child) {
$innerHTML = $child->ownerDocument->saveXML( $child );

$doc = new DOMDocument();
$doc->loadHTML($innerHTML);
//$divElementNew = $dom->getElementsByTagName('td');
$divElementNew = $dom->getElementsByTagname('td');

    /*** the array to return ***/
    $out = array();
    foreach ($divElementNew as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }

echo '<pre>';
print_r($out);
echo '</pre>';

} 

?>

That bit of code works very fine and it performs an operation that i intend to call upon multiple times, Therefore it makes sense to wrap it in a function. We can name it whatever we want- Let us just name it "multiload".

I tried to do this with the following code - but this does not run... I am still not sure where to put the uid - inside or outside the function...

<?php

function multiload ($uid) {
/*...*/
//  $uid = '60119';

$dom = new DOMDocument();

         $dom->loadHTMLFile('http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1%5buid%5d=' . $uid);


     }

multiload ('60089');
multiload ('60152');
multiload ('60242');
/*...*/

$divElement = $dom->getElementById('wfqbeResults');

$innerHTML= '';
$children = $divElement->childNodes;
foreach ($children as $child) {
$innerHTML = $child->ownerDocument->saveXML( $child );

$doc = new DOMDocument();
$doc->loadHTML($innerHTML);
//$divElementNew = $dom->getElementsByTagName('td');
$divElementNew = $dom->getElementsByTagname('td');

    /*** the array to return ***/
    $out = array();
    foreach ($divElementNew as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }

echo '<pre>';
print_r($out);
echo '</pre>';

}


?>

where to put the following lines

multicall('60089');
multicall('60152');
multicall('60242');
/*...*/

This is still repetitive, so we can put the numbers in an array - can ´t we!

Then we can loop through the array.

$numbers = array ('60089', '60152', '60242' /*...*/);
foreach ($numbers as $number) {
    doStuff($number);
}

But the question is - how to and where to put the loop!?

Can anybody give me a starting point...

BTW - if i have to be more descriptive i am trying to explain more - just let me know...

it is no problem to explain more

greetings

December 25, 2010

Hello dear friends,

found out the following:

$dom->getElementById('floatbox');

...in original html it's not an id, it's a class.

So i have to rewrite like so:

$divElement = $dom->getElementByClass('floatbox');

Well i try out this solution.

December 25, 2010

hello dear revraz, hello dear litebearer, good day

many many tanks to you both! Great to hear from you.

The idea with an array is convincing me! I am convinced!

BTW: this uses the Dot-Operator, doesn ´t it!? One of two solutions for string- or Url-concetenation??

Perhaps like this?

$orig_string = "http://www.somesite.com?page=";
$number_array = array ("123", "43567", "9287","3323");
for($i=0; $i<$count($number_array); $i ++) {
$new_url = $orig_url . $number_array[$i];
/* do something with the new url */
}

many many thanks for the hint!!

btw; this is a absolute great forum - i love it!! Many many thanks for the ideas and hints.

@ you both - you are very very supportive. GREAT To have you here!

Have a great season break and merry merry Christmas

greetings

Dilbertone!

Update: one last question: I integrate the loop solution with the array into my basic-script ....

<?php

$dom = new DOMDocument();
@$dom->loadHTMLFile('http://schulen.bildung-rp.de/gehezu/startseite/einzelanzeige.html?tx_wfqbe_pi1%5buid%5d=60119');
$divElement = $dom->getElementById('wfqbeResults');
$innerHTML= '';
$children = $divElement->childNodes;
foreach ($children as $child) {
$innerHTML = $child->ownerDocument->saveXML( $child );

$doc = new DOMDocument();
$doc->loadHTML($innerHTML);
//$divElementNew = $dom->getElementsByTagName('td');
$divElementNew = $dom->getElementsByTagname('td');

    /*** the array to return ***/
    $out = array();
    foreach ($divElementNew as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }

echo '<pre>';
print_r($out);
echo '</pre>';

}

.......like so:

<?php

$dom = new DOMDocument();
$orig_string = "http://www.somesite.com?page=";

@$dom->loadHTMLFile {

$number_array = array ("123", "43567", "9287","3323");
for($i=0; $i<$count($number_array); $i ++) {

$new_url = $orig_url . $number_array[$i];

/* do something with the new url */
}

$divElement = $dom->getElementById('wfqbeResults');
$innerHTML= '';
$children = $divElement->childNodes;
foreach ($children as $child) {
$innerHTML = $child->ownerDocument->saveXML( $child );

$doc = new DOMDocument();
$doc->loadHTML($innerHTML);
//$divElementNew = $dom->getElementsByTagName('td');
$divElementNew = $dom->getElementsByTagname('td');

    /*** the array to return ***/
    $out = array();
    foreach ($divElementNew as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }

echo '<pre>';
print_r($out);
echo '</pre>';

}

December 25, 2010

good evening dear Community,

Well first of all: felize Navidad - I wanna wish you a Merry Christmas!!

Today i'm trying to debug a little DOMDocument object in PHP. Ideally it'd be nice if I could get DOMDocument to output in a array-like format, to store the data in a database!

My example: head over to the url -

see the example: the target http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=8880

I investigated the Sourcecode:

I want to filter out the data that that is in the following class <div class="floatbox">

See the sourcecode:

<span class="grey"> <span style="font-size:x-small;">></span></span>
<a class="navLink" href="http://dms-schule.bildung.hessen.de/suchen/index.html" title="Suchformulare zum hessischen schulischen Bildungssystem">suche</a>
              </div>
            </div>
          <!-- begin of text -->
           <h3>Siegfried-Pickert Schule</h3>
<div class="floatbox">

See my approach: Here is the solution return the labels and values in a formatted array ready for input to mysql!

<?php

$dom = new DOMDocument();
@$dom->loadHTMLFile('http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=8880');
$divElement = $dom->getElementById('floatbox');

$innerHTML= '';
$children = $divElement->childNodes;
foreach ($children as $child) {
$innerHTML = $child->ownerDocument->saveXML( $child );

$doc = new DOMDocument();
$doc->loadHTML($innerHTML);
//$divElementNew = $dom->getElementsByTagName('td');
$divElementNew = $dom->getElementsByTagname('td');

    /*** the array to return ***/
    $out = array();
    foreach ($divElementNew as $item)
    {
        /*** add node value to the out array ***/
        $out[] = $item->nodeValue;
    }

echo '<pre>';
print_r($out);
echo '</pre>';

}

well Duhh: this outputs lot of garbage. The code spits out a lot of html anyway.

What can i do to get a more cleaned up code!?

What is wrong with the idea of using this attribute:

 $dom->getElementById('floatbox');

any idea!?

any and all help will greatly appreciated.

season-greetings

db1

December 25, 2010

Hello dear community, good day!

first of all: Merry Christmas to all of you!!

How to combine / concatenate a *divided* string in order to use this combined / concatenated string in a loop where i run the

$dom = new DOMDocument();
@$dom->loadHTMLFile('<- path to the file-> =60119');

and the following.... numbers - Note: they replace the ending!!!

60299

64643

62958

63678

60419

60585

60749

60962

and so on. (

Question: How to combine the string (in fact the string is an URL) so that i am able to build the URLs automatically. And that i am able to run all that in a loop - eg with foreach [probably this is the right way to do that].

I hope that i was able to explain the question so that you understand it. If i have to be more descriptive - just let me know!

Many many thanks for a hint!

db1

December 24, 2010

hello good evening!

i was able to run the test:

suse-linux:~ # php -r "echo class_exists('DOMDocument') ? 'It exists' : 'It Does NOT exist';"
It existssuse-linux:~ #

well i am glad - now i can continue with the work on the parser.

@DJ Kat: i continue with some tests on the parserscript (you gave me in an other thread - furhter below)]

December 24, 2010

hi there - hello dear DJ Kat

well - i am a bit confused! what to say? ;-)

Actually that what you showed doesn't mean DOMDocument doesn't exist. It means you ran the command wrong. The dollar sign just indicates it's on the command line you shouldn't run that.

hmmm - i want to run the DOM-Document-code you suggested to me: so i am trying my best to get the Linux-box up and running with all that i need to have.

Lemme know if i did something wrong!?

well i try it on the shell:

Question: should i run this:

$ php -r 'echo (class_exists("DOMDocument")) ? "It exists \n" : "It Does NOT exist \n";'

or this: on the

-r 'echo (class_exists("DOMDocument")) ? "It exists \n" : "It Does NOT exist \n";'

or so:

-r 'echo (class_exists("DOMDocument")) ? "It exists \n" : "It Does NOT exist \n"

hmmm - i am a bit confused...

love to hear from you!

db1

Sign In

dilbertone

Posts

Joined

Last visited

Content Type

Profiles

Forums

Posts posted by dilbertone

Parserscript with perl or php

Parserscript with perl or php

PHP Simple HTML DOM Parser - how to get up to speed with this approach?

file_get_contents loop problem

how to fetch a page with a parser [live demo]

how to fetch a page with a parser [live demo]

how to fetch a page with a parser [live demo]

Html::tableExtract: how to optimize the CSV-Output?

Html::tableExtract: how to optimize the CSV-Output?

Parser runs nicely: how to apply some separators of the table

Parser runs nicely: how to apply some separators of the table

Regex or explode to array: I need some help in a simple string!

Regex or explode to array: I need some help in a simple string!

Regex or explode to array: I need some help in a simple string!

Adressbook - how to add more fileds to this one

Adressbook - how to add more fileds to this one

Adressbook - how to add more fileds to this one

repetitive use of a function - how to perform with an array in a loop

repetitive use of a function - how to perform with an array in a loop

PHP DOMDocument, finding specific tags in a very easy example [here my approach]

How to combine a string to get a foreach & run in a loop

PHP DOMDocument, finding specific tags in a very easy example [here my approach]

How to combine a string to get a foreach & run in a loop

how to test if the DOMdocument [class] exists?

how to test if the DOMdocument [class] exists?

Browse

Activity

Important Information