Jump to content

PHP HTML DOM Parser : parse, exclude and export


aviju

Recommended Posts

Hello ! I'm new on this board and I need your help !

 

I explain you my problem. I'd like to collect the content of around 1000 URL in a textfile (I use the wget function in a bash). And then I want to parse this textfile in order to pick one type of content up in a csv file.

 

1) So my bash is this one :

 



file=/home/julien/tests/file.txt
for i in $(cat $file)
do
wget $i -O ->> songs_t.txt;
done


It works perfectly and the textfile songs_t.txt is well created. That textfile contains the content of the 1000 URLs.

 

2) Then I make a php script to parse songs_t.txt. I only want to get concert setlists (the setlist is only a part of a the content of each URL). So my approach is to remove tags such as 'a', 'h4', 'Title' and so on and save the rest in a csv file called 'SONGS.csv' An example of a URL can be seen here : http://members.tripod.com/~fun_fun_fun/8-17-63.html

 

My part of the php script dealing with the parsing is this one :

 



$html = file_get_html('songs_t.txt');
foreach ($html->find('title, script, div, center, style, img, noscript, h4, a') as $es)
$es->outertext = 'title, script, div, center, style, img, noscript, h4, a';


$f = fopen('SONGS.csv', "w");
fwrite ($f, $html);
fclose($f);


 

The script works for the 35 first URL (I nearly only get the setlists) but as soon as the script has to deal with more than 35 URL, I have the following error message :

 



Call to a member function find() on a non-object in /home/julien/tests/boys2.php on line 23.


That line 23 corresponds to :

 



foreach ($html->find('title, script, div, center, style, img, noscript, h4, a') as $es)


 

3) In order to test if my html object is valid, I use that code :

 



html = file_get_html('songs_t.txt');
if (!is_object($html)){ 
echo "invalid object"; 
}


And the result is "invalid object". This test is made in a textfile composed of the content of 50 URL. But when I apply that test on textfile composed of 30 URL, I have no error ! So how can I do to parse my HTML even if it's not a entire valid object ?

 

 

Could you help me please ? 

Thanks !

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.