Jump to content


Photo

PHP HTML DOM Parser : parse, exclude and export

parse html csv

  • Please log in to reply
No replies to this topic

#1 aviju

aviju

    Newbie

  • New Members
  • Pip
  • 1 posts

Posted 07 April 2014 - 07:34 AM

Hello ! I'm new on this board and I need your help !
 
I explain you my problem. I'd like to collect the content of around 1000 URL in a textfile (I use the wget function in a bash). And then I want to parse this textfile in order to pick one type of content up in a csv file.
 
1) So my bash is this one :
 
file=/home/julien/tests/file.txt
for i in $(cat $file)
do
wget $i -O ->> songs_t.txt;
done
It works perfectly and the textfile songs_t.txt is well created. That textfile contains the content of the 1000 URLs.
 
2) Then I make a php script to parse songs_t.txt. I only want to get concert setlists (the setlist is only a part of a the content of each URL). So my approach is to remove tags such as 'a', 'h4', 'Title' and so on and save the rest in a csv file called 'SONGS.csv' An example of a URL can be seen here : http://members.tripo...un/8-17-63.html
 
My part of the php script dealing with the parsing is this one :
 
$html = file_get_html('songs_t.txt');
foreach ($html->find('title, script, div, center, style, img, noscript, h4, a') as $es)
$es->outertext = 'title, script, div, center, style, img, noscript, h4, a';


$f = fopen('SONGS.csv', "w");
fwrite ($f, $html);
fclose($f);
 
The script works for the 35 first URL (I nearly only get the setlists) but as soon as the script has to deal with more than 35 URL, I have the following error message :
 
Call to a member function find() on a non-object in /home/julien/tests/boys2.php on line 23.
That line 23 corresponds to :
 
foreach ($html->find('title, script, div, center, style, img, noscript, h4, a') as $es)
 
3) In order to test if my html object is valid, I use that code :
 
html = file_get_html('songs_t.txt');
if (!is_object($html)){ 
echo "invalid object"; 
}
And the result is "invalid object". This test is made in a textfile composed of the content of 50 URL. But when I apply that test on textfile composed of 30 URL, I have no error ! So how can I do to parse my HTML even if it's not a entire valid object ?
 
 
Could you help me please ? 
Thanks !





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

Cheap Linux VPS from $5
SSD Storage, 30 day Guarantee
1 TB of BW, 100% Network Uptime

AlphaBit.com