Jump to content

Parse HTML Save to XML


codeinphp

Recommended Posts

I am attempting to parse an html page and save the results to an XML file. The purpose of the script is to create a program guide for tv. The script first parses the 'img', if found and matches criteria it will then proceed into getting the program name, program time and description. I can get it to do all of this but it saves all the channels into each xml, not just the info for the particular channel. Example, it reads the html, gets the first channel, say A&E, it then parses info for all the channels in the html and saves all the program info to all the xml, so I end up with 25 xml files, all named based on the different channels, but all containing program info for all channels. I suspect I have something wrong in the loop but can't locate. Any help appreciated. The code below leaves out curl to get $html, not really needed for problem.

<?php
#CREATE DOM PARSER
$dom = new DOMDocument();

$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

$dom->formatOutput = true;

$dom->preserveWhiteSpace = true;

$images= $dom->getElementsByTagName('img');
$childprogram = $xpath->query('//span[@class="prog_name"]');
$childtime= $xpath->query('//div[@class="prog_time"]');
$childdescrip= $xpath->query('//div[@class="prog_desc"]');



foreach($images as $img){   
$xml = new DOMDocument("1.0");
$root = $xml->createElement("programme");

$book = $xml->createElement("tvprogram");
      $icon=   $img   ->getAttribute('src');
      if( preg_match('/\.(jpg|jpeg|gif)(?:[\?\#].*)?$/i', $icon) ) { //only matching types
      $channel=   $img   ->getAttribute('alt');
         
         
         foreach ($childprogram as $programname)
         {
            
               foreach ($childtime as $programtime)
               {
         
                     foreach ($childdescrip as $descrip)
                     {
                                                      
   
$xml->appendChild($root);
$title = $xml->createElement("Channel"); //CHANNEL NAME
$showname= $xml->createElement("programname");  //PROGRAM NAME
$showtime= $xml->createElement("programtime");  //PROGRAM TIME
$descriptime= $xml->createElement("description"); //PROGRAM DESCRIPTION


$titleText = $xml->createTextNode($channel);
$shownameText= $xml->createTextNode($programname->nodeValue);
$showtimeText= $xml->createTextNode($programtime->nodeValue);
$showdescripText= $xml->createTextNode($descrip->nodeValue);



$title->appendChild($titleText);

$showname->appendChild($shownameText);
$showtime->appendChild($showtimeText);
$descriptime->appendChild($showdescripText);



$book->appendChild($title);
$book->appendChild($showname);
$book->appendChild($showtime);
$book->appendChild($descriptime);

}
}
}
$root->appendChild($book);
}//END OF LOOP




$xml->formatOutput = true;

$xml->save(dirname(__FILE__)."/streamguideXML/".$channel.".xml") or die("Error");


} //END OF FUNCTION
?>
Link to comment
Share on other sites

At the very end of the script you have this:

 

 

$xml->save(dirname(__FILE__)."/streamguideXML/".$channel.".xml") or die("Error");


} //END OF FUNCTION

 

However, there is no function in your script. That final bracket is the closing bracket for the first foreach loop - so, yes, you are creating a file for each execution of the loop.

Link to comment
Share on other sites

Thank you, that //End Function was left over from something. I removed it. I can get the script to create a new xml for each channel name (a&e.xml, abc.xml, cbs.xml soforth) but each file had program info for programs for all channels not just the specific one. For example I run the script, all xml files are created. So if I open a&e.xml, I not only have programs for a&e, but for abc, cbs etc. It's not closing and saving the file the specific channel.

Thanks again

Link to comment
Share on other sites

I would make each unique channel an associative array in the loop, then foreach unique channels saving the file as xml.

 

There is nowhere in the current code that distinguishes one channel from the others in the loop. It's everything.

 

It would be easier for us with an example html.

 

Something along the lines of this:

//before loop
$array = array();

//inside loop
$array[$channel][] = array("name"=>$programname->nodeValue,"time"=>$programtime->nodeValue,"description"=>$descrip->nodeValue);

Later on outside the loop

foreach($array as $key=>$value){
//make your xml and save
}
Edited by QuickOldCar
Link to comment
Share on other sites

Here's an example of html. This what should be parsed for A&E.xml

<div class="row">
					<div class="col th">
						<a class="channel_sched_link" href="javascript:void(0)" title="View A&E full schedule" data-channelid="9">
							<img src="http://static.ilive.to/images/tv/AE.JPG" width="30" height="20" alt="A&E" />A&E						</a>
					</div>
					<div class="prog_cols">
											<div class="col ts ts_1 prog_907477  ps_0" data-catid="" >
							<span class="prog_name">Parking Wars</span>
							<div class="prog_time">May 27, 2015, 7:00 am - 8:00 am</div>
							<a class="btn_watchlist " href="javascript:void(0)" data-progid="907477">(+) add to watchlist</a>
							<div class="prog_desc">
								An angry mother and daughter confront a booter in Detroit; and an irate Philadelphia citizen says he got a ticket while trying to help his physically disabled son.<br/>
																	<a class="watchnow" href="http://www.streamlive.to/channels/?q=A%26E">Watch Now</a>
																</div>	
						</div>
												<div class="col ts ts_3 prog_907478  ps_1" data-catid="" >
							<span class="prog_name">Dog the Bounty Hunter</span>
							<div class="prog_time">May 27, 2015, 8:00 am - 10:00 am</div>
							<a class="btn_watchlist " href="javascript:void(0)" data-progid="907478">(+) add to watchlist</a>
							<div class="prog_desc">
								Dog pursues two fugitives whose drug problems have hurt their families.<br/>
																	<a class="watchnow" href="http://www.streamlive.to/channels/?q=A%26E">Watch Now</a>
																</div>	
						</div>
											</div>
											<a class="watchnow" href="http://www.streamlive.to/channels/?q=A%26E">Watch Now</a>
										</div>
Link to comment
Share on other sites

Having spent way too much time parsing and creating XML, I have to ask why XML would be anyone's first choice. I prefer to just serialize the data. If the goal is to store data so you can use it later, then not creating XML means not parsing it later. Serialization retains the data types, which is handy. If you don't need to retain data types, then json encoding is a good option.

Link to comment
Share on other sites

Well, I am using the xml later time. What is going is my php script would run say every hour or so to update programming information. Each time it runs it will over write the existing xml. So when I go to a particular channel it will display the info in the xml for that channel.  Either way I have to get all of the channels programs and times together and that's where I am not successful.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.