Jump to content

Help needed for data scraping using cURL and DOM.DOCUMENT


jk2010

Recommended Posts

Hi guys,

 

I'm learning website querying and data extraction and have built a page that worked perfectly on one site. I'm now testing it on the other site but have some problem with it.

 

The script will go to a category >>  find all the products listed in that page >> follow the links to products >> extract data for individual products like title, price, desc etc >> put the results in array >> insert results in database.

 

ISSUES:

 

1. Unable to follow link for page2, stops after requesting links for that page's products. However it will get all the products from starting page i.e. page 1.

2. Nothing is inserted in the database but just for 1 product.

3. Most of the fields are missed out.

 

 

POSSIBLE ERRORS

 

Could it be Xpath location of the elements?

Script error?

But I know it works because I tested it on another site.

 

Please help I will be grateful since its driving me crazy for about 3 days. Thanks in advance.

 

RESULT OF MY PAGE IN WEB BROWSER

Requesting page: http://www.misco.co.uk/applications/category/category_slc.asp?Nav=|c:727|&Sort=0&Recs=30&page=1
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=336830&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=259445&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=288289&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=311689&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=290889&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=11290&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=236875&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=295541&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=291735&CatId=0
|__ Requesting item: http://www.misco.co.uk/applications/SearchTools/item-details.asp?EdpNo=236273&CatId=0

 

 

THIS IS WHERE I GET ELEMENTS

				// get product details
			$prod_categs 				= $xpath->query('//table[@class="maintbl"]/descendant::a[@class="idBreadCrumb mainframe"]');
			$prod_title_node 			= $xpath->query('//table[@class="maintbl"]/descendant::h1');
			$prod_desc_node 			= $xpath->query('//div[@class="i2htmlcontent"]/span/div[@class="normal"]');
			$prod_price_node 			= $xpath->query('//table[@class="maintbl"]/descendant::div[@class="details-pricebold"]')->item(0);
			$prod_price_as_image_node 	= $xpath->query('//div[@class="s"]/span[@class="lprice"]/img[@src]')->item(0);
			$prod_vat_price_node 		= $xpath->query('//div[@class="details-subprice"]/div')->item(0);
			$prod_quicklinx_node 		= $xpath->query('//table[@class="maintbl"]/descendant::div/td[@class="textblacksmblue"]');	
			$prod_mfr_node 				= $xpath->query('//table[@class="maintbl"]/descendant::td[@class="textblacksm"] and not(contains(@*, "Manufacturer"))');		
			$prod_images_nodes 			= $xpath->query('//div[@id="imagegalleryscroller"]/a/img');
			$prod_features_nodes 		= $xpath->query('//html/body/table[1]/tr/td/table[2]/tr[1]/td[3]/div[2]/table/tr/td[2]/div[1]/span/div[4]/ul');
			$prod_spec_nodes 			= $xpath->query('//div[@class="normal"]/table[@class="table5pxPadLightGreyBottomBorder"]/*');
			$accesses_nodes 			= $xpath->query('//div[@id="accessoriessection"]/descendant::div/table/descendant::tr/td[@class="la"]');
			$resource_nodes 			= $xpath->query('//div[@id="resourcessection"]/descendant::ul/li/a[@href]');
			$main_image_node 			= $xpath->query('//table[@class="maintbl"]/descendant::td[@class="prodimg"]/img[contains(@src, "uploadedimages")]')->item(0);

 

THIS IS THE SOURCE CODE OF PRODUCT DETAILS PAGE

 

<table width="100%" cellpadding="0" cellspacing="0" border="0">
                    <tr> 
                        <td nowrap align="left">

                            
                                <a href="javascript:MM_openBrWindow('http://img.misco.co.uk/images/uploadedimages/large/20091027141548.jpg','LargeImage','scrollbars=no,width=350,height=350')" class="details"> 
                                <img src="/images/itemdetails/icon-enlarge.gif" border="0" alt="" align="absmiddle" hspace="0"></a>
                                <!--<img src="http://img3.misco.co.uk/images/misc/pixel-clr.gif" width="2" height="1" alt="">-->
                            
                                <a href="/applications/email/emailafriend.asp"> <img src="/images/itemdetails/icon-email.gif"  border="0" alt="" align="absmiddle" hspace="6"></a><a href='http://www.misco.co.uk/applications/SearchTools/item-details-print.asp?EdpNo=336830&Sku=Q151273'><img src="/images/itemdetails/icon-print.gif" border="0" align="absmiddle" hspace="4" alt=""></a> 
                            
                        </td>
                        <td width="44" align="right" valign="top"><img src="http://img1.misco.co.uk/images/itemdetails/itemtitle_yellowleft.gif" width="44" height="24" alt=""></td>
                        <td width="340" valign="middle" style="background-image: url('http://img.misco.co.uk/images/itemdetails/itemtitle_yellow_bg.gif'); background-repeat: repeat-x;" class="textblackmed">
                            <table width="100%" height="18" border="0" cellpadding="0" cellspacing="0">
                            <tr valign="top"> 
                                <td width="35" class="textblacksm" nowrap>Misco No: </td>

                                
                                    <td width="40%" class="textblacksmblue"><b>Q151273</b></td>
                                
                                <td align="right" nowrap>
                                   <table border=0 cellspacing=0 cellpadding=3>
                                   <TR Valign=top>                                   
                                     <TD><div style="position: relative; top:-3px;"><a href="javascript:void(0);" onclick="postReview();" alt="Add Review"  style="font-size: 12px;">
						                 <img src="/images/itemdetails/ADD_REVI.GIF" alt="Add Review"  border=0  />
						                 </a></div></TD></tr></table></td></tr>
                                <!--</td>
                            </tr>-->
                            </table>

                        </td>
                        <td width="18" align="right" valign="top"><img src="http://img3.misco.co.uk/images/itemdetails/itemtitle_yellowright1.gif" width="18" height="24" alt=""></td>
                    </tr>
                    <tr> 
                    <td></td>
                    <td></td>
                    <td align="left" >
                    </td>
                    <td></td>                        
                    </tr>

                    <tr> 
                    <td></td>
                    <td></td>
                    <td align="left" >
                    <table>
                    <tr><td class="textblacksm" width=110>Manufacturer:</td><td class="textblacksm" nowrap> <strong>Canon </strong> </td></tr>
                    <tr><td class="textblacksm" width=110>Manufacturer Part No:</td><td class="textblacksm" nowrap> <strong>2925B008AA </strong> </td>

                    </tr>
                    </table>
                    </td>
                    <td></td>                        
                    </tr>
      
                    </table>    
                               </td>       
            </tr>             

      <form method="post" name="itemdets" style="margin:0" action="/cgi-bin/order.asp" onSubmit="return validateForm(document.itemdets)">
        
        <tr> 
          <td>  
            <input type="hidden" name="CatCode" id="CatCode" value="" />

          </td>
        </tr>
      </table>

<table width="100%" border="0" cellspacing="0" cellpadding="0"><input type="hidden" name="EdpNo" value="336830"><input type="hidden" name="LineGroup" value="0"><input type="hidden" name="AssocEdpNo" value="0"><input type="hidden" name="clicksource" value="ITD"><tr>
<td width="250" valign="top" class="cartfont" align="left">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tr>
<td class="prodimg"><a href="javascript:MM_openBrWindow('http://img.misco.co.uk/images/uploadedimages/large/20091027141548.jpg','LargeImage','scrollbars=no,width=350,height=350')"><img src="http://img.misco.co.uk/images/uploadedimages/med/20091027141548.jpg" border="0" title="Canon CS5600F Film Scanner" alt="Canon CS5600F Film Scanner" onError="this.src='http://img1.misco.co.uk/images/SearchTools/no_image.jpg'"></a></td>
</tr>
</table>
<table width="200" border="0" cellspacing="0" cellpadding="0">
<tr>

<td align="center" valign="middle"><a href="/applications/factfinder/search.asp?manufacturer=400"><img src="http://img.systemaxdev.com/manflogos/CAN.jpg" border="0" alt="View All Products by Canon"></a></td>
</tr>
<tr>
<td align="center" valign="middle">
<div id="ccslogos"></div><script src="http://logo.cnetcontentsolutions.com/hook/?h=2a79ea27&mf=Canon&pn=2925B008AA&locale=en&style=1&layout=1x1&locationId=ccslogos" type="text/javascript"></script></td>
</tr>
<tr>
<td><script src="http://content.webcollage.net/miscouk/smart-button"></script></td>
</tr>
</table>
</td>
<td>
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="details">
<tr width="400">
<td align="left" width="100%" colspan="1" nowrap="">

<table width="100%" cellpadding="0" cellspacing="0" border="0"><br><tr>
<tr class="details-pricebold">
<td valign="top" align="left" width="60"><strong nowrap="">Price:   
			</strong></td>
<td><strong nowrap=""><div align="left" nowrap="" class="details-pricebold">£97.99 ex VAT   
				</div>
<div align="left" nowrap="" class="details-subprice">£115.14 inc VAT
					         

				</div></strong></td>
</tr>
</tr><br></table>
</td>
<td></td>
</tr>
<tr>
<td colspan="2" align="left" style="padding-top: 5px;"><strong>Availability:</strong><span class="availab">
				 <font color="#157908"><b>In stock</b></font></span></td>

</tr>
<tr>
<td colspan="2"><img src='http://img3.misco.co.uk/imagesmisc/pixel-clr.gif' width='1' height='5' border='0'><br/></td>
</tr>
<tr>
<td valign="top">
<table>
<tr>
<td align="left" valign="middle" class="textblackmed" colspan="2" nowrap=""><strong>Order Qty:</strong> <input name="Qty" type="text" id="mainQty" value="1" size="4" maxlength="3" onChange="SetWarrantyQty();" style="border: solid 1px #7f9db9"><img src="/images/spacer.gif" width="18px" height="1" alt=""><a href="javascript:void(0);" onclick="VisBasket("EdpNo=336830&QTY=" + document.getElementById("mainQty").value ,"336830" , document.getElementById("mainQty").value);"><img src="http://img1.misco.co.uk/images/main/buynow_lg.gif" align="absmiddle" style="border:none" alt="Buy Now" vspace="4" hspace="4"></a><br><script type="text/javascript">
				rvo_retailer_ref='msc';
				rvo_manufacturer='Canon';
				rvo_model='2925B008AA';
				rvo_format='image_horizontal310x70';
			</script><div id="VisBasketItem336830" name="VisBasketItem336830" CurrQTY="0"></div>
</td>
</tr>
</table>
</td>
<td valign="top"></td>

</tr>
<Tr>
<td colspan="2"></td>
</Tr>
<tr align="right">
<td colspan="2"></td>
</tr>
<Tr>
<td colspan="2">
<tr>
<td colspan="2"></td>
</tr>
<tr>
<td colspan="2"><A href="javascript:void window.open('/applications/SearchTools/AddCartfromGallery.asp?EdpNo=336830&Sku=Q151273&imgCounter=0&WhichImage=0&MfrId=400&MfrName=Canon&MPN=2925B008AA&imageHost=http://img1.misco.co.uk/images&RefurbOpenBox=&Price=97.99&RebateAmt=-1&TaxCode=2&DESC=Canon CS5600F Film Scanner&TaxCode=2&PriceInc=115.1383','_blank','toolbar=no, location=no, directories=no, status=no, menubar=no, scrollbars=yes, resizable=yes, width=550, height=800')"> Larger image
        </A></td>
</tr>

</td>
</Tr>
</table>
</td>
</tr>
</table>

</form>

  
      <img src="http://img1.misco.co.uk/images/misc/pixel-clr.gif" width="20" height="10" alt=""> 
      <!-- Tabs -->
      <a name="ProReview"></a>
      <table width="100%" border="0" cellpadding="0" cellspacing="0">
        <tr> 
            <td width="10" height="32"><img src="http://img.misco.co.uk/images/misc/pixel-clr.gif" width="10" height="32" alt=""></td>

                    <td background="http://img2.misco.co.uk/images/itemdetails/tab-bkgd.gif" align="left">
         <img src="http://img1.misco.co.uk/images/itemdetails/0_3.gif" height="41" border="0"><a href="item-details.asp?EdpNo=336830&Tab=1&NoMapp=0" onMouseOut="MM_swapImgRestore()" onMouseOver="MM_swapImage('Image1','','http://img2.misco.co.uk/images/itemdetails/1_2.gif',1)"><img src="http://img2.misco.co.uk/images/itemdetails/1_1.gif" name="Image1" height="41" border="0"></a><a href="item-details.asp?EdpNo=336830&Tab=2&NoMapp=0" onMouseOut="MM_swapImgRestore()" onMouseOver="MM_swapImage('Image2','','http://img2.misco.co.uk/images/itemdetails/2_2.gif',1)"><img src="http://img.misco.co.uk/images/itemdetails/2_1.gif" name="Image2" height="41" border="0"></a><a href="item-details.asp?EdpNo=336830&Tab=7&NoMapp=0" onMouseOut="MM_swapImgRestore()" onMouseOver="MM_swapImage('Image7','','http://img.misco.co.uk/images/itemdetails/7_2.gif',1)"><img src="http://img2.misco.co.uk/images/itemdetails/7_1.gif" name="Image7" height="41" border="0"></a><a href="item-details.asp?EdpNo=336830&Tab=11&NoMapp=0" onMouseOut="MM_swapImgRestore()" onMouseOver="MM_swapImage('Image11','','http://img2.misco.co.uk/images/itemdetails/11_2.gif',1)"><img src="http://img1.misco.co.uk/images/itemdetails/11_1.gif" name="Image11" height="41" border="0"></a></td> 
                  <td width="80"><a href="item-details.asp?EdpNo=336830&Tab=14&NoMapp=0" onMouseOver="MM_swapImage('whybuy','','http://img1.misco.co.uk/images/itemdetails/14_2.gif',1)" onMouseOut="MM_swapImgRestore()"><img src="http://img3.misco.co.uk/images/itemdetails/14_0.gif" name="whybuy" width="69" height="41" border="0" id="whybuy" alt=""></a><img src="http://img2.misco.co.uk/images/itemdetails/infotable_rightside1.gif" width="7" height="41" border="0" alt=""></td>
      
        </tr>      
      </table>
      <!-- End Tabs -->
<div align="center">
  
<table width="100%" border="0" cellpadding="0" cellspacing="0" class="arial2">
   <tr> 
      <td width="10"><img src="http://img1.misco.co.uk/images/misc/pixel-clr.gif" width="10" height="1" border="0" alt=""></td>
      <td valign="top" align="left">

      
<!-- Compare this deal begin -->

<!-- Compare this deal end -->
      
      <br /><span style="text-align: left"></span>
        <!-- Begin I2 HTML content -->
      <div class="i2htmlcontent"><span style="text-align: left"><img alt="Canon CanoScan 5600F Scanner" width="150" height="132" src="http://img.systemaxdev.com/productmedia/htmlimages/cten/can/can-Q/Q151273-small.jpg" style="float:right;padding-left:20px;">

<div class="subheader" style="margin-bottom:10px"><b>Canon CanoScan 5600F Scanner</b></div>

<div class="normal">
<b>Outstanding photo, film and document scanning with advanced CCD technology</b>

<br><br>
Professional-quality CCD scanner with 35mm film/slide holder delivering exceptional 4800x9600dpi resolution and 11-second 300dpi scans. Zero warm-up time offers instant operation for reflective scans.
<ul>
<li>Advanced CCD technology photo, film and document scanner 
<li>Premium quality 4800x9600dpi resolution and 48-bit colour 
<li>Film strip and slide scanning 
<li>Zero warm-up time 
<li>Fast 300dpi A4 scanning 
<li>7 EZ buttons 
<li>Auto Scan Mode 
<li>USB connection 
<li>Built-in AC adaptor 
</li>
</ul>
</div>


<div class="subheader" style="clear:both;margin-bottom:10px;margin-top:20px;"><b>Main Specifications</b></div>
<div class="normal">
<table cellspacing="0"  cellpadding="4" border="0" width="100%" class="table5pxPadLightGreyBottomBorder">
	<tr>
		<td width="200"><b>Product Description</b></td>
		<td>Canon CanoScan 5600F Scanner</td>
	</tr>
<tr>
		<td><b> Type</b></td>

		<td>Desktop Colour Flatbed Scanner with Film Adaptor Unit</td>
	</tr>
	<tr>
		<td><b>Scanning Element</b></td>
		<td>CCD 6-line colour</td>
	</tr>
	<tr>

		<td><b>Light Source</b></td>
		<td>White LED - Reflective / Cold cathode fluerescent lamp</td>
		</tr>

		<tr>
		<td><b>Resolution</b></td>
		<td>Optical: 4800 dpi x 9600 dpi<br>
		Selectable: 25 - 19200 dpi</td>

		</tr>
		<tr>
		<td><b>Interface</b></td>
		<td>Hi-Speed USB</td>
	</tr>
		<tr>
		<td><b>Scanning Graduation</b></td>

		<td>Colour: 48bit input -> 48/24 bit output<br>
		GreyScale: 48 bit input - > 16 bit (Film scanning) / 8 bit output</td>
	</tr>	
		<tr>
		<td><b>Maximum document size</b></td>
		<td>A4 / Letter [216 x 297mm]</td>
	</tr>	

		<tr>

		<td><b>EZ-Scan Buttons</b></td>
		<td>7 buttons (PDF x 4, COPY, PHOTO/FILM, E-MAIL)</td>
	</tr>		

				<tr>
		<td><b>Preview Speed</b></td>
		<td>Approx. 3 sec</td>
	</tr>

		<tr>
		<td><b>Scanning speed</b></td>
		<td>Colour: 1.8 msec./line (300 dpi), 14.6 msec/line (4800dpi)<br>
		GreyScale: 1.8 msec./line (300 dpi), 14.6 msec/line (4800dpi) <br>
		Mono: 1.8 msec./line (300 dpi), 14.6 msec/line (4800dpi)</td>
	</tr>
			<tr>

		<td><b>Scan speed</b></td>
		<td>A4, 300 dpi, Colour: Approx. 11 sec</td>
	</tr>
					<tr>
		<td><b>Film Handling</b></td>
		<td>35 mm strip (negative/positive)/6 frames,<br>
Slide (negative /positive) /4 frames</td>

	</tr>
		<tr>
		<td><b>Software included</b></td>
		<td>ScanGear, MP Navigator EX, ArcSoft PhotoStudio</td>
		</tr>
			<tr>
		<td><b>Operaing System Requirements</b></td>

		<td>Windows Vista™, XP SP2, 2000 Professional SP4 / Internet Explorer 6.0 / CD-ROM drive / Display 1024x768
Mac OS X v.10.3.9, v.10.4,v.10.5 / Safari / CD-ROM drive / Display 1024x768</td>
	</tr>
		<tr>
		<td><b>Dimensions (WxDxH)</b></td>
		<td>272 x 491 x 97 mm</td>
	</tr>
	<tr>

		<td><b>Weight</b></td>
		<td>approx 4.3 kg</td>
	</tr>	

</table>
</div>

<br><br>

<div class="subheader" style="clear:both;margin-bottom:10px;margin-top:20px;"><b>A Closer Look</b></div>
<div align="center" style="margin-bottom:20px;">
<img alt="Canon CanoScan 5600F Scanner" src="http://img.systemaxdev.com/productmedia/htmlimages/cten/CAN/CAN-Q/Q151273-left.jpg">

	</div>


<div class="subheader" style="clear:both;margin-bottom:10px;margin-top:20px;"><b>Features</b></div>
<div class="normal">
<b>High-resolution scanning</b> <br>
With a maximum resolution of 4800x9600dpi, this advanced CCD technology scanner has the capacity to deliver scans with very high levels of clarity, photographic detail and with accurate colour reproduction. The 48-bit input ensures the accuracy of colours and brings the best out of your images every time.
<br><br>
The scanner has the capacity to create 300dpi A4 scans in a remarkably fast time of approximately 11 seconds.
<br><br>
<b>Superb functionality</b><br>
Both 35mm film and slide scanning are delivered with exceptional quality. It is possible to scan film straight into your PC or Mac.

<br><br>
Those using film scanning will benefit from six-frames support for 35mm strip (negative/positive) and four-frames support for 35mm slide (negative/positive).
<br><br>
The inclusion of a high-brightness white LED for reflective scans equals fast operation. This is superbly demonstrated via zero warm-up, so there is no need for the user to wait around before they can begin scanning.
<br><br>
<b>Easy to use</b><br>
This is a sophisticated scanner with many creative functions, but it remains easy to use. Seven EZ buttons are provided – they are all on the front of the scanner – and these allow the most common tasks to be performed with one-click. The buttons cover Copy, Scan, Email and creating PDFs. The buttons can also be configured by the user to suit their individual preferences.
<br><br>
Auto Scan Mode is another creative, yet straightforward feature. The scanner automatically senses what is to be scanned – document or photo – and scans and saves this using settings appropriate to the original. This can be completed from the One-click software menu.
<br><br>
<b>Connectivity</b><br>
The scanner features a Hi-Speed USB interface that helps to increase productivity and the speed of scans. For power, an AC adaptor is built into the product.
<br><br>
<b>Software</b><br>

A comprehensive range of software is supplied with the scanner and includes MP Navigator EX (for easy operation and to perform the most complex scanning operations), ScanGear and also ArcSoft PhotoStudio
</div></span></div>
        <!-- End I2 HTML content -->
   
     
	<span style="text-align: left"></span><br>

      <!-- /font -->

	   


         <table width="100%" border="0" cellspacing="0" cellpadding="0">
            <tr>                                 
               <td><td>                        
                  <table width="100%" border="0" cellspacing="0" cellpadding="0">
                     <tr>                              
                        <td width="60">                                 
                           
                       </td>

                     </tr>                                                                  
                  </table>
				</td>
            </tr>                                      
         </table>

 

Once again any help will be appreciated.

 

Thanks guys.

 

 

jari

 

 

 

 

 

 

 

 

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.