dflow Posted December 11, 2010 Share Posted December 11, 2010 i want to parse all the urls from a page using preg_match i get the results i want to isolate so i have a distinct result so if there is more than one link to the same url i only want to display i result in the array i want to find all urls with this pattern: "products/productpage1.htm" im using this code <?php $data = file_get_contents("http://example.com"); $pattern = "/href=[\"']?([^\"']?.*(htm))[\"']?/i"; preg_match_all($pattern, $data, $urls); print_r($urls) ?> Quote Link to comment Share on other sites More sharing options...
MMDE Posted December 11, 2010 Share Posted December 11, 2010 Sorry if it is a bit shabby, I'm not the best on regex, and feel free to change it, I pretty much recommend it if you can. <?php $site='siteurl'; if($sitecont=@file_get_contents($site)){ $links=array(); $templinks=array(); $templink=array(); preg_match_all('/href="[\S]{1,50}"/',$sitecont,$templinks); foreach($templinks AS $templink){ foreach($templink AS $tlink){ $links[]=substr($tlink,6,strlen($tlink)-7); } } foreach($links AS $link){ echo $link.'<br />'; } } ?> Quote Link to comment Share on other sites More sharing options...
MMDE Posted December 11, 2010 Share Posted December 11, 2010 <?php $site='http://localhost'; $domain=$_SERVER['HTTP_HOST']; if($sitecont=@file_get_contents($site)){ $templinks=array(); $templink=array(); $links=array(); preg_match_all('/href=".*?"/',$sitecont,$templinks); foreach($templinks AS $templink){ foreach($templink AS $tlink){ $tlink=preg_replace('/'.$domain.'/','',$tlink); $links[]=substr($tlink,6,strlen($tlink)-7); } } foreach($links AS $link){ echo $link.'<br />'; } } ?> Added some more functionality. Quote Link to comment Share on other sites More sharing options...
MMDE Posted December 11, 2010 Share Posted December 11, 2010 <?php $site='http://localhost'; $domain=$_SERVER['HTTP_HOST']; if($sitecont=@file_get_contents($site)){ preg_match_all('/href=".*?"/',$sitecont,$templinks); $links=array(); foreach($templinks AS $templink){ foreach($templink AS $tlink){ $tlink=preg_replace('/'.$domain.'/','',$tlink); $nlink=substr($tlink,6,strlen($tlink)-7); $dupelink=0; foreach($links AS $ulink){ if($nlink==$ulink){ $dupelink=1; } } if($dupelink==0){ $links[]=$nlink; } } } foreach($links AS $link){ echo $link.'<br />'; } } ?> Now it should only echo/store a link once. (only unique links) Quote Link to comment Share on other sites More sharing options...
dflow Posted December 16, 2010 Author Share Posted December 16, 2010 <?php $site='http://localhost'; $domain=$_SERVER['HTTP_HOST']; if($sitecont=@file_get_contents($site)){ preg_match_all('/href=".*?"/',$sitecont,$templinks); $links=array(); foreach($templinks AS $templink){ foreach($templink AS $tlink){ $tlink=preg_replace('/'.$domain.'/','',$tlink); $nlink=substr($tlink,6,strlen($tlink)-7); $dupelink=0; foreach($links AS $ulink){ if($nlink==$ulink){ $dupelink=1; } } if($dupelink==0){ $links[]=$nlink; } } } foreach($links AS $link){ echo $link.'<br />'; } } ?> Now it should only echo/store a link once. (only unique links) for some reason it still outputs 3 links now how would i loop through each result and parse each webpage ?? foreach($links AS $link){ foreach($link->find('span[id=apartmentname]') as $partmentname) echo $partmentname->plaintext.'<br><br>'; } } the foreach isnt correct how should it be to lop through the resulted array?? i know im making a mess but new to looping arrays thanks Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.