Jump to content

Trying to use preg match to get all links in html source code...


physaux

Recommended Posts

Hey guys, here is what I have so far:

$regex = '/<a(.+?)\/a>/';
preg_match($regex,$htmlcode,$output);
echo $output[1] . '<br>';

 

But I think I am doing it wrong. Here is what I want to do: I have all the html code from a page. I want to extract all the links into an array, and preferable get the anchor text too. So I want my output to be like so:

 

$finaloutput[1]['url']="http://google.com";

$finaloutput[1]['anchor']="google";

$finaloutput[2]['url']="http://phpfreaks.com";

$finaloutput[2]['anchor']="phpfreaks";

...

:confused: :confused: :confused::wtf:

Could anyone please point me in the right direction to do this?? Thank you!!

Ok it is not working, just prints out "Array ()"

I tried changing echo print_r to just print_r, and then nothing was outputted. Please help!

 

<?php
if($_POST){

$domains = explode("\n", $_POST['domains']);

$ch = curl_init();

foreach($domains as $url)
{
	curl_setopt ($ch, CURLOPT_URL, $url);
	curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
	curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
	curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
	curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt ($ch, CURLOPT_REFERER, 'http://www.google.com/search?q=best+community+forum');
	$AskApache_result = curl_exec ($ch);

	$pattern = '%<a [^>]+href="(?P<url>[^"]+)"[^>*]*>(?P<text>[^< ]+)</a>%si';
	preg_match_all($pattern, $AskApache_result, $matches);
	$urls = array();
	foreach($matches['url'] as $k=>$v) {
	    $urls[$k] = array('url' => $v,'text' => $matches['text'][$k]);
	}
	echo print_r($urls, true);

	flush();
	//ob_flush();
}
}
?>

 

what is wrong :confused:

thank you!!

My suggestion would be to to use something like dom / domxpath.

So for example, suppose I wanted to fetch all the links from say http://www.sfu.ca/, this would be one way to do it:

 

$dom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTMLFile('http://www.cs.sfu.ca/'); // insert url of choice
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$aTag = $xpath->query('//a[@href]'); // search for all anchor tags that provide an href attribute

$finaloutput = array(); // declaring array $finaloutput
foreach($aTag as $url){
    $finalouput[] = array('url' => $url->getAttribute('href'), 'anchor' => $url->nodeValue);
}

echo '<pre>'.print_r($finalouput, true);

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.