Jump to content

Limiting results using Regexp from a URL scraping query


DJphp

Recommended Posts

Hello,

 

Whilst scraping a web page for URL's with this expression:

$urlLink = "/<a[^>]+href=\"(showthread\.php\?s=[^\"]+)/i";

 

and using preg_match_all and printf to display the results

 

i get the results (showing a subset):

showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&t=26041

showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&t=26041&page=2

showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&p=274081#post274081

and so on.

 

what I would like to do is only capture URL's like the first line:

  showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&t=26041

 

 

and exclude the results line 2 and 3

showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&t=26041&page=2

showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&p=274081#post274081

 

 

That is, I only want to exclude anything like the following:

 

showthread.php?s=****&p=****

or

  showthread.php?s=****&t=****&page=****

 

and only display results with:

showthread.php?s=****&t=****

 

I just cannot seem to do this. My thoughts were to exclude any results that included "&p=" or "&page=".

I just cannot seem to do that.

 

any help would be appreciated.

 

DJphp

 

Try something like this:

 

<pre>
<?php
$data = <<<DATA
<a href="showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&t=26041"></a>
<a href="showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&t=26041&page=2"></a>
<a href="showthread.php?s=21be590fe8a4b4e317a7fa54b3ff8230&p=274081#post274081"></a>
DATA;
preg_match_all('/<a[^>]+href="(showthread\.php\?s=(??!&p(?:age)?=)[^"])+)"/i', $data, $matches);
array_shift($matches);
print_r($matches);
?>
</pre>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.