Jump to content

Some help with scraping and regular expression


sloth456

Recommended Posts

Hi,

 

I hate to ask, I feel like such a leecher rather than a contributor.

 

I was wondering if anyone would help with the following problem.

 

This is my code:

 

preg_match('/<a href="(.*?)".*?<\/a>/',$googleresult,$matches);

 

I've basically gone and scraped one page of listings from google using file_get_contents and put it in $googleresult

 

I have stripped out all domains with "google" in them which leaves me with just the URLs for the actual listings.

 

I'm basically trying to pull JUST the first URL between anchor tags out.  I'm really useless with regular expression and so used http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial for help.

 

Trouble is I seem to be pulling out more than just the URL

 

<a href="http://java.sun.com/docs/books/tutorial/getStarted/application/index.html" class=l>Lesson: A Closer Look at the "<em>Hello World</em>!" Application (The Java <b>...</b></a>%20-site:<a href="http://java.sun.com/docs/books/tutorial/getStarted/application/index.html" class=l>Lesson: A Closer Look at the "<em>Hello World</em>!" Application (The Java <b>...</b></a>

 

you're actually getting exactly what you're asking for (i'm assuming the last code block is an example of the return of the preg_match).

your regex describes the entire contents of <a...></a> tags. if you only want the url try using something like :

preg_match('/href="(.*?)"/',$googleresult,$matches);

 

i'm pretty useless with regex myself, so success is not guaranteed. ;)

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.