Some help with scraping and regular expression

sloth456 · October 31, 2008

Hi,

I hate to ask, I feel like such a leecher rather than a contributor.

I was wondering if anyone would help with the following problem.

This is my code:

preg_match('/<a href="(.*?)".*?<\/a>/',$googleresult,$matches);

I've basically gone and scraped one page of listings from google using file_get_contents and put it in $googleresult

I have stripped out all domains with "google" in them which leaves me with just the URLs for the actual listings.

I'm basically trying to pull JUST the first URL between anchor tags out. I'm really useless with regular expression and so used http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial for help.

Trouble is I seem to be pulling out more than just the URL

<a href="http://java.sun.com/docs/books/tutorial/getStarted/application/index.html" class=l>Lesson: A Closer Look at the "<em>Hello World</em>!" Application (The Java <b>...</b></a>%20-site:<a href="http://java.sun.com/docs/books/tutorial/getStarted/application/index.html" class=l>Lesson: A Closer Look at the "<em>Hello World</em>!" Application (The Java <b>...</b></a>

bobbinsbro · October 31, 2008

you're actually getting exactly what you're asking for (i'm assuming the last code block is an example of the return of the preg_match).

your regex describes the entire contents of <a...></a> tags. if you only want the url try using something like :

preg_match('/href="(.*?)"/',$googleresult,$matches);

i'm pretty useless with regex myself, so success is not guaranteed.

samshel · October 31, 2008

try preg_match_all and print $matches

sloth456 · October 31, 2008

Thanks bobbinsbro, but that didn't quite work, I still pulled out the href=".

@samshel: I think you found me the solution.

bobbinsbro · October 31, 2008

i know you pulled the "href=". you were supposed to cut that bit off once the results returned...

Sign In

Some help with scraping and regular expression

Recommended Posts

sloth456

Link to comment

Share on other sites

bobbinsbro

Link to comment

Share on other sites

samshel

Link to comment

Share on other sites

sloth456

Link to comment

Share on other sites

bobbinsbro

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information