Jump to content

Some help with scraping and regular expression


sloth456

Recommended Posts

Hi,

 

I hate to ask, I feel like such a leecher rather than a contributor.

 

I was wondering if anyone would help with the following problem.

 

This is my code:

 

preg_match('/<a href="(.*?)".*?<\/a>/',$googleresult,$matches);

 

I've basically gone and scraped one page of listings from google using file_get_contents and put it in $googleresult

 

I have stripped out all domains with "google" in them which leaves me with just the URLs for the actual listings.

 

I'm basically trying to pull JUST the first URL between anchor tags out.  I'm really useless with regular expression and so used http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial for help.

 

Trouble is I seem to be pulling out more than just the URL

 

<a href="http://java.sun.com/docs/books/tutorial/getStarted/application/index.html" class=l>Lesson: A Closer Look at the "<em>Hello World</em>!" Application (The Java <b>...</b></a>%20-site:<a href="http://java.sun.com/docs/books/tutorial/getStarted/application/index.html" class=l>Lesson: A Closer Look at the "<em>Hello World</em>!" Application (The Java <b>...</b></a>

 

Link to comment
Share on other sites

you're actually getting exactly what you're asking for (i'm assuming the last code block is an example of the return of the preg_match).

your regex describes the entire contents of <a...></a> tags. if you only want the url try using something like :

preg_match('/href="(.*?)"/',$googleresult,$matches);

 

i'm pretty useless with regex myself, so success is not guaranteed. ;)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.