Jump to content


Photo

PHP Regexp file crawling challenge


  • Please log in to reply
No replies to this topic

#1 clarencek

clarencek
  • Members
  • PipPip
  • Member
  • 21 posts

Posted 02 July 2006 - 08:13 AM

Hi, I'm fairly new to php but am learning quite a bit.

I am trying to grab some information from a web page: http://en.wikipedia....adway_theatre.  What I want to get is just the table information by storing the theater, the show, the address, and when it opened.

Is it possible to do this via php and regular expression?

This is what I have so far:

<?php

error_reporting(0);

$url = 'http://en.wikipedia.org/wiki/Broadway_theatre';
$page = file_get_contents($url);
if (preg_match_all('/<td>.+?title=".+?">(.+?)<\/a>/im', $page, $links, PREG_SET_ORDER))

  for($i = 0; $i < count($links); $i++){
  print_r($links[$i]);
  echo "<br />";
  }


?>

The problem with it is that it picks up everything that starts with <td> includes title and ends in </a>, not just the theater name.  AS for how to pick up the address and opening date by itself, I have no idea how to separate out since they are enclosed in simple <td> </td> tags.  Ultimately I want to take these variables and stick it in the database.  I know how to stick it in the database.  The challenge is getting all the variables.

Thanks for any help,

Clarence




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users