Jump to content


Photo

PHP Regexp file crawling challenge


  • Please log in to reply
8 replies to this topic

#1 clarencek

clarencek
  • Members
  • PipPip
  • Member
  • 21 posts

Posted 02 July 2006 - 05:03 PM

Hi, I'm fairly new to php but am learning quite a bit.

I am trying to grab some information from a web page: http://en.wikipedia....adway_theatre.  What I want to get is just the table information by storing the theater, the show, the address, and when it opened.

Is it possible to do this via php and regular expression?

This is what I have so far:

<?php

error_reporting(0);

$url = 'http://en.wikipedia.org/wiki/Broadway_theatre';
$page = file_get_contents($url);
if (preg_match_all('/<td>.+?title=".+?">(.+?)<\/a>/im', $page, $links, PREG_SET_ORDER))

  for($i = 0; $i < count($links); $i++){
    print_r($links[$i]);
    echo "
";
  }


?>

The problem with it is that it picks up everything that starts with <td> includes title and ends in [/url], not just the theater name.  AS for how to pick up the address and opening date by itself, I have no idea how to separate out since they are enclosed in simple <td> </td> tags.  Ultimately I want to take these variables and stick it in the database.  I know how to stick it in the database.  The challenge is getting all the variables.

Thanks for any help,

Clarence

#2 Wildbug

Wildbug
  • Members
  • PipPipPip
  • Advanced Member
  • 1,149 posts

Posted 04 July 2006 - 02:22 PM

If you only need to do this one time, an easier solution might be to highlight/copy the text from your web browser and paste it into your favorite macro enabled text editor to add the SQL necessary to insert it into the database.  (Crimson Editor, for instance.)

Otherwise, this should work (untested):
/<tr><td><a.+?>(.+?)<\/a><\/td>\s+<td><i><a.+?>(.+?)<\/a><\/i><\/td>\s+<td>(.+?)<\/td><td><a.+?>(.+?)<\/a>,\s+<a.+?>(\d+)<\/a><\/td>\s+<\/tr>\s+/s

[1] Theater
[2] Show
[3] Address
[4] Date
[5] Year
Twice a day my clock works PERFECTLY!  I can't figure out what's wrong with it.

#3 Wildbug

Wildbug
  • Members
  • PipPipPip
  • Advanced Member
  • 1,149 posts

Posted 04 July 2006 - 02:54 PM

Disregard earlier example.... this one actually works. ::)
preg_match_all('/<tr>\n<td><a.+?>(.+?)<\/a>.*?\n.*?<a.+?>(.+?)<\/a>.*?\n.*?<td>(.+?)<\/td>.*?\n.*?<a.+?>(.+?)<\/a>, <a.+?>(\d+)/i',$broadway,$matches);

Twice a day my clock works PERFECTLY!  I can't figure out what's wrong with it.

#4 clarencek

clarencek
  • Members
  • PipPip
  • Member
  • 21 posts

Posted 04 July 2006 - 05:06 PM

Hmm...I'm not sure I'm getting this to work.  I have put my code below with my comments and questions in the //Comments.

------------------------------
error_reporting(E_ALL);

$url = 'http://en.wikipedia.org/wiki/Broadway_theatre';
$page = file_get_contents($url);

//This part below grabs just the table html, excluding the first <table> and the last </table> and stores it in $links[0][1]

if (preg_match_all('/<table.class="wikitable">.+?<\/tr>(.+?)<\/table>/si', $page, $links, PREG_SET_ORDER));


// Then I put in your code

if (preg_match_all('/<tr>\n<td><a.+?>(.+?)<\/a>.*?\n.*?<a.+?>(.+?)<\/a>.*?\n.*?<td>(.+?)<\/td>.*?\n.*?<a.+?>(.+?)<\/a>, <a.+?>(\d+)/i', $links[0][1], $matches));


// But when I print it out, I get a blank array.  Am I doing something wrong with your snippet?

echo '<pre>' . print_r($matches,true) . '</pre>'

//Also, I was trying to keep date and year in one field although I don't know if that matters here.

#5 Wildbug

Wildbug
  • Members
  • PipPipPip
  • Advanced Member
  • 1,149 posts

Posted 05 July 2006 - 06:29 PM

That preg_match_all() worked for me on that webpage when I tried it (although I had to save it locally -- file_get_contents() didn't work with that URL for some reason).  You don't need the first preg_match_all() function.  And don't worry about bringing out the date in one variable; just concatenate it when you put it in your database.

Does your program execution make it to the regex with the correct contents in those variables?  Really, it worked like a champ the other day.  :)
Twice a day my clock works PERFECTLY!  I can't figure out what's wrong with it.

#6 clarencek

clarencek
  • Members
  • PipPip
  • Member
  • 21 posts

Posted 05 July 2006 - 09:41 PM

Can you show me your complete code?

Here is what I have

error_reporting(E_ALL);

$url = 'http://LocalCopyOftheHTMLPage';
$page = file_get_contents($url);

if (preg_match_all('/<tr>\n<td><a.+?>(.+?)<\/a>.*?\n.*?<a.+?>(.+?)<\/a>.*?\n.*?<td>(.+?)<\/td>.*?\n.*?<a.+?>(.+?)<\/a>, <a.+?>(\d+)/i', $page, $matches));

echo '<pre>' . print_r($matches,true) . '</pre>'

But the output I get on html page is such:

Array
(
    [0] => Array
        (
        )

    [1] => Array
        (
        )

    [2] => Array
        (
        )

    [3] => Array
        (
        )

    [4] => Array
        (
        )

    [5] => Array
        (
        )

)


Results are blank.  Am I doing something wrong here?  I believe I'm using the exact code that you typed up.


#7 Wildbug

Wildbug
  • Members
  • PipPipPip
  • Advanced Member
  • 1,149 posts

Posted 06 July 2006 - 01:52 PM

<?php
$broadway = file_get_contents("Broadway_theatre.html");
preg_match_all('/<tr>\n<td><a.+?>(.+?)<\/a>.*?\n.*?<a.+?>(.+?)<\/a>.*?\n.*?<td>(.+?)<\/td>.*?\n.*?<a.+?>(.+?)<\/a>, <a.+?>(\d+)/i',$broadway,$matches);

echo '<pre>',print_r(array_slice($matches,1),TRUE),'</pre>';

?>

Twice a day my clock works PERFECTLY!  I can't figure out what's wrong with it.

#8 clarencek

clarencek
  • Members
  • PipPip
  • Member
  • 21 posts

Posted 07 July 2006 - 05:23 AM

Hmm... I don't get it.  I'm still getting the same blank array even after cutting and pasting your exact code.  I guess this works on your end?  Not sure why I'm not getting anything.

#9 Wildbug

Wildbug
  • Members
  • PipPipPip
  • Advanced Member
  • 1,149 posts

Posted 07 July 2006 - 02:09 PM

Did you check your file contents variable to be sure it's full of the right thing?
Twice a day my clock works PERFECTLY!  I can't figure out what's wrong with it.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users