Jump to content

Data scraping, preg_match_all/regex questions


memoryproblems

Recommended Posts

First off, I'm pretty new at this, so please try not to laugh (too hard) at me.

 

I'm trying to put together a script to scrape out some data of some page source for me. This is for an online game, and I'm looking to sort out everything inside the code that is shown below.

title="Ruler: DATA">

 

I've looked around the web (again, I'm very new), and found a few tutorials that look interesting, and went about doing this with Regex and preg_dump_all

 

here is my code:

 

<?php
$data = file_get_contents('scrapedata.html');
$regex = '/title="Ruler: (.+?)">/';
preg_match_all($regex,$data,$match);
var_dump($match);
echo ($match);
?>

 

I've got two problems:

1) var_dump($match) spits out the entire array, but echo ($match) says only "Array".

If I change preg_match_all to simply preg_match, echo ($match) shows the first item that I'm looking for, but obviously it doesn't go through the entire source to find all the instances of what I'm looking for. (each page has roughly 20 items that I'm looking to collect)

 

My main question here is, how do I take the results of the preg_match_all (which is an array), and list the results of that array just one by one on echo?

 

2) For what I'm doing, I need to do two different versions, one just like I coded above, and another that modifies the $regex line. In the source code, there is a variable that can be listed among the data, and I want to skip over any listing that has that variable.

 

For example,

 

I want to collect it if its like this:

 

	<td>
<p align="center">
<a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a>


</td

 

but if its like this, I want to skip over it:

 

	<td>
<p align="center">
<a href="send_message.asp?Nation_ID=XXXXXX"><img border="0" src="assets/compose_message.png" width="16" height="16" title="Ruler: DATA"></a>


<a href="stats_alliance_stats_custom.asp?Alliance=Rapture"><img src="images/alliance_statistic.gif" border="0" title="Alliance: DATA"></a>


</td>

 

I figured that the way to do this would be to change the $regex line to

 

$regex = '/title="Ruler: (.+?)"></a></td>/';

 

but it returns a warning (shown below) and says null in the var_dump ($match)

 

Warning: preg_match_all() [function.preg-match-all]: Unknown modifier 'a' in /home/virtual/site80/fst/var/www/html/scraper/scraper.php on line 4

 

null

 

Is there some way to put the </a></td> into the $regex line and have that work?

 

Sorry if my questions are a little dumb, been trying to find answers to this all day (and fighting off the inevitable heart attack from all the frustration) with little luck.

 

Thanks for any insight you might have

mp

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.