Jump to content

Sorry first post here, help request for preg_match page scrape


Recommended Posts

Hi everyone,

 

I am making a screen scraper in php which scrapes the usernames from forum posts and stores them in an SQL database.  I need some help with part of the preg_match code if possible please? 

 

The code and also the pseudo code I have so far is:(the pseudocode I am having trouble with but will try to solve my self if possible).

Edit: sorry, editted the page as was confusing to read.  Please ask for clarificaiton if there is anything I have failed to explain properly. thank you.

 

//I will be placing the following php in the confirmation page people see after making a new post, so for this example lets say the referrer header says: http://www.mysite.com/showthread.php?tid=1'

 

$threadurl=$_SERVER['HTTP_REFERER'];

 

// scrape the page

$content = file_get_contents($threadurl);

 

// find the pattern in source which makes it easy to find the username- the only things that change are the uid and the color

if (preg_match("/\b uid=792"><span style="color:#ffcc00">fapafap</span></a>\b /i", $content)) {

 

//extract username from this string search

Don't know!

 

//copy the username (in this case 'fapafap') to the database along with the referral ID

 

$query_insert="INSERT INTO newpostersdatabase(username,referrerurl) VALUES('$username','$threadurl')" ;
$result=mysql_query ( $query_insert);
if(!$result){
die(mysql_error());
}

 

 

Thank you so much for any guidance, I know that I have totally messed up with the string search also but my brain is too small it seems!

if (preg_match("/\b uid=[color=red]792[/color]"><span style="color:[color=red]#ffcc00[/color]">fapafap</span></a>\b /i", $content)) {

 

That will not work you need to escape the squared brackets and flag characters in this case the forward slash. You also need to escape your escaping characters if that's what your trying to match.

 

e.g.

 

<?php

if (preg_match("/\\b uid=\[color=red\]792\[\/color\]\"><span style=\"color:\[color=red\]#ffcc00\[\/color\]\">fapafap</span></a>\\b /i", $content)) {

}

 

I would narrow down what your trying to match because this seems a bit overly complicated. I've done 3 site scrapes recently which took me a couple of days to the complexity of the sites, the key was to identify small but unique strings to match on and use as many variable characters as you can. e.g. if there is misc text between html entities, just match [^<]+ which will get everything not an entity.

if (preg_match("/\b uid=[color=red]792[/color]"><span style="color:[color=red]#ffcc00[/color]">fapafap</span></a>\b /i", $content)) {

 

That will not work you need to escape the squared brackets and flag characters in this case the forward slash. You also need to escape your escaping characters if that's what your trying to match.

 

e.g.

 

<?php

if (preg_match("/\\b uid=\[color=red\]792\[\/color\]\"><span style=\"color:\[color=red\]#ffcc00\[\/color\]\">fapafap</span></a>\\b /i", $content)) {

}

 

I would narrow down what your trying to match because this seems a bit overly complicated. I've done 3 site scrapes recently which took me a couple of days to the complexity of the sites, the key was to identify small but unique strings to match on and use as many variable characters as you can. e.g. if there is misc text between html entities, just match [^<]+ which will get everything not an entity.

 

 

Hi thanks for answering, I had to edit it as didn't realise that you couldnt highlight things within code brackets.  Basically yes I understand what you are saying, but as i say the only things that change in the string are the UID, name and color code.  On that basis I was thinking of something like:

 

if (preg_match("/\b >**</span>"><span style="color:#**">**</span></a>\b /i", $content))

 

So that it would find the pattern accepting anything inbetween the stars as being variable, if you know what I mean. 

Why do you have to scrape pages from your own website?

 

Hello! I know it seems daft, but I have a larger project in mind for these skills and thought I would take the time to implement some simpler stuff now as a learning exercise.

With regexp it's very much trial and error.

 

What I find easier is throwing in $matches after content which will show you what you've matched (anything wrapped in curly brackets e.g. () will show in matches ).

 

And if your regexp is not matching just keep making your match simpler until you get a match then work forward from there.

 

Also, * is not a wildcard in perl regexp it means many or none. The '.' is a wildcard so if you wanted a wild match you would use (.*) but this can be sketchy as it will match anything and everything so it's better using a not equal to match like stated before ([^"]+).

With regexp it's very much trial and error.

 

What I find easier is throwing in $matches after content which will show you what you've matched (anything wrapped in curly brackets e.g. () will show in matches ).

 

And if your regexp is not matching just keep making your match simpler until you get a match then work forward from there.

 

Also, * is not a wildcard in perl regexp it means many or none. The '.' is a wildcard so if you wanted a wild match you would use (.*) but this can be sketchy as it will match anything and everything so it's better using a not equal to match like stated before ([^"]+).

 

I see, ok thank you.  (God this stuff is hard!)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.