fapapfap Posted December 22, 2011 Share Posted December 22, 2011 Hi everyone, I am making a screen scraper in php which scrapes the usernames from forum posts and stores them in an SQL database. I need some help with part of the preg_match code if possible please? The code and also the pseudo code I have so far is:(the pseudocode I am having trouble with but will try to solve my self if possible). Edit: sorry, editted the page as was confusing to read. Please ask for clarificaiton if there is anything I have failed to explain properly. thank you. //I will be placing the following php in the confirmation page people see after making a new post, so for this example lets say the referrer header says: http://www.mysite.com/showthread.php?tid=1' $threadurl=$_SERVER['HTTP_REFERER']; // scrape the page $content = file_get_contents($threadurl); // find the pattern in source which makes it easy to find the username- the only things that change are the uid and the color if (preg_match("/\b uid=792"><span style="color:#ffcc00">fapafap</span></a>\b /i", $content)) { //extract username from this string search Don't know! //copy the username (in this case 'fapafap') to the database along with the referral ID $query_insert="INSERT INTO newpostersdatabase(username,referrerurl) VALUES('$username','$threadurl')" ; $result=mysql_query ( $query_insert); if(!$result){ die(mysql_error()); } Thank you so much for any guidance, I know that I have totally messed up with the string search also but my brain is too small it seems! Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/ Share on other sites More sharing options...
AGuyWithAthing Posted December 22, 2011 Share Posted December 22, 2011 if (preg_match("/\b uid=[color=red]792[/color]"><span style="color:[color=red]#ffcc00[/color]">fapafap</span></a>\b /i", $content)) { That will not work you need to escape the squared brackets and flag characters in this case the forward slash. You also need to escape your escaping characters if that's what your trying to match. e.g. <?php if (preg_match("/\\b uid=\[color=red\]792\[\/color\]\"><span style=\"color:\[color=red\]#ffcc00\[\/color\]\">fapafap</span></a>\\b /i", $content)) { } I would narrow down what your trying to match because this seems a bit overly complicated. I've done 3 site scrapes recently which took me a couple of days to the complexity of the sites, the key was to identify small but unique strings to match on and use as many variable characters as you can. e.g. if there is misc text between html entities, just match [^<]+ which will get everything not an entity. Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300688 Share on other sites More sharing options...
requinix Posted December 22, 2011 Share Posted December 22, 2011 Why do you have to scrape pages from your own website? Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300689 Share on other sites More sharing options...
fapapfap Posted December 22, 2011 Author Share Posted December 22, 2011 if (preg_match("/\b uid=[color=red]792[/color]"><span style="color:[color=red]#ffcc00[/color]">fapafap</span></a>\b /i", $content)) { That will not work you need to escape the squared brackets and flag characters in this case the forward slash. You also need to escape your escaping characters if that's what your trying to match. e.g. <?php if (preg_match("/\\b uid=\[color=red\]792\[\/color\]\"><span style=\"color:\[color=red\]#ffcc00\[\/color\]\">fapafap</span></a>\\b /i", $content)) { } I would narrow down what your trying to match because this seems a bit overly complicated. I've done 3 site scrapes recently which took me a couple of days to the complexity of the sites, the key was to identify small but unique strings to match on and use as many variable characters as you can. e.g. if there is misc text between html entities, just match [^<]+ which will get everything not an entity. Hi thanks for answering, I had to edit it as didn't realise that you couldnt highlight things within code brackets. Basically yes I understand what you are saying, but as i say the only things that change in the string are the UID, name and color code. On that basis I was thinking of something like: if (preg_match("/\b >**</span>"><span style="color:#**">**</span></a>\b /i", $content)) So that it would find the pattern accepting anything inbetween the stars as being variable, if you know what I mean. Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300699 Share on other sites More sharing options...
fapapfap Posted December 22, 2011 Author Share Posted December 22, 2011 Why do you have to scrape pages from your own website? Hello! I know it seems daft, but I have a larger project in mind for these skills and thought I would take the time to implement some simpler stuff now as a learning exercise. Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300703 Share on other sites More sharing options...
AGuyWithAthing Posted December 22, 2011 Share Posted December 22, 2011 With regexp it's very much trial and error. What I find easier is throwing in $matches after content which will show you what you've matched (anything wrapped in curly brackets e.g. () will show in matches ). And if your regexp is not matching just keep making your match simpler until you get a match then work forward from there. Also, * is not a wildcard in perl regexp it means many or none. The '.' is a wildcard so if you wanted a wild match you would use (.*) but this can be sketchy as it will match anything and everything so it's better using a not equal to match like stated before ([^"]+). Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300705 Share on other sites More sharing options...
fapapfap Posted December 22, 2011 Author Share Posted December 22, 2011 With regexp it's very much trial and error. What I find easier is throwing in $matches after content which will show you what you've matched (anything wrapped in curly brackets e.g. () will show in matches ). And if your regexp is not matching just keep making your match simpler until you get a match then work forward from there. Also, * is not a wildcard in perl regexp it means many or none. The '.' is a wildcard so if you wanted a wild match you would use (.*) but this can be sketchy as it will match anything and everything so it's better using a not equal to match like stated before ([^"]+). I see, ok thank you. (God this stuff is hard!) Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300707 Share on other sites More sharing options...
fapapfap Posted December 23, 2011 Author Share Posted December 23, 2011 Hi sorry last attempt to see if anyone could come up with a solution to this, i am still struggling. OR maybe point me in the direction of some good resources? Quote Link to comment https://forums.phpfreaks.com/topic/253715-sorry-first-post-here-help-request-for-preg_match-page-scrape/#findComment-1300754 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.