djtozz Posted September 30, 2009 Share Posted September 30, 2009 Hi, I'm creating a crawler for megaupload.com downloadlinks. sample link: http://www.megaupload.com/?d=SFMTFBRV Currently I'm not using the correct pattern, I'm only getting a part of the url 'http://www.megaupload.com/?d' get_urls_by_kwd("\"megaupload.com/?d=\" ".$row[1],"/megaupload\.com\/\?(\d+)/"); Can somebody advice me how to use the correct pattern? Thanks Quote Link to comment Share on other sites More sharing options...
djtozz Posted September 30, 2009 Author Share Posted September 30, 2009 anybody? Thanks Quote Link to comment Share on other sites More sharing options...
thebadbad Posted September 30, 2009 Share Posted September 30, 2009 '~(?:http://)?(?:www\.)?megaupload\.com/\?d=[0-9a-z]{8}~i' Assuming the ID consists of a-z, A-Z and/or 0-9, and that it's always 8 in length. Quote Link to comment Share on other sites More sharing options...
dreamwest Posted September 30, 2009 Share Posted September 30, 2009 $html = file_get_contents('http://www.megaupload.com'); //d=SFMTFBRV preg_match_all('~d\s?=\s?(.*?)~is', $html, $matches); foreach ($matches[1] as $link) { $link = trim($link); echo "http://www.megaupload.com/?{$link}<br>"; } Quote Link to comment Share on other sites More sharing options...
djtozz Posted September 30, 2009 Author Share Posted September 30, 2009 '~(?:http://)?(?:www\.)?megaupload\.com/\?d=[0-9a-z]{8}~i' Assuming the ID consists of a-z, A-Z and/or 0-9, and that it's always 8 in length. Thanks for the help! I think I made a little typo in LINE 3 while integrating in my script, because I'm getting following error: (Warning: preg_match_all() [function.preg-match-all]: Unknown modifier '\' ) The others are working fine! get_urls_by_kwd("\"rapidshare.com/files\" ".$row[1],"/rapidshare\.com\/files\/(\d+)\/([^\'^\"^\s^>^<^\\^\/]+)/",1); get_urls_by_kwd("\"badongo.com/file\" ".$row[1],"/badongo\.com\/file\/(\d+)/",2); get_urls_by_kwd("\"megaupload.com/?d=\" ".$row[1],"/megaupload\.com/\?d=[0-9a-z]{8}~i/",3); get_urls_by_kwd("\"sendspace.com/file\" ".$row[1],"/sendspace\.com\/file\/(\w+)/",4); get_urls_by_kwd("\"4shared.com/file\" ".$row[1],"/4shared\.com\/file\/(\d+)\/(\w+)\/([^\'^\"^\s^>^<^\\^\/]+)/",5); Quote Link to comment Share on other sites More sharing options...
djtozz Posted September 30, 2009 Author Share Posted September 30, 2009 $html = file_get_contents('http://www.megaupload.com'); //d=SFMTFBRV preg_match_all('~d\s?=\s?(.*?)~is', $html, $matches); foreach ($matches[1] as $link) { $link = trim($link); echo "http://www.megaupload.com/?{$link}<br>"; } Thanks for the feedback, but I'm not sure how to integrate it in my current code: Since the code is already working for the other file sharing sites, I guess I only need to change the patern in line 3: get_urls_by_kwd("\"rapidshare.com/files\" ".$row[1],"/rapidshare\.com\/files\/(\d+)\/([^\'^\"^\s^>^<^\\^\/]+)/",1); get_urls_by_kwd("\"badongo.com/file\" ".$row[1],"/badongo\.com\/file\/(\d+)/",2); get_urls_by_kwd("\"megaupload.com/?d=\" ".$row[1],"/megaupload\.com/\?d=[0-9a-z]{8}~i/",3); get_urls_by_kwd("\"sendspace.com/file\" ".$row[1],"/sendspace\.com\/file\/(\w+)/",4); get_urls_by_kwd("\"4shared.com/file\" ".$row[1],"/4shared\.com\/file\/(\d+)\/(\w+)\/([^\'^\"^\s^>^<^\\^\/]+)/",5); I'm not shure how to. Thanks Quote Link to comment Share on other sites More sharing options...
thebadbad Posted October 1, 2009 Share Posted October 1, 2009 get_urls_by_kwd("\"megaupload.com/?d=\" ".$row[1],"/megaupload\.com\/\?d=[0-9a-z]{8}/i",3); I was using ~ as pattern delimiters, and you're using /. Fixed that. The i modifier makes the search case in-sensitive. Quote Link to comment Share on other sites More sharing options...
redarrow Posted October 1, 2009 Share Posted October 1, 2009 i think 0,10) better for the end... ~ as pattern delimiters (why was it changed... Quote Link to comment Share on other sites More sharing options...
thebadbad Posted October 1, 2009 Share Posted October 1, 2009 i think 0,10) better for the end... ~ as pattern delimiters (why was it changed... Why?? Megaupload IDs are always 8 chars long AFAIK. I swapped to slashes since the OP are using them in the rest of the patterns (to be less confusing). Quote Link to comment Share on other sites More sharing options...
djtozz Posted October 1, 2009 Author Share Posted October 1, 2009 Thank you guys for the help! it Works like a charm! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.