miniramen Posted May 18, 2010 Share Posted May 18, 2010 Hello guys, I'm a new member and I'm in desperate need of help.....I learned some php and other types of coding (C++, SQL) but never went in detail. I was trying to understand a crawling script where it takes important information from a website and put it all on a MYSQL database file. It's a nice script but I'm asked to improve it. While checking out this script, there are many PHP statements where it cannot be found on PHP.net. I do not know why but it made my life very difficult. Would anyone mind telling me: $sql = new MySQL(); //why is it "new mysql(); ?// -------------------------------------------------- $qry = 'DROP TABLE IF EXISTS TEMP_tblBusiness;'; $sql->Query($qry); //I've never seen -> anywhere before, can anyone plz tell me?// -------------------------------------------------- $scraper->items = array( 'items' => '#<div class="business-data">'. '\n\s*\n\n\n\s*<div class="clearfix">\n.*Category.*\n\s*<div class="business-value">\n\s*(.*?)\s*</div>.*\n\s*</div>'. //what is \n\s(.*?)\s* .......I really want to understand// //and what is clearfix// ------------------------------------------------- $description = $scraper->getMatch('items', $i, 7); //what does getMatch('items',$i,7) is? ------------------------------------------------ I've searched on PHP.net and nothing came up. If anyone would be kind enough to clear this up, thank you very very much. Quote Link to comment Share on other sites More sharing options...
.Stealth Posted May 18, 2010 Share Posted May 18, 2010 Everything you're asking about is related to classes. Look up OOP PHP. (.*?) That though, i haven't got a clue. Quote Link to comment Share on other sites More sharing options...
ignace Posted May 18, 2010 Share Posted May 18, 2010 //why is it "new mysql(); ?// Because you are creating an Object. //I've never seen -> anywhere before, can anyone please tell me?// It's the operator for objects. //what is \n\s(.*?)\s* .......I really want to understand// RegEx (Regular Expressions) //and what is clearfix// clearfix is a CSS class, more info: http://www.webtoolkit.info/css-clearfix.html //what does getMatch('items',$i,7) is? getMatch() is a method of the Object $scraper. Quote Link to comment Share on other sites More sharing options...
miniramen Posted May 19, 2010 Author Share Posted May 19, 2010 wow!! Thank you for the fast reply. Is it possible to add a question? The script that I'm looking at was made to crawl a specific website, therefore the way that it is structure is toward crawling something specific, and I'm working toward to find a generic way to do it. Therefore, I would like to ask if there's a generic way to check all the pages that is inside a website by following its hyperlinks without going to the external links? It would be useful if this has already been done so I can refer from it and customize it a bit. Again, the help is very much appreciated. Thank you !!!! Quote Link to comment Share on other sites More sharing options...
ignace Posted May 19, 2010 Share Posted May 19, 2010 $queue = new SplQueue(); $dom = new DomDocument(); if ($dom->loadHtmlFile('http://path/to/html/file')) { foreach ($dom->getElementsByTagName('a') as $a) { if ($a->hasAttributes() && $node = $a->attributes->getNamedItem('href')) { $queue->enqueue($node->nodeValue); } } } foreach ($queue as $uri) { print $uri; } Quote Link to comment Share on other sites More sharing options...
miniramen Posted May 20, 2010 Author Share Posted May 20, 2010 Tnx!!! I actually used something I found and it also lets me obtain all the url links from the whole website. Now I have advanced the part where I'm using Regex to find the right generic pattern for the things I'll be searching for. For example I did: $Regex = "/[a-zA-Z]{1}[0-9]{1}[a-zA-Z]{1}(\-| |){1}[0-9]{1}[a-zA-Z]{1}[0-9]{1}/"; preg_match_all ($Regex, $f_data, $matches, PREG_PATTERN_ORDER); echo $matches[0][0] . ", " . $matches[0][1] . "\n"; echo $matches[1][0] . ", " . $matches[1][1] . "\n"; To find all the postal codes. But the thing is that I want all of them to display, not just 00 to 11 Quote Link to comment Share on other sites More sharing options...
miniramen Posted May 21, 2010 Author Share Posted May 21, 2010 Responding to my own question, it`s fix XD Quote Link to comment Share on other sites More sharing options...
miniramen Posted May 21, 2010 Author Share Posted May 21, 2010 Oh first of all, thanks for the help, this forum is extremely resourceful. Again, in order to crawl all the pages from a website, i'll need to search recursively on all the links.... I heard that Curl FOLLOWLOCATION function might actually do this? Is it true? If so, how is it actually done? *Ignace: I tried your code, it's useful but it's not what I want, I'll need to that it searches nonstop, even at new pages, for all the pages that there is inside the website, but yet they are not external links . This does seem very complicated.... Quote Link to comment Share on other sites More sharing options...
ignace Posted May 21, 2010 Share Posted May 21, 2010 set_time_limit(0); class Crawler implements IteratorAggregate { private $dom = null; private $urlList = null; public function __construct() { $this->dom = new DomDocument(); $this->urlList = new ArrayObject(); } public function getUrlList() { return $this->urlList; } public function getIterator() { return $this->urlList->getIterator(); } public function crawl($url) { $this->urlList->append($url); if ($dom->loadHtmlFile($url)) { foreach ($dom->getElementsByTagName('a') as $a) { $href = $a->attributes->getNamedItem('href'); if (!$this->_isUrl($href)) continue;//trail ends here $this->crawl($href); } } } private function _isUrl($url) { return FALSE !== parse_url($url); } } Let's hope none lead to external sources or this script may run forever. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.