[HELP] Don't understand script

miniramen · May 18, 2010

Hello guys,

I'm a new member and I'm in desperate need of help.....I learned some php and other types of coding (C++, SQL)

but never went in detail.

I was trying to understand a crawling script where it takes important information from a website and put it

all on a MYSQL database file. It's a nice script but I'm asked to improve it.

While checking out this script, there are many PHP statements where it cannot be found on PHP.net. I do not know why but it made my life very difficult.

Would anyone mind telling me:

$sql = new MySQL();

//why is it "new mysql(); ?//

--------------------------------------------------

$qry = 'DROP TABLE IF EXISTS TEMP_tblBusiness;';

$sql->Query($qry);

//I've never seen -> anywhere before, can anyone plz tell me?//

--------------------------------------------------

$scraper->items = array(

'items' => '#<div class="business-data">'.

'\n\s*\n\n\n\s*<div class="clearfix">\n.*Category.*\n\s*<div class="business-value">\n\s*(.*?)\s*</div>.*\n\s*</div>'.

//what is \n\s(.*?)\s* .......I really want to understand//

//and what is clearfix//

-------------------------------------------------

$description = $scraper->getMatch('items', $i, 7);

//what does getMatch('items',$i,7) is?

------------------------------------------------

I've searched on PHP.net and nothing came up.

If anyone would be kind enough to clear this up, thank you very very much.

.Stealth · May 18, 2010

Everything you're asking about is related to classes. Look up OOP PHP.

(.*?)

That though, i haven't got a clue.

ignace · May 18, 2010

//why is it "new mysql(); ?//

Because you are creating an Object.

//I've never seen -> anywhere before, can anyone please tell me?//

It's the operator for objects.

//what is \n\s(.*?)\s* .......I really want to understand//

RegEx (Regular Expressions)

//and what is clearfix//

clearfix is a CSS class, more info: http://www.webtoolkit.info/css-clearfix.html

//what does getMatch('items',$i,7) is?

getMatch() is a method of the Object $scraper.

miniramen · May 19, 2010

wow!! Thank you for the fast reply. Is it possible to add a question?

The script that I'm looking at was made to crawl a specific website, therefore the

way that it is structure is toward crawling something specific, and I'm working toward

to find a generic way to do it.

Therefore, I would like to ask if there's a generic way to check all the pages that is inside a website by following

its hyperlinks without going to the external links? It would be useful if this has already been done so I can

refer from it and customize it a bit.

Again, the help is very much appreciated.

Thank you !!!!

ignace · May 19, 2010

$queue = new SplQueue();

$dom = new DomDocument();
if ($dom->loadHtmlFile('http://path/to/html/file')) {
  foreach ($dom->getElementsByTagName('a') as $a) {
    if ($a->hasAttributes() && $node = $a->attributes->getNamedItem('href')) {
      $queue->enqueue($node->nodeValue);
    }
  }
}

foreach ($queue as $uri) {
  print $uri;
}

miniramen · May 20, 2010

Tnx!!! I actually used something I found and it also lets me obtain all the url links from the whole website.

Now I have advanced the part where I'm using Regex to find the right generic pattern for the things I'll be searching for.

For example I did:

$Regex = "/[a-zA-Z]{1}[0-9]{1}[a-zA-Z]{1}(\-| |){1}[0-9]{1}[a-zA-Z]{1}[0-9]{1}/";

preg_match_all ($Regex, $f_data, $matches, PREG_PATTERN_ORDER);

echo $matches[0][0] . ", " . $matches[0][1] . "\n";

echo $matches[1][0] . ", " . $matches[1][1] . "\n";

To find all the postal codes. But the thing is that I want all of them to display,

not just 00 to 11

miniramen · May 21, 2010

Responding to my own question, it`s fix XD

miniramen · May 21, 2010

Oh first of all, thanks for the help, this forum is extremely resourceful.

Again, in order to crawl all the pages from a website, i'll need to search recursively on all the links....

I heard that Curl FOLLOWLOCATION function might actually do this? Is it true?

If so, how is it actually done?

*Ignace: I tried your code, it's useful but it's not what I want, I'll need to that it searches nonstop, even at new pages, for all the pages that there is inside the website, but yet they are not external links . This does seem very complicated....

ignace · May 21, 2010

set_time_limit(0);

class Crawler implements IteratorAggregate {
  private $dom = null;
  private $urlList = null;
  
  public function __construct() {
    $this->dom = new DomDocument();
    $this->urlList = new ArrayObject();
  }
  
  public function getUrlList() {
    return $this->urlList;
  }
  
  public function getIterator() {
    return $this->urlList->getIterator();
  }
  
  public function crawl($url) {
    $this->urlList->append($url);
    
    if ($dom->loadHtmlFile($url)) {
      foreach ($dom->getElementsByTagName('a') as $a) {
        $href = $a->attributes->getNamedItem('href');
        if (!$this->_isUrl($href)) continue;//trail ends here
        
        $this->crawl($href);
      }
    }
  }
  
  private function _isUrl($url) {
    return FALSE !== parse_url($url);
  }
}

Let's hope none lead to external sources or this script may run forever.

Sign In

[HELP] Don't understand script

Recommended Posts

miniramen

Link to comment

Share on other sites

.Stealth

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

miniramen

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

miniramen

Link to comment

Share on other sites

miniramen

Link to comment

Share on other sites

miniramen

Link to comment

Share on other sites

ignace

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information