Jump to content

[HELP] Don't understand script


miniramen

Recommended Posts

Hello guys,

 

I'm a new member and I'm in desperate need of help.....I learned some php and other types of coding (C++, SQL)

but never went in detail.

 

I was trying to understand a crawling script where it takes important information from a website and put it

all on a MYSQL database file. It's a nice script but I'm asked to improve it.

 

While checking out this script, there are many PHP statements where it cannot be found on PHP.net. I do not know why but it made my life very difficult.

Would anyone mind telling me:

 

$sql = new MySQL(); 

 

//why is it "new mysql(); ?//

--------------------------------------------------

$qry = 'DROP TABLE IF EXISTS TEMP_tblBusiness;';

$sql->Query($qry);

 

//I've never seen -> anywhere before, can anyone plz tell me?//

 

--------------------------------------------------

$scraper->items = array(

    'items' => '#<div class="business-data">'.

    '\n\s*\n\n\n\s*<div class="clearfix">\n.*Category.*\n\s*<div class="business-value">\n\s*(.*?)\s*</div>.*\n\s*</div>'.

 

//what is \n\s(.*?)\s* .......I really want to understand//

//and what is clearfix//

-------------------------------------------------

$description = $scraper->getMatch('items', $i, 7);

//what does getMatch('items',$i,7) is?

------------------------------------------------

 

I've searched on PHP.net and nothing came up.

If anyone would be kind enough to clear this up, thank you very very much.

 

Link to comment
Share on other sites

//why is it "new mysql(); ?//

 

Because you are creating an Object.

 

//I've never seen -> anywhere before, can anyone please tell me?//

 

It's the operator for objects.

 

//what is \n\s(.*?)\s* .......I really want to understand//

 

RegEx (Regular Expressions)

 

//and what is clearfix//

 

clearfix is a CSS class, more info: http://www.webtoolkit.info/css-clearfix.html

 

//what does getMatch('items',$i,7) is?

 

getMatch() is a method of the Object $scraper.

Link to comment
Share on other sites

wow!! Thank you for the fast reply. Is it possible to add a question?

The script that I'm looking at was made to crawl a specific website, therefore the

way that it is structure is toward crawling something specific, and I'm working toward

to find a generic way to do it.

 

Therefore, I would like to ask if there's a generic way to check all the pages that is inside a website by following

its hyperlinks without going to the external links? It would be useful if this has already been done so I can

refer from it and customize it a bit.

 

Again, the help is very much appreciated.

Thank you !!!!

Link to comment
Share on other sites

$queue = new SplQueue();

$dom = new DomDocument();
if ($dom->loadHtmlFile('http://path/to/html/file')) {
  foreach ($dom->getElementsByTagName('a') as $a) {
    if ($a->hasAttributes() && $node = $a->attributes->getNamedItem('href')) {
      $queue->enqueue($node->nodeValue);
    }
  }
}

foreach ($queue as $uri) {
  print $uri;
}

Link to comment
Share on other sites

Tnx!!! I actually used something I found and it also lets me obtain all the url links from the whole website.

 

Now I have advanced the part where I'm using Regex to find the right generic pattern for the things I'll be searching for.

 

For example I did:

 

$Regex = "/[a-zA-Z]{1}[0-9]{1}[a-zA-Z]{1}(\-| |){1}[0-9]{1}[a-zA-Z]{1}[0-9]{1}/";

    preg_match_all ($Regex, $f_data, $matches, PREG_PATTERN_ORDER);

    echo $matches[0][0] . ", " . $matches[0][1] . "\n";

    echo $matches[1][0] . ", " . $matches[1][1] . "\n";

 

To find all the postal codes. But the thing is that I want all of them to display,

not just 00 to 11

 

Link to comment
Share on other sites

Oh first of all, thanks for the help, this forum is extremely resourceful.

Again, in order to crawl all the pages from a website, i'll need to search recursively on all the links....

 

I heard that  Curl FOLLOWLOCATION function might actually do this? Is it true?

If so, how is it actually done?

 

*Ignace: I tried your code, it's useful but it's not what I want, I'll need to that it searches nonstop, even at new pages, for all the pages that there is inside the website, but yet they are not external links :(. This does seem very complicated....

Link to comment
Share on other sites

set_time_limit(0);

class Crawler implements IteratorAggregate {
  private $dom = null;
  private $urlList = null;
  
  public function __construct() {
    $this->dom = new DomDocument();
    $this->urlList = new ArrayObject();
  }
  
  public function getUrlList() {
    return $this->urlList;
  }
  
  public function getIterator() {
    return $this->urlList->getIterator();
  }
  
  public function crawl($url) {
    $this->urlList->append($url);
    
    if ($dom->loadHtmlFile($url)) {
      foreach ($dom->getElementsByTagName('a') as $a) {
        $href = $a->attributes->getNamedItem('href');
        if (!$this->_isUrl($href)) continue;//trail ends here
        
        $this->crawl($href);
      }
    }
  }
  
  private function _isUrl($url) {
    return FALSE !== parse_url($url);
  }
}

 

Let's hope none lead to external sources or this script may run forever.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.