Jump to content

miniramen

Members
  • Posts

    14
  • Joined

  • Last visited

    Never

Posts posted by miniramen

  1. Hello,

     

    I've been working on this for ages and I just cannot understand how to deal with it.

    I've been using my crawler script on localhost without any problem, but when I start using shell on a private server, it doesn't work anymore :(.

     

    The only error message I get was undefined variable.....then I defined it and it still doesn't work...the code is running without giving any error messages anymore.

     

    If there's any ways I can change the code restrictions to be similar to my localhost environment, then I think I should be fine? Problem is people are not willing to show me the php.ini and access is prohibited geeee!!!

  2. Hello,

     

    People have been telling me if I want to do any crawling, I need to know the sitemap....I need to do xml parsing....

     

    then I realized that sitemap is written with xml.....sorry my noobness is unbearable even for me sometimes.

     

    So if I need to find a sitemap of a website, how do I go about doing it?

  3. oh sorry about the class tag that doesn't exist, I replaced it with div tag. Now as for the error msg without that "@" inside the code...

     

    Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: I/O warning : failed to load external entity " <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-ca"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Business Directory » index</title> <link href="/favicon.ico" type="image/x-icon" rel="icon" /><link href="/favicon.ico" type="image/x-icon" rel="shortcut icon" /> <meta name="keywords" content=""/> <meta name="description" content=""/> <link rel="stylesheet" type="text/css" href="/css/blueprint/screen.css" media="screen, projection" / in C:\xampp\htdocs\xampp\new.php on line 8

     

    Fatal error: Call to undefined method DOMXPath::getElementsByTagName() in C:\xampp\htdocs\xampp\new.php on line 11

     

     

    The code right now:

     

    <?php

     

    $url = "http://www.village.consort.ab.ca/business-directory/";

    $data = file_get_contents($url);

    $dom = new DomDocument();

     

    $dom -> loadHTMLFile($data);

     

    $xpath = new DOMXPath($dom);

    $divtags = $xpath->getElementsByTagName("div");

  4. Solved my own problem....but the thing is that now sometimes the script gives the right vlaue, and sometimes it does not...wtf? Please can someone tell me

    if this is code that is wrong?

     

     

     

                           

               

    foreach ($site_array as $redirectId => $site)

    {   

                     

        $curl = curl_init();

        curl_setopt($curl, CURLOPT_URL, $site); 

    curl_setopt($curl, CURLOPT_RETURNTRANSFER, True);

        $site_content = curl_exec($curl);

    curl_close($curl);

     

    echo strpos(htmlentities($site_content), $redirectId) . ",    ";

    echo $redirectId;

     

    /*

      $site_content = file_get_contents($site);               

      preg_match('/$redirectId/', $site_content, $matches);

      print_r ($matches);

    */

     

        if (strpos(htmlentities($site_content), $redirectId) != 0)

            echo $site . ", " . "true" . "<br />" ;

        else

            echo $site . ", " . "false" . "<br />" ;

     

    }   

     

    fclose($file_open);                 

    ?>

     

    Sometimes, the script is unable to search through the website :(

  5. Hello, yes I decided to use curl, and from that, I wanted to search for an ID number to see if the site is redirected to the right place.

     

    sadly, it seems the my code is unable to retrieve the content of the site that it's redirected to...actually, it can't even get any site content at all. Here's my code.

     

     

     

    fclose($file_open);

                           

                     

    foreach ($site_array as $site)

    {                       

        $curl = curl_init();

        curl_setopt($curl, CURLOPT_URL, $site);                                                   

        $site_content = curl_exec($curl);

     

    echo $site;

    echo $sitecontent;

     

     

     

      // $site_content = file_get_contents($site);               

                               

        if (strpos($site_content, $redirectId) != false)

            echo $site . "," . "true" . "<br />" ;

        else

            echo $site . "," . "false" . "<br />" ;

     

    }         

  6. Hello,

     

    Recently, I'm asked to check URLs scripts for a redirect code, to see if the job is done or not by other people.

     

    So lets say I have like 100 urls to check.

     

    Thing is, how am I supposed to be able to check multiple url, go inside the script and find if there's a redirect code on it? Then I have to echo a list that will tell me if the code is there or not (So I guess I get a Bool value) from it.

     

    Do I need Regex for this application? I know that I'll need

    While (READ URL)

    {

    Check for redirect code;

    Echo if it exists or not;

    };

     

    Any help would be amazing to get me into the right direction :)

    Thank you again!!!!

  7. Oh first of all, thanks for the help, this forum is extremely resourceful.

    Again, in order to crawl all the pages from a website, i'll need to search recursively on all the links....

     

    I heard that  Curl FOLLOWLOCATION function might actually do this? Is it true?

    If so, how is it actually done?

     

    *Ignace: I tried your code, it's useful but it's not what I want, I'll need to that it searches nonstop, even at new pages, for all the pages that there is inside the website, but yet they are not external links :(. This does seem very complicated....

  8. Tnx!!! I actually used something I found and it also lets me obtain all the url links from the whole website.

     

    Now I have advanced the part where I'm using Regex to find the right generic pattern for the things I'll be searching for.

     

    For example I did:

     

    $Regex = "/[a-zA-Z]{1}[0-9]{1}[a-zA-Z]{1}(\-| |){1}[0-9]{1}[a-zA-Z]{1}[0-9]{1}/";

        preg_match_all ($Regex, $f_data, $matches, PREG_PATTERN_ORDER);

        echo $matches[0][0] . ", " . $matches[0][1] . "\n";

        echo $matches[1][0] . ", " . $matches[1][1] . "\n";

     

    To find all the postal codes. But the thing is that I want all of them to display,

    not just 00 to 11

     

  9. wow!! Thank you for the fast reply. Is it possible to add a question?

    The script that I'm looking at was made to crawl a specific website, therefore the

    way that it is structure is toward crawling something specific, and I'm working toward

    to find a generic way to do it.

     

    Therefore, I would like to ask if there's a generic way to check all the pages that is inside a website by following

    its hyperlinks without going to the external links? It would be useful if this has already been done so I can

    refer from it and customize it a bit.

     

    Again, the help is very much appreciated.

    Thank you !!!!

  10. Hello guys,

     

    I'm a new member and I'm in desperate need of help.....I learned some php and other types of coding (C++, SQL)

    but never went in detail.

     

    I was trying to understand a crawling script where it takes important information from a website and put it

    all on a MYSQL database file. It's a nice script but I'm asked to improve it.

     

    While checking out this script, there are many PHP statements where it cannot be found on PHP.net. I do not know why but it made my life very difficult.

    Would anyone mind telling me:

     

    $sql = new MySQL(); 

     

    //why is it "new mysql(); ?//

    --------------------------------------------------

    $qry = 'DROP TABLE IF EXISTS TEMP_tblBusiness;';

    $sql->Query($qry);

     

    //I've never seen -> anywhere before, can anyone plz tell me?//

     

    --------------------------------------------------

    $scraper->items = array(

        'items' => '#<div class="business-data">'.

        '\n\s*\n\n\n\s*<div class="clearfix">\n.*Category.*\n\s*<div class="business-value">\n\s*(.*?)\s*</div>.*\n\s*</div>'.

     

    //what is \n\s(.*?)\s* .......I really want to understand//

    //and what is clearfix//

    -------------------------------------------------

    $description = $scraper->getMatch('items', $i, 7);

    //what does getMatch('items',$i,7) is?

    ------------------------------------------------

     

    I've searched on PHP.net and nothing came up.

    If anyone would be kind enough to clear this up, thank you very very much.

     

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.