Jump to content

j5uh

Members
  • Posts

    58
  • Joined

  • Last visited

    Never

Posts posted by j5uh

  1. Violating a TOS is not the same as violating the law. With that said, still a bad idea. you could get banned from the site, or they could brow-beat your ISP to booting you (it happened to me once).

     

    ic. well if this is agains't phpfreaks forums TOS, please delete this thread. I don't want to cause any trouble.

  2. Or the script times out.

     

    And are you aware this is against the yellowpages TOS?

     

    HOW YOU MAY USE OUR MATERIALS: We use a diverse range of information, text, photographs, designs, graphics, images, sound and video recordings, animation and other materials and effects on the YELLOWPAGES.COM Web site.

     

    We provide the information, content or advertisements (which we collectively call the "Materials") on the YELLOWPAGES.COM site FOR YOUR PERSONAL, NON-COMMERCIAL USE ONLY.

     

    Accordingly, You may view, use, copy, and distribute the Materials found on YELLOWPAGES.COM Web sites for internal, noncommercial, informational purposes only. You are prohibited from data mining, scraping, crawling, or using any process or processes that send automated queries to the YELLOWPAGES.COM Web site. You may not use the YELLOWPAGES.COM Web sites to compile a collection of listings, including a competing listing product or service. You may not use the Site or any Materials for any unsolicited commercial e-mail. Except as authorized in this paragraph, you are not being granted a license under any copyright, trademark, patent or other intellectual property right in the Materials or the products, services, processes or technology described therein. All such rights are retained by YELLOWPAGES.COM, its subsidiaries, parent companies, and/or any third party owner of such rights.

     

    ooh. did not know this. but there are actual softwares being sold that does the scraping. How are they getting away with that?

  3. I found this AWESOME yellowpages scraper online for free instead of paying someone to scrap the pages... http://www.scrapingpages.com/

     

    I've tested the code here:

    
    <?
    ini_set('memory_limit', '99999M');
    function createUrl($url,$lastnum)
    {
    $find = "?";
    $trim = rtrim ($url,'a..z,A..Z,=,_,&');
    $remove_to = strpbrk($trim, '?');
    $number = 1;
    $counter= 0;
    while ($lastnum != $number)
    {
    $over = "?page=".$number."&";
    $replace = str_replace($find,$over,$url);
    $myArray[$counter] = $replace;
    $number++;
    $counter++;
    }
    return $myArray;
    }
    
    
    
    
    
    $url = "http://www.yellowpages.com/TX/Internet-Marketing-Advertising?search_mode=all&search_terms=seo";
    $lastnum = 1 +1;
    $url = createUrl($url,$lastnum);
    
    function createList ($url ) {
    $counter=0;
    foreach ($url as $value)
    {
    $html=file_get_contents ($value);
    $myArray[$counter] = $html;
    $counter++;
    }
    return $myArray;
    }
    $list = createList($url);
    
    
    
    foreach ($list as $value){
    echo "<span style='width:8px; background:blue'> </span>";
    preg_match_all ("/<div class=\"description\">([^`]*?)<\/div>/", $value, $matches);
    foreach ($matches[0] as $match) {
    preg_match ("/<h2>([^`]*?)<\/h2>/", $match, $temp);
    preg_match ("/<p>([^`]*?)<\/p>/" , $match, $desc);
    preg_match ("/<ul>([^`]*?)<\/ul>/" , $match, $num);
    
    $title = $temp['1'];
    $title = strip_tags(trim($title));
    
    $description = $desc['1'];
    $description = strip_tags(trim($description));
    
    $phone = $num['1'];
    $phone = strip_tags(trim($phone));
    
    
    
    print "<b>$title</b>
    <br>$description<br>
    $phone<br>
    <br>";
    }
    }
    ?>
    

     

    Works great but how do I get it to search more than 50+ pages...? I want to scrape all the houston businesses but it times out at 50 or so pages. Is there a way to modify this code to search maybe 50 pages at a time or something? like scrape pages 1-50, than 51-100, etc. etc.

  4. Be aware that HTTP_REFERER can be modified by the user. But generally it would work (if a few users getting "unauthorized" access is OK). If you want to match someone coming from paypal.com, with or without possible sub domains and/or pages aside from the front page, you can use preg_match():

     

    <?php
    $referal = $_SERVER['HTTP_REFERER'];
    if (preg_match('~^https?://(.*?\.)?paypal.com/.*?$~D', $referal)) {
    //they come from paypal.com
    } else {
    //they don't
    }
    ?>

     

    I don't think the other script posted will work, since the URLs are short of a trailing slash and the "https" scheme. But I guess you were supposed to fill in the exact URLs yourself :)

     

    So this script here is better with preg_match?

    so if someone made a payment on paypal, they would be forwarded to this page and it should allow them to access it right?

     

    I have no problem with just a few people sneaking by... I will review the list every couple weeks to make sure people have paid...

  5. I've finally figured it out. here's the final code to share it with the world.

     

    #!/usr/bin/php
    <?php
    
    $db_host = "xxx";
    $db_user = "xxx";
    $db_pwd = "xxx";
    $db_name = "xxx";
    $db_table = "users";
    $db_emailfield = "email";
    
    mysql_connect($db_host, $db_user, $db_pwd);
    mysql_select_db($db_name);
    
    // read from stdin
    $fd = fopen("php://stdin", "r");
    $email = "";
    while (!feof($fd)) {
        $email .= fread($fd, 1024);
    }
    fclose($fd);
    
    function get_string_between($string, $start, $end){
            $string = " ".$string;
            $ini = strpos($string,$start);
            if ($ini == 0) return "";
            $ini += strlen($start);   
            $len = strpos($string,$end,$ini) - $ini;
            return substr($string,$ini,$len);
    }
    
    $email = get_string_between($email, "<div class=3DSection1>", "</div>");
    
    
    // handle email
    $lines = explode("\n", $email);
    
    // empty vars
    $from = "";
    $subject = "your subject here";
    $headers = "";
    $message = "";
    $splittingheaders = true;
    
    for ($i=0; $i < count($lines); $i++) {
        if ($splittingheaders) {
            // this is a header
            $headers .= $lines[$i]."\n";
    
            // look out for special headers
            if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
                $subject = $matches[1];
            }
            if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
                $from = $matches[1];
            }
        } else {
            // not a header, but message
            $message .= $lines[$i]."\n";
        }
    
        if (trim($lines[$i])=="") {
            // empty line, header section has ended
            $splittingheaders = false;
        }
    }
    
    $sql = "SELECT `$db_emailfield` FROM `$db_table`;";
    $result = mysql_query($sql);
    while($row = mysql_fetch_assoc($result)){
    $emails = $row['email'];
    
    $headers = 'MIME-Version: 1.0' . "\n";
    $headers .= 'Content-type: text/html; charset=UTF-8' . "\n";
    $headers .= "From: your address.com";
    $ForwardTo = $emails;
    mail ($ForwardTo,$subject,$message,$headers);
    }
    ?>

  6. ok one more issue. now I've stuck in some code to pull the email addresses from the db, but im getting this error here:

     

    Fatal error: Cannot redeclare get_string_between() (previously declared in /home/newhost/public_html/asd/asd/mailer3.php:28) in /home/newhost/public_html/asd/asd/mailer3.php on line 28

     

    here's the code I am using:

     

    #!/usr/bin/php

    <?php

     

    $db_host = "asd";

    $db_user = "asd";

    $db_pwd = "asd";

    $db_name = "asd";

    $db_table = "users";

    $db_emailfield = "email";

     

    mysql_connect($db_host, $db_user, $db_pwd);

    mysql_select_db($db_name);

     

    $sql = "SELECT `$db_emailfield` FROM `$db_table`;";

    $result = mysql_query($sql);

    while($row = mysql_fetch_assoc($result)){

    $emails = $row['email'];

    $ChangeTo = 'asd;

     

    // read from stdin

    $fd = fopen("php://stdin", "r");

    $email = "";

    while (!feof($fd)) {

        $email .= fread($fd, 1024);

    }

    fclose($fd);

     

    function get_string_between($string, $start, $end){

            $string = " ".$string;

            $ini = strpos($string,$start);

            if ($ini == 0) return "";

            $ini += strlen($start); 

            $len = strpos($string,$end,$ini) - $ini;

            return substr($string,$ini,$len);

    }

     

    $email = get_string_between($email, "<div class=3DSection1>", "</div>");

     

     

    // handle email

    $lines = explode("\n", $email);

     

    // empty vars

    $from = "";

    $subject = "Expert Advisors";

    $headers = "";

    $message = "";

    $splittingheaders = true;

     

    for ($i=0; $i < count($lines); $i++) {

        if ($splittingheaders) {

            // this is a header

            $headers .= $lines[$i]."\n";

     

            // look out for special headers

            if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {

                $subject = $matches[1];

            }

            if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {

                $from = $matches[1];

            }

        } else {

            // not a header, but message

            $message .= $lines[$i]."\n";

        }

     

        if (trim($lines[$i])=="") {

            // empty line, header section has ended

            $splittingheaders = false;

        }

    }

     

    $headers = 'MIME-Version: 1.0' . "\n";

    $headers .= 'Content-type: text/html; charset=UTF-8' . "\n";

    $headers .= "From: asd";

    $ForwardTo = $emails;

    mail ($ForwardTo,$subject,$message,$headers);

    }

    ?>

  7. ok sweet... now I've modified it even more and this is what  I have:

     

    #!/usr/bin/php
    <?php
    
    // read from stdin
    $fd = fopen("php://stdin", "r");
    $email = "";
    while (!feof($fd)) {
        $email .= fread($fd, 1024);
    }
    fclose($fd);
    
    function get_string_between($string, $start, $end){
            $string = " ".$string;
            $ini = strpos($string,$start);
            if ($ini == 0) return "";
            $ini += strlen($start);   
            $len = strpos($string,$end,$ini) - $ini;
            return substr($string,$ini,$len);
    }
    
    
    // handle email
    $lines = explode("\n", $email);
    
    // empty vars
    $from = "";
    $subject = "";
    $headers = "";
    $message = "";
    $splittingheaders = true;
    
    for ($i=0; $i < count($lines); $i++) {
        if ($splittingheaders) {
            // this is a header
            $headers .= $lines[$i]."\n";
    
            // look out for special headers
            if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
                $subject = $matches[1];
            }
            if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
                $from = $matches[1];
            }
        } else {
            // not a header, but message
            $message .= $lines[$i]."\n";
        }
    
        if (trim($lines[$i])=="") {
            // empty line, header section has ended
            $splittingheaders = false;
        }
    }
    
    $headers = 'MIME-Version: 1.0' . "\n";
    $headers .= 'Content-type: text/html; charset=UTF-8' . "\n";
    $headers .= "From: xxx";
    $ForwardTo = 'xxx';
    mail ($ForwardTo,$subject,$message,$headers);
    ?>

     

    and I'm getting this :

     

    This is a multipart message in MIME format. ------=_NextPart_000_010F_01C8C663.3932C670 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Testing 1234 ------=_NextPart_000_010F_01C8C663.3932C670 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
    
    Testing 1234
    ------=_NextPart_000_010F_01C8C663.3932C670--

     

    Which means I've stripped away all that using this line:

    $headers = 'MIME-Version: 1.0' . "\n";
    $headers .= 'Content-type: text/html; charset=UTF-8' . "\n";

     

    But how can  I get rid of that other mess?

  8. ;D

     

    ok ... so what I'm getting as a result is a getting better. But still some crap that gets emailed... here it is...

     

    From asd@a.com Wed Jun 04 16:19:47 2008
    Received: from 123123123.dsl.hs123123obal.net ([123123]:1421 helo=prexxix)
           by gator465.hostgator.com with esmtpa (Exim 4.68)
           (envelope-from <asd@aa.com>)
           id 1K40OY-0001MR-T5
           for oskdpsk@aol.com; Wed, 04 Jun 2008 16:19:47 -0500
    From: "John" <12323@ao.com>
    To: <asdsd@aol.com>
    Subject: test
    Date: Wed, 4 Jun 2008 16:19:54 -0500
    Message-ID: <123434##.com>
    MIME-Version: 1.0
    Content-Type: multipart/alternative;
           boundary="----=_NextPart_000_00E7_01C8C65E.D3062580"
    X-Mailer: Microsoft Office Outlook 12.0
    Thread-Index: AcjGiLnI28Lp3X5tSLuLhc6u9F0ABA==
    Content-Language: en-us
    x-cr-hashedpuzzle: AB0P A5cy CTO+ CX7e CxSy DvBH ECBa HdzE Htgh Ic06 JKPY Jjka Jk2A KtuP LsxO L6XP;1;cwBpAGcAbgBhAGwAQABmAGkAbgBhAG4AYwBpAGEAbAAtAHIAbwBiAG8AdABpAGMAcwAuAGMAbwBtAA==;Sosha1_v1;7;{3A797A8D-B98B-4371-A084-67C4021C6B09};agAuAHMAdQBoAEAAZgBpAG4AYQBuAGMAaQBhAGwALQByAG8AYgBvAHQAaQBjAHMALgBjAG8AbQA=;Wed, 04 Jun 2008 21:19:51 GMT;dABlAHMAdAA=
    x-cr-puzzleid: {3A797A8D-B98B-4371-A084-67C4021C6B09}
    
    
    
    This is a multipart message in MIME format.
    
    ------=_NextPart_000_00E7_01C8C65E.D3062580
    Content-Type: text/plain;
           charset="us-ascii"
    Content-Transfer-Encoding: 7bit
    
    Asidjasoijd asodj asod j
    
    
    ------=_NextPart_000_00E7_01C8C65E.D3062580
    - Show quoted text -
    Content-Type: text/html;
           charset="us-ascii"
    Content-Transfer-Encoding: quoted-printable
    
    <html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
    xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
    xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
    xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
    xmlns=3D"http://www.w3.org/TR/REC-html40">
    
    <head>
    <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
    charset=3Dus-ascii">
    <meta name=3DGenerator content=3D"Microsoft Word 12 (filtered medium)">
    <style>
    <!--
    /* Font Definitions */
    @font-face
           {font-family:"Cambria Math";
           panose-1:2 4 5 3 5 4 6 3 2 4;}
    @font-face
           {font-family:Calibri;
           panose-1:2 15 5 2 2 2 4 3 2 4;}
    /* Style Definitions */
    p.MsoNormal, li.MsoNormal, div.MsoNormal
           {margin:0in;
           margin-bottom:.0001pt;
           font-size:11.0pt;
           font-family:"Calibri","sans-serif";}
    a:link, span.MsoHyperlink
           {mso-style-priority:99;
           color:blue;
           text-decoration:underline;}
    a:visited, span.MsoHyperlinkFollowed
           {mso-style-priority:99;
           color:purple;
           text-decoration:underline;}
    span.EmailStyle17
           {mso-style-type:personal-compose;
           font-family:"Calibri","sans-serif";
           color:windowtext;}
    .MsoChpDefault
           {mso-style-type:export-only;}
    @page Section1
           {size:8.5in 11.0in;
           margin:1.0in 1.0in 1.0in 1.0in;}
    div.Section1
           {page:Section1;}
    -->
    </style>
    <!--[if gte mso 9]><xml>
    <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
    </xml><![endif]--><!--[if gte mso 9]><xml>
    <o:shapelayout v:ext=3D"edit">
    <o:idmap v:ext=3D"edit" data=3D"1" />
    </o:shapelayout></xml><![endif]-->
    </head>
    
    <body lang=3DEN-US link=3Dblue vlink=3Dpurple>
    
    <div class=3DSection1>
    
    <p class=3DMsoNormal>Asidjasoijd asodj asod j<o:p></o:p></p>
    
    </div>
    
    </body>
    
    </html>
    
    ------=_NextPart_000_00E7_01C8C65E.D3062580--

  9. This is what  I get

     

    From xxx@asd.com Wed Jun 04 14:44:08 2008

    Received: from adsl-02020202002.dsl.hstntx.sbcglobal.net ([00.00.00.00]:1234 helo=prexxix)

          by gator465.hostgator.com with esmtpa (Exim 4.68)

          (envelope-from <123@cs.com>)

          id 1K3yu0-0005O6-Hn

          for 123@cs.com; Wed, 04 Jun 2008 14:44:08 -0500

    From: "john" <123@cs.com>

    To: <aasl@cs.com>

    Subject: 12323213123

    Date: Wed, 4 Jun 2008 14:44:15 -0500

    Message-ID: <asdioj@sokd.com>

    MIME-Version: 1.0

    Content-Type: multipart/alternative;

          boundary="----=_NextPart_000_00E2_01C8C651.766A7CC0"

    X-Mailer: Microsoft Office Outlook 12.0

    Thread-Index: AcjGe11qMOjrcJG4R7amtFB1Cvy/Uw==

    Content-Language: en-us

    x-cr-hashedpuzzle: yYE= AdPi Axu3 BAAg ECQt EaW5 EbYl Ey7J E3EF FHiq GPV0 HF4a IH/L I8i8 JXAS KX8S;1;cwBpAGcAbgBhAGwAQABmAGkAbgBhAG4AYwBpAGEAbAAtAHIAbwBiAG8AdABpAGMAcwAuAGMAbwBtAA==;Sosha1_v1;7;{FFA2519B-E285-46B0-92BB-9425F2DC2D68};agAuAHMAdQBoAEAAZgBpAG4AYQBuAGMAaQBhAGwALQByAG8AYgBvAHQAaQBjAHMALgBjAG8AbQA=;Wed, 04 Jun 2008 19:44:13 GMT;MQAyADMAMgAzADIAMQAzADEAMgAzAA==

    x-cr-puzzleid: {FFA2519B-E285-46B0-92BB-9425F2DC2D68}

     

     

     

    This is a multipart message in MIME format.

     

    ------=_NextPart_000_00E2_01C8C651.766A7CC0

    Content-Type: text/plain;

          charset="us-ascii"

    Content-Transfer-Encoding: 7bit

     

    1231231232

     

     

    ------=_NextPart_000_00E2_01C8C651.766A7CC0

    Content-Type: text/html;

          charset="us-ascii"

    Content-Transfer-Encoding: quoted-printable

     

    <html xmlns:v=3D"urn:schemas-microsoft-com:vml" =

    xmlns:o=3D"urn:schemas-microsoft-com:office:office" =

    xmlns:w=3D"urn:schemas-microsoft-com:office:word" =

    xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =

    xmlns=3D"http://www.w3.org/TR/REC-html40">

     

    <head>

    <META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =

    charset=3Dus-ascii">

    <meta name=3DGenerator content=3D"Microsoft Word 12 (filtered medium)">

    <style>

    <!--

    /* Font Definitions */

    @font-face

          {font-family:"Cambria Math";

          panose-1:2 4 5 3 5 4 6 3 2 4;}

    @font-face

          {font-family:Calibri;

          panose-1:2 15 5 2 2 2 4 3 2 4;}

    /* Style Definitions */

    p.MsoNormal, li.MsoNormal, div.MsoNormal

          {margin:0in;

          margin-bottom:.0001pt;

          font-size:11.0pt;

          font-family:"Calibri","sans-serif";}

    a:link, span.MsoHyperlink

          {mso-style-priority:99;

          color:blue;

          text-decoration:underline;}

    a:visited, span.MsoHyperlinkFollowed

          {mso-style-priority:99;

          color:purple;

          text-decoration:underline;}

    span.EmailStyle17

          {mso-style-type:personal-compose;

          font-family:"Calibri","sans-serif";

          color:windowtext;}

    .MsoChpDefault

          {mso-style-type:export-only;}

    @page Section1

          {size:8.5in 11.0in;

          margin:1.0in 1.0in 1.0in 1.0in;}

    div.Section1

          {page:Section1;}

    -->

    </style>

    <!--[if gte mso 9]><xml>

    <o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />

    </xml><![endif]--><!--[if gte mso 9]><xml>

    <o:shapelayout v:ext=3D"edit">

    <o:idmap v:ext=3D"edit" data=3D"1" />

    </o:shapelayout></xml><![endif]-->

    </head>

     

    <body lang=3DEN-US link=3Dblue vlink=3Dpurple>

     

    <div class=3DSection1>

     

    <p class=3DMsoNormal>1231231232<o:p></o:p></p>

     

    </div>

     

    </body>

     

    </html>

     

    ------=_NextPart_000_00E2_01C8C651.766A7CC0--

  10. ok so now I have stripped it down and followed the direction here at http://www.evolt.org/article/Incoming_Mail_and_PHP/18/27914/index.html

     

    Here is the code that I am using now:

    #!/usr/bin/php
    <?php
    
    // read from stdin
    $fd = fopen("php://stdin", "r");
    $email = "";
    while (!feof($fd)) {
        $email .= fread($fd, 1024);
    }
    fclose($fd);
    
    // handle email
    $lines = explode("\n", $email);
    
    // empty vars
    $from = "";
    $subject = "";
    $headers = "";
    $message = "";
    $splittingheaders = true;
    
    for ($i=0; $i < count($lines); $i++) {
        if ($splittingheaders) {
            // this is a header
            $headers .= $lines[$i]."\n";
    
            // look out for special headers
            if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
                $subject = $matches[1];
            }
            if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
                $from = $matches[1];
            }
        } else {
            // not a header, but message
            $message .= $lines[$i]."\n";
        }
    
        if (trim($lines[$i])=="") {
            // empty line, header section has ended
            $splittingheaders = false;
        }
    }
    
    $ForwardTo = 'aaaa@gmail.com';
    mail ($ForwardTo,$subject,$message,$headers);
    ?>

     

    The email piping WORKS! but i'm getting bunch of MS Word html mess along with the email. is there a way to filter all that out and just have the body of the email?

  11. ok i've tested $message = "testing msg"; and it sends  the msg through.

     

    The following part of the script:

     

    $splittingheaders = true;
    
    for ($i=0;$i<count($lines);$i++) {
       if ($splittingheaders) {
          if (preg_match("/^From: (.*)/",$lines[$i],$matches)) {
             if (strpos($lines[$i],"<")) {
                // The name is before the email   
                $data = explode ("<",$lines[$i]);
                $from = substr(trim($data[1]),0,-1);
             } else {
                $from = $matches[1];
             }
          }
         
          if (preg_match("/^Subject: (.*)/",$lines[$i],$matches)) {
             $subject = $matches[1];
          }
       } else {
          $message .= $lines[$i]."\n";
       }
       
       if (trim($lines[$i]=="")) {
          $splittingheaders = false;
       }
    }
    
    $message = <<< EOF
    $message
    EOF;
    
    $headers = "Content-type: text/html\n";
    $headers .= "From: $from\n";
    $headers .= "Return-Path: $from\n";
    //$headers .= "To: $to\n";

     

    I don't understand at all...  ???

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.