Jump to content

Recommended Posts

I am building a spider that will crawl through random whitepages (eg. anywho.com, switchboard.com, whitepages.com, etc..) and collect the information on the people found there and throw it into a database. So far I've only made this little prototype, however after trying to run it I've run into a bunch of problems....a lot of them I fixed but there are some with the expressions that I can't figure out.

 

Here are the errors:

Warning: preg_match_all() [function.preg-match-all]: Compilation failed: missing ) at offset 57 in /home/public_html/spider/inc/anywho.class.php on line 51

Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in /home/public_html/spider/inc/anywho.class.php on line 72

Warning: preg_match_all() [function.preg-match-all]: No ending delimiter '^' found in /home/public_html/spider/inc/anywho.class.php on line 73

Warning: preg_match() [function.preg-match]: No ending delimiter '^' found in /home/public_html/spider/inc/anywho.class.php on line 76

Warning: preg_replace() [function.preg-replace]: No ending delimiter '.' found in /home/public_html/spider/inc/anywho.class.php on line 92

Warning: preg_replace() [function.preg-replace]: No ending delimiter '^' found in /home/public_html/spider/inc/anywho.class.php on line 93

Warning: preg_replace() [function.preg-replace]: No ending delimiter '.' found in /home/public_html/spider/inc/anywho.class.php on line 94

Warning: preg_replace() [function.preg-replace]: No ending delimiter '^' found in /home/public_html/spider/inc/anywho.class.php on line 95

Warning: preg_replace() [function.preg-replace]: No ending delimiter '*' found in /home/public_html/spider/inc/anywho.class.php on line 96

 

 

Along with these it isn't printing out the info like it is suppose to on line 56 of anywho.class.php

As to the fact that these are two files and a little bigger then the normal "snippet" I posted them both in a pin board. The links are below.

 

Spider Class: http://www.coderprofile.com/networks/code-pin-board/258/spiderclassphp

Anywho Class: http://www.coderprofile.com/networks/code-pin-board/257/anywhospiderclassphp

 

And here is the source of the form page:

 

<?php
require("spider.class.php");
require("anywho.class.php");

$spider=new spider("Lorem Ipsum","Lorem Ipsum","Lorem Ipsum","localhost",15);
$any=new anywho;

if(isset($_POST['submit'])){
$state=$_POST['state'];
$last=$_POST['last'];
$first = (isset($_POST['first'])) ? $_POST['first'] : null;
$street = (isset($_POST['street'])) ? $_POST['street'] : null;
$zip = (isset($_POST['zip'])) ? $_POST['zip'] : null;

$any->initialize($last,$state,$first,$street,$city,$zip);
$any->any_crawl($any->url,0,1);
}
?>
<form action="index.php" method="post">
Last Name: <input type="text" name="last">*<br>
First Name: <input type="text" name="first"><br>
Street: <input type="text" name="street"><br>
Zip: <input type="text" name="zip"><br>
State:
<select name="state" style="height:17px; font-size:9px;">
<option value="">Select a State</option>
<option value="AL" selected="selected" >Alabama</option>
...........................
...........................
<option value="WY">Wyoming</option>
</select>*<br><br>
<input type="submit" value="Crawl" name="submit">
</form>

 

 

 

I'm really sorry about the messy code and poor documentation.

 

Also I really appreciate any and all replies! 

 

Link to comment
https://forums.phpfreaks.com/topic/180951-troubles-with-a-spider-class/
Share on other sites

Also file_get_contents() is crappy if you dont know what pages your going after. If the page redirects youll get nothing. Curl is better for this.

 

If you definitley know the page is http://site.com then file_get_contents() is ok but if it redirects http://site.com -> http://www.site.com then your screwed

Oh no...when the form is completed it finishes the url eg.

 

 

And this is going to spider the people (names, phone numbers, addresses...etc) from the pages to follow.

 

Yes I understand, but I know for a fact that it is going to be anywho.com as I have designed it only to follow the links to another anywho.com page.

Welll forgetting about the class, this will get the info  you want. Your not getting multiple items per page so you wont need preg_match_all()

 

$u = 'http://whitepages.anywho.com/results.php?ReportType=34&qi=0&qk=10&qn=A&qs=AK';

$g = @file_get_contents($u); //remove @ to show errors

$a = explode('<span class="singleName">',$g);
$a = explode('</div></div>',$a[1]);
$a= $a[0];

echo $a;
echo '<hr>Done!';

 

Just str_replace() the divs so you can put it into the database

*bump* This time it made it to the middle of the fourth page....so am I really on my own on this one?

 

Properly because of the type of content your after - ppl's private info. Im not going to ask what you intend to do with it.

 

preg_match_all() will definatley work with a foreach loop

Properly because of the type of content your after - ppl's private info. Im not going to ask what you intend to do with it.

 

Well technically it's not private info, considering if you were to just look in a phone book you could easily find the same info (name, phone number, and address)...but I guess you are right.

 

Thank you anyways, I guess I'll work on it some more on my own and if I can't figure it out I'll just hire someone to fix it for me.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.