Jump to content

crawl sites data?


Recommended Posts

What would be the option to parse a sites data daily?

 

Basically the site i am trying to gather data for, every day has new items that are sold. ('6110.html', new entry's = the number goes up)

 

What i hope to accomplish is to parse the values for:

Retail price:

Sold for:

 

So that i can eventually use all the data and graph by average price per product which will help immensely when using the site.

 

Any ideas will be appreciated.

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/
Share on other sites

Getting the values specifically would depend on the way the site is setup. Show the source from the area around where the values you wish to get out are, if you want help with that part. For the running it once daily you'll probably want to look into CRON jobs.

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/#findComment-856696
Share on other sites

You can accomplish this with either file_get_contents or cURL and preg_match_all.  If you do a search in either here, or the PHP Regex section, you should be able to find helpful threads and similar ideas that will assist you.  Good luck.

 

<?php
$html = "<b>bold text</b><a href=howdy.html>click me</a>";

preg_match_all("/(<([\w]+)[^>]*>)(.*)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
    echo "matched: " . $val[0] . "\n";
    echo "part 1: " . $val[1] . "\n";
    echo "part 2: " . $val[3] . "\n";
    echo "part 3: " . $val[4] . "\n\n";
}
?>

So based off of this example would you just do something like this?

 

change

$html = "<b>bold text</b><a href=howdy.html>click me</a>";

to

$html = "<b>bold text</b><a href=http://bidstick.com/latest/d/6401.html>click me</a>";

 

Fairly confused, any ideas on making this work?

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/#findComment-856761
Share on other sites

Kind of.  You would have to match on those characters and patterns.

 

It would look something like:

 

$html = "bold textclick me";
$pattern = "~^(.*)(.*)~i";
preg_match_all($pattern, $html, $matches);
echo "Bold: " . $matches[1][0] . "
href: " . $matches[2][0];

?>

 

- The (.*) will capture any 0 or more characters in that position and add it to the $matches array.

- The '\s', takes care of whitespace.

- I had to escape the '.' because they are special characters (wildcards).  So by escaping them, the pattern will take the dots as literal dots.

- The tildes (~) are my delimiters and you need them around your pattern.

- Finally the 'i' flag is for case-insensitivity.

 

For more information read the tutorial here on phpfreaks -  Regular Expressions (Part1) - Basic Syntax.  You should also take a look at the function documentation from the manual - preg_match_all.

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/#findComment-856815
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.