Jump to content

Recommended Posts

What would be the option to parse a sites data daily?

 

Basically the site i am trying to gather data for, every day has new items that are sold. ('6110.html', new entry's = the number goes up)

 

What i hope to accomplish is to parse the values for:

Retail price:

Sold for:

 

So that i can eventually use all the data and graph by average price per product which will help immensely when using the site.

 

Any ideas will be appreciated.

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/
Share on other sites

Getting the values specifically would depend on the way the site is setup. Show the source from the area around where the values you wish to get out are, if you want help with that part. For the running it once daily you'll probably want to look into CRON jobs.

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/#findComment-856696
Share on other sites

You can accomplish this with either file_get_contents or cURL and preg_match_all.  If you do a search in either here, or the PHP Regex section, you should be able to find helpful threads and similar ideas that will assist you.  Good luck.

 

<?php
$html = "<b>bold text</b><a href=howdy.html>click me</a>";

preg_match_all("/(<([\w]+)[^>]*>)(.*)(<\/\\2>)/", $html, $matches, PREG_SET_ORDER);

foreach ($matches as $val) {
    echo "matched: " . $val[0] . "\n";
    echo "part 1: " . $val[1] . "\n";
    echo "part 2: " . $val[3] . "\n";
    echo "part 3: " . $val[4] . "\n\n";
}
?>

So based off of this example would you just do something like this?

 

change

$html = "<b>bold text</b><a href=howdy.html>click me</a>";

to

$html = "<b>bold text</b><a href=http://bidstick.com/latest/d/6401.html>click me</a>";

 

Fairly confused, any ideas on making this work?

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/#findComment-856761
Share on other sites

Kind of.  You would have to match on those characters and patterns.

 

It would look something like:

 

$html = "bold textclick me";
$pattern = "~^(.*)(.*)~i";
preg_match_all($pattern, $html, $matches);
echo "Bold: " . $matches[1][0] . "
href: " . $matches[2][0];

?>

 

- The (.*) will capture any 0 or more characters in that position and add it to the $matches array.

- The '\s', takes care of whitespace.

- I had to escape the '.' because they are special characters (wildcards).  So by escaping them, the pattern will take the dots as literal dots.

- The tildes (~) are my delimiters and you need them around your pattern.

- Finally the 'i' flag is for case-insensitivity.

 

For more information read the tutorial here on phpfreaks -  Regular Expressions (Part1) - Basic Syntax.  You should also take a look at the function documentation from the manual - preg_match_all.

Link to comment
https://forums.phpfreaks.com/topic/162313-crawl-sites-data/#findComment-856815
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.