Jump to content

Recommended Posts

Hi guys,

 

What I'm trying to accomplish sounds like fairly easy task but due to my poor knowledge of php this turned to be quite a challenge. 

 

What I'm trying to do is to make a php script that will search the keywords on the multiple websites.

Websites that I will search are all web shops, selling spare part for home appliances and keywords used are usually original spare part codes.

 

When searching, script is searching these websites using their own search functions and not Google or other search engines.

 

Input for this script should be a CSV file containing list of keywords and URLs of the web shops that needs to be search for all these keywords.

Here is the example:
http://prntscr.com/4ebhxh

 

Script should perform like this:

It picks up the 1st keyword, browse to the URL1, uses its search, searches for the product, if it finds it, copy its price and write it back to original input CSV. If it doesn't find match (search results appear empty) it should write "no match found" and continue to URL2, URL3 and so on... 

When all URLs from the list are checked for the 1st keyword, scripts picks up 2nd keyword and continues on through all these keywords are not checked.

 

This would be a resulting CSV file after the 1st keyword is checked:
http://prntscr.com/4ebj52

After all data from the input CSV file are processed, script should prompt a msg and create a download link for that CSV file to be downloaded.

If there are multiple matches, in other words if for one keyword some of the website searches find 2 or more products, something like "More then one match" should be written in the file. Example:

http://prntscr.com/4ebkcx

Please note that non of the website is using SSL and non of them requires login in order to display the prizes.
This fact should make this script easier to build.

Its not important for this script to run fast (its better I think to run it with some timeouts because of the server glitches and bottlenecks). What is more important is to make it automatic so one can start it over the night, over the weekends.

Number of the URLs would be around 10, and list of keywords from few tens, to a few hundred. 

 

If I can provide some additional clarification and info I'm available. 
Of course I would be willing to pay someone to help me accomplish this task.

Cheers
Dean

Edited by dolke022

Uhhh, not a php question per se here but let me ask you - Just how do you know exactly how a search is conducted on each and every website?  Is there some "standard" for web designers to follow when adding a search button to their site that I have not noticed when I browse sites daily?  I know when I am trying to find something I have to scan the whole page (or pages) looking for how to do a search on that specific site and I'm using my eyes and intelligence to find it.  You are talking about a non-AI appl that will do this for you?  Or are these sites all identical and you already know how to engage their search protocol? 

Hi ginerjm,

 

No, they are not identical although some of them are using same engines like Prestashop, Magento and so on...

 

What I heard from some people that I talked to, search function is using search html box or some similar element so I guess script that I'm trying to make will go through the whole webpage (homepage) and try to find these search function-specific elements in order to locate it.

This is also in ? and that is one of the reasons I need help from this community :)


 

I made a site like this at one of my first jobs.  You should not be doing these queries in real-time.  You need to write a spider for each of the sites you want to search, and have it constantly running and indexing each of these sites.  Each spider will be custom-tuned to the site in question to properly pull item titles, product codes, descriptions, prices, and images (or whatever set of data you need).  Cache all this information locally in a database along with a raw HTML copy of the page you got it from and links out to the original URL where the information was found.

 

Then you need to work on fine-tuning your search database and algorithms locally to produce acceptable search speeds and results.  Once people find what they're looking for on your site, they can click over to the indexed site directly (much like an actual search engine works).  There is no drop-and-go solution to this, it's a multi-step process of spidering, locally caching, indexing, search optimization, and results presentation.

I made a site like this at one of my first jobs.  You should not be doing these queries in real-time.  You need to write a spider for each of the sites you want to search, and have it constantly running and indexing each of these sites.  Each spider will be custom-tuned to the site in question to properly pull item titles, product codes, descriptions, prices, and images (or whatever set of data you need).  Cache all this information locally in a database along with a raw HTML copy of the page you got it from and links out to the original URL where the information was found.

 

Then you need to work on fine-tuning your search database and algorithms locally to produce acceptable search speeds and results.  Once people find what they're looking for on your site, they can click over to the indexed site directly (much like an actual search engine works).  There is no drop-and-go solution to this, it's a multi-step process of spidering, locally caching, indexing, search optimization, and results presentation.

Thank you for replaying but I believe that you didn't actually caught what I'm trying to make.

 

Sites that I want to search are not owned by me. They are random websites from my country as well as from other countries.

What they all have in common is the field of business (spare parts for home appliances). 

 

I don't want to index them nor do I need all information that they contain. 

 

All I want to check if these websites have item/product that has this keyword somewhere in its name or description and display its price in a CSV format. Reason for using CSV files is because I want to automate this process in order to check the same thing for multiple keywords on multiple website.

 

I don't need results to be stored in a DB, nowhere near that complex.

 

Please note: I'm not trying to make a website. Just single web page that will execute this script. It actually doesn't need to be a webpage at all. If I can make it work as a standalone desktop application that solution works for me to... 

Edited by dolke022

I know they're not your sites, that's why you need to first make a copy of them yourself.

 

If you don't want to do that, and you're ok with each request taking a few minutes to return (rather than a few seconds like a normal website would), then you still need to write one script for each of these pages which will perform the search (use the PHP class Snoopy for this), parse the results (using either regular expressions or the DOMDocument extension), and output them to a CSV (using fputcsv).

Well, making local copies of these websites doesn't make much sense because I'm comparing prices right...

Lets say that I'm making these comparisons twice a week. That means that I would need to download local copies of 10 websites twice a week prior of running a script because prices that I want to extrude might be changed in the meantime...

Can you explain to me why would request took so long? I'm really new with php and my knowledge is quite limited.
Also, why need for a multiple scripts?

 

You wouldn't be "copying the website". You'd be storing a local copy of the HTML results page from the other site so you can parse it locally. What you are trying to do isn't a simple task.

 

You wouldn't need "multiple scripts". You'd need a different parser for each type of site, which can just be different methods in the same class.  For instance, if you wanted to search google remotely it is different than searching MSN. The query strings in the URL are different and the results are also in a different HTML format, so you'd need to know that in order to scrape the data/parse the DOM structure properly. So you'd need a way to identify each "type" of site, like you mentioned Magento, and parse it with the proper parser designed for each type. So you'd need to know ahead of time what each site uses and then use the proper parser for each site.

 

Site A uses Magento

Site B uses ClearCart

Site C uses OpenCart

Site D uses Magento

 

In order to speed up the process, you'd want to be using asynchronous requests so that you don't have to wait for the result from Site A before getting the next one from Site B. So they can all be executed at the same time basically.

Yeah, none of this is extremely difficult - but it will be a LOT of work. I have built a screen-scraper previously (i.e. a script that reads a remote web page and extracts certain data). It is a very laborious, tedious process. You will need to manually run searches and inspect the HTML results for each site to identify how the results pages are built. Then you need to build code for each site in order to extract the information you want. And, the output for a given site will not always be consistent so you need to go through a lot of repetition to determine what differences there may be and account for that. For example, you need to account for what the output look like when there are no matches vs. actual matches. What if there are a lot of matches that are on separate pages? You may need to build logic to traverse those pages to get all the data. Or, in some cases the actual HTML format could be different based on the searches such as is some products have images and others do not. Depending on how the HTML is built the parsing may need to be different. Or, what if the product is out of stock? Then, after you go and do all of that for one single site - BAM!, they change their layout and all your work is down the drain.

 

Plus, you state you want to do this for possibly hundreds of products? That's crazy. That would take a long time and could even get identified as a malicious activity. You should definitely build a database and update it with all the products from each site on a regular basis. Then run the searches against your database. I would suggest reaching out to these sites to see if they have a service to get their current product list and pricing rather than building a screen scraper.

Edited by Psycho

Guys, thank you for detailed explanation.

 

I found this whole thing much clear now.

Psycho asked what if there are a lot of matches that are on separate pages?

This cant happen. Also there are no separate pages. Search result is the only page from which data (price) is extracted so it can only be multiple matches on that single result page. Also, it doesn't matter if it out of stock or not. Script will try to find the price (Im guessing each shop has this price in some specific tag or format I dont know...) so if the product is out of stock and price next to it is no more, script will return noting, a blank field in the CSV file...
 

Or Im getting this all wrong...  :

Btw you said that I need to build a DB and then run through it manually. This is not an option. 
Also, asking my competitors for a full list of their product with prices is mission impossible. 

 

Edited by dolke022

Building a database on your end with constantly-running spiders is still the proper solution.  If you're using this for competitive intel so you can set your own pricing that means you're likely doing LOTS of searches against all these sites, and you'll likely be pulling down most of their product catalog anyway.  Therefore, my initial advice still applies.

 

If a database isn't an option for reasons you cannot share, then you still need to write multiple scripts, one for each site you wish to search, and have those scripts take in arguments on what to search for.  These scripts will have to manually query the page, manually retrieve the search results, and manually look for the pricing.  There are no magic functions here, and based on my experience doing exactly this same thing the sites will either block you outright or start screwing with your spiders.  You will need constant vigilance on your data with extremely flexible scraping activity in order to maintain this project.

 

You also will likely need a lawyer.  

Psycho asked what if there are a lot of matches that are on separate pages?

This cant happen. Also there are no separate pages. Search result is the only page from which data (price) is extracted so it can only be multiple matches on that single result page. Also, it doesn't matter if it out of stock or not. Script will try to find the price (Im guessing each shop has this price in some specific tag or format I dont know...) so if the product is out of stock and price next to it is no more, script will return noting, a blank field in the CSV file...

 

What do you mean it can't happen? How do you know? What if there are 26 matches to your search and the search results are displayed in a paginated fashion with 25 results per page? Unless you have already done that analysis and know that no particular searches will not pull many records you are just making assumptions.

 

As for the "out of stock" scenario, you are again assuming that the price would be displayed or, if not displayed, the layout of the HTML would be the same so you can 'check' the price field. But, what if the HTML layout for an out of stock product does not match in-stock products? The logic to analyze the page could fail. Like I said, this is a very laborious, tedious process because you have to go through a lot of trial and error to produce all the possible outcome and then analyze the HTML source to determine how to build the logic. And, as stated above, if any site makes any changes your script will likely break.

 

Why is a DB out of the question? Would definitely be easier. Plus, you could use it to "run" the scripts to determine which products to search rather than entering them in manually.

Edited by Psycho
  • Like 1

I appreciate everyone's help and arguments don't get me wrong and I'm glad this discussion made so far and so wide.

Still I'm not sure that you guys are viewing this from the right perspective. 

I will try to clarify even further...

 

Dan is right when he say that I need this competitive intel so I can set my own prices accordingly. On the other hand, I'm not doing searches against them, and I'm not pulling any kind of catalogs but sometimes I do use some valuable intel and use it for my website. They all do the same because my website is the first in the filed atm.  

I noticed that you are mostly from the US. Well, I'm from Serbia and due to lack of certain rules and regulations, especially when it comes to internet sales this is legit.

I'm browsing these websites on a daily basis so I could manually search all these products, find their prices and update my own accordingly. So basically I'm trying to find automatic way of doing so, not needing to browse to each od these sites, checking if they even have this product and then checking the price and then finally set my own accordingly. I want to automate process, not necessarily make it faster. If this is a spammer task maybe some timeouts between each req to the server might be a good solution. 

 

Its legit even in the states, at least if you doing this manually and believe me I wont be needing a lawyer here, at least not yet :)

Actually I don't care how this will be implemented. DB or not. Desktop application or web, php some other language.

In short words I need results to be similar to: 
http://prntscr.com/4edui6

 

Also I know 25 results per page cant happen because my keywords are original spare part codes and results can be: 0 or 1 product

Layout in the out of the stock scenario is the same for all these websites. I have checked multiple times. 
 

Edited by dolke022

Start by analyzing the HTML content of the search results and start building the code to extract the content you want from the page. But, trying do this without a DB AND creating a combined output in CSV will be problematic. Performing a request from many sites for many products will take a long time. So, either you would have to run the script from the command line (so it won't time out) or you would have to set it up refresh after each request. And, doing the latter would complicate creating the CSV file since you would have to open it, write to it, then close it on each execution of the script.

 

So, start by creating the process for a single site to perform a search, get the results and parse the results. Then you can move to the process of creating a framework to run the code for each site and output the results.

 

I would suggest using simple_html_dom

Edited by Psycho

Guys from what you have to say on this topic, I'm quite sure that I'm not suited enough to deal with this type of task.

 

Does someone of you care helping me make this script? Of course I would be willing to discuss about honorarium.

 

Also, we doesn't need to fulfill all the requests that I had. Lets start with 1 or 2 URLs and then we can add more methods to the script. 

 

Anyone interested?

Cheers
Dean

After all the advice you've been given, you still believe that this a good idea?  You have to realize that you are talking about trying to write code that examines an incoming webpage that can change at any time and that your code won't change unless you find out that it happened.  How are you going to do that?  Plus each website is most likely (definitely?) going to different and you will have to have a different method of reading each of them.  This a job that involves flying blind most of the time and you want to recruit people to do it for you?  Great if you can do it, but what happens once they have been paid (a lot probably!) and you need those changes I mentioned?  You think they'll come back or that you will be able to maintain the code?

 

Doing this for just one website is quite possible - I've done it myself.  Doing it for many sites is just a monumental task with tons of pitfalls.

Yeah, I'm not even going to touch that. I already stated it will be a LOT of work and, as ginerjm just stated, any changes in the layout of a site would break the code. If you want to do this, you need to learn to do it because you are going to need to support it as those changes occur.

 

I already suggested that you do this for one site first. Once you've done that you will have an idea of the full scope of work. Here are the basic steps you would want to accomplish:

 

1. Determine how a search is done on the site: POST or GET. If it is get, doing a search if easy. Just append the search string to the URL using the appropriate name/value pairs. If it is POST, you can use cURL or (with PHP5) you can use something such as shown on this thread: http://stackoverflow.com/questions/5647461/how-do-i-send-a-post-request-with-php

 

2. Next do some searches through your browser on the site and inspect the HTML source code of the results. Make sure you do a LOT of searches to try and detect any variability between the results page and how the content is structured: no results, one result, multiple results. You say there would only ever be one result. Go ahead and use that assumption, but I would not. What if a search did not find a match but that site instead provided a result of something that was "close" instead?

 

3. Once you have analyzed all the variations in the output of the search results, build the logic for that site to extract the data you want: i.e. the price.

 

If you get that far for one site you will then understand how much work would be involved for multiple sites. Of course, that doesn't cover the process of taking the results from multiple sites and putting them into a combined output.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.