Jump to content

Recommended Posts

I will keep the description in general.

 

 

Let's say there's a third party database website that holds default information about an object, and every object has a unique ID.

 

I want to create a review website where:

 

1. people submit their chosen object with the unique ID

2. default information about the object gets gathered by a bot from the third party database, and the bot knows from which object to gather the default information by the user entered unique ID

3. the use writes his review and gives it a rating

4. user submits his submission

5. (finally) other users can as well add their review + rating to that already submitted submission

 

 

Everything from user submission to rating and writing reviews is to me simple PHP & MySQL, but the actual critical point to me is the part where a bot goes and gathers default information about the object form a third party database website. I don't know much about bots and I'd like to have the experts weigh in. Are there any critical problems that could occur with such a design? I imagined that the bot could be programmed with PHP/CURL.

 

When I'm talking about a third party database website, then I mean those typical websites, there are many out there, be it music licensing websites, or ISBN websites, or movie credentials websites and the list goes on.

 

 

A good thing is that those websites do not require a CAPTCHA, you simply enter the ID and it gives you all the default information about the object.

 

So is there anything that I should worry about with this type of design?

 

Do you think I could manage to program a bot with PHP/CURL to do this task?

 

If you were an investor would you like such an idea?

If I was an investor I'd be asking 'how will this make me money?' From a developer point of view though, the idea sounds feasible. Although I have a few thoughts..

 

How would the user know this object ID? It sounds like you're asking them to provide the.. whatever equivalent of a product's manufacturer code just to access it. You can't really expect the user to go off to Google and spend five minutes trying to find it, because they just wouldn't do it. Also, I'm not really familiar with the website you mentioned, but do they allow you to crawl their website? All of them? How would you know which web service to use for each object ID?

 

When writing a bot you have to consider that if they notice your requests to their website (i.e. you fire several a second, or over two seconds) they'll block you. At work we occasionally get requests through to block this IP or another, either because they're causing performance issues or they've just been noticed crawling the site a lot. You have to limit your connections to keep 'under the radar'. This may be different however, if it's just a single request as the user enters the content. Though if you do get blocked, would certain content on the website fail to work?

Do you think I could manage to program a bot with PHP/CURL to do this task?

 

Sure, just make sure you set:

 

User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

 

To avoid being blocked, only very few actually check the IP-address :)

Sure, but even then you still can use proxy server's or use shared hosted servers (at least if the data you fetch brings in lots of money) across the world or ..

 

PS I am not trying to be smart here, I hate this just as much as you do. As we are not only crawling websites every day, we also have to protect against it. And sometimes make some arrangements/meetings to avoid a lawsuit :D

Ok, you guys gave excellent suggestions and thoughts which will make me think this through much deeper, that was the whole reason for this thread, thanks so far.

 

MrAdam, the ID is on the object itself (think of an ISBN on a book). And to answer the question on which service to use, there are multiple database websites out there, but I personally like only one of them. The input box is right on the start page on every of those websites, so that means the bot only has to go to the start page and then enter the ID, and after that the description page of that object immediately comes up - all the bot has to do now is fetch the simple text data and bring it back to my website.

 

Both of you guys mentioned something very critical, what if the bot gets blocked, this could break the whole design of the website and make it useless, which makes it very risky. All I could do from this point on is simply ask the owners of that website for permission and maybe even offer them a deal, perhaps a share of all the profits.

 

The way how I've gotten it from you guys is, that if all doesn't help I still can use a proxy server as a last resort is that correct? Does this mean that even if they should block the bot, I still can have the service run by using a proxy?

 

Bottom line is that this idea really can work, but it has critical points.

 

 

When you guys talk about "abusive bots", in which dimensions are we talking about, just so I can relate to it. If a bot goes to the start page, enters the id and fetches 2 paragraphs of text on the next page, and that's it. And let's say this procedure happens up to 10 times a day, would that be too much?

 

If the Google Bot reaches my website I have around 6 MB of bandwidth taken, but the Google Bot accesses every bit of the website, whereas my bot would only access two pages and fetch 2 paragraphs of text data, and those pages are text only on white background, which means there's not too much too load.

 

 

 

 

 

MrAdam, the ID is on the object itself (think of an ISBN on a book). And to answer the question on which service to use, there are multiple database websites out there, but I personally like only one of them.

 

You keep referring to "object" - what exactly are you talking about here? Giving it a generic name like that suggests it could be a whole mix of things; books, movies, etc. What I meant before with what service to use, was how will you know which type of object the ID is, as to collect the data from the right web site? Also how will the user know what the ID is?

 

When you guys talk about "abusive bots", in which dimensions are we talking about, just so I can relate to it. If a bot goes to the start page, enters the id and fetches 2 paragraphs of text on the next page, and that's it. And let's say this procedure happens up to 10 times a day, would that be too much?

 

In that case I highly doubt you have anything to worry about, although you may need to re-think things if the site becomes popular and there's a lot more requests.

Nobody can answer your questions.  It entirely depends on the type of information you're pulling and what this means to the company that is providing the information, and what type of copyright they are asserting, or business model they have.

 

Maybe I don't get it, but I can't see how you could have any business based on the premise of 10 visits a day.  I'm guessing you mean 10 visits per day, per ID, and there may be hundreds or thousands of these ID's.  If you don't think you will get noticed by them (assuming they care) you're probably going to be wrong. 

 

If your entire business is going to depend on their data, you should look into making a deal to license it from them at some point.

The number 10 is not referring to visits, it's referring to SUBMISSIONS, since it's a review website. The bot only goes and fetches default data about the object only if the correct ID has been entered and submitted together with the user review + rating.

 

I named the number 10 because I consider 10 submissions a day quite successful, remember that if one submission has been ALREADY made and the object is already registered and listed on the website, then the following users will add their review to the already submitted submission, they will not be able to re-submit something that has been already submitted.

 

If you're saying that 10 submissions per day is in the green area, then this is not that big of a worry as of now. But we should also not forget the unsuccessful queries, when for example somebody enters a wrong ID, the bot will still go and try to fetch which will waste bandwidth, or if he enters an ID that the database website does not contain, then the bot will again waste bandwidth. 

 

So let me formulate my question differently, when do I really have to start worrying? Remember I'm just trying to get a feel for this as of now. It is good to know that even if the website would get a lot of submissions a day I still can fall back to a proxy. If all doesn't help I'll have to offer the database website a deal. And of course I'll have to check if I'm not violating any laws, as gizmola mentioned.

 

 

 

If you're talking about writing a script that takes a site's intellectual property (IP) and then redisplaying it on your own site without any permission from the IP's owner, you're definitely at least in a gray area or way over the line for what's considered "Fair Use" when you're grabbing more than a couple of sentences to quote in a review.  And the cryptic "like an ISBN of a book" - but not that - is raising all kinds of red flags for me. 

 

If you're over the line there, it's not a matter of being flagged as a bot, but waiting around for the cease and desist letter and/or lawsuit regardless of how you're gathering that data.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.