Jump to content

Recommended Posts

Over the past month visitors to one of my site has grown exponentially, currently now over 100K per day and increasing. Im getting worried that the servers cant handle it, and its only a matter of time before they crash.

 

Ive split my css, images, javascript up on subdomains, plenty of caching and zlib.output_compression. Using 14 separate databases - now over 250MB each and 2 servers. Thats all i can think of to do. 

 

 

Link to comment
https://forums.phpfreaks.com/topic/160125-website-growing-too-fast/
Share on other sites

Load balancing is your next route. Are you sure that your visitors are actually human and you are not being hit by robots. Most of our adult projects get hit by scrapers etc all the time and trapping them is key as they cause huge amounts of server load.

Its a tube site. So membership is free, with up to 30 min vids to download and also a convert resize mod thrown in.

 

The conversion takes some server power but this is one the second server, but ive noticed a large drop in site performance.

 

Im allowing other sites to embed videos as well - from the main server

 

I dont want to spend much on this because its a free site, so buying servers is out

 

Theres gotta be advanced caching techniques out there?

 

Heres what im using, this is all i can find on caching:

 

#increase speed and preserve bandwidth
<ifmodule mod_php4.c>
php_value zlib.output_compression 16386
</ifmodule>

# 1 WEEK
<FilesMatch "\.(jpg|jpeg|gif|ico|png)$">
Header set Cache-Control "max-age=604800, public"
</FilesMatch>

# 5 MIN
<FilesMatch "\.(css|js|swf|flv)$">
Header set Cache-Control "max-age=172800, private, proxy-revalidate"
</FilesMatch>

Header unset ETag
FileETag None

# stop ppl from browsing indexes
Options -Indexes

Load balancing is your next route. Are you sure that your visitors are actually human and you are not being hit by robots. Most of our adult projects get hit by scrapers etc all the time and trapping them is key as they cause huge amounts of server load.

 

Half the hits are curl bots. Have no idea how to block them

 

This is from awstats on my server under browsers:

 

Browser | Grabber |hits |percent

 

Curl | Yes | 4756433 | 43.3 %

What is your greatest concern? CPU time and memory consumption, or bandwidth?

 

Load balancing is your next route. Are you sure that your visitors are actually human and you are not being hit by robots. Most of our adult projects get hit by scrapers etc all the time and trapping them is key as they cause huge amounts of server load.

 

Half the hits are curl bots. Have no idea how to block them

 

If you can tell that it's being done using curl then why can't you block it? I suppose you could check the speed at which a given IP address makes requests. If they make requests too far then they're probably bots. You can use something like denyhosts and add bot IP address to /etc/hosts.deny. Doing that will drop all packets coming from those IP addresses.

You know that they are bots because of user-agent?  If so, simply refuse them based on that.  Smarter bot writers can trick you, but it will cut out the dumb ones.

 

As for performance here's a couple of tricks of the big boys.

 

- Serve your static images etc, using a light weight web server like lighthttpd

- make sure you've minified your javascript

- To lessen load on your db's implement memcache to cache your results. 

- Go through your Apache server config and make sure you are using the fewest apache modules possible.

- Use a php opcode cache like APC.

 

You know that they are bots because of user-agent?  If so, simply refuse them based on that.  Smarter bot writers can trick you, but it will cut out the dumb ones.

 

As for performance here's a couple of tricks of the big boys.

 

- Serve your static images etc, using a light weight web server like lighthttpd

- make sure you've minified your javascript

- To lessen load on your db's implement memcache to cache your results. 

- Go through your Apache server config and make sure you are using the fewest apache modules possible.

- Use a php opcode cache like APC.

 

 

Ok thatll get me started. Im already using lighthttpd. Javascript is cached using htaccess. Im also using smarty templates caching on pages that dont need {section} reoccuring

 

If you can tell that it's being done using curl then why can't you block it? I suppose you could check the speed at which a given IP address makes requests. If they make requests too far then they're probably bots. You can use something like denyhosts and add bot IP address to /etc/hosts.deny. Doing that will drop all packets coming from those IP addresses.

 

Will this affect embedding. I get a large amount of incomming traffic from other sites from embedded videos

If you can tell that it's being done using curl then why can't you block it? I suppose you could check the speed at which a given IP address makes requests. If they make requests too far then they're probably bots. You can use something like denyhosts and add bot IP address to /etc/hosts.deny. Doing that will drop all packets coming from those IP addresses.

 

Will this affect embedding. I get a large amount of incomming traffic from other sites from embedded videos

 

No. The request will come from the end user, not the site that embeds it (that is if you're talking about YouTube style embedding). So unless there is a page with a lot of embedded videos on one page it shouldn't be a problem.

Other things on the frontend you will want to do is to use CSS sprites and combine CSS and Javascript files into single files. You will also greatly benefit from design principles such as separation of concerns because it ensures that all your CSS and Javascript is cachable.

 

Also, why are you unsetting ETags? These help with caching. I also noticed "<ifmodule mod_php4.c>" in one of your previous posts. I haven't done any tests, and I can't be bothered to search for any, but I'm pretty sure PHP5 is faster than PHP4.

Have you profiled your script yet to see if you have code bottlenecks?  Also, turn on your MySQL query cache to a reasonable size if you haven't yet.

 

And just checking, but when you said you split the css/images/javascript onto subdomains, did you offload those onto another server or a CDN, or is it still on the same box?  And are your video files on subdomains too?

Please tell me that you're using adverts (adults ones) on this site? And not just doing all of this out of the goodness of your heart?

 

Its a hobby so i dont mind if i make nothing from it - my payment is building apps that ppl actually use. I want it to have no advertising at all just content.

 

And just checking, but when you said you split the css/images/javascript onto subdomains, did you offload those onto another server or a CDN, or is it still on the same box?  And are your video files on subdomains too?

 

Css, images, and javascript are on main sites subdomains:

 

http://js.main.com

http://img.main.com

http://css.main.com

 

Video files are on remote server

 

Theres one ip that has been hitting every page and stealing large amounts of bandwidth - 61.41.172.241. I cant seem to trace it

 

 

Xylex reminded me of one other item -- turn on the mysql slow query log, and look at that for queries that are hammering the db's.  This could indicate that you have queries that need rewriting or indexes that are needed.  You can also do explain's on those queries to see what's going on with them.  I agree with his query cache tuning, although that assumes your using the myisam engine.  There are similar adjustments you can make if you're using innodb, but you have to read the innodb docs for those.  There's an excellent performance tuning tool called innotop you should check out, if you're using innodb.

 

For caching, I think you jumped over that suggestion far too quickly.  There's a huge difference between smarty (which I'm completely cool with) and using memcache.  Memcache is distributed in memory caching for your data, not for rendered pages.  If you want to know what is being used by myspace, facebook, livejournal and countless others, to allow them to scale, it's memcache.  It's very simple to implement and can drastically lower the drain caused by your db's, especially when it's a read heavy environment like the one I'm assuming you have.

 

As for the curl bots, I'd suggest you start by dropping them to a page with some contact info, and if you want to deal with individual exceptions you can always code that in via custom user agent string you agree upon or IP range. 

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.