Maximising Website Runtime on Host Servers Running FreeBSD

}

November 30, 2013

We’ve all seen the damage caused by web traffic spikes. During this year’s Super Bowl, the websites of 13 companies that advertised during the match went down within five minutes of the advert airing. With advertising slots being sold for up to $5,840,000 (£3.6m, €4.3), and run to drive traffic, that’s one (well, 13) costly website failure.


Another very high profile crash came just last month, when the US’s new healthcare insurance programme, Obamacare, launched. The exceptional traffic and various bottlenecks took the website out almost instantly.

Over 2 per cent of the world’s websites run on BSD (roughly 14 million websites). And a quick Google and forum search highlights that OpenBSD users are not immune to this problem. Therefore traffic spikes need to be part of any business continuity plan when working with BSD or any other Unix variant. But it’s not only website traffic spikes, as we move to cloud services we add extra dimensions to business continuity planning and we have to look at disaster striking when a major cable is cut – or a myriad of other major system failures.

Business continuity planning
There are two key metrics used by industry to evaluate available disaster recovery (DR) solutions. These are called recovery point objective (RPO) and recovery time objective (RTO). A typical response to DR is to have a primary site and a DR site, where data is replicated from the primary to the DR site at a certain interval.
RPO is the amount of data lost in a disaster (such as the failure of a server or data center). This depends on the backup or replication frequency, since the worst-case is that the disaster occurs just before the next scheduled replication occurs.

RTO defines the amount of time it takes an organization to react to a disaster (whether automatically or manually; typically there will be at least some manual element such as changing IP addresses), performing the reconfiguration necessary to recreate the primary site at the DR site.  For example if there is a fire at the primary site you would need to order new hardware and re-provision your servers from backups. For most web hosting companies retaining an exact replica of every server at the primary site at the DR site is not economically viable.

Scaling Websites When There Are Spikes In Traffic


The traditional model

In the common shared hosting model, a web hosting company will install a lot of websites on a single server without any high availability (HA) or redundancy, and set up a nightly backup via rsync.
In this model when a website gets very popular, the server which is hosting it is also busy serving requests for a lot of other websites and becomes over-loaded. Typical consequences of this are that the server will start to respond very slowly as the required number of I/O operations per second exceeds the capacity of the server. The server will soon run out of memory as the web requests stack up, start swapping to disk, and “thrash itself to death”. This results in everyone’s websites going offline.

The CloudLinux model

An alternative is the CloudLinux model, which contains the spike of traffic by imposing OS-level restrictions on the site experiencing heavy traffic. This is an improvement because the other sites on the server stay online.
However the disadvantage to this approach is that the website which is gaining the traffic is necessarily slowed down or stopped completely. If the server were to try to fully service all incoming requests for that site, it would crash, as above.

A better model

Rather than strangling the site experiencing the spike in traffic, it is possible to dynamically live-migrate the server’s other websites on that server to other servers in the cluster. This eradicates downtime on any of the cluster’s sites, and therefore allows a host to enable full and automatic scalability.

Using this method it’s possible to deliver three (or even four) orders of magnitude greater scalability than shared hosting solutions – assuming just 500 websites per server, you can burst to 2 dedicated servers or 1,000x scalability – allowing websites to scale by intelligently and transparently migrating them between hosts.
This is the method used for HybridCluster’s Site Juggler Live Migration.

Keeping you online in a disaster
When building distributed (cloud) systems we’re faced with three tradeoffs:
• Consistency – if a system is consistent, then queries to different nodes for the same data will always result in the same answer
• Availability – the system always responds to requests with a valid response
• Partition tolerance – if the parts of a distributed system become disconnected from each other they can continue to operate

In reality you can select two and these should be availability and partition tolerance over consistency. Doing this allows the website to stay online in a disaster scenario, for example if an under-sea cable gets cut, and the European and US components of a cluster can no longer communicate with each other.
If we look at a typical web request, for example a user uploading a photo to a WordPress blog, we can see content is backed up on multiple continents to prevent loss in a disaster (natural or man made).

But once disaster strikes it is essential to then elect new masters for all the sites on both sides of the partition in order to keep the websites online on both sides of the Atlantic; something we worked into the HybridCluster protocols.
Using this approach, when traffic is re-routed or the under-sea cable is repaired, the cluster can also rejoin and the masters negotiate which version of the website is more valuable based on how many changes have been made on both sides of the partition. This keeps your websites online all the time, everywhere in the world.

Summary

Regardless of whether you’re running a huge multinational organization or an open source software site, downtime is costly.
New techniques need to be applied to both cope with increased demands and, as we shift to cloud computing, factor in disaster. Intelligent handling of data to move sites between clusters and through the automatic assigning of parent clusters after a partition is formed… and merge them again once it is fixed. By using an integrated suite of storage, replication and clustering technologies – such as HybridCluster – it’s possible to shift to a true cloud computing model and enable intelligent auto-scaling, as well as integrated backup and recovery.

Author
Luke Marsden
CEO HybridCluster
www.hybridcluster.com
This article was re-published with the permission of BSD Magazine.  To Learn More about iXsystem’s commitment to open source check us out here:   https://www.ixsystems.com/about-ix/

Join iX Newsletter

iXsystems values privacy for all visitors. Learn more about how we use cookies and how you can control them by reading our Privacy Policy.
π