Highly available web service design?

Question

I was wondering if any design gurus out there could help me understand how large web companies design their services to be highly available. The scenario I'm thinking of is:

Client A connects to Service A
Client A sends N requests
On the N+1 request, Service A blows up
Client A reconnects to Service B
Service B services requests N+1 onwards

The only design I could think of was using a "metadata"/"discovery" service that the Client could be statically aware of. This service would give information on the best available Service, which the Client would then connect to and begin his requests, and re-query the "metadata" service when he realizes Service A went down. The application service is now highly available, but...

...the glaring problem is that the "metadata"/"discovery" service is static, will come under high load, and is not highly available, which kind of defeats the whole purpose. I suppose I could throw a lot of hardware under this service, but that's not a very good solution.

How should I go to design a real highly available web service?

Sure, could be used for load balancing, but the main scenario I'm thinking of is when a service instance goes completely down, and there's another one up somewhere that could, potentially, service the client, if the client knew about it. You could, of course, just keep statically defining each service instance, but that's not scalable to deploying more and more instances. — user109533, Nov 22 '13 at 18:54
Load balancing and replication is not the same. I think he's more talking about the latter. Algorithms like Paxos and it's more modern descendants like RAFT are more what you think of, to build a distributed, fault-tolerant system. Try googling these keywords and you'll end up with a bunch of existing software which builds on one of these algorithms, helping you accomplishing that task. Another option could be to have a good, replicated DB backend server. Depends on your use case. — JensG, Nov 22 '13 at 19:25

score 5 · Answer 1 · answered Nov 22 '13 at 19:26

I think it is important to note that the client does not really react to the destruction of service A. The web service itself must fix this situation.

You could provide a set of redundant load balancers. These will provide the servicing of requests to their collection of available computers.

In the scenario you described above, the load balancer would hand Client A off to Service A. Once Service A is unavailable, the load balancer hands Client A off the Service B, and so on.

Of course, load balancers can die too. There are a few solutions I'm aware of.

Use round-robin DNS. This is not great as some clients will be serviced and others won't.
Have the load balancers check each other's statuses and then take over when the other is gone.
Use a cloud service, like EC2 Elastic Load Balancer.

We use AWS, so I've learned a lot from their architecture page. It is available for the public as well. Of course, if you don't use AWS, you need to extrapolate the ideas.

Highly available web service design?

1 Answers1