This is a presentation from Josh Berkus called Scale Fail given at the O’Reilly MySQL CE 2011 that goes into detail on how building “scalable” applications and services can go very wrong (thanks Ivan for the link)
The key six points from the presentation are (from http://highscalability.com):
- Be trendy. Use the tool that has the most buzz: NoSQL, Cloud, MapReduce, Rails, RabbitMQ. It helps you not scale and the VCs like. Use Reddit to decide what to tool use. Whatever is getting the most points this week is what you should use.
- Troubleshoot after the barn door has closed. Math is not sexy. Statistics are not sexy. Forget resource monitoring, performance testing, traffic monitoring, load testing, tuning analysis. They are all boring. Be more intuitive. Let history be your guide. Whatever problems you had on your last job are the ones you’ll have on this job.
- Don’t worry about it. Parallel programming is not sexy. Erlang can parallel program a 1000 node cluster, but it’s not sexy. Be hot and from the hip. Ignore details about memory and management. Don’t worry about it. Use single-thread programming, lots of locks, ignore scope and memory contexts, have frequently-updated single-row tables, have a single master queue that controls everything, and blocking threads are your friend.
- Hit the database with every operation. Caching is not your friend. Every single query should go directly to the database. Ignore caches completely.
- Scale the impossible things. Scaling easy things is for wimps. There’s no hotness there. There’s no speaking engagements in scaling web servers, caches, shared-nothing hosts and simple app servers. Scaling the impossible things is where the hotness is: transactions, queues, shared file systems, web frameworks. This is how you can have the long nights and weekend and the war stories that will get you up on stage.
- Create single points of failure. No matter how large your software is, you must have a couple of places where a single point of failure will bring down your entire infrastructure. Like a single load balancer, or a single queue, or load balancers that run a 100% capacity.
Even though the focus here is mostly dev related, the paralels with other IT areas are self evident. Let’s talk about the relevant points from a networking perspective:
- Be trendy. How many times have you seen some odd box from an unknown vendor that someone purchased just because it sounded cool?. (*cough* Cisco MARS *cough*). Or someone rolling an untested technology into production just because they saw it in the latest issue of Trendy Mag? (*cough* Most DPI technologies *cough*)
- Troubleshoot after the barn door has closed. You get a lovely support ticket “Customer XYZ reports the network is slow“. You start asking the typical questions; what’s your baseline? where are you seen the issue? Is this related to a single application? etc. No answers, no graphs, no netflow, nothing. For bonus points, you might have to deal with the spaghetti chucker or a battle-scarred soldier. Another classic is the customer who claims to need to upgrade the speed of the network, let’s say from dual 1 GbE uplinks to dual 10GbE uplinks. The upgrade is done and the customer complains the network it’s still “too slow”, tests are run, everything is running at wire speed and at the end of the day it turns out that the problem that prompted the upgrade was not related to the network speed.
- Scale the impossible things. This reminds me of the shift we are seeing with the streched L2 domains across Data Centers and all the problems that will come with it. But hey that’s what happens when the gurus fall asleep at the wheel.
- Create single points of failure. Quite self explanatory, most networks have a lot of SPoF. Other have them but don’t know it, simply because they have some sort of redundancy in place but they haven’t bother to test it and so they can’t really answer questions such as; is the backup firewall really going to take over seamlessly? What’s happens if the root of STP goes down? are we sure the backup fiber works? is HSRP ok on the standby?. I’d had customers with NMS systems that report thing such as “R-ISP-ABC-1 is down” and seconds later “R-ISP-ABC-HSRP is down”. One of my favorites is when everything is redundant but someone decides to add a new non-redundant device, lets say an inline IDS; in the best scenario this results in a performance hit and in the worst case you just created a new and shinny SPoF. After all, complex systems have complex failures.
The interesting paradox here, just as Josh points out in his presentation, is that if you prevent this issues beforehand, you will never be the organizational hero. But be able to fix them at the moment of crisis and voila!; folks will sing your name and write of your adventures
Perhaps this is why so many IT folks are always in “firefighter mode”.
The Scale Fail by CCIE Blog, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.