LARGE x RARE == DIFFERENT: Why scaling companies is harder than it looks

cartoon6497

Something interesting happens when you run more than 1,000 servers, as we do at WP Engine.

Suppose I told you that on average our servers experience one fatal failure every three years. The kernel panics (the Linux equivalent of the Blue Screen of Death), or both the main and redundant power supply fails, or some other rare event that causes outage. Does that sound like a bad batting average?

Think about the laptops and desktops you’ve owned. Some last longer than others, but it’s probably something like 2-4 years before the OS locks up or a battery can’t hold a charge anymore. Now consider your laptop is idle the vast majority of the time, whereas our servers are getting pounded multiple times per second, 24/7. Even in the wee hours of American timezones when our pan-company traffic dips to its nadir, we’re running malware scanners, off-site backups, and other maintenance.

So, even if our servers are more hardy than your MacBook Pro, they’re taking 100x the beating, so one failure every three years seems pretty reasonable.

Windows NT crashed.
I am the Blue Screen of Death.
No one hears your screams.
–Haiku from FSF

But remember, we have 1,000 servers. Three years is about 1,000 days. So that means, on average, every single day we have a fatal server error.

Not to mention 10 minor incidents with degraded performance, or a DDoS attack somewhere in the data center affecting our network traffic, or some other thing that sets pagers a-buzzing in our Tech Ops team and mobilizes our Customer Support team to notify and help customers.

“Well sure,” you say, “that’s normal as you grow. If you had just 10 servers and 100 customers, you’d have much fewer problems and many fewer employees. Today you have more customers, more servers, and more employees. What’s so hard about that?”

The insight is that that scale causes rare events to become common. Things happen with 2000 servers that you literally never once saw with 50 servers, and things which used to happen once in a blue moon, where a shrug and a manual reboot every six months was in fact an appropriate “process,” now happen every week, or even every day.

Things as rare as, well, you know…

cartoon6064

Also, it’s not just problems that morph with scale, but your ability to handle problems morphs too.

For example, a dozen minor and major events every day means 20-50 customers affected every day. Now consider what happens as we try to inform 50 customers. For some we won’t have current email addresses, so they don’t get notified. Some of those will notice the problem and create extra customer support load at minimum, but at worst they’ll post on Twitter about how their website was slow today and WP Engine didn’t even know it. Then our social media team has to piece all this together, attempt to respond, maybe put together a special phone call with that customer, etc..

Or, consider the scale-ramifications of on-boarding 1,000 new customers a month. In that case, it’s likely that any given server issue can affect a customer who has only been with us for 30-60 days. Thus the issue causes a “bad first impression,” which is harder to address than a customer who has been with us for three years and therefore has built up a “bank account of patience.”

All of these aspects of the “solution” side of the process is affected by the same rule of rare events happening regularly, and causing much more work to solve than when the company was small.

The usual response to this is “automate everything.”

As with most knee-jerk responses, there’s truth in it, but it’s not the whole story.

Sure, without automated monitoring we’d be blind, and without automated problem-solving we’d be overwhelmed. So yes, “automate everything.”

But some things you can’t automate. You can’t “automate” a knowledgable, friendly customer support team. You can’t “automate” responding to a complaint on social media, which as our Twitter meister Austin Gunter says is usually a customer’s last resort and thus should always be treated as the very legitimate issue that it is. You can’t “automate” the recruiting, training, rapport, culture, and downright caring of teams of human beings who are awake 24/7/365, with skills ranging from multi-tasking on support chat to communicating clearly and professionally over the phone to logging into servers and identifying and fixing issues as fast as (humanly?) possible.

And you can’t “automate” away the rare things, even the technical ones. By their nature they’re difficult to define, hence difficult to monitor, and difficult to repair without the forensic skills of a human engineer.

Does this mean all our customers have a worse experience? No, just the opposite. Any one customer of ours has fewer problems per month today than a year ago, because we’re constantly improving our processes, automation, hardware, human service, etc.. It’s when you look across the entire company, and the non-linear additional effort it takes to not just improve the average experience, but to manage the worst-case experience, that you appreciate the difficulties.

Does that give high-scale companies like WP Engine an excuse to have problems? No way! In fact, if we’re not constantly improving on all fronts, the scale itself will catch up and overtake us, so we have to adhere to the laws of automation and diligence even more than smaller, slow-growing ones.

But for those of you in the earlier stages of your companies, when you project 5x growth coupled to just 5x the costs (or only 3x the costs because you’ll get cost-savings at scale), you’re guessing low. When you show 5x growth in projections but don’t budget for new hires in areas like security, technical automation, specialized customer service areas, and managers and executives who have trod this path before and come battle-hardened with play-books on how to tackle all this, you’re heading for an ugly surprise.

And with high growth, the surprise appears quickly, and recovery means acting twice as fast again to claw back ahead of the effect.

 

  • jasonstoddard

    Great post. Mo Scaley, Mo problems. Great opportunity for enterprise software startups: Lead with product to support automation, support with service to enhance the customer’s experience (and create maximum value.) Sidebar: Is Twitter Meister on Austin’s bcard? Does he even have bcards? :-)

    • http://austingunter.com/ Austin Gunter

      Lord, I hope that’s not the only thing I’m known for

      • http://iammike.co/ Mike Zielonka

        @austingunter:disqus How do you keep up with WP Engine’s social media on weekends, overnight, etc?

  • Mike Christensen

    Ideally, any one server should be able to crash and burn and zero customers should be affected. No one piece of hardware should be important enough to cause any downtime. Hardware needs to be redundant and the entire network should be fault tolerant. It’s fine if one in 1,000 servers goes down on any given day, it’s not fine if your network topology is not designed to handle that seamlessly.

    • http://blog.asmartbear.com Jason Cohen

      It’s a *lot* more complicated than that. For example, often a failure isn’t “going down,” but some kind of degradation that’s harder to detect and act on. Also remember that all our customers run their own code — it’s not one codebase that we control — which means any behavior you see in terms of slowness or even outright failures could be caused by code rather than any problem with infrastructure, which makes diagnosing infrastructure-related issues more tricky.

  • Shannon Lewis

    Good post. Love this quote: scale causes rare events to become regular.

  • brian piercy

    “A flock of black swans…” I agree. Rare events become regular.

  • Assaf Lavie

    Listen, I love this blog. It’s one of my few must-reads. But this post sounds so much like self-congratulating, look at us and our big-boy problems. Very little actual content. Read a high-scalability blog post if you want to see what actual insight into scalability reads like. Very disappointing this post…

    • pauldwaite

      “Read a high-scalability blog post if you want to see what actual insight into scalability reads like.”

      I don’t think that’s what this post was aiming for though. I think it meant to broadly describe what dealing with scale problems can be like, so that startup founders not yet there don’t get unrealistic expectations about how easy it will be to deal with growth.

    • http://blog.asmartbear.com Jason Cohen

      Thank you for your frank criticism.

      This was written in reaction to a startup I’m in investor in who is (hopefully!) about to enter this phase of their business, and who was not including and appreciating the costs and investments they will need to make, and thus building an unreasonable plan to take to Series B investors.

      This is a distillation of what I told them, of course using WP Engine as an example, since we’re about 2 years into this phase. That’s why at the end I caution other startups — in my mind I was still speaking to them.

      However, I do appreciate your take on it — you’re right it could also sound like we’re just boasting.

      I agree about reading a high-scalability blog — or really *book* — around *solving* technical scale, which of course this post and this blog isn’t for. But there’s also the people and process scale, don’t forget! And in both cases, reading will take you only so far — I don’t know of any company that went through high scale that didn’t have challenges and problems. It’s just hard!

      I’m trying to warn startup leaders entering this phase that it’s going to be hard in ways that are difficult to prepare for, and thus they need to assume they’ll be over-investing in people and technology. Sometimes (often?), they don’t.

  • nathanlatka1

    Fantastic post Jason – thanks!

  • Pingback: Chaotic Mixing – 31st January 2014 | Chaos Management

  • Marco

    This post made my day.
    It basically means that smaller hosting company have less costs per sale hence they can be more profitable.
    I’ve always thought a restaurant can remain small and profitable, but an hosting business has to scale to remain in business.
    After this post I’m glad to learn that also an hosting business can remain small, and still be profitable.

    • http://blog.asmartbear.com Jason Cohen

      You are correct, assuming you’re charging enough. Meaning, if your prices are too low, you need so many customers to be profitable, that then you need the scale to support them, which costs more, etc etc., until you have significant scale. That’s the trap of the shared hosting companies.

      But if you’re able to charge a premium because you’re genuinely providing a premium service, then you’re right that you might not need scale to be profitable, and that can be an extremely happy place to be indeed!

  • Pingback: Episode 71: Catchupisode with Andy and Travis - The ATX Web Show!

  • Bruce Hansen

    Again: scale causes rare events to become regular. Isn’t this a good thing? Once you reach a certain scale the black swans all even out and it becomes part of the regular process. Whereas for smaller shops the black swans are major irritants that cause disruption of process.

  • Dawn Simpson

    Coming from a data center, hosting business that is transitioning to being a cloud provider – I applaud the spirit of this article. I appreciate this perspective on growth, I also think it can apply to mature organizations. Especially those who have become complacent and need to be reminded of what it was like to scale and grow. Even mature hosting companies need reminding from time to time that it takes ongoing investments in both infrastructure and talent to scale – and with scale there can be growing pains.

  • igostartup

    Good post!!!

  • Pingback: Link storm: Avoid losing customers, more keyword not provided, what Facebook did to your reach, on churn rate, and on scaling a business :: smartmetrics - the contentmetrics web analytics blog

  • http://www.dhruwalandankita.com/ tuaid71119

    cheap air max I realize when I was living in Hawaii, I’d personally have loved to put a fashionable cardigan and some UGGS, but that is that is not suitable for the climate. You see, the signature tennis tournament in the sphere is just a away and Nike Tennis has pointed out some of the appliance that their excellent players in wear when they go ahead and take stage on Centre Court.Since being shut out by Toronto, the streaking Canadiens have gone 7-0-2 and outshot the Leafs 40-23 before a season-high Air Canada Centre gathering of 19,625.Toronto also earned a 2-1 air jordan shoes for sale road win over Montreal in the season opener for both but dropped to 4-5-0 this season at the Air Canada Centre.

    After the initial attempt grabbed the attention of a new generation, Nike realized the potential for its Dunk designs, got more adventurous, and took it a step further. They commissioned popular people – ranging from street artists and designers to established companies, and popular athletes to collaborate with them on the so-called “”limited-edition”” models.In game 1, the Heat won the game by nine points and pretty much dominated the-whole game. The Pacers struggled offensively against Miami’s relentless defense.The only difficulty would be that they are not often centrally situated and you could have to push a ways to get to 1. The merchandise performs by limiting your energy by way of urge for food suppression.

    Now, let us see how the CISSP certification helps. First, it certifies you as a credible IT security professional.Now, at $100 these are a much better value than when they were at their original $170-ish pricing. Plus, Christian Louboutin are a great fit air jordan 4 for sale on my higher arched feet; they are almost always true to size and ready to wear without my needing to run to the store for wedding shoes inserts.Importance Of English LanguageAs we know that we are living in the world of globalization.I ENED UP WITH LUPUS AT AGE 33 AND A FEW OTHER MEDICAL PROBLEMS THAT IS WHY I USE WHAT I CAN TO HELP. A VERY EXPENSIVE PRODUCT BUT DOES IT WORK?

    After you determine what type of jersey you want, which will narrow down your search, you should consider air jordan 3 for sale about you budget. When all these are done, go searching.These nike Jordan shoes are definitely athletic-inspired but they can easily be worn outdoors for any occasion as well. You can wear the retro 2 to your college party or for a game of beach volleyball.000 locales.Bruselas ha advertido de que en estas elecciones el pais se juega las aspiraciones europeas, apoyadas por el 86 % de la poblaci n, y ha abogado por la celebraci n de unos comicios en conformidad con los “estandares internacionales y europeos”.

  • Pingback: Managed WordPress Hosting: WP Engine Hosting Review

  • nike kd 6

    Located within walking distance to many restaurants just minutes from tax-free shopping, championship golf, movie theaters and much more.Too often education and development is the province of HR or the training department but by becoming more involved in their managers training and development, line management will have a greater impact on their teams performance and capability, which will ultimately impact the performance of their organisation.Despite my intelligence I haven nor will I likely ever do anything to help humanity in anyway.If this were a TV show, you be a smash hit.This is a factor that causes big businesses to attempt in every way to show to the public that they are ethical.

    Rinse your hair thoroughly.Finishing Touches for Your RoomDecor position designed to reach the meet from, potent as well as supportive together with variable base Nike Air Max Thea wonderful balance approximately adaptability., Indigo Biosystems, Seriosity, Chorus and Collaborative Drug Discovery, Inc.You are not being a bother.If I did this study again I would have another person with me to do the observations, though I would not have them walk next to me as I shopped.When you have the freedom to explore, you can usually create change pretty quickly.Except |the perfect dot com domain for our site is already taken.They recommend an alkaline diet (that includes lemon juice) as a means to have healthy saliva.

    Recent CommentsLately, James was the advanced of participation in community events in Hollywood in Houston.All media is not the same.At first she didn’t think I really meant it.If the addresses are over 6 months old and haven been mailed to in that time then you should question the value of the list.Timberland social formation of companies, from start to finish, and their partnership with a priority commitment.But should you take the time to learn, there is no Nike Air Max 2013 other way to easily give a sexy and seductive look other than buying a perfect pair of high heel shoes.El saque de Sahin encontr , en el minuto 41, la cabeza del polaco Robert Lewandowski que mand el bal n al fondo de la red y empat el encuentro.

    That’s because it will take their sales staff the same time to process your $150 order as it would to process a $5,000 order.Well if your opponent is right handed his/her left leg will normally be weaker, and if he/she is left handed his/her right leg will be the weaker one.Often troops on foot are expected to carry very heavy and large loads, this slows them down, which can threaten their lives when under fire.As a result, it won’t look like a wart on someone’s face.Do not trust completely on this Nike Air Presto advice.Now a mock turtle neck and classy blazer are likely to match expectations.No other brand or name comes close to the Jordan shoes and for the same reason everybody is looking forward to get cheap Jordon shoes in all style and design.

    Heels from Sergio Rossi and a clutch bag in gold highlighted the frock.Carromero, quien se encuentra en prisi n provisional en La Habana “instruido de cargo por homicidio”, conducia a exceso de velocidad y cometi otros errores mientras circulaba por una carretera en obras y sin pavimentar pero senalizada al efecto, segun la versi n oficial sobre las causas del siniestro.Okay, so I am a community college faculty member who is supposed to wear my academic regalia to our graduation ceremony next Saturday.Prices are mostly called odds .Yes, they drive product sales and ad revenue, and in many cases, spearhead major philanthropic initiatives.MARROCOS Brazil!