“We Sweat the Seconds”: An Interview with the LimeLight DevOps Team

“We Sweat the Seconds”: An Interview with the LimeLight DevOps Team

At LimeLight, we spend a lot of time thinking about our platform stability and uptime — and for good reason. Our 99.9% average platform uptime is something we’re very proud of, and since uptime matters a lot to advertisers and business owners, we’re always looking at ways to not only improve our product offerings but our reliability as well.

To get an inside look at how we’ve achieved one of the highest platform stability ratings in the industry, we caught up with two of the people responsible for making the LimeLight platform one of the most reliable in the business: VP of Engineering and Operations Clark Huang, and Senior Systems Administrator Gustavo Folga.

From sweating every second to investing in both people and performance, Clark and Gustavo shine a light on what really matters in today’s fast-paced digital economy.


What exactly is platform stability, and why is it so important to advertisers and eCommerce brands today?

Clark: Platform stability, which is sometimes referred to as uptime, basically means how reliably do the tools and features on an online platform work like they’re intended. And for every one of our clients and virtually every eCommerce business owner, uptime directly impacts your business and your bottom line.

In this industry, uptime equals money. Our clients run offers and pay affiliate marketers a lot of money to drive traffic to their sites. Once customers are on their site and are ready to complete a transaction, they’re looking for total transaction response times of three, four or maybe five seconds maximum. Higher transaction times equate to drop-offs since people just don’t want to wait for a page to load.

In this industry, uptime equals money.”

There are two things at play here. First, there’s the speed factor, how quickly can you process a transaction. But there’s also a reliability element — how reliably can customers complete a transaction? A lot of this comes down to how well advertisers and the platforms you rely on to do business can withstand certain failures.

Failures are inevitable, but the way you build in platform reliability is with the appropriate amount of redundancies. One of the reasons we are able to accomplish significant platform stability is because we’ve put certain redundancies in place. There’s rarely any single points of failure [on our platform].

What are redundancies, and why do they matter to platform speed and reliability?

Clark: Redundancy is having multiple servers connected to multiple networks and presented out of multiple physical data centers. For example, we’re hosted in AWS (Amazon Web Services) — Amazon’s overall infrastructure is that they have multiple data centers in the United States and throughout the world.

These data centers are grouped into geographical regions, so you have US West 1, US West 2, so on and so forth. As of today, we are not in multiple geographical locations, or what they call regions. The only reason is that, when you have servers in different regions that are separate data centers in separate networks, they’re running over generally public lines. The latency between two regions — so West coast and East coast, for example — can affect the speed [of transaction] component.

What has been LimeLight’s approach to redundancy? How does that impact our speed?

Clark: We’ve chosen our redundancies very carefully and strategically based on what we know about latency across different regions. So yes, we could make elements of our platform redundant in two regions, but in order for all of that data to sync up in a consistent and reliable way, it takes time.

What a lot of people don’t understand about AWS is that within the different regions, there’s a split called “availability zones”. These are physically separate buildings, so they’re not all in one place. They could be miles away from one another, but usually in the same city or general area. By working in multiple availability zones within the same region, you get many of the benefits of cross-region redundancies but without the negatives of slower speeds.

So it could take eight seconds to process a transaction in a highly redundant environment. That doesn’t sound like a lot, but we and our customers simply can’t afford an eight-second response time. We want a three to four second response time, so we have chosen to stay in one region while working across multiple availability zones. This is how we balance speed with reliability.

So how different is a four-second vs. an eight-second response time? Do a few seconds really matter?

Clark: Even within the technical team at LimeLight, we’re all consumers. We know what it’s like to have a slow transaction and how much of a hindrance that is to the customer experience. So for example, you go to some small business sites that are hosted on a WordPress eCommerce system. You click on something, the wheel spins, you’re not sure if it went through so you click again… it’s a very aggravating experience.

Now in certain industries and for certain products that have limited availability, [a consumer] might be willing to put up with this slow and sub-par online experience, but for [our clients’ industries] where the customer journey is relatively brief and the attention span is equally short, you can’t really afford an extra few seconds.

“In our work, we have to sweat the seconds, because for our clients, just a few seconds is a very big deal.”

So for example, if someone watches a 30-second clip about a product, click on that ad and submit their credit card details only to have that transaction take even just a few extra seconds from what you’re used to, they’ll sometimes say “oh forget it, I don’t want it that bad”. In our work, we have to sweat the seconds, because, for our clients, just a few seconds is a very big deal.

What’s changed in the consumer demand for online shopping experiences? Are people really that much more impatient?

Clark: Consumers today are spoiled — they’re used to unlimited server capacity and lightning-fast speed. So really, that puts a lot more pressure on performance, which comes into a product development side. So when I work with our team to launch new features, we always make sure that performance is a key element we look at. That sometimes means pushing back at the right times and putting [very robust] quality assurance processes in place.

It’s saying “let’s do things in a way that’s performant”, especially when we want to add all the bells and whistles to our platform. Sometimes, those bells and whistles don’t equate to fast performance, and striking that balance we always focus on.

This really comes back to redundancies. So yes, [we] could expand into different AWS regions in order to launch a bunch of new features, but then speed suffers… performance suffers. It’s a tightrope walk of getting the right features but also putting performance at a very high standard.

LimeLight is at a 99.9% uptime while some of the others guys are down around 95% or lower uptimes. What makes uptime and stability so hard to achieve?

Clark: I think it’s the investment. Some of our competitors put their data centers in areas that save cost but also put performance and speed at a disadvantage. So for example, one of our major competitors put their data center in Iceland. That probably happened for a number of reasons, but first and foremost it was likely due to cost.

So why does it even matter that it’s in Iceland? Well, I’ve seen latency differences between just the East coast and West coast of the United States — so just imagine what that lag might be like between here and Iceland!

Gustavo: And it’s not only about choosing the right cloud provider, it’s really about choosing the right tools to maintain the reliability of the system. That [comes down to using the right] monitoring tools and leveraging the right automation to make things repeatable, from pushing code to your web servers, to run tests in an automated fashion, et cetera. There really are so many other elements that go into how we maintain our platform stability. We’ve made a significant investment in these types of things, especially very advanced monitoring tools to enhance reliability and accessibility.

We’ve got an exceptionally robust technical team for what is a pretty lean operation overall. How many of our competitors take the time to have an Engineering/Ops person who spends the time on these things?

Clark: Many of our competitors, they’re down to two or three developers. So no, often they don’t have a dedicated person like Gustavo who puts a significant amount of time and energy into these elements. Instead, they have three technical people wearing all kinds of hats doing all of these things.

So again, it comes down to investment. Not only are we investing in AWS to have the right amount of redundancy, but we’re able to keep a staff that’s big enough that there is a separation of duties. That kind of separation across a robust technical team equates to higher performance, better reliability and, down the line, significantly higher uptime.

What are some of the issues we’ve had on the LimeLight platform, and how have we changed up the way we’re doing things to prevent these types of issues in the future?

Clark: A while ago I was having a conversation with someone that said to me, “the word on the street is that LimeLight is more buggy than usual.” I think that was true at the time, because we were trying to move very fast in getting features up and running for our clients, and sometimes that comes at the expense of performance.

I think, through that process, we learned to put additional focus and effort into quality assurance and testing. Since then, I think we’ve pivoted to focus more on performance as well as capability — really understanding how to better balance risk and reward. Plus, just recently we split our infrastructure and our services and made a shift to something called microservices, which is essentially separating sub-systems so that any one failure doesn’t bring down the whole system.

So recently, we actually had an outage in our analytics platform. But it only affected the analytics and reporting — it wasn’t affecting transactions and getting money from consumers. Sure, accountants might have had to go for a coffee break or step away for lunch and run reports afterwards, but it wasn’t affecting those mission-critical, money-making elements of the business.

This is important because, for our clients, the services and systems that support revenue growth are most critical. We need to be able to prioritize these things for our clients to safeguard against the inevitability of short-term failures and outages.

So what is it about the model we have at LimeLight that works for so many of our clients? What makes LimeLight the right platform for such a wide range of business owners and advertisers?

Clark: The way I see it, we’re a small company, but we’re big enough to have the right people in the right roles that are properly segmented. That said, we’re still small and nimble, and we work extremely fast. At this point, we’re doing weekly releases while some of our much larger competitors are more on monthly or even quarterly releases. So if you file some bug and that company determines that it’s not a mission-critical fix, you’ll likely have to wait quite a while to see that improvement in your platform.

There are also a lot of guys out there that put a huge focus on getting new features out there before anybody else — but what I often see is that this comes at a huge cost to performance and reliability. If the focus is entirely on new features and speed to market, then suddenly you see a significant dip in how often these features work reliably.

I think that’s the beauty of LimeLight — our systems are just as robust as these major companies but we’re able to tailor things on a much more personalized level for our customers. I think we’re in the sweet spot. We’ve adopted all the best practices that any large organization would do, but in a way that still makes sense to a small or mid-sized company.

So as LimeLight continues to grow, how do you plan to continue to raise an already high bar for platform stability?

Gustavo: As our business grows, we really look at where to invest — and having a plan in place as to how we plan to innovate and improve down the line. Plus, we’re looking at ways to streamline at the same time, so we can move faster and do so in a way that’s economical for us and, in turn, our clients.

We’re constantly looking a few steps ahead so that we’re able to create a strategy for not just being “good enough”, but being exponentially better and better as we grow.

Contact our team to find out more about how and why platform stability is important for the success of your business.