AWS Lambda: What We’ve Learned
By: Joe Awad
Last October, we published this blog post on the business case for why we moved our main API to AWS Lambda and about some of the benefits it provides. Now that we have been using Lambda for two years and had its scope expand significantly, we wanted to revisit some of the things we’ve learned over that time.
There were a number of benefits that we expected to receive from using Lambda. These include ability to scale, cost savings, client isolation, and increased security, all of which have come to fruition and exceeded our expectations going in.
For ability to scale, the maximum number of simultaneous requests that Lambda can handle is in the thousands, while our old story generation process was unable to handle more than 150 simultaneous requests. Additionally, the time it takes to scale when facing sudden bursts of traffic is vastly improved. In our load testing, it only takes Lambda approximately 5 seconds to spin up more capacity, even if we hit it with a sudden large burst of traffic. This stands in contrast to the 4 to 8 minutes it would take us to spin up new EC2 instances.
Despite the huge increase in scale and throughput, our AWS bill associated with powering that API end to end is less than a tenth of the cost that we used to pay for the EC2 instances before migrating to Lambda. Instead of pursuing other options to reduce costs even more, we decided to turn our focus toward optimizing performance, improving reliability, and increasing test coverage.
Over the two years that we have been using Lambda, no client has caused an outage. Even during our internal load testing, in which we flooded the API with orders of magnitude more requests than our peak traffic volume, we were unable to trigger an outage or substantially degrade performance. Previously, an unexpected burst of traffic from one customer had the potential to degrade performance for other customers in our multitenant environment. In unusual cases, this led to API service outages.
Finally, because the code that processes each client’s requests is divided into separate Lambda functions and the data is never saved to disk, it has been substantially easier to get sign off from potential new customers’ security teams. There are no concerns about data retention periods or the risk of having a database breached.
There were also a number of benefits that we thought we might get from the transition but were unsure about. These include improved individual request performance as well as improved maintainability of both the system and individual endpoints.
Since we had to refactor the architecture of our API to put it on Lambda, we were also able to re-evaluate which features we absolutely needed for production requests and reduce the number of steps that each API request goes through. Because each Lambda endpoint is for a single use case and we do not need to save the data payloads that clients send us, we no longer have a database associated with the API, eliminating a scaling bottleneck and substantially speeding up response rate. Also, we are not handing off the request between multiple services anymore since all of the code used to process the request is already loaded into the Lambda function.
Lambda calls into a single service that processes the request. This means that we do not have services taking up processing power while simply waiting on a response from a different service, increasing our utilization of the available resources. Additionally, each endpoint can have its own customized amount of memory and processing power, allowing us to rightsize the hardware to the client’s needs. These factors combine to decrease average API response time by more than 50 percent.
Running our API on Lambda has made it significantly easier to maintain. The code base is a fraction of its previous size, and each request is no longer hitting services that are shared with other requests or being passed between services. Each API endpoint having its own Lambda function that has its own copy of our shared libraries gives us increased granularity and control over testing and rolling out changes. Additionally, each Lambda function writes logs and tracks metrics in its own individual CloudWatch log group. This separates out the logs for each client endpoint, which also allows for easier monitoring, debugging, and cleanup out of the box.
We have not experienced any downsides as a result of our migration to Lambda. When we started building the system, we investigated the recommended method of using API Gateway to trigger the Lambda invocations. However, API Gateway has a number of limitations related to timeouts and request size that do not meet our requirements for this use case. To get around these limitations, we were able to build our own routing system that connects the API with Lambda. We also built automation to fully handle the building, deployment, and testing of new Lambda functions to ensure that everything is running properly and up to date.
Overall, Lambda has been exceeding our hopes here at Narrative Science. We improved performance across the board and decreased our costs by more than 90 percent. We firmly believe that the key to that success has been the fact that we invested appropriate time into building the tooling and automation to manage our Lambda usage so that it runs entirely on its own with extremely high reliability.