When it comes to delivering a delightful user experience, every second matters. Studies have shown that users have a very low tolerance for slow-loading sites, so it’s critical for us as a software company to understand the performance of our products and how it changes over time. There are a variety of factors that impact the speed of writing a data story in our product Lexio, and Lexio has multiple releases a day as a part of our CI/CD process, so we want to be the first to know when and why performance degradations occur. This starts with collecting granular, actionable data about Lexio’s performance. We are working towards storing and presenting this information in a way that is easy to access and interpret across our business.
How do we do Observability in Lexio?
For New Relic, we track endpoint-level timing information, along with lower-level function timings. Since there’s some overhead in recording timing info, we only instrument functions that have a large effect on performance. Along with raw run time, we also attach metadata to each event. This data can be combined to give a granular picture of the performance of Lexio and what is contributing to any performance problems.
Why doesn’t New Relic solve all our problems?
For New Relic in particular, we make heavy use of their custom event framework that allows you to easily time code blocks and attach arbitrary custom parameters to the timing event. This model is quite powerful — it allows us to generate a rich dataset with arbitrarily detailed performance information for Lexio. However, we need to strike a fine balance between extremely granular data (which can be very noisy and difficult to interpret) and very high-level data (which is often not detailed enough to provide actionable insights). We’ve found that tracking at the endpoint level is too coarse, and instead try to find meaningful places to put instrumentation within our library code.
The New Relic user interface is designed for software engineers and other technical users. Data exploration is done using New Relic Query Language (NRQL), a SQL-like query language with custom syntax and functions used to select and aggregate event data from the New Relic database. New Relic can also present data in dashboards of graphs, tables, and tiles, but they require the user to know how to interpret the array of visualizations and where exactly to look for key takeaways. So while the data collected in New Relic is valuable to everyone who builds, supports, or sells Lexio, only a handful of engineers have direct access to the platform, and its modes of data exploration and presentation fail to communicate performance insights to a wider audience.
Our product specializes in surfacing easy-to-understand data insights in plain English, so we embarked on a hackathon project to make Lexio’s performance metrics accessible to our internal users using the product itself.
How do we get data out of New Relic?
The first step in our project was to get data out of New Relic and into our system. We achieved this using the Singer framework, which defines a standard for data extraction via a “tap” script that sends data to be consumed by a “target” script. We wrote a Singer tap to pull data from the New Relic API and then set up an integration in Stitch, the ETL vendor we use, which can be used as a Singer target to consume the data. Because we track performance data for a wide variety of requests, we ran into limits for how much historical data we could process for our project. Two days worth of data amounted to nearly 2 million rows, but only a fraction of these requests are relevant to most stakeholders, so filtering during extraction or transformation could allow us to work with data over a longer period.
For the purposes of our hackathon project, we simply ran the Singer tap from a laptop, but since then, we started orchestrating scheduled executions of a data extraction script based on the New Relic Singer tap using AWS Step Functions so the extraction process would be reusable across organizations in Lexio. Once the Stitch data integration was receiving data from New Relic, we made use of existing infrastructure that links a data connection in Lexio to a Stitch integration. Stitch sends data from the integration to our S3 data lake then the data is transformed and loaded through our data pipeline.
What story do we tell?
Now that the New Relic data was in Lexio, we had to decide what we wanted to communicate. The aspect of performance that is most visible to Lexio’s end users, and thus the most significant to internal stakeholders, is story write time: the time it takes for a data story to show up on the screen. We track this by measuring the request duration for a function called
run_story that generates story language given story parameters.
Using Lexio, we could read about changes in average and maximum story write time and number of written stories per day, and see how different data connections and organizations compared in terms of these metrics. An additional insight Lexio provided that wasn’t available in New Relic was driver analysis, which identifies factors across multiple dimensions that drove change in a metric over time. For instance, when tracking average story write time, we were able to see which data connection, organization, and user were most responsible for driving an increase in story write times over a few days.
Lexio also broke down story write times and number of written stories along the dimension of story cache usage. To improve story-write performance, Lexio maintains a cache of story language for recently viewed stories and pre-cache frequently accessed stories like bookmarked or pinned metric stories, and we track in New Relic whether each story write request was able to use a cached result. Unsurprisingly, cached stories load much more quickly than uncached ones, but we also found that organizations differ widely in their proportions of cached story writes. This could be accounted for by different usage patterns between organizations. An organization where users’ interaction with Lexio is confined to bookmarked and pinned stories will encounter more cache hits, while a more exploratory mode of interaction, such as opening recommended follow-up stories, will require writing more uncached stories.
With upcoming support for writing about calculated fields in Lexio, we would be able to see a single story analyzing the story cache hit rate broken down by organization to gain more clarity on these discrepancies and potentially identify ways to improve cache warming.
What would we do next?
Even though Lexio stories are easily accessible to our team, the nature of performance data means that many of them still require a lot of domain knowledge to act on. To make sure our team is able to stay informed, we’re planning to write up recurring performance updates consisting of Lexio story snapshots along with some extra contextualizing comments by taking advantage of Lexio’s upcoming commenting functionality.
New Relic data is a great pressure test for Lexio. It shows off our team’s flexibility both in the ability to work with large datasets and extending Lexio for a very wide range of use cases.