Plotting all you data can be hard. Some think it’s pointless. Some think it’s a waste of time. Some think generic dashboards are better. Some think logging is more precise.

Nothing can replace a well designed scatter plot as a critical diagnostic tool.

Let’s start with a traditional time series view of a large dataset. We’ve all seen these plastered on dashboards, powerpoint presentations, and benchmarking documents. In this case I have:

  • a kubernetes cluster,
  • running a micro services driven architecture,
  • with auto-scaling enabled on search services,
  • spinning up to 12 parallel pods,
  • servicing randomized fuzzy searches,
  • with a 250 ms target latency,
  • and a linear scalability target as we start more pods.

This test launched 96 parallel threads conducting, each continuous searches, with a thread spun up every minute until the 96th minute mark.

most performance engineers start here…

With this in mind, my default metrics spit out a familiar transactions per minute summary (because counting per minute is easy!):

If this was the only information I had access to, I’d explore the event at 12:00 where I see a drop in throughput, then I’d focus on 12:45 where I see the workload peak.

when we should start with raw logs

But I have the raw logs too. About a million transactions. Why not have a look?

By taking my analysis a step further I can probe for additional insights before diving into logs around these two time periods. For one, logDNA on this cluster is recording about a billion lines a day – that’s a lot of probing.

What more can we see by manipulating the latency data?

Lets create a scatter plot of ALL data points then look for patterns. Driving the point size of the marker down yields a big old gray mass, but we get an early indication that something has shifted at the front end of the test. We see a second population of high latency transactions develop:

If we increase the transparency of the fill on the marker, we can start to see a better frequency distribution as the overlays of transparent markers provide us with a heat map:

Now we can see many things going on within our data!

insights!

We can quickly see we have at least four sustained latency bands in the dataset. These indicate that some percentage of the transactional population is being serviced at a reasonable latency, and another population is much slower.

This type of pattern is consistent with a saturated context pool somewhere in our stack.

Here’s a closer look at the bands:

As we zoom in, we can also see there may be double rainbows …er…bands within the lower region that we didn’t see before. We can also see the upper bands appear later in the plot than the lower bands:

This indicates that latency populations are shifting higher as the test is running.

now for the cool stuff

You could argue that we can use other statistical methods for characterizing these latency shifts through a percentile time series analysis, and I would not argue with you. Our objective should be to come up with numbers that represent these scatter plots as our end goal, but a complete visualization of latency populations is an great starting point for familiarizing yourself with a new workload in a new environment.

That said, I have not found consistent statistical approaches that can quickly identify small problems in transactional workloads. In the last decade, I have yet to see a monitoring tool identify micro-outages like these (note the header on my blog – same thing…from over 10 years ago!):

The arrows are highlighting curious gaps (maybe GC events?) that warrant investigation. I now can narrow in on specific transactions that surround these gaps to narrow the timestamps spanning the event of interest.

Did you see the other weird blip? This one:

There’s a curious shift in our workload, and it recovered. These patterns jump out when you look at all the data. Now we have another interesting time gap to take a look at.

By looking at “all the data” we quickly identify patterns that we would never see by scrolling through a log, or aggregating a tpm rate, or by dividing by my tpm rate to get a tps rate (because aggregating by the 1440 minutes per day is easier than counting by the 86,400 seconds per day).

In conclusion, spend time getting to know your data up front by looking deeper than a simple moving average. There are many interesting patterns in your existing metadata that are strong indicators of specific problems you then target for deeper investigation. Remember my original comment on the scale of this problem; a billion lines of logs, with 1 million search transaction entries, where we are looking for any actionable performance tuning indicators. In this example, plotting the population gives us massive insights into those billion lines of log entries.