timeline2015v1.0

Time for a new twist on an old concept…the Gantt Chart. Tried and true, we’re used to seeing it used for old waterfall planning models. However, we can take a project chart for a large scale POC of an MDM application housing 250 million records on a new clustered database. The schedule was extremely aggressive with an original target of three weeks, which, due to delays in database deployments and JDBC/ODBC connectivity issues stretched to 5 weeks.

Can you build a visualization that conveys the complexity of a project involving a dozen technical people, hundreds of tests, with distinct goals?

Hey, I’m a data guy – you know what the answer is going to be…

Here’s a complete version of the graphic that is intended to be printed out on a large sheet of paper for a detailed review:

You can click on the image and navigate the full scale graphic. This post will walk you through the details contained in the graphic. But for now, I just like looking at the visualization, because it’s, well, pretty freakin cool looking.

Why all the fuss

Understanding high scale tuning and profiling requires deep understanding of performance data sciences, stack tuning, and the art of negotiating with multiple groups within an organization. You need to understand how to talk to the business owner, application administrator, database administrator, and hardware infrastructure personnel. Typical deployment or corrective action projects are on an aggressive schedule and an understanding of these core principles is required to ensure no time is lost to miscommunication. Developing a succinct graphic with layers of information that allow the presenter and the audience to probe as deeply as they want goes a long way to aligning everyone involved in the project.

x-axis = time, y-axis = tps objective

In traditional project charts the y-axis has minimal significance other than a sequential order of tasks that mirrors the x-axis. If we rethink the presentation of a project, we can assign more significance to the y-axis. In this case, since this POC was focused on achieving specific throughput goals, I chose to make the y-axis a rate so we could assess our progress toward our goal over time.

setting tuning targets and tracking progress

Though the schedule was aggressive, this particular project set distinct target milestones with a comprehensive test plan.

timeline2015v1.0-schedule

This is basic project management, but we also look at how these milestones correlate to test outcomes visually. The linking of project schedules to results plots help business owners understand the complexities of infrastructure setup and testing outcomes.

workload analytics by transactional mix types

Test sequencing was developed as part of the test plan, and that sequencing is trapped in the automated profiling runs captured on the right side of the graphic. These lines represent experiments (points) connected in sequences of related experiments that form studies (lines). Various studies by workload type, insert, read, and search, are run sequentially and metrics are collected across the stack and time correlated with these runs to develop distinct workload profiles.

server tuning and workload optimization

It is much easier to track progress by monitoring test sequence outcomes along with target metrics. In this case the range of throughput rates for individual experiments is trapped on the right. You can see steady progress in tuning following an extensive debugging phase:

timeline2015v1.0-tuning

workload profiling

Once the system was tuned, the full test plan was executed, moving through various workloads determining the range of throughput rates achievable on this environment with the representative data set.

timeline2015v1.0-profiling

canary / tracer testing

There are two interesting sets of data points highlighted in black. The horizontal sequence is a reference workload that is injected into the test plan over time to monitor the reproducibility of a test result. This is helpful in determining the standard error associated with a test. This concept can be extended to a production system, where a reference workload or transaction can be issued at a regular cadence and used as a “canary” or “tracer” test for evaluating environment performance.

timeline2015v1.0-canary

fault injection

The triangles represent fault injection tests. These tests are critical for understanding how an infrastructure will react to a failure. At a basic level, this testing can be as simple as taking a host within a cluster offline, which is what was done during these test cycles.

timeline2015v1.0-faultinjection

lab efficiency

Finally the lower portion of the plot tracks lab efficiency. Typically these large scale environments are expensive to stand up and use. We should have crisp test plans with trackable objectives as we have seen above. However, I like seeing and sharing efficiency data that looks at how well we are using the environment we have been given access to. This graphic uses a par chart approach broken by a white line when we have hit our metric.

timeline2015v1.0-efficiency

You can see that we didn’t do that well during the initial tuning phase where we were attempting to determine the cause of low tps rates. After completing the tuning phase of the project, you can see how the utilization rates of the platforms increased substantially, periodically meeting our objectives.

I find putting these metrics in front of the team get them thinking about hitting their goal. I’ve also used these objectives for team goals when bringing new infrastructures online. The best way to get automation in place is to set a high utilization target.

Give it a try

Give some thought to how you can adopt information design principles in your project graphics.

visualizing scale testing projects