Gnuplot data analysis: a real world example

2013-04-01

Creating graphs in LibreOffice is a nightmare. They’re ugly, nearly impossible to customize and creating pivot tables with data is bloody tedious work. In this post, I’ll show you how I took the output of a couple of performance test scripts and turned it into reasonably pretty graphs with a few standard command line tools (gnuplot, awk, a bit of (ba)sh and a Makefile).

The Data

I ran a series of query performance tests against data sets of different sizes. The sets contain 10k, 100k, 1M, 10M, 100M and 500M documents. One of the basic constraints is that it has to be easy to add/remove sets. I don’t want to faff about with deleting columns or updating pivot tables. If I add a set to my test data, I want it automagically show up in my graphs.

The output of the test script is a simple tab separated file, and looks like this:

#Set	Iteration	QueryID	Duration
500M	1	101	10.497499465942383
500M	1	102	3.9973576068878174
500M	1	103	9.4201889038085938
500M	1	104	2.8091645240783691
500M	1	105	2.944718599319458
500M	1	106	5.1576917171478271
500M	1	107	5.7224125862121582
500M	1	108	5.7259769439697266
500M	1	109	4.7974696159362793

Each row contains the query duration (in seconds) for a single execution of a single query.

Processing the data

I don’t just want to graph random numbers. Instead, for each query in each set, I want the shortest execution time (MIN), the longest (MAX) and the average across iterations (AVG). So we’ll create a little awk script to output data in this format. In order to make life easier for gnuplot later on, we’ll create a file per dataset.

% head -n 3 output/500M.dat

#SET	QUERY	MIN	MAX	AVG	ITERATIONS
500M	200	0.071	2.699	0.952	3
500M	110	0.082	5.279	1.819	3

Here’s the source of the awk script, transform.awk. The code is quite verbose, to make it a bit easier to understand.

BEGIN {
}
 
{
        if($0 ~ /^[^#]/) {
                key = $1"_"$3
                first = iterations[key] > 0 ? 0 : 1
                sets[$1] = 1
                queries[$3] = 1
                totals[key] += $4
                iterations[key] += 1
 
                if(1 == iterations[key]) {
                        minima[key] = $4
                        maxima[key] = $4
                } else {
                        minima[key] = $4 < minima[key] ? $4 : minima[key]
                        maxima[key] = $4 > maxima[key] ? $4 : maxima[key]
                }
        }
}
 
END {
 
        for(set in sets) {
                outfile = "output/"set".dat"
                print "#SET\tQUERY\tMIN\tMAX\tAVG\tITERATIONS" > outfile
                for(query in queries) {
                        key = set"_"query
                        iterationCount = iterations[key]
                        average = totals[key] / iterationCount
                        printf("%s\t%d\t%.3f\t%.3f\t%.3f\t%d\n", set, query, minima[key], maxima[key], average, iterationCount) >> outfile
 
                }
        }
}

This code will read our input data, calculate MIN, MAX, AVG, number of iterations for each query and dump the contents in a tab-separated dat file with the same name as the set. Again, this is done to make life easier for gnuplot later on.

I want to see the effect of dataset size on query performance, so I want to plot averages for each set. Gnuplot makes this nice and easy, all I have to do is name my sets and tell it where to find the data. But ah … I don’t want to tell gnuplot what my sets are, because they should be determined dynamically from the available data. Enter, a wee shellscript that outputs gnuplot commands.

#!/bin/sh
 
# Output plot commands for all data sets in the output dir
# Usage: ./plotgenerator.sh column-number
# Example for the AVG column: ./plotgenerator.sh 5
 
prefix=""
 
echo -n "plot "
for s in `ls output | sed 's/\.dat//'` ;
do
        echo -n "$prefix \"output/$s.dat\" using 2:$1 title \"$s\""
 
        if [[ "$prefix" == "" ]] ; then
                prefix=", "
        fi
done

This script will generate a gnuplot “plot” command. Each datafile gets its own title (this is why we named our data files after their dataset name) and its own colour in the graph. We want to plot two columns: the QueryID, and the AVG duration. In order to make it easier to plot the MIN or MAX columns, I’m parameterizing the second column: the $1 value is the number of the AVG, MIN or MAX column.

Plotting

Gnuplot will call the plotgenerator.sh script at runtime. All that’s left to do is write a few lines of gnuplot!

Here’s the source of average.gnp

#!/usr/bin/gnuplot
reset
set terminal png enhanced size 1280,768
 
set xlabel "Query"
set ylabel "Duration (seconds)"
set xrange [100:]
 
set title "Average query duration"
set key outside
set grid
 
set style data points
 
eval(system("./plotgenerator.sh 5"))

The result

% ./average.gnp > average.png

Click for full size.

Wrapping it up with a Makefile

I don’t like having to remember which steps to execute in which order, and instead of faffing about with yet another shell script, I’ll throw in another *nix favourite: a Makefile.

It looks like this:

average:
        rm -rf output
        mkdir output
        awk -f transform.awk queries.dat
        ./average.gnp > average.png

Now all you have to do, is run make whenever you’ve updated your data file, and you’ll end up with a nice’n purdy new graph. Yay!

Having a bit of command line proficiency goes a long way. It’s so much easier and faster to analyse, transform and plot data this way than it is using graphical “tools”. Not to mention that you can easily integrate this with your build system…that way, each new build can ship with up-to-date performance graphs. Just sayin'!

Note: I’m aware that a lot of this scripting could be eliminated in gnuplot 4.6, but it doesn’t ship with Fedora yet, and I couldn’t be arsed building it.

— Elric