JAX-RS client filters

Life goes on with JAX-RS/Jersey. I wasted a couple of moments figuring out how to add custom headers to a Jersey generated JAX-RS client. Might as well write it down in the hope of saving someone a couple of minutes.

For starters, you’ll need a Client Filter that does the actual heavy(ish) lifting.

import javax.ws.rs.client.ClientRequestContext;
import javax.ws.rs.client.ClientRequestFilter;
 
/**
 * Add the X-GARBAGE header to all requests.
 */
public class GarbageFilter implements ClientRequestFilter {
	@Override
	public void filter(final ClientRequestContext requestContext) throws IOException {
		requestContext.getHeaders().add("X-GARBAGE", "This is added to all requests");
	}
}

And then you’ll have to register the filter with the Client(Config).

// import org.glassfish.jersey.client.ClientConfig;
 
final ClientConfig clientConfig = new ClientConfig();
clientConfig.register(new GarbageFilter()); // Yes, you could use JDK8 magic :-)
 
final Client client = ClientBuilder.newClient(clientConfig);

And that’s all. Every request you launch using the generated client will now contain your X-GARBAGE header.


JAX-RS client file upload

Another hiccup in using the wadl2java client generated from a (Jersey) JAX-RS app. This time, it concerns multipart/form-data.

The method:

@POST
@Produces(MediaType.APPLICATION_JSON)
@Consumes(MediaType.MULTIPART_FORM_DATA)
public JsonFoo create(@FormDataParam("file") InputStream data, 
	@FormDataParam("file") FormDataContentDisposition fileDetail,
	@FormDataParam("file") FormDataBodyPart bodyPart) {
			// Implementation foo bar baz
	}

The WADL exposes something that looks like this.

<method id="create" name="POST">
	<request>
		<representation mediaType="multipart/form-data"/>
	</request>
	<!-- Response omitted for the sake of brevity -->
</method>

And the generated client has a method to go along with it. Unfortunately, it gives you no hints whatsoever as to how to actually provide a file/data.

// Long class names shortened
public static Create create(Client client, URI baseURI) {
	return new Create(client, baseURI);
}
 
// The Create object contains this little gem
public<T >T postMultipartFormDataAsJson(Object input, GenericType<T> returnType);

That’s wonderful. Unfortunately, if you pass in a java.io.File, nothing happens. The client barfs.

Many DuckDuckGo-searches, StackOverflow hunts and headscratchings later, I came up with a working solution:

// import import org.glassfish.jersey.media.multipart.FormDataMultiPart;
// import org.glassfish.jersey.media.multipart.file.FileDataBodyPart;
 
File file = new File(); // Your file!
FormDataMultiPart form = new FormDataMultiPart();
form.field("filename", file.getName());
form.bodyPart(new FileDataBodyPart("file", upload, new MediaType("image", "jpeg")));
 
Api.create(client, uri).postMultipartFormDataAsJson(form, new GenericType<CreateResponse>() {});

But wait! That won’t cut it. You also need to tell your Client that you want to use the Multipart Feature. Makes sense. If you don’t, you’ll end up with this exception.

org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyWriter not found for media type=multipart/form-data, type=class org.glassfish.jersey.media.multipart.FormDataMultiPart, genericType=class org.glassfish.jersey.media.multipart.FormDataMultiPart.

// import org.glassfish.jersey.client.ClientConfig;
// import org.glassfish.jersey.media.multipart.MultiPartFeature;
 
final ClientConfig clientConfig = new ClientConfig();
clientConfig.register(MultiPartFeature.class);

And there you have it. File upload with JAX-RS and a wadl2java generated client.


Endless Locale bugs

This isn’t the first time I’ve ranted about Locale related issues. It probaly won’t be the last, either.

Currently my biggest gripe is with convenience methods in Java which rely on platform defaults, and are thus platform-specific. For instance, the venerable System.out.println() will terminate your line with whatever your platform thinks a line ending should be. Printing dates or numbers will try to use your platform Locale. Writing a string to a file will default to platform encoding.

Some of these defaults can be controlled at run time, others require JVM startup options. This is all horribly wrong. This results in all kinds of unexpected behaviour. It’s error-prone. None of this should be allowed to happen. Any method that assumes platform-specific magic should be deprecated. Which is exactly what I’ll do, as soon as I can figure out how to write a Sonar plugin to detect all this nonsense.


JAX-RS vs Collections vs JSON Arrays

The Problem

Looks like I’ve managed to get myself into a pickle. I have a (Jersey) JAX-RS app, which automagically publishes a WADL. If I then generate a Client (using wadl2java), I end up with generated code that doesn’t work.

@GET
@Produces(MediaType.APPLICATION_JSON)
public List<Foo> get() {
	return myList;
}

Foo is a simple POJO with an @XmlRootElement annotation. The resulting JSON is perfectly fine: I get a JSON Array with a bunch of Foo objects. Perfect!

The generated WADL is a different matter:

<method id="get" name="GET">
	<response>
		<!-- foo is a reference to an XSD Complex Type ... but where's my list? -->
		<ns2:representation element="foo" mediaType="application/json"/>
	</response>
</method>

For some reason, the WADL generator is smart enough to realise that we’re dealing with Foo instances. But it’s too stupid to realise that we’re dealing with more than one.

If you then generate a Client using wadl2java, you’ll end up with something like this:

public Foo getAsFoo() {
	return response.readEntity(Foo.class);
}

Well that’s not going to work, is it? Trying to read a single Foo when you’ve got an array of them. And indeed … you get a wonderful exception: Internal Exception: java.lang.ClassCastException: Foo cannot be cast to java.util.Collection

This seems to be a fundamental XML vs JSON problem. After all there is no such thing as an “XML Array”. Either you have one element, or you have multiple elements nested under a common root.

I could solve this by not relying on a plain JSON Array and encapsulating the result.

// Not this:
[
   {
      "foo":"bar"
   },
   {
      "bar":"baz"
   }
]
 
// But this:
{
   "listOfFoos":[
      {
         "foo":"bar"
      },
      {
         "bar":"baz"
      }
   ]
}

But then that’s ugly. And instead of using a native Java collection I’ll have to create a useless intermediary object, resulting in more garbage collection. And it would break compatibility with current API clients.

The Solution

Whelp … I haven’t found one yet. To be continued, I hope. But I did manage to find a workaround, thanks to Adam Bien’s blog.

The incorrectly generated getAsFoo() method doesn’t work. But we can use getAsJson() instead – which doesn’t necessarily have to return a JSON string.

List<Foo> foos = client.getFoo().getAsJson(
	new GenericType<List<Foo>>() {
	}
);

GenericType is a dirty JAX-RS hack in my opinion, but it works. It’s a shame that I have to rely on getAsJson(), though. It would’ve been much cleaner to use the getAsFoo() method directly.


Java Date performance subtleties

A recent profling session pointed out that some of our processing threads were blocking on java.util.Date construction. This is troubling, because it’s something we do many thousands of times per second, and blocked threads are pretty bad!

A bit of digging led me to TimeZone.getDefault(). This, for some insanely fucked up reason, makes a synchronized call to TimeZone.getDefaultInAppContext(). The call is synchronized because it attempts to load the default time zone from the sun.awt.AppContext. What. The. Fuck. I don’t know what people were smoking when they wrote this, but I hope they enjoyed it …

Unfortunately, Date doesn’t have a constructor which takes a TimeZone argument, so it always calls getDefault() instead.

I decided to run some microbenchmarks. I benchmarked four different ways of creating Dates:

// date-short:
    new Date();

//date-long: 
    new Date(year, month, date, hrs, min, sec);

// calendar:
    Calendar cal = Calendar.getInstance(TimeZone);
    cal.set(year, month, date, hourOfDay, minute, second)
    cal.getTime();

// cached-cleared-calendar:
//    Same as calendar, but with Calendar.getInstance() outside of the loop, 
//    and a cal.clear() call in the loop.

I tested single threaded performance, where 1M Dates were created using each method in a single thread. Then multi-threaded with 4 threads, each thread creating 250k Dates. In other words: both methods ended up creating the same number of Dates.

Performance graph, indicatin dramatic differences

Lower is beter.

With exception of date-long, all methods speed up by a factor of 2 when multi-threaded. (The machine only has 2 physical cores). The date-long method actually slows down when multi-threaded. This is because of lock contention in the synchronized TimeZone acquisition.

The JavaDoc for Date suggests replacing the date-long call by a calendar call. Performance-wise, this is not a very good suggestion: its single-threaded performance is twice as bad as that of Date unless you reuse the same Calendar instance. Even multi-threaded it’s outperformed by date-long. This is simply not acceptable.

Fortunately, the cached-cleared-calendar option performs very well. You could easily store a ThreadLocal reference to an instance of a Calendar and clear it whenever you need to use it.

More important than the raw duration of the Date creation, is the synchronization overhead. Every time a thread has to wait to enter a synchronized block, it could end up being rescheduled or swapped out. This reduces the predictability of performance. Keeping synchronization down to a minimum (or zero, in this case) increases predictability and liveness of the application in general.

Before anyone mentions it: yes, I’m aware that the long Date constructors are deprecated. Unfortunately, they are what Joda uses when converting to Java Dates. I hope that the -kind?- folks at Oracle will reconsider their shoddy implementation.

I’d be much happier if java.util.Date would stop sucking quite as much. And if SimpleDateFormat were made thread-safe. And … the list goes on.


The I in IM

Instant messaging sucks. The “instant” part in particular. Being interrupted while trying to churn out code, sucks and reduces my productivity. When I’ve got my headphones on, it’s a sign not to interrupt me unless the building is on fire. When you’ve got your favourite IM client’s status set to BUSY/FUCK OFF/DND, people seem to ignore it and bombard you with piles of senseless drivel anyway. So I’ve stopped using any kind of instant messaging. The fact that all of these IM clients have megabytes upon megabytes of useless crap in them and generally misbehave on your network, is further ammunition against them. Skype is particularly bad. It even binds to port 80 for some entirely fucked up reason.

IRC, on the other hand, is amazing, and has been for over 20 years. It Just Works™. There are many clients for every single platform out there. It’s nicely standardized, so everyone can talk to everyone. It doesn’t make noise. It doesn’t get in your face. You can have group conversations, individual conversations, encrypted conversations without relying on a 100mb large untrusted binary from EvilCorp™.

Running irssi in a screen (or tmux, if you’re so inclined) on a VPS somewhere ensures that you’re always online and can look back and read conversations that you may have missed. And all without wanting to throw anyone off a cliff. Yay. Thank you, IRC.

Now if only more people would start using IRC again, then the world would be perfect.

And while I’m ranting .. what’s wrong with e-mail? Facebook messages are probably about as annoying as IM clients. One of these days I’m going to start slapping people with trouts for not using e-mail.


Gnuplot data analysis: a real world example

Creating graphs in LibreOffice is a nightmare. They’re ugly, nearly impossible to customize and creating pivot tables with data is bloody tedious work. In this post, I’ll show you how I took the output of a couple of performance test scripts and turned it into reasonably pretty graphs with a few standard command line tools (gnuplot, awk, a bit of (ba)sh and a Makefile).

The Data

I ran a series of query performance tests against data sets of different sizes. The sets contain 10k, 100k, 1M, 10M, 100M and 500M documents. One of the basic constraints is that it has to be easy to add/remove sets. I don’t want to faff about with deleting columns or updating pivot tables. If I add a set to my test data, I want it automagically show up in my graphs.

The output of the test script is a simple tab separated file, and looks like this:

#Set	Iteration	QueryID	Duration
500M	1	101	10.497499465942383
500M	1	102	3.9973576068878174
500M	1	103	9.4201889038085938
500M	1	104	2.8091645240783691
500M	1	105	2.944718599319458
500M	1	106	5.1576917171478271
500M	1	107	5.7224125862121582
500M	1	108	5.7259769439697266
500M	1	109	4.7974696159362793

Each row contains the query duration (in seconds) for a single execution of a single query.

Processing the data

I don’t just want to graph random numbers. Instead, for each query in each set, I want the shortest execution time (MIN), the longest (MAX) and the average across iterations (AVG). So we’ll create a little awk script to output data in this format. In order to make life easier for gnuplot later on, we’ll create a file per dataset.

% head -n 3 output/500M.dat

#SET	QUERY	MIN	MAX	AVG	ITERATIONS
500M	200	0.071	2.699	0.952	3
500M	110	0.082	5.279	1.819	3

Here’s the source of the awk script, transform.awk. The code is quite verbose, to make it a bit easier to understand.

BEGIN {
}
 
{
        if($0 ~ /^[^#]/) {
                key = $1"_"$3
                first = iterations[key] > 0 ? 0 : 1
                sets[$1] = 1
                queries[$3] = 1
                totals[key] += $4
                iterations[key] += 1
 
                if(1 == iterations[key]) {
                        minima[key] = $4
                        maxima[key] = $4
                } else {
                        minima[key] = $4 < minima[key] ? $4 : minima[key]
                        maxima[key] = $4 > maxima[key] ? $4 : maxima[key]
                }
        }
}
 
END {
 
        for(set in sets) {
                outfile = "output/"set".dat"
                print "#SET\tQUERY\tMIN\tMAX\tAVG\tITERATIONS" > outfile
                for(query in queries) {
                        key = set"_"query
                        iterationCount = iterations[key]
                        average = totals[key] / iterationCount
                        printf("%s\t%d\t%.3f\t%.3f\t%.3f\t%d\n", set, query, minima[key], maxima[key], average, iterationCount) >> outfile
 
                }
        }
}

This code will read our input data, calculate MIN, MAX, AVG, number of iterations for each query and dump the contents in a tab-separated dat file with the same name as the set. Again, this is done to make life easier for gnuplot later on.

I want to see the effect of dataset size on query performance, so I want to plot averages for each set. Gnuplot makes this nice and easy, all I have to do is name my sets and tell it where to find the data. But ah … I don’t want to tell gnuplot what my sets are, because they should be determined dynamically from the available data. Enter, a wee shellscript that outputs gnuplot commands.

#!/bin/sh
 
# Output plot commands for all data sets in the output dir
# Usage: ./plotgenerator.sh column-number
# Example for the AVG column: ./plotgenerator.sh 5
 
prefix=""
 
echo -n "plot "
for s in `ls output | sed 's/\.dat//'` ;
do
        echo -n "$prefix \"output/$s.dat\" using 2:$1 title \"$s\""
 
        if [[ "$prefix" == "" ]] ; then
                prefix=", "
        fi
done

This script will generate a gnuplot “plot” command. Each datafile gets its own title (this is why we named our data files after their dataset name) and its own colour in the graph. We want to plot two columns: the QueryID, and the AVG duration. In order to make it easier to plot the MIN or MAX columns, I’m parameterizing the second column: the $1 value is the number of the AVG, MIN or MAX column.

Plotting

Gnuplot will call the plotgenerator.sh script at runtime. All that’s left to do is write a few lines of gnuplot!

Here’s the source of average.gnp

#!/usr/bin/gnuplot
reset
set terminal png enhanced size 1280,768
 
set xlabel "Query"
set ylabel "Duration (seconds)"
set xrange [100:]
 
set title "Average query duration"
set key outside
set grid
 
set style data points
 
eval(system("./plotgenerator.sh 5"))

The result

% ./average.gnp > average.png

Click for full size.

average

Wrapping it up with a Makefile

I don’t like having to remember which steps to execute in which order, and instead of faffing about with yet another shell script, I’ll throw in another *nix favourite: a Makefile.

It looks like this:

average:
        rm -rf output
        mkdir output
        awk -f transform.awk queries.dat
        ./average.gnp > average.png

Now all you have to do, is run make whenever you’ve updated your data file, and you’ll end up with a nice’n purdy new graph. Yay!

Having a bit of command line proficiency goes a long way. It’s so much easier and faster to analyse, transform and plot data this way than it is using graphical “tools”. Not to mention that you can easily integrate this with your build system…that way, each new build can ship with up-to-date performance graphs. Just sayin'!

Note: I’m aware that a lot of this scripting could be eliminated in gnuplot 4.6, but it doesn’t ship with Fedora yet, and I couldn’t be arsed building it.


On Charsets, Timezones, and Hashes

Character sets, time zones and password hashes are pretty much the bane of my life. Whenever something breaks in a particularly spectacular fashion, you can be sure that one of those three is, in some way, responsible. Or DNS, of course. Apparently the software developers Just Don’t Get It™. Granted, they are pretty complex topics. I’m not expecting anyone to care about the difference between ISO-8859-15 and ISO-8859-1, know about UTC’s subtleties or be able to implement SHA-1 using a ball of twine.

What I do expect, is for sensible folk to follow these very simple guidelines. They will make your (and everyone else’s) life substantially easier.

Use UTF-8..

Always. No exceptions. Configure your text editors to default to UTF-8. Make sure everyone on your team does the same. And while you’re at it, configure the editor to use UNIX-style line-endings (newline, without useless carriage returns).

..or not

Make sure you document the cases where you can’t use UTF-8. Write down and remember which encoding you are using, and why. Remember that iconv is your friend.

Store dates with time zone information

Always. A date/time is entirely meaningless unless you know which time zone it’s in. Store the time zone. If you’re using some kind of retarded age-old RDBMS which doesn’t support date/time fields with TZ data, then you can either store your dates as a string, or store the TZ in an extra column. I repeat: a date is meaningless without a time zone.

While I’m on the subject: store dates in a format described by ISO 8601, ending with a Z to designate UTC (Zulu). No fancy pansy nonsense with the first 3 letters of the English name of the month. All you need is ISO 8601.

Bonus tip: always store dates in UTC. Make the conversion to the user time zone only when presenting times to a user.

Don’t rely on platform defaults

You want your code to be cross-platform, right? So don’t rely on platform defaults. Be explicit about which time zone/encoding/language/.. you’re using or expecting.

Use bcrypt

Don’t try to roll your own password hashing mechanism. It’ll suck and it’ll be broken beyond repair. Instead, use bcrypt or PBKDF2. They’re designed to be slow, which will make brute-force attacks less likely to be successful. Implementations are available for most sensible programming environments.

If you have some kind of roll-your-own fetish, then at least use an HMAC.

Problem be gone

Keeping these simple guidelines in mind will prevent entire ranges of bugs from being introduced into your code base. Total cost of implementation: zilch. Benefit: fewer headdesk incidents.


On bug reports

There’s an old joke about a Manager, an Engineer and a Software Developer: they’re in a car and the brakes fail as they go down a mountain road. They miraculously come to a standstill, but now they’re stuck, and Somebody Should Do Something ™. The Manager suggests they make a Plan, define Goals & Measurable Objectives in order to solve the Critical Problem. The Engineer has his tools with him, so he suggests he take look at the problem and fix it. The Software Developer, of course, is not convinced and suggests they push the car up hill to see if the problem will manifest itself again …

You can probably find a bunch of different versions of this on the web, but the punch line is always the same: the developer wants to be able to reproduce the bug. Not just to verify its existence (although it wouldn’t be the first time a user reported a critical data loss bug after having pressed the delete button..), but also to have a place to start the bug hunt. Software systems are complex. We have enough layers of abstraction built on top of a bunch of transistors to make Shrek jealous. Something as seemingly simple as displaying a bit of text on a screen is so complex that it can no longer be fully understood by a single person.

Being able to reproduce a bug is the only way to resolve ambiguity. When it isn’t possible to describe all the steps that led to the bug, then a thorough description of the problem is the next best thing. Think screenshots, explanations of the expected result, and answers to all of the usual “Which”-questions (Version, OS, Environment, Lunar Phase, ..).

Debugging is hard (and fun). Vague bug reports like “the application is broken” rather make me want to tie a team of horses to your limbs and decapitate your bloody remains educate you on the virtues of well-written bug reports.

Now .. let me see if I can fix this “broken” application .. sigh.