An ActiveMQ gotcha

Spent way too much time hunting down a strange ActiveMQ issue recently. Finding the cause was complicated by several factors:

  1. The ActiveMQ servers are managed by an external party
  2. Our application runs on Kubernetes, also managed by an external party
  3. The problem only occurred on an environment that’s in use by a customer, not in any of our test environments or during development
  4. We connect to the ActiveMQ servers in several different ways, from various Spring Boot applications, sometimes using annotation magic/garbage and Spring Data connection strings, other times using connection factories in Java
  5. Strangely, the problem only occurred in 1 of our microservices, not in any of the others
  6. Firewall rules make it very hard to talk to the ActiveMQ servers from a local machine

Whats the problem?

Application logs were littered with this ominous exception. Without any further context or details.

java.lang.Exception: Failed to find a store at /path/to/broker.ks
  at org.apache.activemq.artemis.core.remoting.impl.ssl.SSLSupport.validateStoreURL(SSLSupport.java:399)
  at org.apache.activemq.artemis.core.remoting.impl.ssl.SSLSupport.loadKeystore(SSLSupport.java:339)
  at org.apache.activemq.artemis.core.remoting.impl.ssl.SSLSupport.loadKeyManagerFactory(SSLSupport.java:375)
  at org.apache.activemq.artemis.core.remoting.impl.ssl.SSLSupport.loadKeyManagers(SSLSupport.java:355)
  at org.apache.activemq.artemis.core.remoting.impl.ssl.SSLSupport.createContext(SSLSupport.java:222)
  at org.apache.activemq.artemis.core.remoting.impl.ssl.DefaultSSLContextFactory.getSSLContext(DefaultSSLContextFactory.java:50)
  at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector.loadJdkSslEngine(NettyConnector.java:786)
  at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector$1.initChannel(NettyConnector.java:690)
  at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:129)
  at io.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:112)
  at io.netty.channel.AbstractChannelHandlerContext.callHandlerAdded(AbstractChannelHandlerContext.java:1114)
  at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:609)
  at io.netty.channel.DefaultChannelPipeline.access$100(DefaultChannelPipeline.java:46)
  at io.netty.channel.DefaultChannelPipeline$PendingHandlerAddedTask.execute(DefaultChannelPipeline.java:1463)
  at io.netty.channel.DefaultChannelPipeline.callHandlerAddedForAllHandlers(DefaultChannelPipeline.java:1115)
  at io.netty.channel.DefaultChannelPipeline.invokeHandlerAddedIfNeeded(DefaultChannelPipeline.java:650)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:514)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:429)
  at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:486)
  at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:416)
  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

The path the broker keystore did indeed not exist. In fact, it was a completely different path from the one in our application configuration!

The hunt begins

My initial thought was that the ActiveMQ client was falling back to a default location because of misconfiguration on our end. I was wrong.

The first step in trying to fix any bug, is trying to reproduce it. I couldn’t touch the environment on which it was actually occurring, so I set out to make sure that our other environments closely resembled the affected environment. This, much to my surprise, ended up taking several weeks. Not of continuous effort, but mostly of trying to coax $vendor to get the configuration right. There were many (subtle and not-so-subtle) differences between the environments. Without access to the ActiveMQ servers, it was very difficult to pinpoint them. I had to rely on diffing JMX outputs, which is less than ideal.

Once the environments were as identical as we could make them, reproducing the bug still failed. Strange.

Fuck it

Given that this was beginning to piss me off, I picked a time where I wasn’t likely to cause any issues for our users, and I did something ill-advised: I connected the problematic MQ server to a dev environment. This could have caused issues if any messages were being sent or consumed during this time, but a bit of fiddling prevented that.

The error popped up almost instantly. Aha. Reproducible at last. Kind of.

I had enabled trace logging during this test, and aha, something interesting came up in the logs: ClusterTopologyChangeMessage_V4 messages containing multiple ActiveMQ servers.

Apparently $vendor had added high availability to our setup, but only to the environment that was now throwing errors …. so much for identical environments.

An attempted solution

The ActiveMQ connection string was only configured for one server (makes sense, given that I was only aware of 1 server). So I added the second server, with an explicit path to the correct truststore.

And so the problem … persisted. Motherfucker.

At this point I had noticed that the path to the incorrect truststore was included in the ClusterTopologyChangeMessage_V4 we received from the ActiveMQ server(s). For some reason, the ActiveMQ client was trying to create a connection to the second server, using an incorrect path it received from the server.

This makes no sense. Why would you ever use a local, fully qualified, path that you receive from an external server? The server doesn’t know what’s on my filesystem, how could it? And so I sent a mail to the ActiveMQ mailing list, asking whether this might be a bug. The answer surprised me.

And an actual solution

It is, in fact, not a bug. By default, the ActiveMQ client uses the topology it receives from the server to connect to servers in the cluster. This makes sense from a high availability point of view. It still doesn’t make sense that this should include the path to a truststore (and even its password?!), because those paths aren’t going to be the same on the client.

The solution was to add an extra property to the connection string, which forces the client to use my configuration rather than the server’s asinine suggestion: useTopologyForLoadBalancing=false.

To their credit, this behaviour is documented in the documentation. But that’s only useful when you know that ActiveMQ is running in HA mode in the first place. I still don’t understand why this behaviour only showed up on 1 microservice and not the others, but the problem is fixed, and the client is happy. The end.

— Elric