Friday, 26 September 2008

Ensuring XA Recovery works with a secure bus

I have been responsible for security for the service integration bus for a number of years now and most of the problems I have dealt with over that time have come down to configuration problems. While helping the tenth person to hit a particular problem can be a little frustrating, at least I am not having to explain security flaws. Their are a number of very common problems, such as typing in the wrong userid and password, to not understanding how foreign bus security works. One of the more worrying problems that occurs relates to XA Recovery.

The problem here will be that during normal operation of the environment connections will be made, to a secure bus, authentication will occur and messages sent and received; all is good in the world. Then disaster strikes and for some reason the application server goes down leaving uncommitted work in the bus. The node agent restarts the application server which then connects to the bus and performs recovery. At least that is what should happen. In this case the connection fails with a JMS SecurityException. The original connection was established, but recovery does not work.

So what went wrong here? During normal operation when a connection is made to a transactional resource and a transaction is in effect the connection factory is written into something called the partner log. This contains details of how to connect to all the transactional partners that may be needed during recovery. In this case the connection factory does not contain any information on what security credentials should be used, so no credentials are used, causing this problem.

So if you see this how do we get the transactions resolved? Their are two of options and the first one would be preferable:
  1. Grant the special Everyone group access to the bus. Assuming dynamic configuration is enabled this will allow recovery to work quickly.
  2. Turn security off for the bus until recovery is complete. We generally advice restarting the whole bus, but a single application server may work on a temporary basis.
So now you have solved the temporary problem and your business is up and running with no in-doubt work, how do you solve it more generally? Every JMS connection factory can have an XA Recovery Alias specified. This should be configured with a user that has bus connector authority (only bus connector authority is required). Once this has been configured save, sync and restart and any new JMS work with the bus will recover with no security problem. At least until the password expires.

In fact the XA Recovery Alias is not JMS specific it exists for all JCA resources, so it can help when using WebSphere MQ and DB2 too.

Did I hear someone say "yuck"? You do not like this? Well to be honest neither did we. One of the themes of the WAS v7 release was to make it more usable (we use the term "consumable", but who'd want to eat WAS?) so we have tried to simplify here. In WAS v7 the XA recovery alias is no longer required; during recovery the application server will use the WAS server identity to perform recovery. Their are a few limitations. The first is that the special Server group needs to have bus connector authority, and the second is that the recovery server must be in the same cell as the bus. Other than that you are good to go. Oh and do not worry, if you already have an XA recovery alias we will continue to use this unless a problem occurs.