nowucca.com - personal software technology blog

When I worked at Kaazing, I wrote the first implementation of their Enterprise Shield™ (a.k.a reverse connections) feature for the networking gateway. Being the first implementation, I will be first to admit that it was the one with ALL the bugs. Fortunately Kaazing’s automated testing regime is still the best I’ve seen for their class of products. So I was thinking back to a set of related issues that were discovered during the early adoption phase of that feature, and thought they might make interesting reading.

What is Enterprise Shield™?

The essential idea of the “Enterprise shield™” is to close all incoming ports on a firewall, protecting the internal network from direct access through any port. To do this, we install a gateway server outside the firewall in the DMZ, and one internal gateway server inside the DMZ. The DMZ gateway accepts connections from real clients, but simply queues up the connection requests as they occur. The internal gateway is configured to reach out in “reverse” to the DMZ gateway, to “pick off” the pending connections from real cients and form a collection of connections through to the internal network.

(image owned by kaazing.com, used for reference purposes only)

Now there were a myriad of practical concerns with this feature. (See the Kaazing documentation for more details as to how these concerns are configurable.)

Having solved all of those issues in the initial implementation, we discovered a larger symptom….

The Presenting Symptom

The stable state of the system was a collection of prepared connections between the internal and DMZ gateways, with an extra SOCKS remote bind request pending, waiting for any additional clients to arrive.

			+----------+                  +--------------+
			|          | <--(socks)------ |              |
			|          |                  |              |
			| dmz      | <--(prepared)--> | internal     |
			|          | <--(prepared)--> |              |
			+----------+                  +--------------+

Having taken care of all these concerns, we were faced with this challenge: On very slow virtual machines, running Ubuntu, we were never reaching the stable state. We would thrash, and would continue building and tearing down both the SOCKS and the prepared WebSocket connections.

Recreating the same Ubuntu version on a VirtualBox machine and throttling the CPU fortunately reproduced the same errors. Here’s the issues that we encountered.

Issue 1: Coalesced Buffers with Multiple Protocols

The virtual machine was slowing down TCP writes to the point they were coalesced into a single buffer. The DMZ machine was sending the final SOCKS response and the HTTP websocket handshake in a single buffer. The internal machine was receiving an IO buffer with content from multiple protocols.

{ <socks-response-bytes><http handshake bytes> }

The problem was, as soon as our decoder finished decoding the final SOCKS message, we removed the Codec from the filter chain. That means we were never delivering the HTTP handshake bytes, and so we were tearing down the connection. We had seen this issue before for WebSocket handshakes when the same packet might have 101 and the first WebSocket frame, e.g.

{ <http-101-response><web-socket-frame> }

Effectively, when we know we are at the end of the conversation, we need to write any extra content buffer to the decoder “out” product (using Apache Mina’s codec conventions):

    private boolean decodeCommandResponse(IoSession session, IoBuffer in,
                                          ProtocolDecoderOutput out) throws
            Exception {
        ...
        session.setAttribute(DECODER_STATE_KEY, state);
        // Make sure we flush any data present in the buffer after the
        // SOCKS connection is made, and before this filter is removed from the chain.
        if (state == DecoderState.CONNECTED) {
            decodeExtraContentBuffer(in, out);
        }
        return true;
    }

    static boolean decodeExtraContentBuffer(IoBuffer in, ProtocolDecoderOutput out)
            throws Exception {
        if (in.hasRemaining()) {
            out.write(in.duplicate());
            in.position(in.limit());
        }
        return true;
    }

Issue 2: Dispatching with Coalesced Buffers

As before in Issue 1, the virtual machine was slowing down TCP writes to the point they were coalesced into a single buffer. The DMZ machine was now correctly sending the final SOCKS response and the HTTP websocket handshake in a single buffer. The internal machine was correctly receiving an IO buffer with content from multiple protocols.

{ <0x05 remaining-9-socks-response-bytes> <47('G' for GET) remaining-http-handshake-bytes> }
#  [buffer.position=10 tells us we should read HTTP bytes only]

We were now passing this buffer to some dispatching logic: Briefly, the dispatch filter has the job of inspecting the first byte of the buffer it sees, and tries to work out whether we are talking RTMP, HTTP, or SOCKS protocols so we know how to process the buffer. When we were reading the first byte, we were reading always from position “0”, which for coalesced buffers ended up being the 0x05 byte from the SOCKS command.

But we really wanted the 0x47 byte (G for GET) so we knew we had shifted to HTTP.

The bad and fixed code:

// Bad code reads from position 0

  @Override
    public void messageReceived(NextFilter nextFilter, IoSession session,
                                Object message) throws Exception {
        IoBuffer buffer = (IoBuffer) message;
        int discriminator = buffer.get(0) & 0xff;
    }

// Good code is smarter for coalesced buffers.

    @Override
    public void messageReceived (NextFilter nextFilter, IoSession session,
            Object message)throws Exception {
        IoBuffer buffer = (IoBuffer) message;
        int discriminator = buffer.get(buffer.position()) & 0xff;
    }

Issue 3: Being Too Idle

Now our SOCKS connections were stable, but we were still thrashing our prepared connections.

The last issue was a race condition whereby if we enabled WebSocket ping/pong as a keep-alive mechanism every 10 seconds or so, we would become unstable, even with other fixes. The problem was that if the PONG came back really fast, just after the PING, we did not “see” it and we then thought the session became IDLE, and we closed the WebSocket connection. THe fix was to avoid the race condition by making it impossible to receive a PONG for the session before we expected to. This explained why prepared connections were thrashing.

Takeaway thoughts

For More Information about Kaazing

If you are interested in the Kaazing Gateway and the Enterprise Shield features, head on over to: