BOSH infinite loop followed by an ejabberd crash.

Hello Forum.

We're seeing some funky stuff with our BOSH testing. We haven't been able to fully track down the cause yet, but at some point, ejabberd (2.1.5) starts spewing tons and tons of debugging logs (~500MB in a few minutes) with mostly this:

=INFO REPORT==== 2010-11-02 09:45:23 ===
D(<0.550.0>:ejabberd_http_bind:918) : OutPacket: [{xmlstreamstart,
                                                  "stream:stream",
                                                  [{"xml:lang","en"},
                                                   {"xmlns","jabber:client"},
                                                   {"xmlns:stream",
                                                    "http://etherx.jabber.org/streams"},
                                                   {"id","4224552436"},
                                                   {"from",
                                                    ""}]}]

This is only for two or three users connected. Eventually the server crashes with this reason:

Slogan: eheap_alloc: Cannot allocate 912262800 bytes of memory (of type "heap"). I analyzed the crash dump in the erlang crashdump_viewer and it looks like the error_logger process had it's mailbox filled up. This makes sense since the disk on this machine is rather slow so it lagged in writing out the debug traces and the memory balloon burst.

I see this manifest on our production cluster of two nodes where we run at log_level 3 by an exorbitant amount of consumed CPU (30% for beam.smp with no users connected). Just to confirm, I attached a debugger to the running session and got a TON of these traces:

(<6363.1192.0>) call ejabberd_http_bind:prepare_response({http_bind,"d0abbca731f18294c02f10768914d9a54a76e26c",<6363.1193.0>,
           {"",[]},
           1,0,100,0.0},5641028,[{xmlstreamstart,"stream:stream",
                 [{"xml:lang","en"},
                  {"xmlns","jabber:client"},
                  {"xmlns:stream","http://etherx.jabber.org/streams"},
                  {"id","3496582496"},
                  {"from",""}]}],true)
(<6363.3183.0>) call ejabberd_http_bind:prepare_response({http_bind,"661e696cdfcad0869e27ca8dfc6f427506beb75a",<6363.3186.0>,
           {"",[]},
           1,0,100,0.0},5641443,[{xmlstreamstart,"stream:stream",
                 [{"xml:lang","en"},
                  {"xmlns","jabber:client"},
                  {"xmlns:stream","http://etherx.jabber.org/streams"},
                  {"id","355914092"},
                  {"from",""}]}],true)

What we suspect happens is that a client sends a bosh stream start request but doesn't wait for the reply and closes the socket. Looking through the code, I see that in ejabberd_http_bind, prepare_response/4 calls prepare_outpacket_response/4 which matches the second pattern of the function head:

%% Handle a new session along with its output payload
prepare_outpacket_response(#http_bind{id=Sid, wait=Wait, 
				      hold=Hold, to=To}=Sess,
			   Rid, OutPacket, true) ->

Here OutEls is set to [] as can be seen from the above log entry -- there's only one element in the list:

   case OutPacket of
	[{xmlstreamstart, _, OutAttrs} | Els] ->
...
	    OutEls =
		case Els of
		    [] ->
			[];

And eventually prepare_response/4 is called again and the whole cycle repeats:

	    case OutEls of 
		[] ->
		    prepare_response(Sess, Rid, OutPacket, true);

I haven't been able to trace http_get/2 in prepare_response/4 yet, so I am just bringing this up on the list for someone who is more intimate with this code.

Would anyone here have any ideas on how to track this down further?

Thanks!

Just a quick note: the

Just a quick note: the infinite loop can be produced by applying this patch to ejabberd 2.1.5, and login with any BOSH client:

--- a/src/web/ejabberd_http_bind.erl
+++ b/src/web/ejabberd_http_bind.erl
@@ -903,6 +903,7 @@ update_shaper(ShaperState, PayloadSize) ->
     end.

 prepare_response(Sess, Rid, OutputEls, StreamStart) ->
+    ?INFO_MSG("prepare_response called with OutputEls: ~n~p", [OutputEls]),
     receive after Sess#http_bind.process_delay -> ok end,
     case catch http_get(Sess, Rid) of
        {ok, cancel} ->
@@ -926,6 +927,7 @@ prepare_response(Sess, Rid, OutputEls, StreamStart) ->

 %% Send output payloads on establised sessions
 prepare_outpacket_response(Sess, _Rid, OutPacket, false) ->
+    ?INFO_MSG("prepare_outpacket_response called with OutPacket: ~n~p", [OutPacket]),
     case catch send_outpacket(Sess, OutPacket) of
        {'EXIT', _Reason} ->
            {200, ?HEADER,
@@ -1015,8 +1017,12 @@ prepare_outpacket_response(#http_bind{id=Sid, wait=Wait,


 http_get(#http_bind{pid = FsmRef, wait = Wait, hold = Hold}, Rid) ->
-    gen_fsm:sync_send_all_state_event(
-      FsmRef, {http_get, Rid, Wait, Hold}, 2 * ?MAX_WAIT * 1000).
+    {ok, [{xmlstreamstart, "stream:stream",
+             [{"xml:lang","en"},
+             {"xmlns","jabber:client"},
+             {"xmlns:stream", "http://etherx.jabber.org/streams"},
+             {"id","4224552436"},
+             {"from", ""}]}]}.

 send_outpacket(#http_bind{pid = FsmRef}, OutPacket) ->
     case OutPacket of

In the mean time, do you

In the mean time, do you think it would be too horrible to make a small module that will track these processes and kill any long-running ones? Is there anything terrible in just doing "exit(Pid, kill)" on these? I.e. will that actually destroy the bosh sessions of valid clients?

max kalika wrote: Would

max kalika wrote:

Would anyone here have any ideas on how to track this down further?

Thanks to you and Aleksey Shchepin, the bug now has a fix, and it's very short, see https://support.process-one.net/browse/EJAB-1358

I applied the patch on our

I applied the patch on our dev environments and it's looking really great. Thanks!

Syndicate content