mnesia corruption in ejabberd cluster

A few times in the past week I've seen my ejabberd cluster get into a bad state. Each of these times we had one of our nodes offline for a period of time, brought the other node online, saw the chat client work for a few seconds, and then saw the nodes get into a bad state where they no longer worked and wouldn't start up properly again. The only fix I've found so far has been to wipe out the database and start over completely.

The setup we have is straightforward. Two nodes with a load balancer in front of them. Aside from the above issue the cluster operates correctly. I have both nodes with the exact same mnesia configuration in terms of disc copies, ram copies, and disc only copies for fault tolerance. This way if we lose node0, node1 has all data and vice versa.

Heres an example of the trace I see when we get into this bad state. The trace appears on both nodes:
** Generic server ejabberd_sm terminating
** Last message in was {mnesia_system_event,
{mnesia_down,
'ejabberd@node1'}}
** When Server state == {state}
** Reason for termination ==
** {function_clause,[{lists,foreach,
[#Fun,undefined]},
{mnesia_tm,non_transaction,5},
{ejabberd_sm,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}

Leading up to the crash I see quite a few of these:
E(<0.4615.5>:ejabberd_s2s:85) : {badarg,
[{lists,member,
["dev",undefined]},
{ejabberd_s2s,'-is_service/2-fun-0-',2},
{lists,any,2},
{ejabberd_s2s,find_connection,2},
{ejabberd_s2s,do_route,3},
{ejabberd_s2s,route,3},
{ejabberd_router,route,3},
{lists,foreach,2}]}
when processing: {{jid,"953886d3-1903-34d2-8529-ee50bb3af545",
"dev",
"28668181131359131574557730",
"953886d3-1903-34d2-8529-ee50bb3af545",
"dev",
"28668181131359131574557730"},
{jid,"953886d3-1903-34d2-8529-ee50bb3af545",
"dev",[],
"953886d3-1903-34d2-8529-ee50bb3af545",
"dev",[]},
{xmlelement,"presence",[{"type","unavailable"}],[]}}

I read elsewhere something that suggests that mnesia can get corrupt if you being processing transactions on a node before it has finished catching up. This certainly could be the case here and makes sense given that things were OK for a seconds. I find it strange though that mnesia would have such a restriction.

Syndicate content