RPC failure talking to local node when issuing join_cluster

Hi there. I'm stumped with getting ejabberd nodes to cluster.. It's SO CLOSE but won't make the final connection.

I've got 1 node running fine. The second node does its startup and then issues the `join_cluster` command. The script handling that looks like so:

echo "Join cluster..."

response=$(${EJABBERDCTL} ping ${erlang_cluster_node})
while [ "$response" != "pong" ]; do
    echo "Waiting for ${erlang_cluster_node}..."
    sleep 2
    response=$(${EJABBERDCTL} ping ${erlang_cluster_node})
done

echo "Join cluster at ${erlang_cluster_node}... "
NO_WARNINGS=true ${EJABBERDCTL} join_cluster ${erlang_cluster_node}

The output is:

Join cluster...
Join cluster at ejabberd@i-077514ab35d8140b8.node.us-east-1.consul...
[info] Stop accepting TCP connections at 0.0.0.0:5222 for ejabberd_c2s
[info] Stop accepting TCP connections at 0.0.0.0:5269 for ejabberd_s2s_in
[info] Stop accepting TCP connections at 0.0.0.0:5280 for ejabberd_http
[info] Stop accepting TCP connections at 0.0.0.0:5281 for ejabberd_http
Failed RPC connection to the node 'ejabberd@i-0e1acb2584c61d913.node.us-east-1.consul': timeout

So that `Join cluster at...` text shows that it's found the correct DNS name of the other node and successfully pinged it. After that we go into the ejabberd erlang code at https://github.com/processone/ejabberd/blob/62806607bf52cf57a123885ee18f...

Here we shut down the ejabberd application (That seems to be happening with all the `Stop accepting` messages) and then reconfigure mnesia.

So here's the super confusing part.. The address that it's saying "failed RPC connection" to.. is itself.. It obviously can't talk to itself since it just shut itself down.. and furthermore I'm not even sure where in the code it would be trying to call the local node.

It can successfully ping the other node. I've even SSH'd in, `docker exec`'d into the container, ran an erlang shell and verified this myself.. I'm not sure why it wouldn't be able to complete the clustering operation from this point.

Please help!

Got this solved. It boiled

Got this solved. It boiled down to a customer module we wrote not shutting down correctly.

Syndicate content