cluster nodes don't notice if a previously down server becomes available again

Hi.

Thanks to badlop's fine answers (http://www.ejabberd.im/node/4330) I seem to have quite well running clustered setup now.
In principle I've followed the guide for this and have two nodes (which are connected via a secure IPsec tunnel) forming the cluster.

I guess I simply set the mode (disk only/disk+ram/ram only) of each table to be the same as on the already existing node, hope that is right?! I put the output of ejabberdctl mnesia info to the end of that post.

At least one bigger problem remains however:
If the connection gets lost (and that seems to be independent of the IPsec thingy) ejabberd notices this at some point and give me:
running db nodes = ['ejabberd@b.example.org']
stopped db nodes = ['ejabberd@a.example.org']
...and vice versa.

However, it seems that connection is never tried again, and both nodes think forever that the other is down.
After restarting one of them, it's recognised again (and as far as I understand the whole idea, all tables are then synchronised).

btw: What would happen, if each table is modified on each of the hosts, while no connection is there (e.g. the network link dies, but client continue to use both hosts). Is there some kind of merging and conflict resolution?

Thanks,
Chris.

ejabberdctl mnesia info:
# ejabberdctl mnesia info
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
mod_register_ip: with 0 records occupying 284 words of mem
local_config : with 18 records occupying 5010 words of mem
caps_features : with 6 records occupying 1276 words of mem
config : with 16 records occupying 781 words of mem
http_bind : with 0 records occupying 284 words of mem
reg_users_counter: with 4 records occupying 500 words of mem
pubsub_subscription: with 0 records occupying 284 words of mem
bytestream : with 0 records occupying 284 words of mem
privacy : with 0 records occupying 284 words of mem
passwd : with 5 records occupying 720 words of mem
roster : with 10 records occupying 3540 words of mem
last_activity : with 5 records occupying 857 words of mem
roster_version : with 0 records occupying 284 words of mem
pubsub_last_item: with 0 records occupying 284 words of mem
offline_msg : with 0 records occupying 8641 bytes on disc
route : with 32 records occupying 2500 words of mem
motd : with 0 records occupying 284 words of mem
acl : with 4 records occupying 511 words of mem
s2s : with 0 records occupying 284 words of mem
vcard : with 3 records occupying 10888 bytes on disc
captcha : with 0 records occupying 284 words of mem
pubsub_index : with 1 records occupying 296 words of mem
session_counter: with 4 records occupying 500 words of mem
vcard_search : with 3 records occupying 2868 words of mem
motd_users : with 0 records occupying 284 words of mem
schema : with 35 records occupying 4613 words of mem
session : with 4 records occupying 980 words of mem
private_storage: with 0 records occupying 5752 bytes on disc
pubsub_item : with 5 records occupying 8856 bytes on disc
muc_room : with 0 records occupying 284 words of mem
pubsub_state : with 15 records occupying 1578 words of mem
iq_response : with 0 records occupying 284 words of mem
muc_registered : with 0 records occupying 284 words of mem
muc_online_room: with 0 records occupying 284 words of mem
pubsub_node : with 15 records occupying 3862 words of mem
===> System info in version "4.4.14", debug level = none <===
opt_disc. Directory "/var/lib/ejabberd" is used.
use fallback at restart = false
running db nodes = ['ejabberd@b.example.org','ejabberd@a.example.org']
stopped db nodes = []
master node tables = []
remote = []
ram_copies = [bytestream,captcha,http_bind,iq_response,
mod_register_ip,muc_online_room,pubsub_last_item,
reg_users_counter,route,s2s,session,session_counter]
disc_copies = [acl,caps_features,config,last_activity,local_config,
motd,motd_users,muc_registered,muc_room,passwd,privacy,
pubsub_index,pubsub_node,pubsub_state,
pubsub_subscription,roster,roster_version,schema,
vcard_search]
disc_only_copies = [offline_msg,private_storage,pubsub_item,vcard]
[{'ejabberd@a.example.org',disc_copies}] = [caps_features,
local_config]
[{'ejabberd@a.example.org',disc_copies},
{'ejabberd@b.example.org',disc_copies}] = [pubsub_node,
muc_registered,
pubsub_state,muc_room,
schema,motd_users,
vcard_search,
pubsub_index,acl,motd,
roster_version,
last_activity,roster,
passwd,privacy,
pubsub_subscription,
config]
[{'ejabberd@a.example.org',disc_only_copies},
{'ejabberd@b.example.org',disc_only_copies}] = [pubsub_item,
private_storage,
vcard,offline_msg]
[{'ejabberd@a.example.org',ram_copies}] = [mod_register_ip]
[{'ejabberd@a.example.org',ram_copies},
{'ejabberd@b.example.org',ram_copies}] = [muc_online_room,
iq_response,session,
session_counter,captcha,
s2s,route,
pubsub_last_item,
bytestream,
reg_users_counter,
http_bind]
1206 transactions committed, 216 aborted, 10 restarted, 294 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []

other case

It seems that this can also happen, when the connection is not really lost, but getting rather slow, e.g. because of high load.

got even worse

The situation got even worse now, even when restarting (one or both nodes) they do not connect anymore. Firewall and IPsec tunnel is however ok.

Syndicate content