Erlang core dump in OpenSSL

Hi there,

I was running some load (maybe four to five hundred transactions/second) through my ejabberd installation when I got a core dump from the Erlang virtual machine. I ran it though gdb and got a backtrace producing this:

#0 0xfed1420c in ?? ()
#1 0xfa845410 in CRYPTO_lock ()
from /usr/local/libs/libcrypto.so.0.9.8
#2 0xfa845480 in CRYPTO_add_lock ()
from /usr/local/libs/libcrypto.so.0.9.8
#3 0xfa57ca58 in SSL_CTX_free () from /usr/local/libs/libssl.so.0.9.8
#4 0xfa651d5c in tls_drv_finish () at tls_drv.c:237
#5 0x000511e8 in do_unload_driver_entry (dh=0xfaaeb738, save_name=0xfebfaf68)
at beam/erl_bif_ddll.c:1652
#6 0x000500a0 in ddll_no_more_references (vdh=0xfaaeb738)
at beam/erl_bif_ddll.c:1210
#7 0x000c9450 in exec_misc_ops (rq=0xd1c420) at beam/erl_process.c:6439
#8 0x000c8480 in schedule (p=0x53b81e8, calls=0) at beam/erl_process.c:6081
#9 0x00172140 in process_main () at beam/beam_emu.c:1126
#10 0x000bff00 in sched_thread_func (vesdp=0x117a0b0)
at beam/erl_process.c:3015
#11 0x001f4960 in thr_wrapper (vtwd=0xffbff100) at common/ethread.c:475
#12 0xff0c5800 in _lwp_start () from /usr/lib/libc.so.1
#13 0xff0c5800 in _lwp_start () from /usr/lib/libc.so.1
Previous frame identical to this frame (corrupt stack?)

Looks like it might have broken somewhere in an OpenSSL call. Has anyone else run into an issue like this? I've run fairly high load though it before and have not observed a problem. I'm running ejabberd 2.1.1 and Erlang R13B01 and OpenSSL 0.9.8m (pretty sure it's version m) on a Solaris 10 installation. I took a quick look at change logs for the newest version of Erlang and Ejabberd but didn't see anything particularly referencing this expect:

OTP-8261 Fixed emulator crash caused by crypto using an old openssl
version that did not cope with large file descriptors.

in the Erlang logs, but I'm not sure if that applies. I'm willing to upgrade the Erlang and Ejabberd versions to the latest if you think it'll help, but it's not a trivial task, so I'm wondering if it'll be worth it or not. Any other advice you have you would be helpful! Thanks!

EDIT: The OpenSSL version is 0.9.8l not 0.9.8m, my mistake. Still, nothing looks obvious from the change log between l/m what would resolve this issue

You can take a look at the

You can take a look at the first lines of your erl_crash.dump file, maybe they give some clue. Or do: "erl -s webtool", then open a webbrowser, enable dump viewer, load the dump and take a look.

Hmm

I can't seem to find an erlang crash dump, which I find odd. Usually they are in the /var/log/ejabberd folder. All I seem to have is the core file.

I did try installing Erlang R13B04 and Ejabberd 2.1.3, but now I have a new issue: I can hit the web client fine but every time I try to sign in, I fail half way through the authentication with:

** Reason for termination =
** {badarg,[{erlang,port_control,[crypto_drv02,1,"admin1:localhost:admin1"]},

Which seems identical to the issue mentioned here: http://www.ejabberd.im/node/3223, but there doesn't seem to be a resolution to that nor does any of the solutions listed there work. Perhaps I should start a separate thread for that though.

Re: Erlang core dump in OpenSSL

Since there is a core dump, there will not be any crash dump. The problems with crypto is caused by incorrect Erlang installation in 99% of cases.

Installing Erlang...

Interesting! I'd haven't heard that, and I have had to do some tricks to get Erlang running which might be causing an issue with something I'm not aware of.

Basically, since I'm running on Solaris my only option is to build Erlang from source. Things are made trickier by the fact that I have absolutely no build tools on my runtime machine, so I have to build on a separate Solaris machine. So my method of "installing" erlang is to download the source code, run ./configure with a prefix to a specific directory (e.g. /usr/local/me/erlang), then make install.

I then take the built /usr/local/me/erlang folder and copy it verbatim to the runtime box including the path structure. From there, I do the same with my ejabberd install and make sure it's pointing to the correct Erlang. This has been working fine until now, either with the load issue, or with logging in on the newer version. So is there something that my install method is missing that could cause an issue like this?

Re: Erlang core dump in OpenSSL

Dunno. Perhaps your target system lacks of some libraries installed.

Well

I did run into some problems like that at first but I tracked down all the missing libraries (that I'm aware of or that were causing a problem) and put them in my LD_LIBRARY_PATH. This includes all the crypto/ssl libraries. So I'm not aware of any that could be missing but I can't guarantee it either.

EDIT:
There is the weird header issue. When I run erl on my build box I get:

Erlang R13B01 (erts-5.7.2) [source] [smp:128:128] [rq:128] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)

but on the runtime box I only get:

Eshell V5.7.2 (abort with ^G)
1>

without the "Header" at the beginning. I never was able to figure that out.

SMP 128?

Tronman wrote:

When I run erl on my build box I get:

Erlang R13B01 (erts-5.7.2) [source] [smp:128:128] [rq:128] [async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)

And another weird thing: do you have 128 processors?

In my case it's:

$ erl
Erlang R13B04 (erts-5.7.5) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Eshell V5.7.5  (abort with ^G)
1>

Maybe Erlang on Solaris has some bug. You can try to disable SMP support in ejabberdctl.cfg

Indeed

badlop wrote:

And another weird thing: do you have 128 processors?

It's a pretty high end server which advertises up to 128 simultaneous jobs, I'm not exactly sure if that translates to cores or processors, or some sort of sharing, but I don't think that number is wrong. I share the server with many other parties so it's likely it doesn't actually have access to all 128 at once.

badlop wrote:

Maybe Erlang on Solaris has some bug. You can try to disable SMP support in ejabberdctl.cfg

[/quote]

I am starting to fear this as well. Of course, my build box (the one I mentioned previously) is also Solaris 10 which is where I printed the proper header from. The runtime box is also Solaris 10, but *doesn't* show the header. So possibly some sort of configuration issue? I'm not a Solaris expert, but I don't get a choice in the OS :)

Re: Erlang core dump in OpenSSL

In the backtrace I see that function tls_drv_finish() is called. This function is only called when ejabberd stops. Try to comment that function - it is not so critical to free memory when ejabberd stops until you don't reload ejabberd in the emulator every second ;)

Re: Erlang core dump in OpenSSL

Hi zinid,

I'm not sure what you mean by "when ejabberd stops until you don't reload ejabberd in the emulator every second" ? Would you mind elaborating?

If tls_drv_finish() is only called when ejabberd stops, would that imply that my crash is occurring elsewhere and the core dump I'm getting is a consequence of ejabberd trying to restart itself to recover from a crash?

Thanks

Re: Erlang core dump in OpenSSL

Tronman wrote:

I'm not sure what you mean by "when ejabberd stops until you don't reload ejabberd in the emulator every second" ? Would you mind elaborating?

That was just a joke ;)

Quote:

If tls_drv_finish() is only called when ejabberd stops, would that imply that my crash is occurring elsewhere and the core dump I'm getting is a consequence of ejabberd trying to restart itself to recover from a crash?

Yes with the one exception: ejabberd never tries to restart itself. In your case it just stops or Erlang emulator tries to stop it. So your core dump is not a cause but an effect. I think if you comment tls_drv_finish() function you will get a crash dump next time.

Re: Erlang core dump in OpenSSL

Oh I see. I'd thought that ejabberd restarted itself since I could still login/hit the web client after the core without my actually starting it back up again. So maybe what ever caused it to crash (which I'd thought was a segmentation fault) only disconnected all the currently connected users (which is still bad, lol) but didn't actually bring the system down.

The load environment is down, so I haven't tested it yet since upgrading to 2.1.3, I wasn't able to upgrade the Erlang version since it was giving me different errors when I tried to log in (though with the crypto library yet again). I've also increased the number of ERL_MAX_PORTS, ERL_PROCESSES and ERL_MAX_ETS_TABLES. Hopefully I'll have more luck, if not, I'll be popping back in here. Thanks!

Update

A minor update:

It appears that I can recreate the core pretty easily, all I have to do is connect a client with Pidgin, then run ./ejabberdctl stop while he's connected.

It appears that killing the simulator with a connected client causes it to core with the same backtrace I reported. Is this expected behavior or a bug in ejabberd? So it's possible something did indeed cause the simulator to crash/stop while clients were disconnected which caused the core and not the core causing the clients to disconnect.

ejabberdctl stop is a normal operation, no crash

Tronman wrote:

A minor update:

It appears that I can recreate the core pretty easily, all I have to do is connect a client with Pidgin, then run ./ejabberdctl stop while he's connected.

It appears that killing the simulator with a connected client causes it to core with the same backtrace I reported. Is this expected behavior or a bug in ejabberd? So it's possible something did indeed cause the simulator to crash/stop while clients were disconnected which caused the core and not the core causing the clients to disconnect.

That isn't normal. In ejabberd 2.1.3 I compiled and installed from source in Debian Sid 32bits, with the Erlang packaged in Debian, with Pidgin (home) and Tkabber (work) connected to my 'badlop' account, when I execute "ejabberdctl stop", there isn't any crash, just this is logged in ejabberd.log and ejabberd stops:

=INFO REPORT==== 2010-05-17 12:07:05 ===
I(<0.366.0>:ejabberd_c2s:1409) : ({socket_state,tls,{tlssock,#Port<0.3910>,#Port<0.3957>},<0.365.0>})
  Close session for badlop@localhost/Home

=INFO REPORT==== 2010-05-17 12:07:05 ===
I(<0.369.0>:ejabberd_c2s:1409) : ({socket_state,gen_tcp,#Port<0.3960>,<0.368.0>})
  Close session for badlop@localhost/work

=INFO REPORT==== 2010-05-17 12:07:10 ===
I(<0.41.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

Another update

Bah, that's a problem then. It seems to happen for me on a fresh install on ejabberd 2.1.3 on Solaris 10, though I did have to change:

ejabberdctl to a bash script from a shell script. The shell script was giving me a "bad substitution error" but wouldn't identify what line so I don't know whats causing it

I also had to change id -g/-G to be /usr/xpg4/bin/id since the one in the path didn't support the -g option.

The logs do show me:

=INFO REPORT==== 2010-05-17 21:12:24 ===
I(<0.470.0>:ejabberd_c2s:1366) : ({socket_state,tls,{tlssock,#Port<0.3434>,#Port<0.3454>},<0.469.0>}) Close session for admin1@localhost/27926996461274119930670529

=INFO REPORT==== 2010-05-17 21:12:29 ===
I(<0.37.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

Then, the core occurs at 21:12:41, around 10-12 seconds later or so.

I've also discovered that:

The client doesn't have to be connected when the simulator goes down to core. However, at least one client must have been connected during the lifetime of the session. To recap:

1. Start Ejabberd/Stop Ejabberd = No Core
2. Start Ejabberd/Connect Client/Stop Ejabberd = Core
3. Start Ejabberd/Connect Client/Disconnect Client/Stop Ejabberd = Core

gdb always shows:

1. That the "Program terminated with signal 11, Segmentation fault."
2. The core was generated by: beam.smp -K true -P
3. The backtrace is always the same as I first listed; with CRYPTO_lock being the last one executed.

Which brings me back to the possibility that something is screwing up with OpenSSL...but what? And why if only someone has connected? Does that cleanup code only get executed in that case?

EDIT: You might also find it interesting that ejabberd takes up to 10 to 15 seconds to start. That is, from the time I invoke ./ejabberdctl start to the time I can hit the web client. This is quite a bit slower then my initial tests on Ubuntu. This has always been the case since I've been running it on the Solaris machine.

EDIT2: I took your advice and tried disabling symmetric multiprocessing. It seems to have stopped the core's when logging clients in and out which is great, but I don't know about under load yet. Speaking of load, any idea how disabling the SMP will impact ejabberd's performance? I'm not sure how ejabberd would even make use of it since all the XMPP messages would be asynchronous. Or perhaps any guesses on why the SMP was causing such a weird error? And again, I'm sure I never had such strange behavior before even with SMP set to auto...so there still could be something in the environment changed somewhere.

Thanks!

Some ideas to try

Tronman wrote:

The logs do show me:

=INFO REPORT==== 2010-05-17 21:12:24 ===
I(<0.470.0>:ejabberd_c2s:1366) : ({socket_state,tls,{tlssock,#Port<0.3434>,#Port<0.3454>},<0.469.0>}) Close session for admin1@localhost/27926996461274119930670529

=INFO REPORT==== 2010-05-17 21:12:29 ===
I(<0.37.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

Then, the core occurs at 21:12:41, around 10-12 seconds later or so.

So, the erlang node crashes while it is being stopped. You can try to start and stop it interactively:

$ ejabberdctl live
...
=INFO REPORT==== 20-May-2010::11:54:15 ===
I(<0.41.0>:ejabberd_app:69) : ejabberd 2.1.3 is started in the node ejabberd@localhost

=PROGRESS REPORT==== 20-May-2010::11:54:15 ===
         application: ejabberd                
          started_at: ejabberd@localhost      

... login a client ...

(ejabberd@localhost)1> application:stop(ejabberd).

=INFO REPORT==== 20-May-2010::12:01:23 ===
I(<0.41.0>:ejabberd_app:86) : ejabberd 2.1.3 is stopped in the node ejabberd@localhost

=INFO REPORT==== 20-May-2010::12:01:24 ===
    application: ejabberd                 
    exited: stopped                       
    type: temporary                       
ok                                        
(ejabberd@localhost)2> application:stop(mnesia).  

=INFO REPORT==== 20-May-2010::12:01:29 ===
    application: mnesia                   
    exited: stopped
    type: permanent
ok
(ejabberd@localhost)3> application:stop(crypto).

=INFO REPORT==== 20-May-2010::12:01:35 ===
    application: crypto
    exited: stopped
    type: temporary
ok
(ejabberd@localhost)6> init:stop().
ok

$
Tronman wrote:

Which brings me back to the possibility that something is screwing up with OpenSSL...but what? And why if only someone has connected? Does that cleanup code only get executed in that case?

Maybe OpenSSL (or some part of it, or some library of it) is only loaded/started when the first client connects to the server.

Tronman wrote:

EDIT: You might also find it interesting that ejabberd takes up to 10 to 15 seconds to start. That is, from the time I invoke ./ejabberdctl start to the time I can hit the web client. This is quite a bit slower then my initial tests on Ubuntu. This has always been the case since I've been running it on the Solaris machine.

If ejabberd doesn't stop cleanly in Solaris, when it's started it may need to repair the Mnesia table, showing something like this on the log files:

dets: file "/var/lib/ejabberd/offline_msg.DAT" not properly closed, repairing ...
Tronman wrote:

Or perhaps any guesses on why the SMP was causing such a weird error?

Maybe Erlang support for SMP in Solaris wasn't completely stable in Erlang R13B01.

Tronman wrote:

And again, I'm sure I never had such strange behavior before even with SMP set to auto...so there still could be something in the environment changed somewhere.

'before' what? Updating Erlang R12 to R13B01?, Updating ejabberd 2.0.0 to 2.1.3 (which has SMP enabled by default)?, ...

You can try to update Erlang to the most recent R13B04.

RE: Some ideas to try

badlop wrote:

'before' what? Updating Erlang R12 to R13B01?, Updating ejabberd 2.0.0 to 2.1.3 (which has SMP enabled by default)?, ...

You can try to update Erlang to the most recent R13B04.

That's the strange thing, I haven't changed anything...it just started crashing this way out of nowhere as far as I can tell. I used to be able to turn ejabberd on and off with SMP set to auto, but now I can't. No hardware changes, no software changes, no configuration changes. There *might* have been changes in the Solaris environment that I'm unaware of, but I don't think so. I didn't see any error messages such as "not properly closed" in the ejabberd log (level is set at INFO).

I did indeed try upgrading to R13B04 and ejabberd 2.1.3, but I was no longer able to log any clients in, receiving errors such as:

** Reason for termination =
** {badarg,[{erlang,port_control,[crypto_drv02,1,"admin1:localhost:admin1"]},

which I spent some time trying to solve but eventually gave up so I went back to the original R13B01 and 2.1.2 setup (I think I said 2.1.1 in my original message, but I meant 2.1.2, sorry about that).

Anyway, not worrying about why I had to disable SMP for no reason, since I've disabled it the crashes/core dumps have stopped but under load (~30000 simultaneous users) many of my simulated clients are getting time out errors when trying to sign it. I have the following values set in ejabberdctl:

POLL=true
SMP=disable
ERL_MAX_PORTS=320000
ERL_PROCESSES=2500000
ERL_MAX_ETS_TABLES=100000

It looks like I'm hitting a wall just over 9000 users. Specifically 9,227 is the number I show logged in in this last test before I start getting hit with time outs.

prstat shows memory usage around 900 MB and CPU usage from 0.5% to 1.0%. The memory/cpu usages of the other processes are negligible. The machine is a T5120 with 16GB of ram and the machine and a 64 core processor. So I think I should be able to handle a lot more users, and that this ~9000 user limit I'm seeing seems like its being artificially imposed by something, I just don't know what!

Syndicate content