Character sets with mod_irc

Hello,

I am currently testing my own ejabberd server. Chatting in local rooms and I can also connect to IRC channels (=rooms) via my own server.

In "local chat" I have Gajim on one PC and irssi-xmpp-plugin one another. I've also tried mcabber instead of irssi-xmpp-plugin.

The character set between Gajim<->irssi-xmpp-plugin and Gajim<->mcabber looks correct. I suppose it is UTF-8 because the XMPP default is UTF-8. The problem comes when I try to use the irssi-xmpp-plugin or mcabber to connect via my server to an external IRC server channel. People using UTF-8 on those channels show garbled characters on my screen and people using iso8859-1 show correct characters on the screen.

Because this is happening with irssi-xmpp-plugin/gajim/terminals/screen etc., I think it is some ejabberd specific thing. I have enabled unicode (UTF-8) in different terminals, in "screen", etc. This never existed when I used irssi without the plugin or jabber.

The system I'm running is Ubuntu Lucid Lynx. I have this line in ejabberd.cfg:
{mod_irc, [{access, all}, {default_encoding, "utf-8"}]},

There seems to be no effect changing the default_encoding. What I should change next?

Thanks,

olliex wrote: The system I'm

olliex wrote:

The system I'm running is Ubuntu Lucid Lynx. I have this line in ejabberd.cfg:
{mod_irc, [{access, all}, {default_encoding, "utf-8"}]},

There seems to be no effect changing the default_encoding. What I should change next?

That option is documented in the Guide, but I didn't see it implemented in the code. It can be implemented with this patch, can you try it and report if now it works?

--- a/src/mod_irc/mod_irc.erl
+++ b/src/mod_irc/mod_irc.erl
@@ -330,7 +330,7 @@ do_route1(Host, ServerHost, From, To, Packet) ->
                        [] ->
                            ?DEBUG("open new connection~n", []),
                            {Username, Encoding, Port, Password} = get_connection_params(
-                                                    Host, From, Server),
+                                                    Host, ServerHost, From, Server),
                            ConnectionUsername =
                                case Packet of
                                    %% If the user tries to join a
@@ -662,7 +662,14 @@ set_form(_Host, _, _, _Lang, _XData) ->
     {error, ?ERR_SERVICE_UNAVAILABLE}.


+%% Host = "irc.example.com"
+%% ServerHost = "example.com"
 get_connection_params(Host, From, IRCServer) ->
+    [_ | HostTail] = string:tokens(Host, "."),
+    ServerHost = string:join(HostTail, "."),
+    get_connection_params(Host, ServerHost, From, IRCServer).
+
+get_connection_params(Host, ServerHost, From, IRCServer) ->
     #jid{user = User, server = _Server,
         luser = LUser, lserver = LServer} = From,
     US = {LUser, LServer},
@@ -682,7 +689,10 @@ get_connection_params(Host, From, IRCServer) ->
                    {value, {_, Encoding}} ->
                        {Username, Encoding, ?DEFAULT_IRC_PORT, ""};
                    _ ->
-                       {Username, ?DEFAULT_IRC_ENCODING, ?DEFAULT_IRC_PORT, ""}
+                       Encoding = gen_mod:get_module_opt(
+                                    ServerHost, ?MODULE, default_encoding,
+                                    ?DEFAULT_IRC_ENCODING),
+                       {Username, Encoding, ?DEFAULT_IRC_PORT, ""}
                end,
            {NewUsername,
             NewEncoding,

Each user can also define what encoding he wants when connecting to specific IRC servers. This is possible by "registering" with the IRC transport.

Thanks for the changes. For

Thanks for the changes. For which release is it? I tried to apply it for 2.1.5 and got some errors (only 1/3 hunks succeeded with fuzzy logic).

Change manual

olliex wrote:

Thanks for the changes. For which release is it? I tried to apply it for 2.1.5 and got some errors (only 1/3 hunks succeeded with fuzzy logic).

It's for ejabberd 2.1.5. Maybe the forum converted spaces to tabs, or viceversa. It's just 9 lines, you can change them manually.

Changed

Now I downloaded 2.1.5 sources, changed the mod_irc.erl with the changes in the patch, compiled and installed and added "utf-8" to the ejabberd.cfg. Still the same problem exists: people writing utf-8 show wrong characters but people writing 8859-1 show correct characters.

Second attempt

olliex wrote:

Now I downloaded 2.1.5 sources, changed the mod_irc.erl with the changes in the patch, compiled and installed and added "utf-8" to the ejabberd.cfg. Still the same problem exists: people writing utf-8 show wrong characters but people writing 8859-1 show correct characters.

It seems the option was not yet read by mod_irc.erl. Notice that I only test the code compiles, I don't test the functionality myself.

I've rewritten the patch, get the new version here:
http://tkabber.jabber.ru/files/badlop/4270-215-ircencoding.patch
You need to revert the previous patch, or get the original file.

Let's hope this time the patch applies cleanly to your 2.1.5.

Everytime the option is requested (either read from the config table, or using the default value), a line is written to ejabberd.log "The default_encoding configured for host ... is ...". This allows you to check if the option is read or not. If all works well, you can remove that line of your mod_irc.erl

I applied the patch and it

I applied the patch and it went OK without error messages. Now this looks much more promising. The people writing utf-8 show correct characters. However now the people writing ISO-8859-1 show no characters at all if the character is over the first 7 bits.

This seems like some code somewhere (in ejabberd?) is checking the "utf-8 validity" and strips out those "incorrect" characters.

Normally an IRC client (at least irssi) does the character conversion because it is possible to convert from ISO-8859-1 to utf-8. So I think some type of "pass through", "no conversion" or "no check" option is needed to ejabberd so that it can transfer the character set check responsibility to the IRC client.

Would it be technically possible to bring this support to ejabberd?

Patch to disable conversion

olliex wrote:

This seems like some code somewhere (in ejabberd?) is checking the "utf-8 validity" and strips out those "incorrect" characters.

Normally an IRC client (at least irssi) does the character conversion because it is possible to convert from ISO-8859-1 to utf-8. So I think some type of "pass through", "no conversion" or "no check" option is needed to ejabberd so that it can transfer the character set check responsibility to the IRC client.

This patch avoids making a stupid conversion (for example from utf-8 to utf-8):

--- a/src/mod_irc/iconv.erl
+++ b/src/mod_irc/iconv.erl
@@ -84,6 +84,8 @@ terminate(_Reason, Port) ->



+convert(From, To, String) when From == To ->
+    String;
 convert(From, To, String) ->
     [{port, Port} | _] = ets:lookup(iconv_table, port),
     Bin = term_to_binary({From, To, String}),

This patch disables conversion at all, because in all cases the original string is returned without any change:

--- a/src/mod_irc/iconv.erl
+++ b/src/mod_irc/iconv.erl
@@ -85,6 +85,8 @@ terminate(_Reason, Port) ->


 convert(From, To, String) ->
+    String;
+convert(From, To, String) ->
     [{port, Port} | _] = ets:lookup(iconv_table, port),
     Bin = term_to_binary({From, To, String}),
     BRes = port_control(Port, 1, Bin),

Both of these iconv.erl

Both of these iconv.erl patches caused a graphical question mark to be displayed. Then after this the irc client (irssi) showed a growing lag number and I was not able to write anything after the first characters were shown. The characters were some autoreplies to the channel joining.

I was using the irssi-xmpp-plugin and irssi. It might be that a bug in this plugin caused the hanging. But I checked their page and they claim to support utf-8 only. I also checked the mcabber page. They also claim to support utf-8 only as the spec mandates.

Then I checked the xmpp protocol (RFC 3920). It claims that "Implementations MUST NOT attempt to use any other encoding" than UTF-8. So this statement seems that xmpp can't be used for IRC at all. This is quite sad, really. The xmpp IRC solutions can't be used for flexible, backwards compatible "real world" use cases. XMPP people can create their own IRC channels but they can't contact other IRC users and discuss with them. IRC users will continue to use IRC servers. What's the point?

In any case I think the first patch enables the correct behavior for ejabberd to start using the utf-8. I think it even should be made static and unchangeable as the spec says. I think the "default encoding" is useless and can be removed to add support for utf-8-only.

I think the problem are the bugs, not the protocol

olliex wrote:

1. RFC 3920: XMPP implementations MUST NOT attempt to use any other encoding" than UTF-8
2. Olliex: So this statement seems that xmpp can't be used for IRC at all.

I don't see 2. as a consequence of 1.

Let's imagine the worst scenario regarding encodings:
A) IRC client with iso8859 encoding
B) IRC server with iso8859 encoding
C) IRC-XMPP transport
D) XMPP server with utf-8 encoding
E) XMPP client with utf-8 encoding

That scenario satisfies 1., and will work perfectly as long as C speaks in iso8859 encoding with B, speaks in utf-8 encoding with D, and converts the content between encodings correctly.

If your tests, C is mod_irc and D is ejabberd. You get encoding problems, and I think that means there are bugs in one or several programs (maybe in mod_irc), but I can't yet conclude that it couldn't work once the bugs are fixed.

I reorder the letters a

I reorder the letters a little bit to see it better:

A) IRC client with iso8859 encoding
B) XMPP client with utf-8 encoding
C) XMPP server with utf-8 encoding
D) IRC-XMPP transport
E) IRC server with iso8859 encoding

So D speaks (IRC server->client) iso8859 OR utf-8 with E and D speaks utf-8 with C.
This leaves two solutions:
1. If D checks the characters set (to convert?) it has to know that it is iso8859.
2. If XMPP had the support, characters might be passed as-is from D to A.

Now if the user uses utf-8 IRC client (A) as the current practice is, the character set has to be converted from X to utf-8 or do no conversion when it's already utf-8. The client must check at least that the input is utf-8 or not utf-8.

When it's not utf-8 the client uses some guessing. It might not use guessing if the user has defined that this-and-this channel (or "room") uses iso8859 or this-and-this user uses this-and-this character set.

The utf-8-only limitation (?) in XMPP causes that only option 1 can be used with guessing. The guessing would be a good feature if it works. I think the guessing would also need some extra parameters such as language to work better. That, again, needs some channel specific information (user might talk two different languages on different channels).

I can live with the utf-8-only support for D. The utf-8 is anyway recommended for IRC. I still think we can't change the users to use utf-8-only OR the xmpp clients with IRC extensions to support other than utf-8 because there is no need to do it.

Do you think the character set conversion (or guessing) should/can be done in D? If not, do you know whether we will get your mod_irc utf-8 patch to some official ejabberd release?

I add some more information

I add some more information to make this case clear:

Without XMPP and user has utf-8 terminal settings (IRC client and server only):
Client sends: iso8859, utf-8 or other character set. Client is recommended to convert from utf-8 to iso8859 if explicitly set by the client to do so. User should be able to force the settings "this channel/user has character set X".
Client receives: iso8859, utf-8 or other character set. Client converts other than utf-8 to utf-8 and doesn't touch the utf-8 content. Client user can specify that "this channel/user has character set X". If client user did not specify, some guessing is used.

With XMPP and user has utf-8 terminal settings:
Client sends: utf-8 only. This is OK, as it forces others to update to utf-8 if they see wrong characters.
Client receives: utf-8 only. This is not OK. XMPP should allow "write utf-8, read any" behavior. The XMPP user should not be forced to switch to IRC client. The IRC client user should be forced to utf-8. Because of this limitation all of this character set user/channel specific character set conversion must be done by the XMPP IRC transport.

Which patches?

Sorry, more urgent and relevant tasks (for me, I mean) have arrived and I can't invest more time investigating this problem right now. Of course, if you get your hands on erlang and find later a better solution, please open a ticket and propose your patch.

olliex wrote:

Do you think the character set conversion (or guessing) should/can be done in D?

I don't know if conversion should/can. And I suspect guessing isn't possible.

olliex wrote:

If not, do you know whether we will get your mod_irc utf-8 patch to some official ejabberd release?

Yes, let's be pragmatic and commit into ejabberd mainline whatever you consider is better for general case (and breaks less the current mod_irc behaviour). In this sense, this thread contains four patches. Exactly which of them do you consider suitable for inclusion in ejabberd?

Q) http://www.ejabberd.im/node/4270#comment-56609
W) http://tkabber.jabber.ru/files/badlop/4270-215-ircencoding.patch
E) first of http://www.ejabberd.im/node/4270#comment-56725
R) second of http://www.ejabberd.im/node/4270#comment-56725

The best patch is the patch

The best patch is the patch in the file (W). I applied only this to 2.1.5 and it caused that the unknown characters (over first 7 bits) were dropped when they were not utf-8. The utf-8 content showed correctly. I never tested other characters sets than utf-8, however.

Having this patch and the default changed to utf-8 would be good.

Patch applied.

olliex wrote:

The best patch is the patch in the file (W). I applied only this to 2.1.5 and it caused that the unknown characters (over first 7 bits) were dropped when they were not utf-8. The utf-8 content showed correctly. I never tested other characters sets than utf-8, however.

Having this patch and the default changed to utf-8 would be good.

Ok, I've applied patch W to ejabberd 2.1.x.

Changing now in ejabberd 2.1.6 the default value of an option could be confusing to server admins. Interested admins can use the option to set utf-8 as default, now that the option really works.

Syndicate content