Archive for the ‘Erlang’ Category

String encodings in Erlang

Monday, October 31st, 2011

Erlang is famous for the way it deals with strings. Being that strings are “just a list of integers”. Sounds easy, doesn’t it? I’ve been having some issues with Erlang and UTF-8 strings lately and I thought I would write down some of my findings here.

I’m playing with the Erlang shell here so first I need to figure out which encoding my shell is using. This is how I can figure that out.

1> lists:keyfind(encoding,1,io:getopts()).
{encoding,unicode}

According to the Erlang manual, the shell should be able to read and write UTF-8 strings if your environment has been configured properly. Now we can see that our Erlang shell supports Unicode, so we’re happy. Let’s start playing around with some strings.

Let’s start with string “äiti” which is a Finnish word for mom (I find myself missing my mom often when dealing with strings in Erlang). By default, Erlang strings use Latin-1 encoding. The list of integers representing this string looks like this.

1> String = [228, 105, 116, 105].
"äiti"

As you can see, Erlang is “clever” enough to show the string represented by this list of integers. These integers represent code points for the characters and since the default encoding in Erlang is Latin-1 the code point for character “ä” is 228. For Latin-1, these code points are integers from 0 to 255. So you can represent 256 different characters with Latin-1. If you’re using ASCII encoding your code points are between 0 and 127 (ASCII uses seven bits). Since ASCII is a subset of Latin-1 the characters, “i” and “t” have same code points in ASCII and Latin-1.

Ok, pretty easy so far. But what happens when 256 characters just is not enough. Say hello to Unicode. With unicode, one can represent basically any number of characters. The code points are not limited to just one byte anymore. Let’s stick with the word “äiti” still for a while. We know that ASCII is a subset of Latin-1 and actually Latin-1 is also a subset of Unicode. Makes sense, ha? So Latin-1 string “äiti” looks in terms of code points exactly the same as Unicode string “äiti“.

1> Latin1String = [228, 105, 116, 105].
"äiti"
2> UnicodeString = [228, 105, 116, 105].
"äiti"

Now this is just convenient, since the character “ä” has the same code point in Latin-1 and Unicode and it can be represented using only one byte. Ok, this looks pretty easy. Nothing can go wrong here since the Latin-1 and Unicode strings look exactly the same. Well, not quite. Since Unicode can represent way more characters than Latin-1 we need to agree on how the Unicode strings are represented on the byte level. You cannot represent the Unicode character “snow man” (☃) with one byte since it’s Unicode code point is 9731. Here we need UTF-8 encoding. It is very commonly used and it is the encoding you need to use nowadays. For example, popular data serialization format JSON assumes that the JSON string is encoded in UTF-8. Ok, let’s look at how the string “äiti” looks like in UTF-8. The integers here no longer represent the code point in Unicode, but the byte of the UTF-8 string. As you can see we are using Erlang binaries here to represent the string.

1> Utf8String = <<195, 164, 105, 116, 105>>.
<<"äiti">>

Now Erlang shows the string a bit messed up, since it tries to convert the binary to string using one byte per character and as you can see UTF-8 uses two bytes to represent the character “ä” and the string looks to be messed up. How we can deal with this in Erlang? We need to use the unicode module.

2> unicode:characters_to_list(Utf8String, utf8).
"äiti"

A very common way to mess up things here is to use erlang:binary_to_list/1. The same thing happens as with the shell print out. binary_to_list/1 converts binary to list byte by byte and now your string has five characters instead of four. This can easily lead to “exploding” strings if you write this string to database and read it from there and decode it again with binary_to_list/1.

Like I already mentioned UTF-8 is the “de facto” encoding nowadays. So here are few pointers about how to deal with UTF-8 strings in Erlang.

Creating strings

If you create strings the usual way (String = "äiti".) remember that it is Latin-1 encoded.

If you are reading strings from somewhere as a byte stream, you should use unicode:characters_to_list/2 and give the function the proper encoding e.g.,

1> unicode:characters_to_list(<<226,152,131>>, utf8).
[9731]

Here we have the UTF-8 encoded binary representation of the “snow man” (☃) character. The output is list with Unicode code point integers. This Unicode string you can use with e.g., functions from the string module.

Convert Unicode strings to binaries

Some functions like crypto:sha_mac/2 requires the input to be an iolist(), which a list that does not care about the encodings but must be just a list of bytes. This cannot take Unicode string as a parameter since those might have integers in them that need more than one byte. So if you’re handling strings as Unicode, you will need to convert them to UTF-8 binaries to functions like crypto:sha_mac/2. Like this:

1> UnicodeString = [9731].
[9731]
2> Utf8Bin = unicode:characters_to_binary(UnicodeString, unicode, utf8).
<<226,152,131>>
3> crypto:sha_mac(<<"key">>, Utf8Bin).

As a conclusion. Everything should go well if you stick with Unicode strings and remember to encode/decode them to/from UTF-8 every now and then.

Playing Around With Openstack’s Object Storage

Friday, October 22nd, 2010

Couple of weeks ago I found out about the Openstack project and I found it immediately to be very interesting. What I’ve been playing around with the most is the object storage part of Openstack called Swift. I’ll show here how you can use Swift with a couple of different libraries. The nice thing about Swift is that it is basically the Rackspace Cloudfiles storage, so the same libraries that work with Cloudfiles, should work with Swift as well. Well, they require some small modifications. But, I’ll show you here two libraries that I know are working already. Of course, you will need a Swift instance running somewhere and instructions on how to setup one you can read the “Swift All In One” document that shows how you can run Swift on a single server.

The first library I’ll show here is the python-cloudfiles. I recommend using the latest one from Github, since the one that you can get for example from Ubuntu repositories does not support Swift and the one you can get from Python Package Index had a bug that made it not work with Swift.

Here I’ll show you how you can connect to your local Swift instance using the authurl parameter and how you can create containers and objects using python-cloudfiles.

from cloudfiles.connection import Connection
 
conn = Connection("test:test", "test", authurl="http://127.0.0.1:11000/v1.0")
 
container = conn.create_container("test")
 
obj = container.create_object("test.txt")
obj.content_type = "text/plain"
obj.write("test")

Pretty straightforward… right?

Next, I’ll show you another library that works with Swift called cferl. It’s a Erlang library for Cloudfiles and I made some simple patches to it to make it work with Swift.

Here’s how you can do the same things as in previous example using cferl.

ibrowse:start().
{ok, Connection} = cferl:connect("test:test", "test", "http://127.0.0.1:11000/v1.0").
 
{ok, Container} = Connection:create_container(<<"test">>).
 
{ok, Object} = Container:create_object(<<"test.txt">>).
ok = Object:write_data(<<"test">>, <<"text/plain">>).

Ok, that’s it. Now you can start playing with Swift and storing petabytes of data in it.

Generating random strings in Erlang

Saturday, November 7th, 2009

I could not find any decent examples from the web on how to generate a random string with a certain set of characters and length in Erlang. The basic idea for such a method is to take a string of allowed characters and loop N times where the N is the length of the resulting string. Then at each loop we take some random character from the string that contains the required set of characters. Sounds relatively simple, right? Next we have to write this in Erlang. This is what I came up with…

1
2
3
4
5
6
get_random_string(Length, AllowedChars) ->
    lists:foldl(fun(_, Acc) ->
                        [lists:nth(random:uniform(length(AllowedChars)),
                                   AllowedChars)]
                            ++ Acc
                end, [], lists:seq(1, Length)).

Ok, Erlang is not the most readable language in the world and a simple thing such as generating a random string can look pretty tedious. No worries. I’ll go through the method line by line.

I’m using the lists:foldl method here. What it does is that it goes through a list (from left to right) and calls a function that has as it’s parameter a value from that list and the result form the previous iteration. The result of the method is the result of the last call to the function. The list I give as a parameter to lists:foldl is a sequence of numbers from one to the length of the resulting random string. For that I use the lists:seq method. This is how we define how many times we loop.

I’ll explain the fun() that is the first parameter of lists:foldl. Here is what it looks like separate from the whole code.

fun(_, Acc) ->
     [lists:nth(random:uniform(length(AllowedChars)), AllowedChars)]
          ++ Acc
end

The first parameter of the function is the value from the given list ([1, 2, 3, 4,..., N]) and we don’t use it (hence the underscore). The second parameter Acc is called the accumulator that is the result from the previous iteration. To achieve our goal of producing random strings we use lists:nth and random:uniform method calls to pick a random character from the AllowedChars string. Note that the lists:nth returns the integer value of that character so that is why the method call is wrapped in square brackets making the result a string (in Erlang strings are lists of integers). What we do then is that we add the Acc (the result of the previous iteration) to the result and this way build our random string.

There is also a third parameter for the lists:foldl method that you probably have guessed already. Naturally, you also have to give the value of the accumulator for the first iteration, which in this case is empty list [] or empty string since strings in Erlang are actually lists.

Here is an example of the result that the method produces.

test:get_random_string(32, "qwertyQWERTY1234567890").     
"8qttW01wQET1qRTt1r4tr2T392QY94Re"

Automatic code reloading in Erlang

Tuesday, November 3rd, 2009

I’ve recently got back to coding Erlang and noticed a neat module that I didn’t know existed that is probably worth writing a blog entry about. I’ve started developing a PubSubHubbub hub in Erlang called Hubbabubba and I’m using the great Mochiweb HTTP library as the HTTP server implementation. I discovered the reloader.erl module that comes with Mochiweb. It automatically reloads the code when you have the application running and you modify the code (remember to compile as well). This is something that I’ve found very useful when developing with Django or AppEngine and I’m really satisfied that there is a similar solution for Erlang as well.


blog.teemu.im is Digg proof thanks to caching by WP Super Cache