Pages

Wednesday, February 12, 2014

Building Distributed Cache with GlusterFS - not a good idea

Cache is a popular concept. You can't survive in the cruel word of programming without using cache. Cache is inside you cpu. There is one inside you browser. Your favorite web site is serving pages from cache and you probably have one in your brain as well :)



Anyway, my mission was to build a distributed cache which can be used to speedup application X. This cache needed to be scalable and to allow extending it by adding more compute nodes. The compute nodes should be cheap amazon instances and the cache should provide very low latency (time until first byte of data arrives should be around 5ms).

Distributed file systems are a very cool peace of technology, so naturally it seems as a good idea to use it in building this cache. After reading about Ceph vs Gluster wars I've decided to settle down on Gluster. I liked very much the idea of translators which add functionality layer by layer. Each translator is a simple unit which handles one task of complicated file system.

The DHT translator in Gluster allows to distribute files between the different nodes. Later when the client wants to read file, it can locate the correct node using O(1) hash function computation and to read it directly from the node where it is located.

Using amazon, I quickly created 5 m1.large instances and installed glusterfs. The installation is pretty easy. I used my favorite tool for repeating tasks on multiple compute nodes (fabric) and after a while installation of x nodes became as simple as running

>> fab -H hostname1,hostname2,hostnamexx -P -f gluster.py install_gluster create_volume create_mount

Next I created distributed glusterfs volume consisting of 5 nodes

>> gluster volume create cache hostname1:/brick1/cache hostname2:/brick1/cache hostname3:/brick1/cache hostname4:/brick1/cache hostname5:/brick1/cache

I modified application X code to try and use the remote cache and my algorithm became something similar to:

fd = open("/local_cache/file1/block1.bin", O_RDONLY);
if (fd == -1) {
    fd = open("/gluster/file1/block1.bin", O_RDONLY);   
...
pread(fd, ....)

And it worked well when there was small amount of application X instances (around 10). However when the amount of application X instances was increased (8 servers, 15 application X instances on each, total of 120 applications which continuously access glusterfs through fuse interface), I've noticed that access to mounted glusterfs file system became very slow.

Looking at latency of some of the glusterfs operation on client side (thanks good gluster has translator, called debug/io-stats, which can aggregate statistics), I've noticed that some of the lookup operation would take 10-20 seconds ! Basically lookup should be very lite operation, using the DHT translator the client should determine the exact server where the file should be located and then it can query it directly which will lead to distribution of stress between the 5 nodes. Each time I open the file a lookup command will be sent by gluster to determine where the file is located and later I can read it or write to it without a problem.


Fop     Call Count    Avg-Latency  Min-Latency    Max-Latency
---     ----------    -----------  -----------    -----------
STAT            58     2133.43 us      1.00 us    62789.00 us
MKDIR          112  1009804.77 us     18.00 us 14452294.00 us
OPEN           149    22108.78 us      2.00 us  1778858.00 us
READ           151   160008.16 us      7.00 us  7246801.00 us
FLUSH          149    13178.94 us      1.00 us  1594615.00 us
SETXATTR      2633   232459.26 us    719.00 us 10749344.00 us
LOOKUP        4725   641604.43 us      2.00 us 17630043.00 us
FORGET           1           0 us         0 us           0 us
RELEASE        149           0 us         0 us           0 us
------ ----- ----- ----- ----- ----- --- -----  ----- ----- ---

So why are lookups take so much time ? I realized I need to look at the network level. Luckily glusterfs has plugin for wireshark which can dissect traffic and show glusterfs operations.

Looking at the captured packets I saw that when we access file /file1/block1.bin, gluster will invoke 3 lookup operations. First for directory /, second for sub-directory file1 and then to file block1.bin.

Now here are the bad news: when performing lookup for directory, gluster will broadcast lookup request to all nodes and then wait until all nodes will respond. In my case one of the nodes (not the one where the actual file sits) would be super busy and would respond only after 5 seconds. And so the lookup request will wait for 5 seconds without trying to continue and actually communicate with the node where the file is located. This bad situation repeats for each path directory component and leads to awful performance with lookups.



Tweaking glusterfs options such as setting dht.lookup-unhashed=off, didn't help since this only relevant the file component and we still need to traverse at least two directory components until we reach the file component in the path.

So in my opinion such behavior makes gluster very unsuitable for functioning as cache where checking if data is in cache (lookup) is a frequent method. I guess hacking gluster and modifying the DHT translator can solve the issue (and introduce many other interesting issues :)

But this would be another interesting post ...