One of the difficulties faced by fault-tolerant web applications is distributed file storage. Indeed, this is one of the many challenges we face here at 2buntu. We allow our editors to upload images to accompany their articles and these images must be available on each server used for serving the website. Keeping files synchronized across multiple servers can be tricky.
One of the solutions to this problem is GlusterFS and it comes to us from the good folks at Red Hat. GlusterFS provides both a server and client application that allow us to tackle the problem described above. We will be using replication to ensure that files added to a specific directory are immediately propagated to the other servers running GlusterFS.
A lot of what follows will be based on the GlusterFS Quick Start Guide. However, that page contains very little explanation for each of the commands, so this article will fill in some of the gaps. The example below assumes that we are setting up three nodes, although GlusterFS scales to much larger numbers.
Installing GlusterFS is as simple as installing a single package on each node:
sudo apt-get install glusterfs-server
This package provides the gluster
command, which we will be using to create our storage volume. Some of the earlier Ubuntu releases provide older GlusterFS versions. To rectify this, the GlusterFS team maintains a PPA of the latest release (3.6 at the time of writing):
ppa:gluster/glusterfs-3.6
GlusterFS recommends a separate partition for data storage, although this is not a requirement. In the example below, we will assume that /dev/sdb1
is a blank partition on each node formatted with a suitable filesystem (EXT4 and XFS are good choices). Because we are using replication, each of the partitions must contain enough space for all of the files.
The example further assumes that /deb/sdb1
is mounted at /data
. Adding the mount to /etc/fstab
ensures that it is accessible across reboots. If the partition was formatted with XFS, the entry in /etc/fstab
would look like this:
/dev/sdb1 /data xfs defaults 0 0
Each node must be accessible to the other nodes either through a public or private network interface. DNS provides an easy way to refer to each node by name instead of IP address. In order for this to work, each node needs to be able to resolve the name of every other node.
One of the easiest ways to set this up is by editing /etc/hosts
on each node. Assuming that the IP addresses of the three nodes are 1.1.1.1
, 1.1.1.2
, and 1.1.1.3
, add the following lines to /etc/hosts
on each node:
1.1.1.1 gluster1
1.1.1.2 gluster2
1.1.1.3 gluster3
The downside of this approach is that the IP address of each new node must be manually added to the other nodes.
Now that we have GlusterFS installed on each node, it is time for them to discover each other. This is done through "probing". Open a terminal on gluster2
and run the following commands:
gluster peer probe gluster1
gluster peer probe gluster3
Each command must be run as the root user. Each node will list its peers if you run the gluster peer status
command:
root@gluster2:~# gluster peer status
Number of Peers: 2
Hostname: gluster1
Port: 24007
Uuid: xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
State: Peer in Cluster (Connected)
Hostname: gluster3
Port: 24007
Uuid: xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
State: Peer in Cluster (Connected)
To create a volume, run the command below:
gluster volume create testvol rep 3 transport tcp gluster1:/data gluster2:/data gluster3:/data
This command creates a new replication storage volume. rep 3
indicates that we want all data replicated across three nodes (all of them in our case, since there are only three). We also specify the location for the data on disk for each node.
Once the volume is created, we can start it with:
gluster volume start testvol
In order to access the new volume like a regular filesystem, we must mount it. We will mount the volume at /mnt
on the first node with the following command:
mount -t glusterfs gluster1:/testvol /mnt
That's it! Now do the same thing on the second and third nodes, respectively:
mount -t glusterfs gluster2:/testvol /mnt
mount -t glusterfs gluster3:/testvol /mnt
Try downloading a file to the first node:
root@gluster1:/mnt# wget http://2buntu.com/static/img/apple-touch-icon.png
Now open a terminal on the second or third node and view the directory contents:
root@gluster2:/mnt# ls
apple-touch-icon.png
There it is! The file was automatically copied to the second and third nodes. Try doing the same thing on the second node.
We're not quite done yet. Try powering off the second node or disconnecting it from the network. It will no longer remain in synchronization with the other nodes, though they will continue to function on their own. Files copied to the first node will still appear on the third.
Once the second node is brought back online, it will immediately "catch up" to the other nodes, receiving files that were copied in its absence.
GlusterFS is an extremely powerful option for distributed file storage. Although we don't currently use it here at 2buntu yet, we are planning to add it to our infrastructure in the coming months.
Stay tuned for more news in the future as we begin expanding our services.
Download the official 2buntu app for both Android and Ubuntu Touch.