by on August 21, 2014

User Story: Chitika Boosts Big Data with GlusterFS

Chitika Inc., an online advertising network based in Westborough, MA, sought to provide its data scientists with faster and simpler access to its massive store of ad impression data. The company managed to boost availability and broaden access to its data by swapping out HDFS for GlusterFS as the filesystem backend for its Hadoop deployment. […]

Read More

by on August 15, 2014

Running CDH5 on GlusterFS

I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on Gluster.  I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.

Running a Single CDH 5 Deployment on One or More GlusterFS Volumes

Given that the CDH 5 distribution is comprised of other components besides YARN and MapReduce,
I used the Apache Bigtop System Testing Framework to explicitly validate that Apache Sqoop, Apache Flume, Apache Pig, Apache Hive, Apache Oozie, Apache Mahout, Apache ZooKeeper, Apache Solr and Apache HBase also ran successfully.  Work is Still in Progress to Enable the Use of Impala.  

 If you would like to participate in accelerating the work on Impala, please reach out to us on the Gluster mailing list.

Implementation details for this solution and the specific setup required for all the components are available on the glusterfs-hadoop project wiki. If you have additional questions, feel free to reach out to me on FreeNode (IRC handle jayunit100), @jayunit100 on twitter, or via the Gluster mailing list.

Read More

by on August 14, 2014

Vagrant: More than just VMs

PART 1 :  Vagrant in the Container If you use vagrant to maintain your dev recipes, then your natural prediliction might be to now move to supporting docker.Using vagrant to wrap docker means you can run docker apps from anywhere, and maintain the…

Read More

by on August 13, 2014

Docker notes

You need a repository before you can push to dockerhub.  You cant just push to an empty repository.Here is the difference between a Vagrantfile and the corresponding Dockerfile.  The dockerfile defines the image, the Vagrantfile simply define…

Read More

by on August 8, 2014

Vagrant on da cloud

Vagrant has made our lives wonderful for development.But you don’t need vbox/kvm/vmware etc….. to use it.There are other ways to deploy vagrant boxes.- Docker – OpenStack – EC2 – libvirt – and so on -So what changes when you move off a local hypervis…

Read More

by on August 7, 2014

Glusterfs and Tachyon

Tachyon, an in-memory distributed filesystem, is among the most dynamic projects in big data analytics stack. It provides java io like API, support Apache Spark, and vastly improves Spark’s performance under large data set. As illustrated in this paradigm, Tachyon retrieves data from underlying filesystems (HDFS, S3, Glusterfs, and Posix compliant filesystems), caches data in […]

Read More

by on August 4, 2014

GlusterFS: Disaster Recovery

Scenario: You are operating a busy GlusterFS cluster and for whatever reason the volume data gets corrupted. Luckily, you have been backing up the underlying bricks so you are able to restore the bricks to a usable state, but now

Read more ›

Read More