all posts tagged hadoop


by on September 5, 2013

Enabling Apache Hadoop on GlusterFS: glusterfs-hadoop 2.1 released

The Gluster community is pleased to announce a major update to the glusterfs-hadoop project with the release of version 2.1. The glusterfs-hadoop project provides an Apache licensed Hadoop FileSystem plugin which enables Apache Hadoop 1.x and 2.x to run directly on top of GlusterFS. This release includes a re-architected plugin which now extends existing functionality within Hadoop to run on local and POSIX File Systems.

Overview

Apache Hadoop has a pluggable FileSystem Architecture. This means that if you have a filesystem or object store that you would like to use with Hadoop, you can create a Hadoop FileSystem plugin for it which will act as a mediator between the generic Hadoop FileSystem interface and your filesystem of choice. A popular example would be that over a million Hadoop clusters are spun up on Amazon every year, a lot of which use Amazon S3 as the Hadoop FileSystem.

In order to configure the plugin, a specific deployment configuration is required. Firstly, it is required that the Hadoop JobTracker and TaskTrackers (or the Hadoop 2.x equivalents) are installed on servers within the gluster trusted storage pool for a given gluster volume. The JobTracker uses the plugin to query the extended attributes for job input files in gluster to ascertain file placement as well as the distribution of file replicas across the cluster. The TaskTrackers use the plugin to leverage a local fuse mount of the gluster volume in order to access the data required for the tasks. When the JobTracker receives a Hadoop job, it uses the locality information it ascertains via the plugin to send the tasks for the Hadoop Job to Hadoop TaskTrackers on servers that have the data required for the task within their local bricks. This ensures data is read from disk and not over the network. The diagram below provides an overview of the entire solution for a Hadoop 1.x deployment.

Figure 1 – Solution Architecture

glusterfs-hadoop

The community project, along with the documentation and available releases, is hosted within the Gluster Forge. The glusterfs-hadoop project will also be available within the Fedora 20 release later this year, alongside fellow Fedora newcomer Apache Hadoop and the already available gluster project. The glusterfs-hadoop project team welcomes contributions and participation from the broader community.

Stay tuned for upcoming posts around GlusterFS integration into the Apache Ambari and Fedora projects.