When it comes to processing big data platforms, Hadoop has become the go-to platform. It allows vast amounts of data, especially unstructured or very diverse data, to be quickly processed. As the de facto open sources parallel file system for HPC environments, Lustre provides compute clusters with efficient storage and fast access to large data sets. Together these technologies help to solve big data problems. However, they also present some disadvantages, including a need for HTTP calls, added overhead, reduced efficiency, slower speed, and a requirement for fairly large local storage on each Hadoop node.

There is, however, a way to overcome those obstacles. As a Hadoop software adaptor, Intel Enterprise Edition for Lustre (IEEL) provides direct access to Lustre during MapReduce computations, improving performance.

A presentation by J. Mario Gallegos, at the Recent LUG 15 conference highlighted some of the advantages gained and some of the best practices to follow when adding IEEL.

Among the advantages observed:

  • Using Lustre is more efficient for accessing data - HDFS file transfers rely on the HTTP protocol, which results in higher overhead and slower access.
  • Centralized access from Lustre allows data availability to  all compute nodes  - By eliminating transfers during the MapReduce ”shuffle” phase, users gain better performance, such as  higher jobs throughput.
  • Lustre allows convergence of HPC infrastructure with big data applications - The existing HPC cluster has limited storage on each compute node.

You can read about Mario's other findings and see his LUG presentation here.