Originally published on HPCatDell on March 13, 2014.
by Joey Jablonski and Armando Acosta
While the HPC market is more mature than the Big Data space, the level of excitement around Big Data is driving many innovations and new products to market very quickly. In this last installment, we'll examine some of these technologies and paradigms that are being adopted back into the HPC space, not necessarily to fill gaps in the HPC stack, but to complement the mature technology.
Systems Management – As larger and larger environments are built of commodity hardware, the operations and monitoring of that hardware become critical. Many web-scale companies have built software stacks from the ground up to accommodate the management of many distributed systems running as part of a larger coordinated effort. For example, technologies like Genie and Lipstick have been created and released into the community from Netflix. Genie is a Hadoop Platform as a Service, whereas Lipstick is a visualization framework for Pig, enabling easy adoption of new Big Data tools.
Distributed File Systems – The HPC space commonly uses parallel file systems to ensure high-speed performance and sustained throughput. These meet the needs of many workloads, but can create bottlenecks because of the requirement of moving data for each job. Hadoop and associated technologies have introduced new distributed file system capabilities that both protect data in a distributed way, but also ensure high performance access to this data that is spread across many nodes in the compute environment.
MapReduce – MapReduce became common and well known after Google published their first paper on the topic in 2004. This paradigm allows a workload to be broken up and run across many nodes at the same time. While MapReduce does not work for all types of data sets and algorithms, it provides an effective way to index, search and analyze large amounts of semi-structured and unstructured data. Many HPC environments are beginning to offer both MPI for communication-intensive workloads and MapReduce for less chatty types of workloads.
Lower Latency Ethernet Connectivity – Many HPC sites have historically used Infiniband for communication because of its low latency characteristics. The Big Data market has continued to drive Ethernet switch vendors to create more efficient switches, lowering latency on standard connection fabrics even more. This, combined with the continually growing volumes and lower prices of higher speed Ethernet options, has enabled many HPC sites to look at Ethernet as an alternative to Infiniband that is competitive in price and speed for many workloads.
Self-service Data Access – HPC has traditionally been dominated by tools that required experience as a software developer in order for them to be consumed and utilized. Big Data has brought a large number of tools to the market allowing a broad audience - often without software development experience - self-service access to complex data sets. Tools like Kitenga and Toad BI that enable a much broader audience to access Big Data platforms are beginning to be used by traditionally HPC sites to expand access to data.
This sharing of technology between HPC and Big Data will continue. The sheer excitement around Big Data is driving impressive innovation, significant numbers of new talent into the distributed computing space, and many new companies are investing in this growing market. This technology is something to welcome as it enhances capabilities for both domains.