- by Adnan Khaleel

Unless you’ve been living under a rock of late, you’ve probably heard these terms being tossed around with nary an explanation: the Cloud, Big Data, and the more legitimate sibling of the lot – HPC or High Performance Computing. And before you throw your hands up and admit that you never quite understood the distinctions, let me tell you that I’ve got some good news. Those distinctions are already beginning to blur so that now you, the end user, things get a whole lot simpler and easier!

There is lots of foundational information on the Cloud, Big Data and HPC on our website so I’m not going to cover that material here but what I do what to discuss is why there is a difference in how both HPC and Big Data have matured in very different segments with different requirements, and as a result don’t really play nice with each other. But more importantly, how not only are certain technologies being developed that really enable these disparate technologies to come together, but interestingly, there are pressures from both sides – researchers and industry, that are driving a need for greater interoperability and hence a convergence of these disciplines.

So let’s start off with HPC. As you might have guessed, HPC grew up in academia and although industries are amongst the primary beneficiaries of this technology, a lot of the tools and work flows are primary being developed by researchers that are looking for push computational boundaries. Traditionally, these workloads are simulations or massive number crunchers if you will – think modelling the airflow over a wing or simulating the universe. So from a system architecture perspective, the compute side was logically separated from the storage. True, these computations generated a lot of data (and I mean a lot!) but researchers ended up developing their own data analytical tools since nothing out there existed that met their needs – that is until the advent of big data tools. In addition, researchers are now beginning to look at more data driven science that incorporate more empirically collected data as part of a complex simulation workflow in order to improve accuracy, so now there is also an organic need for data-centric programming tools.

On the other hand, in enterprise and industry, although HPC has played a role albeit a minor one relatedly speaking, it’s really the reporting and transaction processing that have driven the needs around data centric system architectures moving large quantities of data. In the past a majority of data was structured, but recently, business have place a greater emphasis on seeking and deriving value in unstructured data, which ultimately gave rise to the technologies like Hadoop, MapReduce and Spark. To avoid the penalties of moving lots of data around, architectures that instead “moved” the code closer to the data won out. However, as industry started looking for better ways to “derive” economic value from data, the programming models started veering towards techniques like machine learning, deep neural networks, graph analytics and discovery algorithms – techniques that were more the mainstay of researchers and academia. Couple this with the emergence of IoT (internet of things) stream processing, now there was a legitimate requirement for HPC style architectures and expertise.

Not surprisingly, since these tools have grown up in different environments, they rarely play nice with one and other, because they never had to. Well until now.

Both sides, academia and industry, were veering towards each other: HPC vendors are beginning to use commercial big data tools and see opportunities open up as enterprise computational problems grow and intersect with their own areas of expertise. And enterprises have a need for HPC style technologies. And so there you have it, many the important elements for this convergence to happen are in place, and so as the adage goes, where there’s cheese, there’s mice.

But jokes aside, this convergence makes sense for practical reasons too. The industrialization of analytics has instilled the model of end-to-end workflows. And by this I mean, looking at the entire process, right from data ingest to actionable insights as the two ends of a pipeline separated by stages that feed results into subsequent stages. Each stage may be specialized in what it does, for e.g. ETL in stage 1, computation in stage 2, graph analytics in stage 3, in-memory analytics in stage 4 and so on. Now an efficient pipeline would dictate that you can execute of these different stages on same platform using the tools best suited to each task. And since nobody is interested in reinventing the wheel, that means running the tools people are familiar with at solving a particular type of problem on the same platform. Utilizing on the same platform also means minimizing data movement from one specialized system to another and this can be a huge overhead when you’re dealing with terabytes of data.

Having this demand is great but it’s definitely not the enabler and here’s where technology also exists to make this possible. From a hardware perspective, interconnect technologies (like Infiniband) have evolved to a point that both sides see benefit in it. Technologies like Intel’s Omnipath are only going to drive this further. Even storage technologies and high performance parallel filesystems, like Lustre, which are almost completely exclusive used by HPC are now being extended to work with Big Data tools. Intel’s Enterprise Edition Lustre is a great example as it’s designed to interoperate with MapReduce.

Even on the software side, technologies that were developed for easier deployments in the Cloud, like OpenStack and Docker, are invaluable for creating environments for creating complete containerized approaches to complex workflows. Another huge enabler are flexible and extensible resource managers, like Apache Mesos, that allow you to run MPI, MapReduce and Spark codes on the same platform.

With the SC15 happening in Austin right now, you can be sure this convergence is going to be a hot topic, especially for us at Dell. So please make sure you stop by and learn more about how we’re laying the framework for the future HPC.

Have any comments or suggestions, feel free to get in touch with me at adnan_khaleel@dell.com.