Building End-to-End Hadoop Solutions

By Mike King

Description:  Here are some key considerations for creating a full-featured Hadoop environment—from data acquisition to analysis.

The data lake concept in Hadoop has a great deal of appeal.  It’s a relatively low-cost, scale-out landing zone for many diverse types of data.  Actually, I’ve yet to see a type of data one couldn’t put in Hadoop.  Although most accessible data is highly structured, data can also be semi-structured or multi-structured.  (To my way of thinking, technically there is no “unstructured” data, but that is a subject for another post.) 

Data may be found internally or externally, and some of the best data is actually purchased from third-party providers that create data products.  Don’t ignore this “for-fee” data as it may allow you to connect the pieces in ways you couldn’t do otherwise.  In many cases the upfront cost is pale in comparison to the opportunity cost.  (Hey, that’s “Yet another Idea for a Cool Blog Post” [YAIFACBP].)

Perhaps one of the richest parts of the Hadoop ecosystem is the ingest layer.  In short, it’s how you get data into Hadoop from a source to a synchronization—that synchronization is where you are moving your data into Hadoop or into a data lake.  Options for moving data include Sqoop, Flume, Kafka, Boomi, SSIS, Java, Spark Streaming, Storm, Syncsort, SharePlex, Talend and dozens of others. 

While the ingestion is important, if all you do is fill your data lake you have failed.  There are several different aspects you should strongly consider for your data lake.  These include data quality, data governance, metadata, findability, organization, ILM, knowledge management and analytics.  How one accounts for each of these points regarding data tooling is of little importance, as there are many ways to skin the cat and my way may not be best for you.  Let’s examine each of them in order.

Data quality can be thought of as cleaning data.  The old adage “garbage in – garbage out”—GIGO—aptly applies.  Data dimensions must include accuracy, completeness, conformance, consistency, duplication, integrity, timeliness and value.  When cleaning data, a suggestion would be to start with a few simple measures based on the dimensions for the data that matter most to you.  Keys (primary, alternate, natural or surrogate) and identifiers (IDs) tend to be the most important attributes when considering data quality.  These keys and IDs are also how we access our data.  Think about the impact to your business when the keys or IDs are incorrect.  Checking items against one or more metrics, standards, rules or validations will allow you to avoid the problems and remediate those that do occur via a closed-loop process. 

Data governance involves the control, access and management of your data assets.  Each business must outline and define its own process.  As Albert Einstein once said, “Make things as simple as possible, but not simpler.”  I’d advise “data lakers” to start small and simple even if it’s only a spreadsheet that includes pre-approved users, sources, syncs and access controls maintained by your Hadoop administrator.  Data governance is an imperative for every data solution implementation.

Metadata is “data about data” and is often misunderstood.  As a quick definition, if one considers a standard table in Oracle, then the column names and table names are the metadata.  The data values in the rows are data.  Similarly for Hive.  There are times when metadata is embedded in the payload, as with XML or JSON.  A payload is simply all the data contained in a given transaction.  A good practice when implementing a big data solution is to collect the disparate metadata in one place to enhance or enable management, governance, findability and more.  The most common manner for collecting the disparate metadata is to do this is with a set of tables in your RDBMS. 

Findability is generally implemented with search.  In Hadoop this typically is either SOLR or elastic search.  Elastic search is one of the newest additions in Hadoop and is far easier to learn and configure, although either method will work.  Note that the search function is an imperative in any big data solution.

The next key is organization, and although this may sound a bit trite, it is a necessity.  Developing a simple taxonomy and the rules on how you create and name your directories in HDFS is a great example of organization.  Create and publish your rules for all to see.  Note that those who skip this step will have unnecessary duplication, unneeded sprawl, lack of reuse and a myriad of other problems.

Information Lifecycle Management (ILM) is a continuum of what one does with data as it changes state, most typically over time.  Think of data as something that is created, cleansed, enhanced, matched, analyzed, consumed and eventually deleted.  As data ages its value and usage declines.  With ILM one might store data in memory till it is cleansed and cataloged, then in a NoSQL database like Cassandra for 90 days, and finally compressed and stored in HDFS for two or more years before it is deleted. 

Knowledge management is simply how one manages the knowledge garnered from data.  All too often one might ask, “If we only knew what we know…..”  Learning is something that has great value to the individual.  In a company or organization we should leverage knowledge so that the value of the knowledge multiplies.  Sharing knowledge makes others smarter and safer, and therefore more productive.  In Hadoop, how do your users learn about components like Hive, Sqoop and Pig?  How do they share their tips with others?  There are many more questions to ask, and using a “wiki” allows the secure knowledge sharing and management.

The next aspect in a big data solution is arriving at the stage where we can begin to analyze the data. When we arrive at this step we begin to build the insights into the data that allow users to see the fruits of their labors.  The consumers of our data lake, like analysts and data scientists, should now have matured to using the data to begin to build the business, protect the business and understand their customers in a more comprehensive manner. Ultimately, driving data-driven insights allows users to be e more productive, make better decisions and get better results.

Mike King is a Dell EMC Enterprise Technologist specializing in Big Data and Analytics