Dell Big Data Blog
Thank you for visiting our Big Data community. Our active contributors have their fingers on the pulse of Big Data and the big ideas it generates.
By Brett Roberts with Debra Slapak
The amount of machine-generated data being created each day is massive and--as we all know--can be extremely valuable. Insights extracted from this data have the potential to help you improve operational efficiency, customer experience, security and much more. But getting started can present real challenges and really big questions, such as "How do we consolidate all of this complex data and analyze it to deliver actionable insights?" Dell EMC works with Splunk to address these challenges and simplify those first steps.
Splunk’s proven platform for real-time operational intelligence helps reduce the complexity of harnessing machine-generated data by providing users with an end-to-end platform to collect, search, analyze and visualize this data. For the Splunk platform to be used to its full potential, organizations need infrastructure that meets or exceeds Splunk’s reference architecture specifications. Dell EMC has partnered with Splunk to create highly-optimized and powerful solutions that help solve machine-generated data challenges. Read more in a recently posted blog about how Splunk and Dell EMC can help you on your journey to valuable insights with machine-generated data.
Building End-to-End Hadoop Solutions
By Mike King
Description: Here are some key considerations for creating a full-featured Hadoop environment—from data acquisition to analysis.
The data lake concept in Hadoop has a great deal of appeal. It’s a relatively low-cost, scale-out landing zone for many diverse types of data. Actually, I’ve yet to see a type of data one couldn’t put in Hadoop. Although most accessible data is highly structured, data can also be semi-structured or multi-structured. (To my way of thinking, technically there is no “unstructured” data, but that is a subject for another post.)
Data may be found internally or externally, and some of the best data is actually purchased from third-party providers that create data products. Don’t ignore this “for-fee” data as it may allow you to connect the pieces in ways you couldn’t do otherwise. In many cases the upfront cost is pale in comparison to the opportunity cost. (Hey, that’s “Yet another Idea for a Cool Blog Post” [YAIFACBP].)
Perhaps one of the richest parts of the Hadoop ecosystem is the ingest layer. In short, it’s how you get data into Hadoop from a source to a synchronization—that synchronization is where you are moving your data into Hadoop or into a data lake. Options for moving data include Sqoop, Flume, Kafka, Boomi, SSIS, Java, Spark Streaming, Storm, Syncsort, SharePlex, Talend and dozens of others.
While the ingestion is important, if all you do is fill your data lake you have failed. There are several different aspects you should strongly consider for your data lake. These include data quality, data governance, metadata, findability, organization, ILM, knowledge management and analytics. How one accounts for each of these points regarding data tooling is of little importance, as there are many ways to skin the cat and my way may not be best for you. Let’s examine each of them in order.
Data quality can be thought of as cleaning data. The old adage “garbage in – garbage out”—GIGO—aptly applies. Data dimensions must include accuracy, completeness, conformance, consistency, duplication, integrity, timeliness and value. When cleaning data, a suggestion would be to start with a few simple measures based on the dimensions for the data that matter most to you. Keys (primary, alternate, natural or surrogate) and identifiers (IDs) tend to be the most important attributes when considering data quality. These keys and IDs are also how we access our data. Think about the impact to your business when the keys or IDs are incorrect. Checking items against one or more metrics, standards, rules or validations will allow you to avoid the problems and remediate those that do occur via a closed-loop process.
Data governance involves the control, access and management of your data assets. Each business must outline and define its own process. As Albert Einstein once said, “Make things as simple as possible, but not simpler.” I’d advise “data lakers” to start small and simple even if it’s only a spreadsheet that includes pre-approved users, sources, syncs and access controls maintained by your Hadoop administrator. Data governance is an imperative for every data solution implementation.
Metadata is “data about data” and is often misunderstood. As a quick definition, if one considers a standard table in Oracle, then the column names and table names are the metadata. The data values in the rows are data. Similarly for Hive. There are times when metadata is embedded in the payload, as with XML or JSON. A payload is simply all the data contained in a given transaction. A good practice when implementing a big data solution is to collect the disparate metadata in one place to enhance or enable management, governance, findability and more. The most common manner for collecting the disparate metadata is to do this is with a set of tables in your RDBMS.
Findability is generally implemented with search. In Hadoop this typically is either SOLR or elastic search. Elastic search is one of the newest additions in Hadoop and is far easier to learn and configure, although either method will work. Note that the search function is an imperative in any big data solution.
The next key is organization, and although this may sound a bit trite, it is a necessity. Developing a simple taxonomy and the rules on how you create and name your directories in HDFS is a great example of organization. Create and publish your rules for all to see. Note that those who skip this step will have unnecessary duplication, unneeded sprawl, lack of reuse and a myriad of other problems.
Information Lifecycle Management (ILM) is a continuum of what one does with data as it changes state, most typically over time. Think of data as something that is created, cleansed, enhanced, matched, analyzed, consumed and eventually deleted. As data ages its value and usage declines. With ILM one might store data in memory till it is cleansed and cataloged, then in a NoSQL database like Cassandra for 90 days, and finally compressed and stored in HDFS for two or more years before it is deleted.
Knowledge management is simply how one manages the knowledge garnered from data. All too often one might ask, “If we only knew what we know…..” Learning is something that has great value to the individual. In a company or organization we should leverage knowledge so that the value of the knowledge multiplies. Sharing knowledge makes others smarter and safer, and therefore more productive. In Hadoop, how do your users learn about components like Hive, Sqoop and Pig? How do they share their tips with others? There are many more questions to ask, and using a “wiki” allows the secure knowledge sharing and management.
The next aspect in a big data solution is arriving at the stage where we can begin to analyze the data. When we arrive at this step we begin to build the insights into the data that allow users to see the fruits of their labors. The consumers of our data lake, like analysts and data scientists, should now have matured to using the data to begin to build the business, protect the business and understand their customers in a more comprehensive manner. Ultimately, driving data-driven insights allows users to be e more productive, make better decisions and get better results.
Mike King is a Dell EMC Enterprise Technologist specializing in Big Data and Analytics
We are headed out to the Big Show for Big Data, the Strata+Hadoop World event being held September 27-29, in New York City. We look forward to meeting with partners and customers as we take a closer look at the customer journey and the possibilities that exist in driving Big Data Hadoop adoption. Dell EMC has integrated all the key components for modern digital transformation, taking you on a Big Data journey that focuses on analytics, integration, and infrastructure. We have a number of exciting discussions planned and invite you to attend the events or connect with our team directly at booth #501. We will have some great giveaways that you won’t want to miss out on. You can also join us throughout the conference for All Day - Facebook LIVE videos on the Dell EMC Big Data Facebook page.
By Armando Acosta
The Strata + Hadoop World conference gets under way today, Tuesday, September 26, at the Jacob Javits Center in New York City. As always, the event will be a showcase for leading-edge technologies related to big data, analytics, machine learning and the like, but this year’s event brings some added attractions.
For starters, the conference will be the first major event to put the spotlight on the broad portfolio of Dell EMC solutions for unlocking the value of data and enabling the data analytics journey. As individual companies, both Dell and EMC had impressive product families in this space. And now that the two companies have become one newly formed company, the combined portfolio is arguably one of the best in the industry. In many ways, we’re talking about a “1 + 1 = 3” equation.
The Dell EMC portfolio for big data and modern analytics includes integrated, end-to-end solutions based on validated architectures incorporating Cloudera distributions for Hadoop, Intel technologies, and analytic software, along with Dell EMC servers, storage, and networking. The portfolio spans from starter bundles and reference architectures to integrated appliances, validated systems and engineered solutions. Our portfolio makes it easier for customers by simplifying the architecture, design, configuration/testing, deployment and management. By utilizing the Dell EMC portfolio, customers can minimize the time, effort, and resources to validate an architecture. Dell EMC has optimized the infrastructure to help free customers’ time to focus on their use cases.
For customers, the Dell EMC portfolio equates to a tremendous amount of choice and flexibility in deployment model, allowing customers to buy, deploy and operate solutions for big data and modern analytics no matter where they are in their journey. From industry-leading integration capabilities to direct-attached and shared storage, from real-time analytics to virtualized environments and hybrid clouds, choice spans the portfolio. The Dell EMC portfolio is configured and tuned to provide leading performance to run analytics workloads, enabling faster decision making.
Recent advances in the portfolio will be in the spotlight at the Dell EMC booth #501 at Strata + Hadoop World and will include use case-based solutions and validated systems for Cloudera Hadoop deployments. Our first iteration of the Hadoop reference architecture was published in 2011, when we partnered with Cloudera and Intel to develop a groundbreaking architecture for Apache Hadoop, which was then a young platform. Since then, hundreds of organizations have deployed big data environments based on our validated systems.
The widespread adoption of simplified and cost-effective validated systems points to a broader theme that will permeate the Strata + Hadoop World conference. That is one of Hadoop as a maturing platform that is heading into the mainstream of enterprise IT and delivering proven business value.
About that business value? Dell EMC and Intel commissioned Forrester Consulting to conduct a Total Economic Impact™ (TEI) study to examine the potential ROI enterprises may realize by deploying the Dell EMC | Cloudera Apache Hadoop Solution, accelerated by Intel. Based on interviews with organizations using these solutions, the TEI study identified these three-year risk-adjusted results:
Clearly, there are many reasons to be excited about how far we’ve come with Hadoop, and the potential to take the platform to all new levels with the Dell EMC portfolio. If you’re heading to Strata + Hadoop World, you will have many opportunities to learn more about the work Dell EMC is doing to help organizations unlock the value of their most precious commodity—their data.
In the meantime, you can learn more at Dell.com/Hadoop.
Armando Acosta is the Hadoop planning and product manager and Subject Matter Expert at Dell EMC.
Dell continues to make major investments in helping enable our customer’s success. Over the past 5 years, Dell has been building a globally connected network of labs that serve as a venue to host customer’s technical conversations as well as enable them to execute proofs-of-concept to “kick the tires” on solutions that solve their business challenges. With 16 locations globally, customers can visit any of these locations for a meeting with Dell’s subject-matter experts on topics ranging from the Dell PowerEdge server portfolio all the way to the latest Cloud and Big Data solutions Dell offers.
Talk to your account team today and join Dell’s other customers in answering questions like:
“Should I make the move to replace all my spinning drives with solid-state?”
“Can I gain agility and efficiency by moving to a hyper-converged platform?”
“What level of consolidation can I achieve with containerization on the latest Intel processors?”
For additional information on Dell’s Customer Solution Centers, please visit: http://Dell.com/SolutionCenters and contact your Dell account team to setup an engagement.
We invite you to join us May 16 – 19 at OSCON 2016 in Austin, where the tagline reads: “Making open work.” That’s been the spirit of OSCON from the beginning of the conference nine years ago, and it remains so today.
Organizers and attendees alike share the understanding that, in the words on the OSCON organization, “Once considered a radical upstart, open source has moved from disruption to default, transforming the practice of software development. Collaborative and transparent, open source has become modus operandi, powering the current state of innovation in technology from the cloud and data to AI, connected devices, and beyond.”At OSCON you’ll experience all of that and more.
While most, if not all, organizations today recognize the importance of open source technologies, not just a few are still waiting for the many talented open source community members to work their magic to deliver easy-to-install and easy-to-use technologies. At OSCON, attendees will be a part of delivering that latest information in the open and collaborative fashion that is OSCON.
As one of the co-sponsors of Open Container Day at OSCON, Dell is a part of the growing focus area of container-based solutions for infrastructure, cloud-native computing, continuous delivery, DevOps, microservices, and a space where industry experts agree “this industry segment is going in 2016 and beyond.” Open Container Day runs from 9 a.m. to 5 p.m. on Tuesday, and is open to all OSCON attendees with a badge, including Expo Plus pass holders.
During Open Container Day, we encourage you to join Dell Software Development Senior Engineer Jose De La Rosa for a 2:30 – 3 p.m. presentation, in which he will discuss responses to the often-asked query around how to containerize legacy applications. De La Rosa outlines his discussion to say: “Docker containers are targeted at microservices: lean, modularized, single-process applications that are easy to quickly deploy. However, older, bloated multiprocess (aka legacy) applications can also take advantage of containers.” Join De La Rosa as he shares his hands-on experiences with containerizing legacy applications.
The Expo Hall at OSCON is not-to-be-missed. Bringing the interest, information, fun and the focus to all that is open source, Dell representatives will be on hand to explain how we deliver “future-ready” solutions for cloud, big data, the Internet of Things and developer systems. Author Guy Harrison will sign his new book, Next Generation Databases, at the Dell booth on Wednesday and Thursday. The book is for enterprise architects, database administrators and developers who need to understand the latest developments in database technologies. It will help readers to choose the correct database technology at a time when concepts such as Big Data, NoSQL and NewSQL are making what used to be an easy choice into a complex decision with significant implications. We are also planning a little Star Wars fun, and look forward to seeing you at Booth 407.
The chance to network is one of the key advantages to attending OSCON. With that thought in mind, Dell invites attendees to register to join us for a happy hour on Wednesday night from 7 to 9 p.m. at the Cedar Door, one of the interesting and “Austin weird” local pubs. Find more information at Booth 407.
Finally, Dell’s Barton George will deliver a presentation that talks to the power of open source, the power of innovation and the amazing power of the collaboration, the intelligence and the cooperation that is the open source community. George will share the Project Sputnik story, a compelling crowd-sourcing of a developer laptop. Hear more Thursday from 11:05 a.m. to 11:45 a.m., in Meeting Room 16B.
It’s exciting to be a part of another OSCON conference and we look forward to seeing you there!
Everyone is talking about Docker containers and microservices. Run a lean, highly modularized application inside a container, deploy rapidly here and there, scale out and easily remove it when done. But what about older, bloated multi-process legacy applications? Should they be redesigned and rewritten so they can be deployed via containers?
It’s certainly an option if you have the time and resources to do it. However, if time and resources are not available, legacy applications can be easily containerized without modifying a single line of code.
Dell’s OpenManage Server Administrator (OMSA) is an in-band systems management solution that offers a web GUI and CLI interface to fully monitor the health of PowerEdge servers. It contains a sophisticated alerting mechanism that notifies system administrators if a hardware degradation or failure is detected.
We wanted to explore if we could take advantage of containers’ unique features such as environment isolation and ease of deployment and apply them to OMSA. The result is a simple and straight-forward way to deploy OMSA in your environment. In addition to making deployment a breeze, OMSA can now seamlessly run on unsupported Linux distributions like Ubuntu Server, Debian and container-only operating systems like Atomic Host.
Please join me for OSCON Container Day May 17th at 2:30 pm in Austin TX for a closer look at how containers can be used with legacy applications that don’t exactly fit the “microservice” mold. It may not be an elegant and pretty solution, but it works, and it will help your organization solve real-world problems.
The Dell XPS 13 laptop, developer edition began life as “Project Sputnik” a scrappy skunkworks project to pilot a developer-focused system. It was made possible by an internal incubation fund designed to bring wacky ideas from around the company to life in order to tap innovation that might be locked up in people’s heads.
From the start, the idea was to conduct project Sputnik out in the open, soliciting and leveraging direct input from developers via blogs and forums. And why developers you may ask? Developers are one of IT’s most influential constituents and not a group Dell had focused on previously.
So how did the project unfold? Read on dear reader...
The Dell Booth is a popular place to be at this year's Strata + Hadoop, thanks to Guy Harrison's book signing events. The author, who is also an executive director for Dell R&D Information Management, is know for his books, such as Oracle Performance Survival Guide and MySQL Stored Procedure Programming (with Steven Feuerstein). His latest book, Next Generation Databases is garnering a lot of attention in the industry.
Focused on the latest developments in database technologies, this is a book for enterprise architects, database administrators, and developers who need to understand what is new, what works, and what’s just hype. The aim is to help you choose the correct database technology at a time when concepts such as Big Data, NoSQL and NewSQL are making what used to be an easy choice into a complex decision with significant implications.
Harrison has divided into two sections, with the first section examining the market and technology drivers that lead to the end of the complete “one size fits all” relational dominance, and taking a closer look at each of the major new database technologies. The second half of the book includes the nitty gritty details of the major new database technologies; examining how databases like MongoDB, Cassandra, HBase, Riak and others implement clustering and replication, locking and consistency management, logical and physical storage models and the languages and APIs provided. Harrison also muses upon the future of database technology and predicts the explosion of new database technologies over the last few years will be followed by a much needed consolidation phase. He believes there are some potentially disruptive technologies on the horizon such as universal memory, blockchain and even quantum computing.
Harrison has spent most of his career in the relational database world. We are very lucky to have him with us now at Dell. He’s always interested in hearing what other people think and their questions and concerns. He will be with us for another book signing and giveaway today, so stop by booth #931 at 1:30 PM.
Strata + Hadoop World 2016 returns to San Jose on Tuesday, March 29, 2016. The event is known as one of the foremost gatherings for the world’s leading data scientists, analysts, and executives in big data and analytics. Attendees represent innovative companies – from startups to well-established organizations – who come together for networking, sharing case studies, identifying proven best practices, and learning effective new analytic approaches and core skills.
We invite you to join us at Dell Booth #931, to share case studies and use cases, demos of new products and solutions we’re bringing to market, and for many, a special book signing event.
Dell is very honored to host Guy Harrison, author of "Next Generation Databases,” who will be on hand for in-booth book signings throughout the conference. Mr. Harrison’s book is for enterprise architects, database administrators, and developers with the need to understand the latest developments in database technologies. It’s the book to help you choose the right database technology at a time when concepts such as big data, NoSQL and NewSQL are making what used to be an easy choice, into a complex decision with significant implications. Mr. Harrison will be at the booth on Tuesday, March 29 at 5:15 PM, on Wednesday, March 30 at 1:30 PM, and Thursday, March 31 at 1:30 PM.
On Wednesday, March 30, join us for a Dell interactive panel presentation, at 11:50 AM, in room LL20B. The team will help attendees to outline their big data journeys as they address the question, “Where are you on your data journey?” Here they will take a close look at how Hadoop enables data-driven insights across organizations no matter where they are on their big data journey. Dell’s own Anthony Dina will host the interactive panel, with panelists Adnan Khaleel, Jeff Weidner, and Armando Acosta. Together they will explore how business units have taken advantage of Hadoop’s strengths to quickly identify and implement two use cases: an early use case for ETL offload that then led to a detailed and robust advanced analytics solution that enabled Dell to use marketing analytics to transform the business and strengthen customer relationships.
In-booth theater presentations from Dell, Cloudera and Intel, a Robot Run for prizes, in-booth demos and chances to win BB-8 robots are just a few of the fun plans we have at the Dell booth at Strata+Hadoop World. Plan to stop by!
Together with Intel and Cloudera, we are also hosting a networking dinner, with wine and beer tastings, on Tuesday. March 29, at The Farmers Union in San Jose. If you are interested in attending, please stop by Dell booth, #931 during the Opening Reception.
We look forward to seeing you in San Jose!