Poor data quality has been a thorn in the side of IT for years. The problem is simple and is centered on knowing that "key" data is correct, reliable and trustworthy. As applications and databases continue to grow and sprawl, the problem becomes exacerbated.
In the world of data quality there are three basic states:
If your data quality is good - consider yourself in the elite minority. By definition, if you know which data items matter and which don't, your data is good. For the data that matters, you measure, monitor and have a process to improve it.
If your data quality is bad, with your acknowledgement, you recognize that you have a problem. You must be measuring it to some extent to actually know it's bad. There is hope.
If the state of your data quality is unknown, then you're in the dominant majority. You probably have many problems, but they are hidden from view. Most organizations in this situation falsely believe that since they sell great products, have loyal customers, have happy IT users, in that no one is complaining, or that they are a world class organization, then they must have excellent data quality. NOT.
Let’s take a look at data sources and data quality. Considering the data source is important. Data can come from internal sources within your company. It can come from external sources. The external sources may be public like data.gov or AWS public datasets. And, external sources may be cloud-based and private - for example, Salesforce.com. Data may be purchased from a third party, which means it’s semi-private. In each of these cases the quality of the data will vary. More importantly, your ability to change a discovered problem is certainly differed. If you found an error in the published census data at data.gov, the U.S. government is probably not going to change it. If you purchased data from Acxiom and find an error, they might change it in the next cut or reissue it, just for you. If it’s in an internal system the owner might change it if it’s impactful to them, but it will take time and money to remediate. And more than likely, you will not find the errors so one should always be skeptical when considering the sources of data.
With the advent of big data where organizations reach far and wide to collect data from numerous disparate sources in a wide array of formats, the importance of data quality is heightened. If one measured the quality of three ingested sources, they were treated equally and we knew they were each 80 percent, then the overall data quality might be 80%**3 = 51.2 percent. In reality more factors and weightings would likely be employed, but for the purpose of this discussion, I think you get the idea.
Now that you have many of data sources how does one put them together? Many sources overlap. Some items conflict. Context and scope can vary. One must integrate data from many different sources to provide a single view(s) of the truth for consumption by analysts and data scientists using software like Mahout or Statistica. This is an important part of the big data puzzle that’s best looked at as an opportunity. If one considers those same three sources we mentioned at 80 percent each, then if we pick and choose the best pieces we might get an integrated, normalized data source that is at approximately 95 percent according to some measure. That’s a win for your analytics environment.
So what's a big data architect to do?
1) Survey your data by ranking data items across all sources in terms of value.
2) Select the top 1-10 percent that matter most.
a) > 500 total items? Use 1 percent
b) > 300 & < 500 items? Use 5 percent
c) <= 300? Use 10 percent
3) Determine a metric for each item.
4) Measure each data item using the metric outlined. Sampling is ok, but beware of bias.
5) Create a process to improve the quality.
6) Set an acceptable target goal for each item.
7) Quantify the cost as compared to the goal.
8) Start working on the items with the highest cost impact to the bottom line most.
9) Fix items at the source, if possible, otherwise do it during ingestion.