Originally posted to LinkedIn on Aug 25, 2014
You Don’t Want to Need Big Data Janitors
When Big Data hit the scene, technologist told us to throw away our databases, our spreadsheets, and our models. This was a brave new world where simply having tons of data would answer everything. A 2008, Wired article prognosticated “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”
6 years later, why are we are still wading through spreadsheets, models, and databases? A recent New York Times article reported that data scientist estimate that they spend 50%-80% of their time doing “Janitor Work” and manually building insights.
Big Data is the new vanity technology. Most of the discussions are not about how to apply it to doing business, but instead about how big does your data need to be before you get Big Data bragging rights.
How much data do you need to be Big is like asking how much money do you need to be rich. As Clair Huxtable explained to Vanessa, “Rich is when your money works for you, not when you work for your money.” Data is Big when it works for you and you don’t work for you data.
By that measure, most of us are still data poor. Data is not a means to an end. Like money, having a giant horde of it doesn’t do you any good unless you know how to use it. Data is a raw materials for building good decisions. The better your data, the better your decision making can be. Notice, I said decision making and not just decisions, because even great data doesn’t solve for using it incorrectly!
You Want A Data Supply Chain
Building decisions is like building software. Start with your end product in mind and work backwards. People who start with a clear goal and end point in mind usually are more successful. What are the questions your business has? Where do you have to rely on hunches? What are the things that keep catching you by surprise.
Instead of Big Data, companies should be talking about Data Supply Chains. Just like Toyota uses parts to build cars, data driven companies use data to build decisions. Car manufacturers with good supply chains are more successful. They have better turn around, better ROI, better visibility into issues, more resilience to change and disruption, higher quality, and less failure.
We need Data Supply Chains to make Big Data a winning proposition for business. To go from data to decisions you need to be able to manage the quality of your data. You need to be able to source your data from many places. You need to be able to combine lots of data and have coherent definitions. You will need to be able to process lots of data through complex models. You will need to be able to do this quickly, reliably, and cost effectively. Finally, because business is always changing; you will need to be able to update your systems as fast as your business changes.
In my current company, we took a Data Supply Chain approach. We leveraged the standard open source Hadoop based tools as the basis of our assembly lines. While most people in our space focused making their Big Data technologically cutting edge for its own sake, we focused the business decisions and how business would use the data. A Supply Chain is driven by the processes it supports.
We created a Supply Chain for data management that gives us the ability to:
- rapidly ingest a wide range of data from a wide range of sources
- track the lineage, coverage, and quality of all data
- automate a data dictionary for all “defined” data
- HiveQL access to all “defined” data
- automated and scheduled managed workflows
- dependency management between data aggregations and workflows
- multi-tenant storage and process security and management
- record and column level security and encryption
- self-service reporting and visualization tools
This up-front investment in automating and business processes means that we have radically less janitor work that the New York Times’s estimated 50% to 80%. The clean and reliable data parts also means that everyone in our company can be a data scientist.
Data Works for Everyone
Before our Data Supply Chain, anyone looking at the raw data had to know a whole list of caveats and rules for combining and cleaning data. Simple mistakes like timezone and currency conversion were easy to make and common. More insidious things like vendor data quality and unclear data definitions were almost impossible to avoid. Only a select few could be trusted to interpret the data.
Our Data Supply Chain is built to prevent these mistakes. Just like if Toyota got a bunch of defective headlights, they would be spotted and replaced before the final assembly. Our employees can walk into our parts bin and pull out good data and start to assemble decisions with the same confidence. Everyone can use data in their decisions. And just like Toyota, most of our grunt work is handled by “robots” these days.
Companies that are able to make the Data Supply Chain transition will be the ones that win with Big Data. Until then, you will have to keep your spreadsheets and databases for all that Janitor work.