With the fast-growing interest in data lakes — a storage solution that allows structured and semi-structured data to live in the same place — attention is turning toward metadata as a way to organize large amounts of diverse enterprise data.
Metadata is an ambiguous and generic term, but it most commonly refers to attribute names, data types, relationships, basic data quality metrics, usage stats and access controls. Metadata is literally data about data and thus is often left unwritten — stored only in the heads of those in the know.
Capturing and harnessing this metadata in a robust, easily accessible catalog can open dramatic opportunities for an organization. Specifically, a catalog can improve the availability of enterprise data. Data scientists can quickly and confidently gather the necessary data for analysis. Data stewards can better understand how data interacts and connects across sources and silos. Through user interfaces that make it easy for data experts, owners and users to access the catalog, you can help create collaboration and shared understanding of your data assets across the enterprise.
Why is this so important? Imagine you’re a data scientist tasked with analyzing payment terms across all of your organization’s suppliers, with data from hundreds of ERP systems flowing into a single data lake. The fact that this data is in a single location might give you the illusion that it’s easily accessible for analysis, but it’s not. You know the data you need is in there, but you don’t know where you can immediately find it when you need it.
A well-maintained metadata catalog would make it far easier for you to identify the sources and attributes required for your payment analysis across organizational silos. The result? Significantly less time spent collecting data and more accurate outcomes.
Following are four best practices for starting to manage your own metadata for analytic applications.
Start with Questions (The Hard Ones)
Before you begin thinking about metadata, start by thinking of the most impactful business-level questions that your organization would want to solve, and the data required to answer them. For example, the data needed for an analysis of cross-selling opportunities among your lines of business may be different from the data needed for inventory projections for your stores. Thinking through requirements ahead of time and ensuring they are baked into your metadata catalog can be an immense time-saver when it comes time to perform analysis. You may not have the data immediately ready, but you’ll know what’s available, its quality and reliability, and its location.
Identify Core Attributes and Sources
As you develop key business questions, you’ll no doubt get a better idea about the underlying entities that would be required for analysis. In the payment analysis example above, the key entities would be suppliers and payment terms. For a pharmaceutical analysis, it could be patient, drug and experiment data. With respect to metadata and analytics, understanding entities and their relationships is critical for downstream analytics.
Identify Key Data Experts
The most valuable metadata often isn’t stored in a database or data lake. It’s stored in the brains of people. In other words, the data owners and experts who are often spread throughout the enterprise.
Understanding table relationships, completeness or emptiness indicators, and table structure is far too big a job for one person. The knowledge is split up among the various domain experts who use the data regularly and the IT analysts who create and maintain the data structure. Everything from quality metrics (e.g., ‘99999’ means null in this attribute) to data origins (e.g., average county income was used for the ‘income’ attribute) to much more nuanced information (e.g., inflation in the US vs. inflation in Mexico) is stored somewhere in the minds of your owners and experts. Once you have identified the business goals and what kind of data you’ll require, make sure you verify with these experts that you have everything you need, and that you have what you think you have.
Beyond verifying that you have what you need, however, people can also help you find the metadata by registering it in the catalog in the first place, as well as by collaboratively annotating it.
Data changes constantly. New business initiatives and needs pop up every day. Responding to all these changes ad hoc is not going to lead to long-term data stability. Instead, create a more deliberate process for reviewing metadata changes and monitoring data streams for change. Metadata is a critical part of a healthy data ecosystem, but it only takes one oversight or mistake to render it ineffective.
Clearly, this last step is the most difficult to implement. Part of this implementation is deciding what tools to use for tracking and maintaining data deltas.
Master Data Management (MDM) software, which uses user-defined rules for matching entities and mapping attributes, was developed for exactly this reason. And many MDM tools have been incredibly effective in mapping segments of a data ecosystem. But there are a few problems with using this top-down, rigid approach in today’s world of data lakes and Big Data: namely, we don’t know what we don’t know.
Some of the newer enterprise data unification products are trying to overcome this problem by using human-guided machine learning. Algorithms automatically connect the vast majority of data sources and resolve duplications, errors and inconsistencies among entities and attributes. When the system can’t resolve connections automatically, it calls on people in the organization familiar with the data to weigh in on the mapping and improve its quality and integrity. As a result, the data gets better and better over time.
So that when you do start with the hard questions — as these best practices encourage you to do — you’ll have some quick help identifying the sources, attributes and key experts that are central to good metadata management.
Maggie Soderholm is a field engineer at Tamr, Inc.. where she helps customers deploy metadata catalogs and other enterprise data unification software. Before joining Tamr, she was a data analyst on the Business Intelligence Team at Evernote, where she helped create the infrastructure for storing and pulling data as well as accessing reports and dashboards. She holds a degree in statistics from Carnegie Mellon University.