When you mention “data warehouse” to someone not acquainted with the term, the first image that springs to their minds is usually something like the one above. When in fact, it actually looks a little more like this:
A data warehouse is, on the highest level, a central repository of data. Just as big data has Doug Laney and his 3 V’s, data warehousing has Bill Inmon, and his much-less catchy but oft-cited definition:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.
This might not be all that illuminating at first glance, but if you unpack it, there’s a lot of information to be found here.
- “Subject-Oriented”: Contrary to what many people believe when first hearing the term, a data warehouse does not have to be one monolithic repository of all your data. A data warehouse can be used to store and analyse data on a specific subject.
- “Integrated”- One of the key features of data warehouses is that they store and leverage data from multiple different sources.
- “Time variant”- Data warehouses are the home of historical data. You can pull up data from a specific time period, be that 3 months or 3 years ago. Data warehouses often also contain detailed temporal data, showing changes to a dataset. For example, if you’re using your data warehouse to store key customer metrics, you can not only see a customer’s current address or recent purchase history, but all data previously associated with this customer & how it’s changed over time.
- “Non-volatile”–The data contained within a data warehouse is not subject to change.
Another key defintion often cited when defining the term “data warehouse” comes courtesy of Ralph Kimball. He states:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
The Data Warehousing Information Center add two caveats to Kimball’s defintion, which I think are worth highlighting:
“1) Sometimes non-transaction data are stored in a data warehouse – though probably 95-99% of the data usually are transaction data.
2) I say “querying and reporting” rather than “query and analysis” because the main output from data warehouse systems are either tabular listings (queries) with minimal formatting or highly formatted “formal” reports. Queries and reports generated from data stored in a data warehouse may or may not be used for analysis.”
They also raise the salient point that what Kimball purposefully doesn’t say is also worth taking into consideration. No mention is made about the form of the stored data, as a data’s form never qualifies or disqualifies it from being considered a data warehouse. A data warehouse can be a relational database, a multidimensional database, a hierarchical database, an object database- the list goes on. There is no one form, or set of forms, that a data warehouse is expected to make.
Some leaders in the field of data warehousing solutions include:
- Exasol- Exasol offer a ready-to-run data warehouse appliance, tailored to your data volume and analysis requirements. They offer a free single-node copy of their software here, if you’d like to try before you buy.
- Teradata- Teradata offer a broad spectrum of data warehousing products, from their fully inclusive Teradata Active Enterprise Data Warehouse, to the Teradata Database Express, which is free to download straight away. Features within the portfolio include support for JSON documents and a range of language plugins (including Ruby, Perl and Python)
- Data Virtuality- Data Virtuality’s USP is that it claims to unite “all data, all systems and all sources” on one platform, automatically, in less than one week. It’s quite a claim, but reviewing their extensive list of connectors should convince you of their credibility.
Other important vendors in the data warehousing space include EMC, Greenplume, Splunk, Amazon Warehouse Services and Google Mesa. Stay tuned in the coming weeks for in-depth guides focused around these technologies.
(Image sources: Wikipedia, Sébastien Barré)