1.3 BI Stack
Before exploring these concepts in more detail, let's begin by outlining the entire BI stack, which is the set of technologies (hardware and software) that either support or directly offer data description and analytics capabilities.
The image above gives an overview of each conceptual component of the BI stack while the image below (from http://hortonworks.com/open-enterprise-hadoop/) summarizes several of the major product vendors for each component. While you don't need to memorize these brands, it would be useful to familiarize yourself with them (particularly the applications on top) because you are the primary user of those tools.
However, before going too far down that road, let's explain each component in more detail by working our way from the bottom of the stack to the top.
Data Sources
Data sources are "where data comes from." Clearly, much of our data will come from our operational databases which were the focus of the prior module. Operational databases keep track of every piece of data necessary for our business processes in a non-redundant (a.k.a. "normalized") relational database. However, there are a lot of other sources of good data that can be used to create business intelligence to aid decision making.
Another important source of data is the research we hire consulting firms to perform or that we perform ourselves. For example, we may administer a survey on our company website. Or, we may hire a firm to call our customers and ask them about their satisfaction with our products. Similarly, we likely have supply chain partners who are incented to see us succeed. For example, our suppliers want us to keep buying parts and products from them. So, they will often offer us access to some of their data which may be useful. Next, it is very common to purchase access to third-party databases. Consider the major credit bureaus (Equifax, Transunion, and Experian). They offer consumer credit information for sale. The United States Post Office sells access to address information. A company can use these data sources to "fill in the gaps" in its own consumer data collection.
ETL
It is not a good idea to perform complex analyses on the operational database. Operational DBs need to be fast so that business can run quickly. They also need to be accurate. As a result, you can't have people querying an operational database for ad-hoc or "non-core business process" reasons and possibly messing up the data. Therefore, we copy the data from the operational databases. This is called the extract/transform/load (ETL) process. Extract/export means that the data is copied from the operational database(s) and other data sources. Transform means that we do not want to load the data "as-is" into another database. We don't need that same level of detail for our analyses. Therefore, we will typically summarize it in some ways. For example, rather than store each individual sale, we will store the daily sales by product line, salesperson, store location, etc. Lastly, load simply means that the data is then pasted into a new database.
The "transform" step in ETL also includes significant data cleaning. Data cleaning is the process of improving the quality of the data. For example, does the data need to be numeric versus text? Can it be summarized to improve speed? Does it need to be transformed into a "normal" probability? Another important concern with ETL is how often it should occur. Should you ETL every single sale as it happens in real-time? Probably not, because that would require significant extra computing power. So, should you ETL once a month? Well, it depends. Is the value of having the most recent month greater than the cost of running an ETL batch once per month? If so, then you should probably run it more often. In other words, that decision affects the timeliness of your data. So, let's say you decide on ETL every morning at 4 a.m. which is when your website traffic is at its lowest. In the past 24 hours, let's say you've had 2500 new customers register. However, few of them filled out all of their profile information. Can you design your ETL process to query the United States Postal Service database and fill in any missing data (to be added into your data warehouse, not your operational customer DB which they see)? Yes, you can. That would improve the missing values problem and possibly any accuracy problems with their contact information. In fact, there are many other databases you can pay to access and program into your ETL process to improve or clean our data.
Data Warehouse
The data warehouse (a.k.a. analytical database) is where we load the recently transformed data. Data warehouses are also relational databases of a sort. However, the same rules of normality (non-redundancy of data) no longer apply. Data warehouses have their own rules and schemas (e.g. the "star" schema) that are optimized for ad-hoc data analyses. However, we will not learn those schemas here because they are a little beyond the scope of this book.
In the image above depicting common vendors, the middle layer is labeled "data systems" because many of today's most popular technologies for data warehousing include a data integration and staging layer in between the ETL and data warehousing layer. Hadoop is currently the most popular technology for integrating data from a wide range of sources into a data warehouse.
Many industry experts believe the days of the data warehouse are numbered. Hardware has progressed to the point where the entire data warehouse can exist entirely in memory (think massive amounts of RAM). This means that there is no physical model of the data warehouse. The ETL process simply loads the analytical data into the memory of servers that are "always on." Essentially, this means that the data is simply being pushed from slow secondary storage (hard drives) to fast primary storage (RAM). On one hand, this is more efficient because it doesn't require a second database with its own copy of the data. On the other hand, it's riskier because servers can crash and need rebooting (thus, erasing the temporary RAM and requiring that the entire analytical database be re-created). However, that may be an acceptable risk since analytical data is not necessary for running core business processes.
Data Marts
Data marts give limited role-based access to certain portions of the data warehouse. Data marts exist for two reasons. First, it limits information security vulnerability (i.e., not everyone sees, or has access to, all the data). Second, it reduces the complexity of the data for end-users because they don't see anything they don't need.
Although data marts do serve those two purposes, they may be more of an artifact of legacy systems rather than an intentional architecture. What does that mean? Well, data warehouses initially formed because individual organization silos (marketing, accounting, HR, etc.) began developing their own mini analytical databases to aid their decision making. As organizations realized the value in these mini analytical databases, they realized that the data storage should be centralized. As a result, in many cases, data warehouses were a "backtracked" technology designed to share the analytical data being stored by each organizational department.
Presentation to End User
Unless you are in the IT department, you will likely never interact directly with any of those technologies other than getting read access (via SQL SELECT statements; see supplementary chapter on structured query language) to the databases. Rather, you will use a suite of tools designed to help you analyze the data. These tools can range from broad applications like Microsoft Excel to more specific applications for data visualization (e.g. Power BI and Tableau) and data prediction (e.g. Azure Machine Learning Studio and AWS Machine Learning). Data mining refers to this process of analyzing large amounts of data (a.k.a. "big data") to create new information that is useful in making unstructured business decisions and solving unstructured business problems.
Data mining analyses can be broadly categorized into the two types mentioned above that distinguish business analytics from business intelligence. That is, data analysis can take a "descriptive" or "predictive" approach. Because these applications and techniques will be the focus of this BI module, we'll describe them in more detail below.