Data Warehouse
- ali@fuzzywireless.com
- Mar 3, 2022
- 5 min read
Connelly & Begg (2015) defined business intelligence (BI) as an umbrella term encompassing collection and analysis of data to extract useful trends and information facilitating corporate decision making. Some of the key enablers of BI are data warehouses, online analytical processing (OLAP), and data mining. Data Warehousing is defined as a subject oriented, time-variant, integrated and non-volatile collection of data to support decision making. Data warehousing is subject oriented because it is organized around major subjects of organizations like customers, products and sales. It is integrated because it bring together varying sources of data. It is time-variant as it contains data from several sources with varying timestamps. Finally, it is non-volatile because it is refreshed with the addition of new data instead of replacement.
Suknovic, Cupic, Martic. & Krulj (2005) outlined some important phase of data warehouse implementation as below:
1. Current situation analysis to fully understand the problem;
2. Selecting interesting data for analysis out of existing databases by filtering out or fixing incorrect, incomplete, missing, wrongly implanted data;
3. Filtering, selecting and formatting the data for the desired duration;
4. Extract, transform and load data into database in the desired schema like single table, star, snowflake, galaxy etc.;
5. Select and design an optimal OLAP system (relational OLAP, multidimensional OLAP and hybrid OLAP).
Online Analytical Processing (OLAP)
Online analytical processing (OLAP) is a term that make use of aggregated multidimensional data to provide quick access to information for advanced analysis (Connelly & Begg, 2015). OLAP analysis ranges from basic navigation, slicing, and dicing to complex time series and modeling. Examples of OLAP tools are Microsoft SharePoint, SAP Business Intelligence, and Tableau etc. Performance of OLAP tools is benchmarked using several measures like just-in-time (JIT) information, analytical queries per minute (AQM) etc.
Data Mining
With the ever increasing amount of data in data warehouse, it is difficult if not impossible to identify trends and relationship using simple queries and reporting tools (Connelly & Begg, 2015). Data mining is the process of extracting valid, comprehensible, actionable and previously unknown information from large databases to make critical business decisions. Key operations behind data mining techniques are predictive modeling (using supervised learning approach), database segmentation (using unsupervised learning approach to identify segments or clusters), link analysis (relationships, pattern recognition etc.), and deviation detection (identification of outliers using statistical and visualization techniques).
Kashner & Zaima (2018) explained the difference between OLAP and data mining by defining OLAP as descriptive or factual analysis using queries whereas data mining is a discovery-driven analysis where statistical and machine learning techniques are used to make predictions and estimations about trends or outcomes within certain error percentage. For instance, Facebook is an example of data warehousing where they store the relevant and important information into central aggregate table to offer relevant ads and friends (Joseph, 2013). On the contrary, fraud detection of credit card usage is an example of data mining where companies utilize the history and knowledge of customer’s likely location to flag fraudulent transactions happened in other city.
Joseph (2013) highlighted three important steps of data mining as:
1. Exploration – which involve data preparation, cleaning and transformation;
2. Model building – selection of best prediction model based on performance;
3. Dependent – where the best model is applied on a new data sets to generate prediction of expected outcome.
Various techniques and methods are used to data mining, some are:
1. Classification – grouping data into pre-determined classes;
2. Clustering – grouping data into clusters of similar groups;
3. Regression – for continuous quantitative data, mathematical model is developed to fit the numerical data set;
4. Association – identify relationship between data;
5. Pattern matching – predicting trends and patterns;
6. Artificial neural network;
7. Decision tree;
8. Genetic algorithm etc. (Joseph, 2013).
Data warehouse and data mining technologies help turn data into actionable knowledge (Joseph, 2013). These technologies allow statistical multidimensional analysis of data to analyze associations, dependencies, correlations and trends in business.
Relationship between Data Mining and Data Warehouse
Data mining is an important process ensuring the foundation of quality data, which is clean, consistent and accurate (Kashner & Zaima, 2018). Otherwise mining inconsistent or dirty data to find patterns and insights will be meaningless as it would not represent the intended business model. Although data mining does not require data warehouse but the process of gathering, cleansing, and transforming data from varying sources will be inefficient, error prone, slow and time consuming process. Data warehouse will ensure that quality of data is maintained thus allowing end user to focus on analysis using mining techniques.
Example – Bad Data Quality in Health Care Industry
From the perspective of health care industry, bad data quality can result into following issues:
1. Degraded quality of care – lead to medical errors, which can cause damage to patient health or even claim life;
2. Possibility of privacy and civil liberty concerns;
3. Increased cost and inefficiencies – coding errors can result in waste of time and resources;
4. Liability risk;
5. Undermine benefits of IT investments to deliver billing, accounting and streamline service delivery benefits (The Connecting for health common framework, 2006).
The connecting for health common framework (2006) suggested to standardize fields and processes for entering data to reduce errors. Real time quality checks, validation and feedback loops can also help in improving data quality in health care industry. Enforcement of healthcare guidelines while documenting care provided to patient will help in tracing and fixing issues. Billing issues can be fixed or at least reduced with automation. Last, but not least is to invest in training, awareness building and bring organizational change to improve overall quality of health care data.
Data Mining Challenges
Even with the successful setup of data warehouse, organizations continue to struggle in getting quality insights due to:
1. Absence of multi-domain skillset – expertise of business, IT and analytics is needed to perform useful data mining resulting in actionable information;
2. Ineffective project management, incorrect business problem definition, robust data preparation, correct model development followed by test and validation;
3. Right architecture – centralized or distributed data marts;
4. Matching technology – data input and output tools, visualization and processing tools etc. (Kashner & Zaima, 2018).
Reference
Connolly, T. & Begg, C. (2015). Database Systems: a practical approach to design, implementation, and management (6th ed.). Upper Saddle River, NJ: Pearson.
UC Berkeley (1998). Readings in database systems – Data warehousing, decision support & OLAP. Retrieved from http://redbook.cs.berkeley.edu/redbook3/lec28.html
Suknovic, M., Cupic, M., Martic, M., & Krulj, D. (2005). Data warehousing and data mining – a case study. Yugoslav Journal of Operations Research, 15 (2005), Number 1, 125-145
Joseph, M. (2013). Significance of data warehousing and data mining in business applications. International Journal of Soft Computing and Engineering, 3 (1), 329-333
Kashner, J. & Zaima, A. (2018). A data mining primer for the data warehouse professional. Retrieved from http://www.bi-bestpractices.com/view-articles/5827
The connecting for health common framework (2006). Background issues on data quality. Retrieved from http://bok.ahima.org/PdfView?oid=63654
コメント