Preprocessing of data
- ali@fuzzywireless.com
- Mar 4, 2022
- 3 min read
Background
Quality of business intelligence extracted from a data set by employing data mining techniques using any framework does not only depend on the usage of method but also depends on the quality of source data (Garcia et al., 2016). Source data can be impacted by multiple shortcomings like:
1. Missing data
2. Noise
3. Inconsistent values
4. Huge size
5. Superfluous data
Above factors can degrade the quality of data which ultimately lead to degradation in the quality of knowledge extracted from the process of data mining.
Due to this reason preprocessing of data before mining is an important step to improve the quality of knowledge (Garcia et al., 2016). Traditional preprocessing techniques are not suitable due to huge size, rate and variation of big data.
Introduction
Depending upon the type of imperfection in the data, multiple preprocessing techniques can be applied to improve the quality of data, for instance (Garcia et al., 2016):
1. Data cleaning
2. Data transformation
3. Data integration
4. Data normalization
5. Missing value imputation
6. Noise identification
Similarly, complexity of data can be reduce by performing multiple techniques as below:
1. Feature selection
2. Instance selection
3. Discretization
Preprocessing Techniques
1. Missing value imputation
Often there are cases where values are missing for the given variable or attribute which can result in error or contribute negatively towards the accuracy of knowledge (Gelman & Hill, 2007). One approach is to discard the instance of missing values which can work in some cases but can lead to bias in other cases. Another approach is to use probabilistic and statistical models to compute the missing value by using maximum likelihood procedures.
2. Noise Treatment
Noise cannot only impact the input but in some cases can affect the output. One way to reduce the noise is to employ data polishing technique in cases where labeling of an instance is impacted. Another way is to use noise filters which identify and eliminate the noise from the training data without impacting data mining technique.
3. Feature selection
Removal of irrelevant and redundant information is called feature selection (Li & Liu, 2017). It will result in saving processing time, resource utilization, sampling and sensing.
4. Instance selection
Reducing the big data set into relevant subset of smaller size without compromising the quality of knowledge and still able to complete the goal of data mining (Garcia et al., 2016). Other benefits of instance selection is the removal of noise and redundancy.
5. Instance generation
In contrast to instance selection, instance generation replace the original data with artificial one to fill the areas where there was no representation in the original data set (Garcia et al., 2016).
6. Discretization
Discretization changes the quantitative data into qualitative by dividing numerical features into fewer non-overlapping intervals thus becoming discrete. Other benefits include data reduction and simplification with the goal of minimal loss of information (Garcia et al., 2016).
7. Under-sampling and Over-sampling
Under sampling create a subset by eliminating majority of instances while oversampling create superset by either replication or creating new instances using interpolation and extrapolation techniques (Garcia et al., 2016).
Machine learning libraries of Spark, namely mllib and ml perform most of the above listed preprocessing techniques (Brownlee, 2017). Similarly Mahout machine learning library performs most of the above preprocessing in Hadoop.
Reference
Garcia, S., Gallego, S., Luengo, J., Benitez, J. & Herrera, F. (2016) Big data preprocessing: methods and prospects. Retrieved from https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0014-0
Gelman, A. & Hill, J. (2007) Data analysis using regression and multilevel/hierarchical models – missing data imputation. Retrieved from http://www.stat.columbia.edu/~gelman/arm/missing.pdf
Li, J. & Liu, H. (2017) Challenges of Feature Selection. 2017 IEEE Intelligent Systems, 32(2), 9-15
Brownlee, J. (2017) 7 ways to handle large data files for machine learning. Retrieved from
Comments