Preprocessing of data

ali@fuzzywireless.com
Mar 4, 2022
3 min read

Background

Quality of business intelligence extracted from a data set by employing data mining techniques using any framework does not only depend on the usage of method but also depends on the quality of source data (Garcia et al., 2016). Source data can be impacted by multiple shortcomings like:

1. Missing data

2. Noise

3. Inconsistent values

4. Huge size

5. Superfluous data

Above factors can degrade the quality of data which ultimately lead to degradation in the quality of knowledge extracted from the process of data mining.

Due to this reason preprocessing of data before mining is an important step to improve the quality of knowledge (Garcia et al., 2016). Traditional preprocessing techniques are not suitable due to huge size, rate and variation of big data.

Introduction

Depending upon the type of imperfection in the data, multiple preprocessing techniques can be applied to improve the quality of data, for instance (Garcia et al., 2016):

1. Data cleaning

2. Data transformation

3. Data integration

4. Data normalization

5. Missing value imputation

6. Noise identification

Similarly, complexity of data can be reduce by performing multiple techniques as below:

1. Feature selection

2. Instance selection

3. Discretization

Preprocessing Techniques

1. Missing value imputation

Often there are cases where values are missing for the given variable or attribute which can result in error or contribute negatively towards the accuracy of knowledge (Gelman & Hill, 2007). One approach is to discard the instance of missing values which can work in some cases but can lead to bias in other cases. Another approach is to use probabilistic and statistical models to compute the missing value by using maximum likelihood procedures.

2. Noise Treatment

Noise cannot only impact the input but in some cases can affect the output. One way to reduce the noise is to employ data polishing technique in cases where labeling of an instance is impacted. Another way is to use noise filters which identify and eliminate the noise from the training data without impacting data mining technique.

3. Feature selection

Removal of irrelevant and redundant information is called feature selection (Li & Liu, 2017). It will result in saving processing time, resource utilization, sampling and sensing.

4. Instance selection

Reducing the big data set into relevant subset of smaller size without compromising the quality of knowledge and still able to complete the goal of data mining (Garcia et al., 2016). Other benefits of instance selection is the removal of noise and redundancy.

5. Instance generation

In contrast to instance selection, instance generation replace the original data with artificial one to fill the areas where there was no representation in the original data set (Garcia et al., 2016).

6. Discretization

Discretization changes the quantitative data into qualitative by dividing numerical features into fewer non-overlapping intervals thus becoming discrete. Other benefits include data reduction and simplification with the goal of minimal loss of information (Garcia et al., 2016).

7. Under-sampling and Over-sampling

Under sampling create a subset by eliminating majority of instances while oversampling create superset by either replication or creating new instances using interpolation and extrapolation techniques (Garcia et al., 2016).

Machine learning libraries of Spark, namely mllib and ml perform most of the above listed preprocessing techniques (Brownlee, 2017). Similarly Mahout machine learning library performs most of the above preprocessing in Hadoop.

Reference

Garcia, S., Gallego, S., Luengo, J., Benitez, J. & Herrera, F. (2016) Big data preprocessing: methods and prospects. Retrieved from https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0014-0

Gelman, A. & Hill, J. (2007) Data analysis using regression and multilevel/hierarchical models – missing data imputation. Retrieved from http://www.stat.columbia.edu/~gelman/arm/missing.pdf

Li, J. & Liu, H. (2017) Challenges of Feature Selection. 2017 IEEE Intelligent Systems, 32(2), 9-15

Brownlee, J. (2017) 7 ways to handle large data files for machine learning. Retrieved from

https://machinelearningmastery.com/large-data-files-machine-learning/

Preprocessing of data

Recent Posts

Comments