Table of Contents
Data Preprocessing
Data Preprocessing or Dataset preprocessing is a activity which is done to improve the quality of data and to modify data so that it can be better fit for specific data mining technique.
Also Read : What is Data Management? Benefits of Data Management
Major Tasks in Data Preprocessing
Below are 4 major tasks which are perform during Data Preprocessing activity.
- Data cleaning
- Data integration
- Data reduction
- Data transformation and data discretization
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error
- incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g. Occupation = “ ” (missing data)
- noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error)
- inconsistent: containing discrepancies in codes or names, e.g.
- Age = “42”, Birthday = “03/07/2010”
- Was rating “1, 2, 3”, now rating “A, B, C”
- discrepancy between duplicate records
- Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?
Data Integration
Data integration: Combines data from multiple sources into a coherent store
- Schema integration: e.g., A.cust-id º B.cust-# Integrate metadata from different sources
- Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
- Detecting and resolving data value conflicts
- For the same real world entity, attribute values from different sources are different
- Possible reasons: different representations, different scales, e.g., metric vs. British units.
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
- Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.
- Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
- Wavelet transforms
- Principal Components Analysis (PCA)
- Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
- Regression and Log-Linear Models
- Histograms, clustering, sampling
- Data cube aggregation
- Data compression
Discretization
Three types of attributes
- Nominal—values from an unordered set, e.g., color, profession
- Ordinal—values from an ordered set, e.g., military or academic rank
- Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
- Interval labels can then be used to replace actual data values
- Reduce data size by discretization
- Supervised vs. unsupervised
- Split (top-down) vs. merge (bottom-up)
- Discretization can be performed recursively on an attribute
- Prepare for further analysis, e.g., classification
Also Read : What is Data Mining and its process ?