Data Preprocessing Explained | Major Tasks | Data Preprocessing Techniques

Data Preprocessing : Concepts

Data Preprocessing

Data Preprocessing or Dataset preprocessing is a activity which is done to improve the quality of data and to modify data so that it can be better fit for specific data mining technique.


Also Read : What is Data Management? Benefits of Data Management


Major Tasks in Data Preprocessing

Below are 4 major tasks which are perform during  Data Preprocessing activity.

  • Data cleaning
  • Data integration
  • Data reduction
  • Data transformation and data discretization

Data Cleaning

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error

  • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g. Occupation = “ ” (missing data)
  • noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error)
  • inconsistent: containing discrepancies in codes or names, e.g.
    • Age = “42”, Birthday = “03/07/2010”
    • Was rating “1, 2, 3”, now rating “A, B, C”
    • discrepancy between duplicate records
  • Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?

Data Integration

Data integration: Combines data from multiple sources into a coherent store

  • Schema integration: e.g., A.cust-id º B.cust-# Integrate metadata from different sources
  • Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
  • Detecting and resolving data value conflicts
    • For the same real world entity, attribute values from different sources are different
    • Possible reasons: different representations, different scales, e.g., metric vs. British units.

Data Reduction Strategies

Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

  • Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.
  • Data reduction strategies

Dimensionality reduction, e.g., remove unimportant attributes

  • Wavelet transforms
  • Principal Components Analysis (PCA)
  • Feature subset selection, feature creation

Numerosity reduction (some simply call it: Data Reduction)

  • Regression and Log-Linear Models
  • Histograms, clustering, sampling
  • Data cube aggregation
  • Data compression

Discretization

Three types of attributes

  • Nominal—values from an unordered set, e.g., color, profession
  • Ordinal—values from an ordered set, e.g., military or academic rank
  • Numeric—real numbers, e.g., integer or real numbers

Discretization: Divide the range of a continuous attribute into intervals

  • Interval labels can then be used to replace actual data values
  • Reduce data size by discretization
  • Supervised vs. unsupervised
  • Split (top-down) vs. merge (bottom-up)
  • Discretization can be performed recursively on an attribute
  • Prepare for further analysis, e.g., classification

 

Also Read : What is Data Mining and its process ?