Classification & Prediction in Data Mining

Data Mining Classification & Prediction

Classification

Classification involves dividing up objects so that each is assigned to one of a number of mutually exhaustive and exclusive categories known as classes.
Many practical decision-making tasks can be formulated as classification problems.

  • customers who are likely to buy or not buy a particular product in a supermarket
  • people who are at high, medium or low risk of acquiring a certain illness
  • student projects worthy of a distinction, merit, pass or fail grade
  • objects on a radar display which correspond to vehicles, people, buildings or trees
  • people who closely resemble, slightly resemble or do not resemble someone seen committing a crime
  • houses that are likely to rise in value, fall in value or have an unchanged value in 12 months’ time
  • people who are at high, medium or low risk of a car accident in the next 12 months
  • people who are likely to vote for each of a number of political parties (or none)
  • the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very unlikely).



Classification vs. Prediction

Classification

  • predicts categorical class labels (discrete or nominal).
  • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.

Prediction

  • models continuous-valued functions, i.e., predicts unknown or missing values.

Supervised vs. Unsupervised Learning

Supervised learning (classification)

  • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
  • New data is classified based on the training set

Unsupervised learning (clustering)

  • The class labels of training data is unknown.
  • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data.

 
Classification—A Two-Step Process

  • Model construction: describing a set of predetermined classes
    • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
    • The set of tuples used for model construction is training set
    • The model is represented as classification rules, decision trees, or mathematical formulae
  • Model usage: for classifying future or unknown objects
    • Estimate accuracy of the model
      • The known label of test sample is compared with the classified result from the model
      • Accuracy rate is the percentage of test set samples that are correctly classified by the model
      • Test set is independent of training set, otherwise over-fitting will occur
  • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known




 
Classification Techniques

  • Decision Tree based Methods
  • Rule-based Methods
  • Neural Networks
    • computational networks that simulate the decision process in neurons (networks of nerve cell)
  • Naive Bayes and Bayesian Belief Networks
    • uses the probability theory to find the most likely of the possible classifications
  • Support Vector Machines
    • fits a boundary to a region of points that are all alike;  uses the boundary to classify a new point




Lazy vs. Eager Learning

  • Lazy vs. eager learning
    • Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple
    • Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify
  • Lazy: less time in training but more time in predicting
  • Accuracy
    • Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
    • Eager: must commit to a single hypothesis that covers the entire instance space

Lazy Learner: Instance-Based Methods

  • Instance-based learning:
    • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified
  • Typical approaches
    • k-nearest neighbor approach
      • Instances represented as points in a Euclidean space.
    • Locally weighted regression
      • Constructs local approximation
    • Case-based reasoning
      • Uses symbolic representations and knowledge-based inference