Table of Contents
Data Mining Classification & Prediction
Classification
Classification involves dividing up objects so that each is assigned to one of a number of mutually exhaustive and exclusive categories known as classes.
Many practical decision-making tasks can be formulated as classification problems.
- customers who are likely to buy or not buy a particular product in a supermarket
- people who are at high, medium or low risk of acquiring a certain illness
- student projects worthy of a distinction, merit, pass or fail grade
- objects on a radar display which correspond to vehicles, people, buildings or trees
- people who closely resemble, slightly resemble or do not resemble someone seen committing a crime
- houses that are likely to rise in value, fall in value or have an unchanged value in 12 months’ time
- people who are at high, medium or low risk of a car accident in the next 12 months
- people who are likely to vote for each of a number of political parties (or none)
- the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very unlikely).
Classification vs. Prediction
Classification
- predicts categorical class labels (discrete or nominal).
- classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data.
Prediction
- models continuous-valued functions, i.e., predicts unknown or missing values.
Supervised vs. Unsupervised Learning
Supervised learning (classification)
- Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
- New data is classified based on the training set
Unsupervised learning (clustering)
- The class labels of training data is unknown.
- Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data.
Classification—A Two-Step Process
- Model construction: describing a set of predetermined classes
- Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
- The set of tuples used for model construction is training set
- The model is represented as classification rules, decision trees, or mathematical formulae
- Model usage: for classifying future or unknown objects
- Estimate accuracy of the model
- The known label of test sample is compared with the classified result from the model
- Accuracy rate is the percentage of test set samples that are correctly classified by the model
- Test set is independent of training set, otherwise over-fitting will occur
- Estimate accuracy of the model
- If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Classification Techniques
- Decision Tree based Methods
- Rule-based Methods
- Neural Networks
- computational networks that simulate the decision process in neurons (networks of nerve cell)
- Naive Bayes and Bayesian Belief Networks
- uses the probability theory to find the most likely of the possible classifications
- Support Vector Machines
- fits a boundary to a region of points that are all alike; uses the boundary to classify a new point
Lazy vs. Eager Learning
- Lazy vs. eager learning
- Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple
- Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify
- Lazy: less time in training but more time in predicting
- Accuracy
- Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
- Eager: must commit to a single hypothesis that covers the entire instance space
Lazy Learner: Instance-Based Methods
- Instance-based learning:
- Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified
- Typical approaches
- k-nearest neighbor approach
- Instances represented as points in a Euclidean space.
- Locally weighted regression
- Constructs local approximation
- Case-based reasoning
- Uses symbolic representations and knowledge-based inference
- k-nearest neighbor approach