Table of Contents
Data Objects
Data sets are made up of data objects.A data object represents an entity.
Examples:
- sales database: customers, store items, sales
- medical database: patients, treatments
- university database: students, professors, courses
Also called samples , examples, instances, data points, objects, tuples.Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object.
E.g., customer _ID, name, address
Attribute Types:
- Nominal
- Binary
- Numeric: quantitative
- Interval-scaled
- Ratio-scaled
Nominal: categories, states, or “names of things”
- Hair_color = {auburn, black, blond, brown, grey, red, white}
- marital status, occupation, ID numbers, zip codes
Binary
- Nominal attribute with only 2 states (0 and 1)
- Symmetric binary: both outcomes equally important
- e.g., gender
- Asymmetric binary: outcomes not equally important.
- e.g., medical test (positive vs. negative)
- Convention: assign 1 to most important outcome (e.g., HIV positive)
- Symmetric binary: both outcomes equally important
Ordinal
- Values have a meaningful order (ranking) but magnitude between successive values is not known.
- Size = {small, medium, large}, grades, army rankings
- Quantity (integer or real-valued)
Interval
- Measured on a scale of equal-sized units
- Values have order
- E.g., temperature in C˚or F˚, calendar dates
- No true zero-point
Ratio
- Inherent zero-point
- We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
- e.g., temperature in Kelvin, length, counts, monetary quantities
Discrete vs. Continuous Attributes
Discrete Attribute
- Has only a finite or countably infinite set of values
- E.g., zip codes, profession, or the set of words in a collection of documents
- Sometimes, represented as integer variables
- Note: Binary attributes are a special case of discrete attributes
Continuous Attribute
- Has real numbers as attribute values
- E.g., temperature, height, or weight
- Practically, real values can only be measured and represented using a finite number of digits
- Continuous attributes are typically represented as floating-point variables
Types of Data Sets
Record
- Relational records
- Data matrix, e.g., numerical matrix, crosstabs
- Document data: text documents: term-frequency vector
- Transaction data
Graph and network
- World Wide Web
- Social or information networks
- Molecular Structures
Ordered
- Video data: sequence of images
- Temporal data: time-series
- Sequential Data: transaction sequences
- Genetic sequence data
Spatial, image and multimedia:
- Spatial data: maps
- Image data
- Video data