Core Data Mining Functionalities

1. Characterization (Data Summarization)

Definition

Data characterization is the process of summarizing the general features and traits of a specific group of data (called the target class). It looks at a single group and describes “what it looks like” using charts, tables, or summary statistics.

Detailed Explanation

Instead of looking at millions of individual rows, characterization compresses the data to give you a big-picture overview of a specific audience or behavior.

E-Commerce Example

  • Target Class: “High-Value Customers” (users who spend more than $1,000 per year).
  • Mined Characteristics: The data mining system analyzes this group and summarizes that $75\%$ of them are aged 25–40, $80\%$ shop primarily via mobile apps, and their peak shopping hours are between 8:00 PM and 11:00 PM.

2. Discrimination (Data Comparison)

Definition

Data discrimination is the process of comparing the features of your target class against one or more contrasting groups. It focuses on the differences that separate one group from another.

Detailed Explanation

While characterization looks at just one group, discrimination always looks at two or more groups to find out exactly what makes them different.

E-Commerce Example

  • Target Class: Regular buyers of organic groceries.
  • Contrasting Class: Regular buyers of conventional (non-organic) groceries.
  • Mined Discrimination: The system compares both and discovers that organic buyers have a $40\%$ higher average household income and consistently read product reviews, whereas conventional buyers are $65\%$ more likely to purchase items only when a discount coupon is applied.

3. Association and Correlation Analysis

Definition

This functionality looks for hidden links or relationships between items in a database. It identifies frequent patterns—meaning conditions or items that consistently happen or appear together at the same time.

Detailed Explanation

This is often called Market Basket Analysis. It calculates how likely it is that the presence of item $A$ will lead to the purchase of item $B$.

E-Commerce Example

  • The Pattern: By analyzing millions of shopping carts, the algorithm discovers an association rule:

$$\text{Smartphone} \rightarrow \text{Screen Protector } [\text{Support} = 5\%, \text{Confidence} = 80\%]$$

  • Meaning: In $5\%$ of all platform transactions, these items are bought together (Support). Furthermore, $80\%$ of the people who put a smartphone in their cart also add a screen protector (Confidence). The platform uses this data to build its “Frequently Bought Together” recommendation box.

4. Classification

Definition

Classification is a two-step process that builds a model to predict a categorical label (a discrete group or “tag”) for new data. First, it learns from a labeled training dataset; then, it applies those rules to unclassified data.

Detailed Explanation

Classification is supervised learning. The categories are predefined (e.g., “Yes/No”, “Safe/Risky”, “Spam/Not Spam”). The AI acts like a sorting machine placing items into labeled boxes.

E-Commerce Example

  • The Task: Automatically sorting incoming customer service emails.
  • The Process: The system is trained on thousands of old emails that humans already tagged. It learns that words like “broken”, “shattered”, or “faulty” belong to the category "Defective Product", while words like “where is” or “tracking” belong to "Shipping Issue". When a new email arrives, the model automatically tags it and routes it to the right department.

5. Regression

Definition

Regression is used to predict a missing, future, or continuous numeric value. While classification predicts a label (like a word), regression predicts a number (like a price, temperature, or total amount).

Detailed Explanation

Regression maps out the mathematical relationship between independent variables (like age, income, history) and a dependent variable (the numeric outcome you want to predict).

E-Commerce Example

  • The Task: Predicting a customer’s Customer Lifetime Value (CLV)—the exact dollar amount they will spend on the platform over the next 12 months.
  • The Process: The regression model analyzes a user’s account age, average monthly clicks, and past purchase amounts to output a specific numeric prediction, such as: “This user is predicted to spend $450.50 next year.”

6. Clustering

Definition

Clustering groups data objects together based on how similar they are to one another. Unlike classification, there are no predefined labels or classes. The algorithm finds natural patterns on its own.

Detailed Explanation

Clustering is unsupervised learning. The data mining tool evaluates the data points, measures the “distance” between them, and groups similar points close together while keeping different points far apart.

E-Commerce Example

  • The Task: Discovering natural student shooper personas.
  • The Process: You feed the algorithm data on student purchasing habits without giving it any categories. The algorithm automatically clusters them into three distinct, unlabelled groups based on behavior:
    • Cluster 1: Low spending, high purchases of textbooks and stationery (The Academic).
    • Cluster 2: Medium spending, high purchases of instant ramen, snacks, and coffee (The Late-Night Studier).
    • Cluster 3: High spending on video games and tech gadgets (The Gamer Student).

7. Outlier Analysis

Definition

Outlier analysis (also called anomaly detection) identifies data points that behave completely differently from the rest of the dataset. These points do not comply with the general model or rules of the data.

Detailed Explanation

In many applications, data deviations are treated as noise and thrown away. However, in outlier analysis, these rare exceptions are exactly what the researcher is looking for because they represent unique events, errors, or fraud.

E-Commerce Example

  • The Task: Credit card fraud detection.
  • The Process: A user living in London typically logs in from a UK IP address and spends an average of £30 per transaction on clothing. Suddenly, their account registers a transaction for a £3,000 diamond ring from an IP address located in a different country, just 5 minutes after their last London login. The system identifies this data point as an extreme outlier, flags it as potential fraud, and freezes the transaction immediately.

Quick Summary

FunctionalityCore ObjectiveType of Output / Result
CharacterizationSummarize one specific groupDescriptions, charts, or summaries of a target class.
DiscriminationCompare two or more groupsDistinctive differences between classes.
AssociationFind items that occur togetherRules showing item relationships (e.g., $A \rightarrow B$).
ClassificationPredict a discrete group or categoryPredefined tags or labels (e.g., "Fraud" vs "Legitimate").
RegressionPredict a continuous numberSpecific numerical values (e.g., Revenue, Price, Age).
ClusteringFind hidden, natural groupsGroupings of data based purely on similarity.
Outlier AnalysisCatch anomalies and rare exceptionsIdentification of data points that break normal patterns.