Definition
"Machine Learning is the study of algorithms and statistical models that computer systems use to perform tasks by learning patterns from data, without using rule-based programming."
Tom Mitchell: "A computer program is said to learn from experience E with respect to task T and performance P, if its performance on T, as measured by P, improves with experience E."
How ML Model is Trained and Tested
- Data Collection: Gather relevant data (structured/unstructured)
- Data Preprocessing: Clean data, handle missing values, normalize
- Feature Selection: Choose important input variables
- Model Selection: Choose algorithm (Linear Regression, KNN, SVM, etc.)
- Training: Feed training data → model learns patterns by adjusting parameters
- Validation: Evaluate on validation set, tune hyperparameters
- Testing: Test on unseen test data to evaluate final performance
- Deployment: Use model for real-world predictions
- Training: Model learns weights/parameters
- Testing: Model predicts on new, unseen data
- Overfitting: Too well on training, poor on test
- Underfitting: Poor on both training and test
1. Supervised Learning
Definition: Model is trained on labeled data (input + correct output given).
Goal: Learn a mapping function f(X) → Y
Examples: Email spam detection, house price prediction, medical diagnosis
Algorithms: Linear Regression, Logistic Regression, SVM, KNN, Decision Trees
2. Unsupervised Learning
Definition: Model is trained on unlabeled data (only input, no output labels).
Goal: Find hidden patterns or groupings in data
Examples: Customer segmentation, anomaly detection, topic modeling
Algorithms: K-Means, DBSCAN, PCA, Apriori
3. Reinforcement Learning
Definition: An agent learns by interacting with environment, receiving rewards or penalties.
- Agent — learner/decision maker
- Environment — what agent interacts with
- State (S) — current situation
- Action (A) — what agent does
- Reward (R) — feedback signal
- Policy (π) — strategy of agent
Comparison Table [MUST DRAW IN EXAM]
| Feature | Supervised | Unsupervised | Semi-Supervised | Reinforcement |
|---|---|---|---|---|
| Labels | Required (all) | Not required | Partial | Reward signal |
| Goal | Predict output | Find patterns | Improve with few labels | Maximize reward |
| Output | Class/Value | Clusters/Patterns | Class/Value | Policy |
| Example | Spam detection | Clustering | Image tagging | Game AI |
| Algorithms | SVM, KNN, LR | K-Means, PCA | Self-training | Q-Learning |
NumPy — Numerical Python
Purpose: Provides support for large multi-dimensional arrays and matrices, along with mathematical functions. Faster than Python lists (implemented in C).
Key Features: N-dimensional array object (ndarray), Broadcasting, Linear algebra, Fourier transform, Random number generation
TensorFlow
Developer: Google Brain Team (2015) | Purpose: Open-source library for numerical computation and large-scale Machine Learning using data flow graphs.
Key Features: Tensors = multi-dimensional arrays, Automatic differentiation (Autograd) for backpropagation, GPU/TPU acceleration, Keras API (high-level), Eager execution mode
NumPy vs TensorFlow
| Feature | NumPy | TensorFlow |
|---|---|---|
| Purpose | Array computation | Deep Learning |
| GPU Support | No | Yes |
| Auto-differentiation | No | Yes |
| Level | Low-level | High-level (with Keras) |
Definition
Linear Regression is a supervised learning algorithm used to predict a continuous output variable (Y) based on one or more input features (X) by fitting a straight line to the data.
Formula
Solved Numerical — PYQ 2025
X: 2.0 | 3.0 | 4.0 | 5.0 | 6.0
Y: 3.00 | 4.00 | 3.40 | 6.00 | 5.00
Step 1: Calculation Table
| X | Y | XY | X² |
|---|---|---|---|
| 2 | 3.00 | 6.00 | 4 |
| 3 | 4.00 | 12.00 | 9 |
| 4 | 3.40 | 13.60 | 16 |
| 5 | 6.00 | 30.00 | 25 |
| 6 | 5.00 | 30.00 | 36 |
| ΣX=20 | ΣY=21.40 | ΣXY=91.60 | ΣX²=90 |
Step 2: Calculate a₁ (slope) — n = 5
Step 3: Calculate a₀ (intercept)
Step 4: Regression Equation & Prediction
When X = 7.0, predicted Y = 6.08
Sick: 72 test +ve, 28 test –ve. Healthy: 28 test +ve, 872 test –ve.
Construct confusion matrix. Calculate Accuracy, Precision, Recall, F1-Score.
Step 1: Identify Values
- TP (Sick predicted Sick) = 72
- FN (Sick predicted Healthy) = 28
- FP (Healthy predicted Sick) = 28
- TN (Healthy predicted Healthy) = 872
Step 2: Confusion Matrix
Step 3: Calculate Metrics
= (72+872)/1000
= 72/(72+28)
= 72/(72+28)
= 2×0.72×0.72/1.44
- TP & TN = friends (both correct)
- FP = False Alarm (predicted sick but healthy)
- FN = Missed case (predicted healthy but sick)
- Precision = "When I say sick, how often right?"
- Recall = "Of all sick people, how many did I catch?"
Definition
K-Nearest Neighbour (K-NN) is a non-parametric, instance-based supervised learning algorithm used for classification and regression. It classifies a new point based on the majority class of its K nearest neighbors.
Euclidean Distance Formula
Algorithm Steps
- Choose value of K
- Calculate Euclidean distance from test point to all training points
- Sort distances in ascending order
- Select K nearest neighbors
- For classification: take majority vote of K neighbors
- Assign that class to test point
Solved Numerical [PYQ 2025 Mid-Sem Q4b]
D1:(2,1)=Y, D2:(4,2)=N, D3:(3,3)=Y, D4:(3,5)=N, D5:(4,3)=N, D6:(5,4)=Y
Euclidean Distance Calculations
| Point | Coord | Class | Distance from (4,6) |
|---|---|---|---|
| D1 | (2,1) | Y | √[(4-2)²+(6-1)²] = √[4+25] = √29 ≈ 5.39 |
| D2 | (4,2) | N | √[(4-4)²+(6-2)²] = √[0+16] = 4.00 |
| D3 | (3,3) | Y | √[(4-3)²+(6-3)²] = √[1+9] = √10 ≈ 3.16 |
| D4 | (3,5) | N | √[(4-3)²+(6-5)²] = √[1+1] = √2 ≈ 1.41 |
| D5 | (4,3) | N | √[(4-4)²+(6-3)²] = √[0+9] = 3.00 |
| D6 | (5,4) | Y | √[(4-5)²+(6-4)²] = √[1+4] = √5 ≈ 2.24 |
K=3 Nearest Neighbors (sorted)
| Rank | Point | Distance | Class |
|---|---|---|---|
| 1 | D4 | 1.41 | N |
| 2 | D6 | 2.24 | Y |
| 3 | D5 | 3.00 | N |
∴ Class of (4,6) = N
Sigmoid Function
Solved Numerical [PYQ 2026 Mid-Sem Q4a]
Step 1: Calculate z
Step 2: Apply Sigmoid
Since σ(2) ≈ 0.88 > 0.5 → Student PASSES
Verify with Data Table
| Hours (x) | Pass/Fail | z = -64+2x | σ(z) | Predicted |
|---|---|---|---|---|
| 24 | 0 (Fail) | -16 | ≈0.0000 | Fail ✓ |
| 15 | 0 (Fail) | -34 | ≈0.0000 | Fail ✓ |
| 28 | 1 (Pass) | -8 | ≈0.0003 | Fail ✗ |
| 33 | 1 (Pass) | 2 | ≈0.8808 | Pass ✓ |
| 39 | 1 (Pass) | 14 | ≈0.9999 | Pass ✓ |
Definition
SVM is a supervised learning algorithm that finds the optimal hyperplane which maximizes the margin between two classes.
Kernel Functions [List any 4]
| Kernel | Formula | Use Case |
|---|---|---|
| Linear | K(x,y) = xᵀy | Linearly separable data |
| Polynomial | K(x,y) = (xᵀy + c)^d | Non-linear boundaries |
| RBF/Gaussian | K(x,y) = exp(-γ||x-y||²) | Most common, non-linear |
| Sigmoid | K(x,y) = tanh(αxᵀy + c) | Neural network-like |
Kernel Trick
The Kernel Trick computes the dot product in high-dimensional space without explicitly transforming data, making SVM computationally efficient for non-linearly separable data.
Cross Validation
Definition: Cross Validation is a technique to evaluate ML models by training and testing on different subsets of data to avoid overfitting and get a reliable performance estimate.
Methods
- K-Fold CV: Divide data into K equal folds. Train on K-1 folds, test on 1 fold. Repeat K times, average results.
- Leave-One-Out (LOOCV): Each data point is test set once. Very accurate but slow.
- Stratified K-Fold: Like K-fold but preserves class proportions in each fold.
Bias-Variance Tradeoff
| Model | Bias | Variance | Problem |
|---|---|---|---|
| Simple (Linear) | High | Low | Underfitting |
| Complex (Deep Tree) | Low | High | Overfitting |
| Optimal Model | Low | Low | Best! |
Goal: Balance both! Not too simple, not too complex.
Entropy Formula
Information Gain
Gini Index
| Measure | Formula | Range | Best Split |
|---|---|---|---|
| Entropy | −Σ pᵢ log₂pᵢ | 0 to 1 | Highest IG |
| Gini Index | 1 − Σ pᵢ² | 0 to 0.5 | Lowest Gini |
- Calculate Entropy of whole dataset
- For each attribute, calculate Information Gain
- Select attribute with highest IG as root node
- Split dataset, repeat recursively for each branch
- Stop when: all data in leaf is same class OR no attributes left
Definition
K-Means is an unsupervised learning algorithm that partitions n data points into K clusters by minimizing the sum of squared distances from each point to its cluster centroid.
Algorithm Steps
- Initialize: Randomly choose K centroids from data
- Assignment: Assign each data point to nearest centroid (using Euclidean distance)
- Update: Recalculate centroids as mean of all points in each cluster
- Repeat: Go to step 2 until centroids do not change (convergence)
Advantages
- Simple and fast
- Scales to large data
- Easy to implement
Disadvantages
- Need to specify K
- Sensitive to outliers
- Assumes spherical clusters
Definition
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups data points based on density. It can find clusters of arbitrary shapes and identifies outliers as noise.
Key Parameters
- ε (epsilon): Radius of neighborhood around a point
- MinPts: Minimum number of points required to form a dense region
Point Types
Algorithm Steps
- Pick an unvisited point
- Find all points within ε radius (neighbors)
- If neighbors ≥ MinPts → Core point → start new cluster
- Expand cluster by recursively adding density-connected points
- If neighbors < MinPts → mark as noise (may change to border later)
- Repeat until all points visited
DBSCAN vs K-Means
| Feature | K-Means | DBSCAN |
|---|---|---|
| Need to specify K? | Yes | No |
| Cluster Shape | Spherical only | Any shape |
| Outlier handling | Assigns to cluster | Labels as noise |
| Sensitive to outliers | Yes | No |
Definition
PCA is an unsupervised dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving maximum variance.
Steps of PCA
- Standardize: Normalize data (mean=0, variance=1)
- Covariance Matrix: Compute covariance matrix of features
- Eigenvalues & Eigenvectors: Compute from covariance matrix
- Sort: Sort eigenvalues in descending order
- Select: Choose top K eigenvectors (principal components)
- Project: Transform data onto new K-dimensional space
- Face recognition, Image compression
- Remove noise from data
- Visualization of high-dimensional data
- Speed up machine learning algorithms
Definition
Apriori is an algorithm for frequent itemset mining and association rule learning. Used to discover interesting relationships (rules) between variables in large databases.
Key Measures
Apriori Principle
Contrapositive: If any subset is infrequent → the superset is also infrequent (prune it!).
Algorithm Steps
- Find all frequent 1-itemsets (≥ min_support)
- Generate candidate 2-itemsets from frequent 1-itemsets
- Prune candidates with infrequent subsets
- Find frequent 2-itemsets
- Repeat until no more frequent itemsets found
- Generate association rules from frequent itemsets
- Keep rules with Confidence ≥ min_confidence
Application: Market Basket Analysis — "Customers who buy bread also buy butter"
Artificial Neuron Model
Activation Functions
| Function | Formula | Range | Use Case |
|---|---|---|---|
| Sigmoid | 1/(1+e^(-z)) | (0,1) | Binary classification output |
| Tanh | (eᶻ-e^(-z))/(eᶻ+e^(-z)) | (-1,1) | Hidden layers (zero-centered) |
| ReLU | max(0,z) | [0,∞) | Most common in deep networks |
| Leaky ReLU | max(0.01z, z) | (-∞,∞) | Fix dying ReLU problem |
| Softmax | e^zᵢ / Σe^zⱼ | (0,1), sums to 1 | Multi-class classification output |
| Linear | z | (-∞,∞) | Regression output |
Solved Numerical [PYQ 2025 End-Term Q5a]
Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively updating parameters in the direction of steepest descent.
Solved Numerical [PYQ ML-2 Q2a]
Iteration 1 (x = 0)
Iteration 2 (x = 0.2)
Iteration 3 (x = 0.36)
True minimum: x = 1 (where f'(x) = 0 → 2x−2=0 → x=1)
Backpropagation Steps
- Forward Pass: Compute output layer by layer
- Compute Loss: L = actual − predicted (using loss function)
- Backward Pass: Use chain rule to compute gradients layer by layer
- Update Weights: w = w − α × ∂L/∂w
- Repeat until convergence
CNN Architecture Layers
Layers Explained
- Conv Layer: Applies filters/kernels to extract features (edges, textures, patterns). Output = Feature Map. Formula: Output = (Input − Kernel + 2×Padding) / Stride + 1
- ReLU Activation: Applies max(0,z) — removes negative values, adds non-linearity
- Pooling Layer: Reduces spatial dimensions. Max Pooling = take maximum in each region. Reduces computation, provides translation invariance
- Flatten: Converts 2D feature maps to 1D vector
- Fully Connected (FC) Layer: Regular neural network layers for classification
- Softmax Output: Converts to class probabilities
RNN — Recurrent Neural Network
RNN is a neural network designed for sequential/time-series data. It has a feedback loop — the output of previous step is used as input to current step.
LSTM — Long Short-Term Memory
LSTM is an improved RNN with 3 gates that control information flow, solving the vanishing gradient problem.
Three Gates of LSTM
- Forget Gate (fₜ): Decides what information to THROW AWAY from cell state.
fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) → Output: 0 = forget all, 1 = keep all - Input Gate (iₜ): Decides what NEW information to ADD to cell state.
iₜ = σ(Wᵢ·[hₜ₋₁, xₜ] + bᵢ), C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) - Output Gate (oₜ): Decides what to OUTPUT as hidden state.
oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo), hₜ = oₜ × tanh(Cₜ)
| Feature | RNN | LSTM |
|---|---|---|
| Memory | Short-term only | Long + Short term |
| Vanishing Gradient | Yes (big problem) | Solved by gates |
| Long dependencies | Fails | Handles well |
| Complexity | Simple | More complex |
Ensembling
Ensembling combines multiple models to produce a better prediction than any single model. Two main methods: Bagging and Boosting.
Bagging (Bootstrap Aggregating)
- Train models in parallel on different random subsets
- Combine by majority vote (classification) or average (regression)
- Reduces Variance
- Example: Random Forest
Boosting
- Train models sequentially — each fixes errors of previous
- Combine by weighted sum
- Reduces Bias
- Examples: AdaBoost, XGBoost, Gradient Boosting
Random Forest
Random Forest = Bagging + Decision Trees. It builds multiple decision trees on random subsets of data and features, then combines their predictions.
Missing Value Handling
- Mean Imputation: Replace with column mean (for normal distribution)
- Median Imputation: Replace with median (for skewed data)
- Mode Imputation: Replace with most frequent value (for categorical)
- KNN Imputation: Replace using K nearest neighbors' values
- Deletion: Remove rows/columns with too many missing values
Feature Scaling [PYQ: "What is feature scaling? Why required?"]
Feature scaling normalizes the range of features so that no feature dominates due to its scale. Required for algorithms that use distance (KNN, SVM, K-Means) or gradient descent (Neural Networks).
Categorical Encoding
- Label Encoding: Assign integer to each category (Red=0, Blue=1, Green=2) — use for ordinal data
- One-Hot Encoding: Create binary column for each category — use for nominal data
- Bias and Fairness: ML models can inherit biases from training data, leading to unfair decisions (e.g., gender bias in hiring algorithms)
- Privacy: Training on personal data without consent violates privacy (facial recognition, health data)
- Transparency (Explainability): "Black box" models are hard to interpret — doctors/judges need explanations for AI decisions
- Accountability: Who is responsible when an AI system causes harm? (Self-driving car accident)
- Job Displacement: Automation through ML may cause unemployment
- Deepfakes & Misinformation: ML can generate realistic fake content, spreading false information
- Security: Adversarial attacks can fool ML models with small, imperceptible changes
Unit 1 — Foundations
Unit 2 — Supervised Learning
Unit 3 — Unsupervised Learning
Unit 4 — Neural Networks & Deep Learning
Unit 5 — Preprocessing & Ethics
⚡ Memory Tricks
MCA II SEMESTER — MACHINE LEARNING (TMC-211)
MOST PROBABLE END-TERM EXAM PAPER
a) Define supervised and unsupervised learning. Illustrate each with two examples. Give four applications of Machine Learning.
a) Use KNN classifier (K=5) to classify the test point (Brightness=20, Saturation=35). Use Euclidean distance: [Data points: 7 rows with Brightness, Saturation, Class (Red/Blue)]
a) What is feature scaling? Why is it required? Explain Min-Max Normalization and Z-Score Standardization with examples.
a) What is sigmoidal function? With a₀=−64, a₁=2, find pass/fail probability for a student who studies 33 hours.
a) Explain Backpropagation algorithm. Use Gradient Descent to minimize f(x)=x²−2x with η=0.1, starting from x=0 for 3 iterations.