Jatin's Notes — MCA Machine Learning

Unit 1

Foundations of Machine Learning

★★★

Define Machine Learning. How is it trained and tested?

10 Marks PYQ 2025 Mid-Sem Most Repeated

Definition

          Standard Exam Definition:

          "Machine Learning is the study of algorithms and statistical models that computer systems use to perform tasks by learning patterns from data, without using rule-based programming."
          
          Tom Mitchell: "A computer program is said to learn from experience E with respect to task T and performance P, if its performance on T, as measured by P, improves with experience E."

How ML Model is Trained and Tested

Data Collection: Gather relevant data (structured/unstructured)
Data Preprocessing: Clean data, handle missing values, normalize
Feature Selection: Choose important input variables
Model Selection: Choose algorithm (Linear Regression, KNN, SVM, etc.)
Training: Feed training data → model learns patterns by adjusting parameters
Validation: Evaluate on validation set, tune hyperparameters
Testing: Test on unseen test data to evaluate final performance
Deployment: Use model for real-world predictions

Data → Split → Training Set (70%) + Testing Set (30%) ↓ Train the Model ↓ Evaluate on Test Data ↓ Calculate Accuracy/Metrics

Key Points to Remember

Training: Model learns weights/parameters
Testing: Model predicts on new, unseen data
Overfitting: Too well on training, poor on test
Underfitting: Poor on both training and test

★★★

Define & Compare: Supervised, Unsupervised, Semi-Supervised, Reinforcement Learning

10 Marks PYQ 2026 Mid-Sem & 2025 End-Term Most Repeated

1. Supervised Learning

Definition: Model is trained on labeled data (input + correct output given).
Goal: Learn a mapping function f(X) → Y
Examples: Email spam detection, house price prediction, medical diagnosis
Algorithms: Linear Regression, Logistic Regression, SVM, KNN, Decision Trees

💡

Memory Trick: "Teacher is present" → labels = teacher

2. Unsupervised Learning

Definition: Model is trained on unlabeled data (only input, no output labels).
Goal: Find hidden patterns or groupings in data
Examples: Customer segmentation, anomaly detection, topic modeling
Algorithms: K-Means, DBSCAN, PCA, Apriori

💡

Memory Trick: "No teacher" → model discovers structure itself

3. Reinforcement Learning

Definition: An agent learns by interacting with environment, receiving rewards or penalties.

Agent — learner/decision maker
Environment — what agent interacts with
State (S) — current situation
Action (A) — what agent does
Reward (R) — feedback signal
Policy (π) — strategy of agent

Q(s,a) ← Q(s,a) + α[R + γ·max Q(s',a') − Q(s,a)]

💡

Memory Trick: "Carrot and stick" → reward/penalty drives learning

Comparison Table [MUST DRAW IN EXAM]

Feature	Supervised	Unsupervised	Semi-Supervised	Reinforcement
Labels	Required (all)	Not required	Partial	Reward signal
Goal	Predict output	Find patterns	Improve with few labels	Maximize reward
Output	Class/Value	Clusters/Patterns	Class/Value	Policy
Example	Spam detection	Clustering	Image tagging	Game AI
Algorithms	SVM, KNN, LR	K-Means, PCA	Self-training	Q-Learning

★★★

Write a Short Note on NumPy and TensorFlow

10 Marks PYQ 2026 Mid-Sem Q2a

NumPy — Numerical Python

Purpose: Provides support for large multi-dimensional arrays and matrices, along with mathematical functions. Faster than Python lists (implemented in C).

Key Features: N-dimensional array object (ndarray), Broadcasting, Linear algebra, Fourier transform, Random number generation

import numpy as np a = np.array([1, 2, 3]) # Create array b = np.zeros((3,3)) # Zero matrix c = np.dot(a, a) # Dot product d = np.mean(a), np.std(a) # Statistics e = a.reshape(1,3) # Reshape

TensorFlow

Developer: Google Brain Team (2015) | Purpose: Open-source library for numerical computation and large-scale Machine Learning using data flow graphs.

Key Features: Tensors = multi-dimensional arrays, Automatic differentiation (Autograd) for backpropagation, GPU/TPU acceleration, Keras API (high-level), Eager execution mode

import tensorflow as tf model = tf.keras.Sequential([ tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=10)

NumPy vs TensorFlow

Feature	NumPy	TensorFlow
Purpose	Array computation	Deep Learning
GPU Support	No	Yes
Auto-differentiation	No	Yes
Level	Low-level	High-level (with Keras)

Unit 2

Supervised Learning

★★★

Linear Regression — Theory + Solved Numerical [EXAM FAVOURITE]

10 Marks PYQ 2025 Mid-Sem Q5a & End-Term Q3b Numerical

Definition

Linear Regression is a supervised learning algorithm used to predict a continuous output variable (Y) based on one or more input features (X) by fitting a straight line to the data.

Formula

ŷ = a₀ + a₁x (Simple Linear Regression) Where: a₁ = [n·Σxy − Σx·Σy] / [n·Σx² − (Σx)²] ← slope a₀ = ȳ − a₁·x̄ ← intercept

Solved Numerical — PYQ 2025

Problem: Find the Linear Regression equation for the following data and estimate Y when X = 7.0
X: 2.0 | 3.0 | 4.0 | 5.0 | 6.0
Y: 3.00 | 4.00 | 3.40 | 6.00 | 5.00

Step 1: Calculation Table

X	Y	XY	X²
2	3.00	6.00	4
3	4.00	12.00	9
4	3.40	13.60	16
5	6.00	30.00	25
6	5.00	30.00	36
ΣX=20	ΣY=21.40	ΣXY=91.60	ΣX²=90

Step 2: Calculate a₁ (slope) — n = 5

a₁ = [n·ΣXY − ΣX·ΣY] / [n·ΣX² − (ΣX)²] = [5×91.60 − 20×21.40] / [5×90 − (20)²] = [458 − 428] / [450 − 400] = 30 / 50 = 0.6

Step 3: Calculate a₀ (intercept)

x̄ = 20/5 = 4, ȳ = 21.40/5 = 4.28 a₀ = ȳ − a₁·x̄ = 4.28 − 0.6×4 = 4.28 − 2.4 = 1.88

Step 4: Regression Equation & Prediction

ŷ = 1.88 + 0.6x When X = 7: ŷ = 1.88 + 0.6×7 = 1.88 + 4.2 = 6.08

Answer: Regression equation is ŷ = 1.88 + 0.6x
When X = 7.0, predicted Y = 6.08

★★★

Confusion Matrix — Diabetes Problem (1000 Patients) [PYQ EXACT]

10 Marks PYQ 2026 Mid-Sem Q3a Must Practice

Problem: 1000 patients tested. 900 healthy, 100 sick.
Sick: 72 test +ve, 28 test –ve. Healthy: 28 test +ve, 872 test –ve.
Construct confusion matrix. Calculate Accuracy, Precision, Recall, F1-Score.

Step 1: Identify Values

TP (Sick predicted Sick) = 72
FN (Sick predicted Healthy) = 28
FP (Healthy predicted Sick) = 28
TN (Healthy predicted Healthy) = 872

Step 2: Confusion Matrix

Pred: Sick (+)

Pred: Healthy (–)

Actual: Sick

TP = 72

FN = 28

Actual: Healthy

FP = 28

TN = 872

Step 3: Calculate Metrics

Accuracy

94.4%

(TP+TN)/(TP+TN+FP+FN)
= (72+872)/1000

Precision

72%

TP/(TP+FP)
= 72/(72+28)

Recall

72%

TP/(TP+FN)
= 72/(72+28)

F1-Score

72%

2×P×R/(P+R)
= 2×0.72×0.72/1.44

Memory Trick

TP & TN = friends (both correct)
FP = False Alarm (predicted sick but healthy)
FN = Missed case (predicted healthy but sick)
Precision = "When I say sick, how often right?"
Recall = "Of all sick people, how many did I catch?"

★★★

K-NN Classifier — Theory + Numerical (Euclidean Distance) [PYQ EXACT]

10 Marks PYQ 2026 Mid-Sem Q3b, 2025 End-Term Q1a, 2025 Mid Q4b Must Practice

Definition

K-Nearest Neighbour (K-NN) is a non-parametric, instance-based supervised learning algorithm used for classification and regression. It classifies a new point based on the majority class of its K nearest neighbors.

Euclidean Distance Formula

d(p,q) = √[(x₁-x₂)² + (y₁-y₂)²] (2D) General: d = √[Σ(pᵢ - qᵢ)²]

Algorithm Steps

Choose value of K
Calculate Euclidean distance from test point to all training points
Sort distances in ascending order
Select K nearest neighbors
For classification: take majority vote of K neighbors
Assign that class to test point

Solved Numerical [PYQ 2025 Mid-Sem Q4b]

Problem: Classify point (4,6) using K=3.
D1:(2,1)=Y, D2:(4,2)=N, D3:(3,3)=Y, D4:(3,5)=N, D5:(4,3)=N, D6:(5,4)=Y

Euclidean Distance Calculations

Point	Coord	Class	Distance from (4,6)
D1	(2,1)	Y	√[(4-2)²+(6-1)²] = √[4+25] = √29 ≈ 5.39
D2	(4,2)	N	√[(4-4)²+(6-2)²] = √[0+16] = 4.00
D3	(3,3)	Y	√[(4-3)²+(6-3)²] = √[1+9] = √10 ≈ 3.16
D4	(3,5)	N	√[(4-3)²+(6-5)²] = √[1+1] = √2 ≈ 1.41
D5	(4,3)	N	√[(4-4)²+(6-3)²] = √[0+9] = 3.00
D6	(5,4)	Y	√[(4-5)²+(6-4)²] = √[1+4] = √5 ≈ 2.24

K=3 Nearest Neighbors (sorted)

Rank	Point	Distance	Class
1	D4	1.41	N
2	D6	2.24	Y
3	D5	3.00	N

K=3 Neighbors: N, Y, N → Majority = N (2 votes)

        ∴ Class of (4,6) = N

💡

Memory: "Find K Friends and vote!" → nearest K decide the class by majority

★★★

Sigmoid Function + Logistic Regression Numerical [PYQ EXACT]

10 Marks PYQ 2026 Mid-Sem Q4a Numerical

Sigmoid Function

σ(z) = 1 / (1 + e^(-z)) Where z = a₀ + a₁x (linear combination) Output range: (0, 1) — used as probability Decision boundary: σ(z) ≥ 0.5 → class 1, else class 0 Equivalent to: z ≥ 0 → class 1, else class 0

Solved Numerical [PYQ 2026 Mid-Sem Q4a]

Problem: With a₀ = -64, a₁ = 2, find pass % for student who studies 33 hours.

Step 1: Calculate z

z = a₀ + a₁·x = -64 + 2×33 = -64 + 66 = 2

Step 2: Apply Sigmoid

σ(z) = 1/(1 + e^(-2)) = 1/(1 + 0.1353) = 1/1.1353 = 0.8808

Pass Probability = 88.08%

        Since σ(2) ≈ 0.88 > 0.5 → Student PASSES

Verify with Data Table

Hours (x)	Pass/Fail	z = -64+2x	σ(z)	Predicted
24	0 (Fail)	-16	≈0.0000	Fail ✓
15	0 (Fail)	-34	≈0.0000	Fail ✓
28	1 (Pass)	-8	≈0.0003	Fail ✗
33	1 (Pass)	2	≈0.8808	Pass ✓
39	1 (Pass)	14	≈0.9999	Pass ✓

★★★

SVM — Support Vector Machine + Kernel Functions [PYQ EXACT]

10 Marks PYQ 2026 Mid-Sem Q2b, 2025 End-Term Q3c Theory + Kernel

Definition

SVM is a supervised learning algorithm that finds the optimal hyperplane which maximizes the margin between two classes.

Decision boundary: w·x + b = 0 Margin = 2 / ||w|| Maximize margin = Minimize ||w||²/2 Support Vectors = data points closest to hyperplane

💡

"Draw the FATTEST possible line between two classes" — the support vectors are the data points that sit right on the edge of that fat line.

Kernel Functions [List any 4]

Kernel	Formula	Use Case
Linear	K(x,y) = xᵀy	Linearly separable data
Polynomial	K(x,y) = (xᵀy + c)^d	Non-linear boundaries
RBF/Gaussian	K(x,y) = exp(-γ\|\|x-y\|\|²)	Most common, non-linear
Sigmoid	K(x,y) = tanh(αxᵀy + c)	Neural network-like

Kernel Trick

The Kernel Trick computes the dot product in high-dimensional space without explicitly transforming data, making SVM computationally efficient for non-linearly separable data.

Example from PYQ 2026: Points like (0,2), (0,-2) belong to class O and points like (1,1), (-1,-1), (2,0) belong to class X. These are not linearly separable in 2D. Using RBF kernel: φ(x) = x₁² + x₂² transforms them into a linearly separable 1D problem.

★★

Cross Validation + Bias-Variance Tradeoff

10 Marks PYQ 2025 Mid-Sem Q5b, 2025 End-Term

Cross Validation

Definition: Cross Validation is a technique to evaluate ML models by training and testing on different subsets of data to avoid overfitting and get a reliable performance estimate.

Methods

K-Fold CV: Divide data into K equal folds. Train on K-1 folds, test on 1 fold. Repeat K times, average results.
Leave-One-Out (LOOCV): Each data point is test set once. Very accurate but slow.
Stratified K-Fold: Like K-fold but preserves class proportions in each fold.

K-Fold: Final Score = (Score₁ + Score₂ + ... + ScoreK) / K

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise Bias: Error from wrong assumptions (underfitting) Variance: Error from sensitivity to training data (overfitting)

Model	Bias	Variance	Problem
Simple (Linear)	High	Low	Underfitting
Complex (Deep Tree)	Low	High	Overfitting
Optimal Model	Low	Low	Best!

💡

Bias = "wrong assumption" | Variance = "too sensitive to training data"
Goal: Balance both! Not too simple, not too complex.

★★★

Decision Tree — Information Gain, Entropy, Gini Index

10 Marks PYQ 2025 End-Term Theory + Numerical

Entropy Formula

Entropy(S) = -Σ pᵢ · log₂(pᵢ) Where pᵢ = proportion of class i in set S Pure node (all same class) → Entropy = 0 Equally mixed → Entropy = 1 (maximum disorder)

Information Gain

IG(S, A) = Entropy(S) − Σ [|Sᵥ|/|S| × Entropy(Sᵥ)] Where Sᵥ = subset of S where attribute A = v Choose attribute with HIGHEST Information Gain as root

Gini Index

Gini(S) = 1 − Σ pᵢ² Pure node → Gini = 0 Equally mixed (2 classes) → Gini = 0.5 (maximum)

Measure	Formula	Range	Best Split
Entropy	−Σ pᵢ log₂pᵢ	0 to 1	Highest IG
Gini Index	1 − Σ pᵢ²	0 to 0.5	Lowest Gini

ID3 Algorithm Steps

Calculate Entropy of whole dataset
For each attribute, calculate Information Gain
Select attribute with highest IG as root node
Split dataset, repeat recursively for each branch
Stop when: all data in leaf is same class OR no attributes left

Unit 3

Unsupervised Learning

★★★

K-Means Clustering Algorithm

10 Marks PYQ 2025 End-Term Q1c

Definition

K-Means is an unsupervised learning algorithm that partitions n data points into K clusters by minimizing the sum of squared distances from each point to its cluster centroid.

Algorithm Steps

Initialize: Randomly choose K centroids from data
Assignment: Assign each data point to nearest centroid (using Euclidean distance)
Update: Recalculate centroids as mean of all points in each cluster
Repeat: Go to step 2 until centroids do not change (convergence)

Centroid update: μₖ = (1/|Cₖ|) × Σ xᵢ for xᵢ ∈ Cₖ Objective: Minimize J = Σₖ Σ_{xᵢ∈Cₖ} ||xᵢ − μₖ||²

Advantages

Simple and fast
Scales to large data
Easy to implement

Disadvantages

Need to specify K
Sensitive to outliers
Assumes spherical clusters

💡

K-Means = "Pick K centers, assign, move centers, repeat until stable"

★★★

DBSCAN Algorithm — Density Based Clustering [PYQ REPEATED]

10 Marks PYQ 2025 End-Term Q1c, 2025 End-Term Q2a

Definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups data points based on density. It can find clusters of arbitrary shapes and identifies outliers as noise.

Key Parameters

ε (epsilon): Radius of neighborhood around a point
MinPts: Minimum number of points required to form a dense region

Point Types

Core Point: Has ≥ MinPts within ε radius Border Point: Has < MinPts within ε, but within ε of a Core point Noise Point: Neither Core nor Border — treated as outlier

Algorithm Steps

Pick an unvisited point
Find all points within ε radius (neighbors)
If neighbors ≥ MinPts → Core point → start new cluster
Expand cluster by recursively adding density-connected points
If neighbors < MinPts → mark as noise (may change to border later)
Repeat until all points visited

DBSCAN vs K-Means

Feature	K-Means	DBSCAN
Need to specify K?	Yes	No
Cluster Shape	Spherical only	Any shape
Outlier handling	Assigns to cluster	Labels as noise
Sensitive to outliers	Yes	No

★★

PCA — Principal Component Analysis

10 Marks PYQ 2025 End-Term

Definition

PCA is an unsupervised dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving maximum variance.

Steps of PCA

Standardize: Normalize data (mean=0, variance=1)
Covariance Matrix: Compute covariance matrix of features
Eigenvalues & Eigenvectors: Compute from covariance matrix
Sort: Sort eigenvalues in descending order
Select: Choose top K eigenvectors (principal components)
Project: Transform data onto new K-dimensional space

Covariance: Cov(X,Y) = Σ(xᵢ-x̄)(yᵢ-ȳ) / (n-1) Variance explained = λᵢ / Σλ (where λ = eigenvalue)

💡

"Compress by keeping IMPORTANT directions" — PCA finds the directions of maximum variance (spread) in data.

Applications

Face recognition, Image compression
Remove noise from data
Visualization of high-dimensional data
Speed up machine learning algorithms

★★

Apriori Algorithm — Association Rule Mining

10 Marks PYQ 2025 End-Term Q3

Definition

Apriori is an algorithm for frequent itemset mining and association rule learning. Used to discover interesting relationships (rules) between variables in large databases.

Key Measures

Support(A) = Transactions containing A / Total transactions Confidence(A→B) = Support(A∪B) / Support(A) Lift(A→B) = Confidence(A→B) / Support(B) Lift > 1 = Positive association (useful rule)

Apriori Principle

Apriori Property: If an itemset is frequent, ALL its subsets must also be frequent.

        Contrapositive: If any subset is infrequent → the superset is also infrequent (prune it!).

Algorithm Steps

Find all frequent 1-itemsets (≥ min_support)
Generate candidate 2-itemsets from frequent 1-itemsets
Prune candidates with infrequent subsets
Find frequent 2-itemsets
Repeat until no more frequent itemsets found
Generate association rules from frequent itemsets
Keep rules with Confidence ≥ min_confidence

Application: Market Basket Analysis — "Customers who buy bread also buy butter"

Unit 4

Neural Networks & Deep Learning

★★★

Artificial Neuron + Activation Functions [PYQ REPEATED]

10 Marks PYQ 2025 ML-2 Mid-Sem Q3a, 2025 End-Term

Artificial Neuron Model

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = Σwᵢxᵢ + b Output: y = f(z) where f is the activation function Inputs (x) → Weights (w) → Sum (z) → Activation f(z) → Output (y)

x₁ ─── w₁ ──┐ x₂ ─── w₂ ──┤ x₃ ─── w₃ ──┼──→ [ Σwᵢxᵢ + b ] ──→ f(z) ──→ Output ... │ xₙ ─── wₙ ──┘

Activation Functions

Function	Formula	Range	Use Case
Sigmoid	1/(1+e^(-z))	(0,1)	Binary classification output
Tanh	(eᶻ-e^(-z))/(eᶻ+e^(-z))	(-1,1)	Hidden layers (zero-centered)
ReLU	max(0,z)	[0,∞)	Most common in deep networks
Leaky ReLU	max(0.01z, z)	(-∞,∞)	Fix dying ReLU problem
Softmax	e^zᵢ / Σe^zⱼ	(0,1), sums to 1	Multi-class classification output
Linear	z	(-∞,∞)	Regression output

Solved Numerical [PYQ 2025 End-Term Q5a]

Problem: 3-input neuron, inputs x=(0.8, 0.6, 0.4), weights w=[0.2, 0.1, -0.3, 0.35], b=0.35. Use sigmoid. Find output y.

z = w₁x₁ + w₂x₂ + w₃x₃ + b = 0.2×0.8 + 0.1×0.6 + (-0.3)×0.4 + 0.35 = 0.16 + 0.06 − 0.12 + 0.35 = 0.45 y = sigmoid(0.45) = 1/(1+e^(-0.45)) = 1/(1+0.6376) = 1/1.6376 ≈ 0.611

Output y ≈ 0.611

★★★

Backpropagation + Gradient Descent [PYQ EXACT]

10 Marks PYQ 2025 ML-2 Mid-Sem Q2a, 2025 End-Term Numerical

Gradient Descent

Gradient Descent is an optimization algorithm used to minimize the loss function by iteratively updating parameters in the direction of steepest descent.

Weight update rule: w = w − α × ∂L/∂w where: α = learning rate ∂L/∂w = gradient of loss with respect to weight

Solved Numerical [PYQ ML-2 Q2a]

Problem: Use Gradient Descent to minimize f(x) = x² − 2x, learning rate η = 0.1, start x = 0, 3 iterations.

f(x) = x² − 2x f'(x) = 2x − 2 ← derivative (gradient) Update rule: xₙₑₓₜ = x − η × f'(x)

Iteration 1 (x = 0)

f'(0) = 2(0) − 2 = −2 x₁ = 0 − 0.1×(−2) = 0 + 0.2 = 0.2

Iteration 2 (x = 0.2)

f'(0.2) = 2(0.2) − 2 = 0.4 − 2 = −1.6 x₂ = 0.2 − 0.1×(−1.6) = 0.2 + 0.16 = 0.36

Iteration 3 (x = 0.36)

f'(0.36) = 2(0.36) − 2 = 0.72 − 2 = −1.28 x₃ = 0.36 − 0.1×(−1.28) = 0.36 + 0.128 = 0.488

After 3 iterations: x ≈ 0.488
True minimum: x = 1 (where f'(x) = 0 → 2x−2=0 → x=1)

Backpropagation Steps

Forward Pass: Compute output layer by layer
Compute Loss: L = actual − predicted (using loss function)
Backward Pass: Use chain rule to compute gradients layer by layer
Update Weights: w = w − α × ∂L/∂w
Repeat until convergence

★★★

CNN — Convolutional Neural Network [PYQ ML-2]

10 Marks PYQ 2025 ML-2 Mid-Sem Q3b

CNN Architecture Layers

Input → Conv Layer → ReLU → Pooling → Conv Layer → ReLU → Pooling → Flatten → FC Layer → Output (Softmax)

Layers Explained

Conv Layer: Applies filters/kernels to extract features (edges, textures, patterns). Output = Feature Map. Formula: Output = (Input − Kernel + 2×Padding) / Stride + 1
ReLU Activation: Applies max(0,z) — removes negative values, adds non-linearity
Pooling Layer: Reduces spatial dimensions. Max Pooling = take maximum in each region. Reduces computation, provides translation invariance
Flatten: Converts 2D feature maps to 1D vector
Fully Connected (FC) Layer: Regular neural network layers for classification
Softmax Output: Converts to class probabilities

Applications: Image Recognition, Object Detection, Medical Image Analysis, Self-driving cars, Face Recognition

💡

CNN = "Look → Shrink → Repeat → Guess!" (Conv→Pool→Conv→Pool→FC)

★★★

RNN + LSTM — Architecture and Gates [PYQ ML-2 REPEATED]

10 Marks PYQ 2025 ML-2 Mid-Sem Q3b, Q4a

RNN — Recurrent Neural Network

RNN is a neural network designed for sequential/time-series data. It has a feedback loop — the output of previous step is used as input to current step.

hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + b) yₜ = Wᵧ·hₜ + bᵧ Problem: Vanishing Gradient — can't remember long sequences!

LSTM — Long Short-Term Memory

LSTM is an improved RNN with 3 gates that control information flow, solving the vanishing gradient problem.

Cell State (Cₜ): Long-term memory highway Hidden State (hₜ): Short-term/working memory

Three Gates of LSTM

Forget Gate (fₜ): Decides what information to THROW AWAY from cell state.
fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) → Output: 0 = forget all, 1 = keep all
Input Gate (iₜ): Decides what NEW information to ADD to cell state.
iₜ = σ(Wᵢ·[hₜ₋₁, xₜ] + bᵢ), C̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc)
Output Gate (oₜ): Decides what to OUTPUT as hidden state.
oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo), hₜ = oₜ × tanh(Cₜ)

💡

"3-gated memory controller" — LSTM: Forget useless, Input new, Output relevant!

Feature	RNN	LSTM
Memory	Short-term only	Long + Short term
Vanishing Gradient	Yes (big problem)	Solved by gates
Long dependencies	Fails	Handles well
Complexity	Simple	More complex

★★

Ensembling — Bagging, Boosting, Random Forest [PYQ ML-2]

10 Marks PYQ 2025 ML-2 Mid-Sem Q1a,Q1b

Ensembling

Ensembling combines multiple models to produce a better prediction than any single model. Two main methods: Bagging and Boosting.

Bagging (Bootstrap Aggregating)

Train models in parallel on different random subsets
Combine by majority vote (classification) or average (regression)
Reduces Variance
Example: Random Forest

Boosting

Train models sequentially — each fixes errors of previous
Combine by weighted sum
Reduces Bias
Examples: AdaBoost, XGBoost, Gradient Boosting

Random Forest

Random Forest = Bagging + Decision Trees. It builds multiple decision trees on random subsets of data and features, then combines their predictions.

Random Forest = Many Decision Trees + Voting = Better accuracy, less overfitting

Unit 5

Implementation, Preprocessing & Ethics

★★

Data Preprocessing — Missing Values, Feature Scaling, Encoding

10 Marks PYQ 2025 End-Term Q2a, 2025 End-Term Q5b

Missing Value Handling

Mean Imputation: Replace with column mean (for normal distribution)
Median Imputation: Replace with median (for skewed data)
Mode Imputation: Replace with most frequent value (for categorical)
KNN Imputation: Replace using K nearest neighbors' values
Deletion: Remove rows/columns with too many missing values

Feature Scaling [PYQ: "What is feature scaling? Why required?"]

Feature scaling normalizes the range of features so that no feature dominates due to its scale. Required for algorithms that use distance (KNN, SVM, K-Means) or gradient descent (Neural Networks).

Min-Max Normalization: x' = (x − min) / (max − min) → Range: [0, 1] Z-Score Standardization: x' = (x − μ) / σ → Mean=0, Std=1

Categorical Encoding

Label Encoding: Assign integer to each category (Red=0, Blue=1, Green=2) — use for ordinal data
One-Hot Encoding: Create binary column for each category — use for nominal data

💡

Min-Max → squishes into [0,1] | Z-Score → centers at 0 with unit spread

★

Ethical Issues in Machine Learning

5–10 Marks Short Note

Bias and Fairness: ML models can inherit biases from training data, leading to unfair decisions (e.g., gender bias in hiring algorithms)
Privacy: Training on personal data without consent violates privacy (facial recognition, health data)
Transparency (Explainability): "Black box" models are hard to interpret — doctors/judges need explanations for AI decisions
Accountability: Who is responsible when an AI system causes harm? (Self-driving car accident)
Job Displacement: Automation through ML may cause unemployment
Deepfakes & Misinformation: ML can generate realistic fake content, spreading false information
Security: Adversarial attacks can fool ML models with small, imperceptible changes

Quick Reference

Important Formulas Sheet

📐 All Formulas at a Glance

Linear Regression — Slope

a₁ = (nΣxy − ΣxΣy) / (nΣx² − (Σx)²)

Linear Regression — Intercept

a₀ = ȳ − a₁·x̄

Sigmoid Function

σ(z) = 1 / (1 + e^(-z))

Euclidean Distance

d = √[Σ(xᵢ − yᵢ)²]

Accuracy

(TP + TN) / (TP + TN + FP + FN)

Precision

TP / (TP + FP)

Recall (Sensitivity)

TP / (TP + FN)

F1-Score

2 × Precision × Recall / (Precision + Recall)

Entropy

−Σ pᵢ × log₂(pᵢ)

Gini Index

1 − Σ pᵢ²

Information Gain

IG = Entropy(S) − Σ(|Sv|/|S|) × Entropy(Sv)

Gradient Descent Update

w = w − α × ∂L/∂w

Min-Max Scaling

x' = (x − min) / (max − min)

Z-Score Standardization

x' = (x − μ) / σ

Apriori: Support

Support(A) = freq(A) / N

Apriori: Confidence

Conf(A→B) = Support(A∪B) / Support(A)

Neuron Output

y = f(Σwᵢxᵢ + b)

ReLU

f(z) = max(0, z)

Softmax

f(zᵢ) = e^zᵢ / Σe^zⱼ

Q-Learning (RL)

Q(s,a) ← Q(s,a) + α[R + γ·maxQ(s',a') − Q(s,a)]

Viva Prep

Viva-Style Short Questions

Q: What is a tensor?

A generalization of scalars (0D), vectors (1D), matrices (2D) to N-dimensions. TensorFlow's basic data structure.

Q: What is the curse of dimensionality?

As the number of features increases, data becomes sparse and algorithms perform poorly.

Q: What is regularization?

Technique to prevent overfitting by adding a penalty term to the loss function (L1=Lasso, L2=Ridge).

Q: What is the vanishing gradient problem?

Gradients become extremely small during backpropagation in deep networks, making training very slow or stuck. LSTM solves this.

Q: Difference between Precision and Recall?

Precision = how many predicted positives are actually positive. Recall = how many actual positives are correctly predicted.

Q: Why is K-fold better than simple train-test split?

K-fold uses all data for both training and testing across K experiments, giving a more reliable performance estimate.

Q: What does 'kernel trick' mean in SVM?

Computing dot product in high-dimensional space without explicitly transforming data, making it computationally efficient.

Q: What is Gini Impurity?

Measure of how often a randomly chosen element would be incorrectly labeled. Gini=0 means pure node.

Q: What is one-hot encoding?

Converting categorical variables into binary columns (each category = one column with 0 or 1).

Q: What is dropout in neural networks?

Regularization technique that randomly sets neurons to zero during training to prevent overfitting.

Q: What is epoch in deep learning?

One complete pass through the entire training dataset.

Q: What is transfer learning?

Using a pre-trained model as starting point for a new task. Reduces training time and data requirements.

Q: What is similarity score?

A measure of how similar two data points are. Cosine similarity = cos(θ) = A·B / (|A|×|B|). Range: [-1, 1].

Q: What is hyperparameter tuning?

Process of finding the best values for hyperparameters (like K in KNN, learning rate). Methods: Grid Search, Random Search, Bayesian Optimization.

Emergency Prep

Last Night Revision Notes 🔥

🔥 Must Remember for Tomorrow

Unit 1 — Foundations

✅ ML Definition = "Learning from data without explicit programming" (Tom Mitchell)

✅ 4 Types: Supervised (labeled), Unsupervised (unlabeled), Semi-supervised (partial), Reinforcement (reward)

✅ NumPy = array computation | TensorFlow = deep learning framework by Google

Unit 2 — Supervised Learning

✅ Linear Reg slope: a₁ = (nΣxy−ΣxΣy)/(nΣx²−(Σx)²)

✅ Sigmoid: σ(z) = 1/(1+e^(-z)), z = a₀ + a₁x, decision: z≥0 → class 1

✅ KNN: Euclidean distance → sort → K nearest → majority vote

✅ Confusion Matrix: TP,TN,FP,FN → Accuracy=(TP+TN)/N, Precision=TP/(TP+FP), Recall=TP/(TP+FN)

✅ Cross Validation: K-Fold, LOOCV, Stratified K-Fold

✅ Bias-Variance: Simple=High Bias(underfitting), Complex=High Variance(overfitting)

✅ SVM: Maximize margin, Kernel trick for non-linear data

Unit 3 — Unsupervised Learning

✅ K-Means: Random centroids → Assign → Update → Repeat until convergence

✅ DBSCAN: ε and MinPts → Core/Border/Noise points, no K needed, handles outliers

✅ PCA: Standardize→Covariance→Eigenvalues→Project, keeps max variance directions

✅ Apriori: Support, Confidence, Lift — "Customers who buy X also buy Y"

Unit 4 — Neural Networks & Deep Learning

✅ Neuron: z = Σwᵢxᵢ + b, output = f(z)

✅ Activations: Sigmoid(0,1), Tanh(-1,1), ReLU(max(0,z)), Softmax(multiclass)

✅ Backprop: Forward pass → Loss → Backward (chain rule) → Update weights

✅ GD: w = w − α×∂L/∂w (move against gradient)

✅ CNN: Conv→ReLU→Pool→Flatten→FC→Output (for images)

✅ LSTM: 3 gates (Forget, Input, Output) — solves vanishing gradient problem in RNN

✅ Random Forest = Bagging + Decision Trees

Unit 5 — Preprocessing & Ethics

✅ Missing values: Mean/Median/Mode imputation or KNN imputation

✅ Scaling: Min-Max x'=(x-min)/(max-min), Z-score x'=(x-μ)/σ

✅ Encoding: Label encoding (integers), One-hot (binary columns)

✅ Ethics: Bias, Privacy, Transparency, Accountability, Deepfakes

⚡ Memory Tricks

📌 KNN: "Find K Friends and vote"

📌 SVM: "Draw fattest line between classes"

📌 PCA: "Compress by keeping important directions"

📌 LSTM: "3-gated memory controller: Forget, Input, Output"

📌 Confusion Matrix: "TP and TN are correct friends, FP and FN are mistakes"

Most Probable

Mock Exam Paper (PYQ Pattern Based)

MCA II SEMESTER — MACHINE LEARNING (TMC-211)

MOST PROBABLE END-TERM EXAM PAPER

Time: 3 Hours Note: Answer any TWO parts from each question. Each part = 10 marks. Max Marks: 100

Q1. (10×2=20)
a) Define supervised and unsupervised learning. Illustrate each with two examples. Give four applications of Machine Learning.

— OR —

b) What is Reinforcement Learning? Explain its components (Agent, Environment, State, Action, Reward, Policy) with a suitable example.

— OR —

c) Write a short note on: (i) NumPy (ii) TensorFlow (iii) Pandas

Q2. (10×2=20)
a) Use KNN classifier (K=5) to classify the test point (Brightness=20, Saturation=35). Use Euclidean distance: [Data points: 7 rows with Brightness, Saturation, Class (Red/Blue)]

— OR —

b) Explain: (i) Recall (ii) Information Gain (iii) Gini Index (iv) K-Means Clustering (v) F1 Score

— OR —

c) Explain DBSCAN algorithm for density based clustering. List its advantages compared to K-Means.

Q3. (10×2=20)
a) What is feature scaling? Why is it required? Explain Min-Max Normalization and Z-Score Standardization with examples.

— OR —

b) Explain Bayesian Classifier for multiclass classification with a suitable example.

— OR —

c) What are Kernel Functions in SVM? Explain any four kernel functions with examples.

Q4. (10×2=20)
a) What is sigmoidal function? With a₀=−64, a₁=2, find pass/fail probability for a student who studies 33 hours.

— OR —

b) Obtain the Linear Regression equation for: X: 2.0 3.0 4.0 5.0 6.0 / Y: 3.00 4.00 3.40 6.00 5.00. Find Y when X=7.0.

— OR —

c) 1000 patients tested for Diabetes; 900 healthy, 100 sick. Sick: 72 +ve, 28 −ve. Healthy: 28 +ve, 872 −ve. Construct confusion matrix. Calculate Accuracy, Precision, Recall, F1-Score.

Q5. (10×2=20)
a) Explain Backpropagation algorithm. Use Gradient Descent to minimize f(x)=x²−2x with η=0.1, starting from x=0 for 3 iterations.

— OR —

b) Draw clear diagram of CNN. Give brief introduction of all layers. Explain Sigmoid, tanh, ReLU activation functions.

— OR —

c) Give architecture of LSTM. What problem does it solve in RNN? Explain its three gates in detail.