ROC-AUC: Area Under the Receiver Operating Characteristic Curve
Plots True Positive Rate vs False Positive Rate at various thresholds
AUC = 0.5: Random classifier
AUC = 1.0: Perfect classifier
Threshold-independent metric
What are some common metrics for regression tasks?
Mean Absolute Error (MAE):
Average of absolute differences: (1/n) × Σ|yᵢ - ŷᵢ|
Same units as target variable
Less sensitive to outliers
Mean Squared Error (MSE):
Average of squared differences: (1/n) × Σ(yᵢ - ŷᵢ)²
Penalizes larger errors more heavily
Not in same units as target
Root Mean Squared Error (RMSE):
Square root of MSE: √MSE
Same units as target variable
Most commonly used
R² (Coefficient of Determination):
Proportion of variance explained by model
Range: 0 to 1 (can be negative for bad models)
R² = 1 means perfect fit
Mean Absolute Percentage Error (MAPE):
Percentage-based metric
Easy to interpret
Problems when actual values are near zero
Explain the curse of dimensionality.
The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces that don't occur in low-dimensional settings.
Key Issues:
Data Sparsity: As dimensions increase, data points become increasingly sparse. The volume of space grows exponentially, requiring exponentially more data to maintain density
Distance Metrics Become Meaningless: In high dimensions, distances between points become similar, making nearest neighbor algorithms less effective
Computational Cost: More features = more computation time and memory
Overfitting Risk: With many features, models can easily overfit to noise
Solutions:
Feature selection (remove irrelevant features)
Dimensionality reduction (PCA, t-SNE)
Regularization
Collect more data
Difference between parametric and non-parametric models.
Parametric Models:
Make strong assumptions about data distribution
Fixed number of parameters (doesn't grow with data)
Examples: Linear Regression, Logistic Regression, Naive Bayes
Pros: Faster, require less data, easier to interpret
Cons: Strong assumptions may not hold, less flexible
Non-Parametric Models:
Make few assumptions about data distribution
Number of parameters can grow with training data
Examples: KNN, Decision Trees, Random Forest, SVM
Pros: More flexible, can model complex patterns
Cons: Require more data, slower, risk of overfitting
2. Programming & Data Handling
How do you handle missing or corrupted data in a dataset?
1. Understanding the Missing Data:
MCAR (Missing Completely At Random)
MAR (Missing At Random)
MNAR (Missing Not At Random)
2. Techniques:
Deletion:
Drop rows with missing values (if small percentage)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['category_encoded'] = encoder.fit_transform(df['category'])
3. Target/Mean Encoding:
Replace category with mean of target variable
Reduces dimensionality
Risk of data leakage - use cross-validation
4. Frequency Encoding:
Replace with frequency of each category
5. Binary Encoding:
Convert to integer then to binary
Reduces dimensionality vs one-hot
6. Embeddings (for deep learning):
Learn dense vector representations
Good for high-cardinality features
Write a Python function to normalize or standardize a dataset.
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
def normalize_data(X, method='standard'):
"""
Normalize or standardize dataset
Parameters:
X: array-like, features to normalize
method: 'standard' or 'minmax'
Returns:
Normalized data and fitted scaler
"""
if method == 'standard':
# Standardization: mean=0, std=1
# X_scaled = (X - mean) / std
scaler = StandardScaler()
elif method == 'minmax':
# Min-Max normalization: scale to [0, 1]
# X_scaled = (X - min) / (max - min)
scaler = MinMaxScaler()
else:
raise ValueError("Method must be 'standard' or 'minmax'")
X_scaled = scaler.fit_transform(X)
return X_scaled, scaler
# Manual implementation
def standardize_manual(X):
"""Standardize without sklearn"""
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)
return (X - mean) / std
def normalize_manual(X):
"""Min-max normalize without sklearn"""
min_val = np.min(X, axis=0)
max_val = np.max(X, axis=0)
return (X - min_val) / (max_val - min_val)
How do you split data into training, validation, and test sets?
from sklearn.model_selection import train_test_split
def split_data(X, y, train_size=0.7, val_size=0.15,
test_size=0.15, random_state=42):
"""
Split data into train, validation, and test sets
Common splits:
- 70/15/15 or 80/10/10 for large datasets
- 60/20/20 for smaller datasets
"""
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y,
test_size=test_size,
random_state=random_state,
stratify=y # Maintain class distribution
)
# Second split: separate train and validation
val_ratio = val_size / (train_size + val_size)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp,
test_size=val_ratio,
random_state=random_state,
stratify=y_temp
)
return X_train, X_val, X_test, y_train, y_val, y_test
# For time series data - no shuffling!
def split_time_series(X, y, train_size=0.7, val_size=0.15):
"""Split time series maintaining temporal order"""
n = len(X)
train_end = int(n * train_size)
val_end = int(n * (train_size + val_size))
X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]
return X_train, X_val, X_test, y_train, y_val, y_test
Given a dataset in CSV, how would you load it and prepare it for a machine learning pipeline?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
def prepare_ml_pipeline(filepath):
"""
Complete ML data preparation pipeline
"""
# 1. Load data
df = pd.read_csv(filepath)
# 2. Basic exploration
print(f"Shape: {df.shape}")
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Data types:\n{df.dtypes}")
# 3. Handle missing values
# Numerical: fill with median
numerical_cols = df.select_dtypes(include=[np.number]).columns
df[numerical_cols] = df[numerical_cols].fillna(
df[numerical_cols].median()
)
# Categorical: fill with mode
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
df[col].fillna(df[col].mode()[0], inplace=True)
# 4. Encode categorical variables
label_encoders = {}
for col in categorical_cols:
if col != 'target': # Don't encode target yet
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
label_encoders[col] = le
# 5. Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# 6. Encode target if categorical
if y.dtype == 'object':
le_target = LabelEncoder()
y = le_target.fit_transform(y)
# 7. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 8. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return (X_train_scaled, X_test_scaled,
y_train, y_test,
scaler, label_encoders)
# Usage
X_train, X_test, y_train, y_test, scaler, encoders = \
prepare_ml_pipeline('data.csv')
What are Python libraries you have used for ML? Compare NumPy, Pandas, Scikit-learn, and TensorFlow/PyTorch.
NumPy:
Fundamental package for numerical computing
Efficient multi-dimensional arrays
Mathematical functions
Foundation for other libraries
Use for: Matrix operations, mathematical computations
Interview Tip: Mention that you'd use Scikit-learn for traditional ML, and TensorFlow/PyTorch for deep learning. Pandas for data prep, NumPy for computations.
3. Algorithms & Math
Explain gradient descent. How does learning rate affect convergence?
Gradient Descent: Optimization algorithm that iteratively adjusts model parameters to minimize a cost function.
Algorithm:
Initialize parameters θ randomly
Compute cost function J(θ)
Calculate gradient ∇J(θ)
Update: θ = θ - α × ∇J(θ)
Repeat until convergence
Learning Rate (α) Effects:
Too High: May overshoot minimum, diverge, oscillate
Too Low: Converges very slowly, may get stuck in local minima
Optimal: Converges efficiently to minimum
Solutions:
Learning rate scheduling (decay over time)
Adaptive learning rates (Adam, RMSprop)
Grid search or random search for α
What is the difference between batch, mini-batch, and stochastic gradient descent?
Batch Gradient Descent:
Uses entire dataset to compute gradient
θ = θ - α × ∇J(θ) computed on all samples
Pros: Stable convergence, accurate gradient
Cons: Slow for large datasets, high memory usage
Stochastic Gradient Descent (SGD):
Uses one random sample at a time
Updates after each sample
Pros: Fast, can escape local minima
Cons: Noisy updates, doesn't converge smoothly
Mini-Batch Gradient Descent:
Uses small batches (typically 32, 64, 128, 256)
Best of both worlds
Pros: Balance between speed and stability, vectorization benefits
Cons: Need to tune batch size
Most Common: Mini-batch is the default in modern ML frameworks.
Explain the difference between correlation and covariance.
Covariance:
Measures how two variables change together
Formula: Cov(X,Y) = E[(X - μₓ)(Y - μᵧ)]
Positive: variables move in same direction
Negative: variables move in opposite directions
Zero: no linear relationship
Issue: Scale-dependent, hard to interpret magnitude
Correlation (Pearson's r):
Normalized version of covariance
Formula: r = Cov(X,Y) / (σₓ × σᵧ)
Range: -1 to +1
-1: perfect negative correlation
0: no linear correlation
+1: perfect positive correlation
Advantage: Scale-independent, easy to interpret
Important: Both only measure LINEAR relationships. Can miss non-linear dependencies.
What is the role of eigenvectors and eigenvalues in PCA?
Principal Component Analysis (PCA): Dimensionality reduction technique that finds orthogonal axes that capture maximum variance.
Eigenvectors:
Define the directions of principal components
Each eigenvector represents a direction in feature space
Orthogonal to each other (uncorrelated components)
The first eigenvector points in direction of maximum variance
Eigenvalues:
Represent amount of variance captured by each eigenvector
Larger eigenvalue = more important component
Used to determine how many components to keep
Sum of all eigenvalues = total variance in data
PCA Process:
Standardize the data
Compute covariance matrix
Calculate eigenvectors and eigenvalues of covariance matrix
Sort eigenvalues in descending order
Select top k eigenvectors (principal components)
Transform data to new k-dimensional space
Describe how a decision tree splits data. What is information gain?
Decision Tree Splitting:
Decision trees recursively partition data by choosing the best feature and threshold at each node to maximize homogeneity in child nodes.
Splitting Criteria:
1. Information Gain (based on Entropy):
Measures reduction in entropy (uncertainty) after split
Entropy: H(S) = -Σ pᵢ log₂(pᵢ)
Information Gain = H(parent) - Weighted_Average(H(children))
Choose split with highest information gain
Used in ID3 and C4.5 algorithms
2. Gini Impurity:
Measures probability of incorrect classification
Gini = 1 - Σ pᵢ²
Choose split with lowest Gini impurity
Used in CART algorithm (Scikit-learn default)
Faster to compute than entropy
3. Variance Reduction (for regression):
Minimize variance in child nodes
Process:
For each feature, try all possible split points
Calculate information gain or Gini for each split
Choose the split with best score
Recursively repeat for child nodes
Stop when: max depth reached, min samples met, or no improvement
What is a confusion matrix?
A confusion matrix is a table showing the performance of a classification model by comparing predicted vs actual labels.
For Binary Classification:
Predicted
Positive Negative
Actual Positive TP FN
Negative FP TN
True Positive (TP): Correctly predicted positive
True Negative (TN): Correctly predicted negative
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)
Derived Metrics:
Accuracy: (TP + TN) / Total
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
Specificity: TN / (TN + FP)
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
4. Machine Learning Algorithms
Compare Linear Regression vs Logistic Regression.
Linear Regression:
Task: Regression (predict continuous values)
Output: Real number (-∞ to +∞)
Function: y = β₀ + β₁x₁ + β₂x₂ + ...
Loss: Mean Squared Error
Example: Predicting house prices, temperature
Logistic Regression:
Task: Classification (predict categories)
Output: Probability (0 to 1)
Function: P(y=1) = 1 / (1 + e^(-z)) where z = β₀ + β₁x₁ + ...
Loss: Log Loss (Cross-Entropy)
Example: Spam detection, disease diagnosis
Key Differences:
Linear: No activation function | Logistic: Sigmoid activation
Linear: Direct prediction | Logistic: Probability then threshold
Linear: Assumes linear relationship | Logistic: Models log-odds
When would you use K-Nearest Neighbors? What are its drawbacks?
When to Use KNN:
Small to medium-sized datasets
When decision boundary is very irregular
For both classification and regression
When interpretability is important (decisions based on similar examples)
As a baseline model
When you don't want to make assumptions about data distribution
How KNN Works:
Choose k (number of neighbors)
Calculate distance to all training points
Find k nearest neighbors
Classification: Vote by majority class
Regression: Average of neighbor values
Drawbacks:
Computational Cost: Must compute distance to all points at prediction time (slow for large datasets)
Storage: Must store entire training dataset
Curse of Dimensionality: Performance degrades in high dimensions
Sensitive to Feature Scales: Must normalize features
Sensitive to Irrelevant Features: All features equally weighted
Imbalanced Data: Biased toward majority class
Choosing k: Difficult to select optimal k
Best Practices: Always normalize features, use cross-validation to select k, consider using KD-trees or Ball-trees for faster search.
Explain SVM (Support Vector Machine) and kernel trick.
Support Vector Machine:
SVM finds the optimal hyperplane that maximizes the margin between different classes.
Key Concepts:
Hyperplane: Decision boundary separating classes
Support Vectors: Data points closest to hyperplane that define the margin
Margin: Distance between hyperplane and nearest points from each class
Goal: Maximize margin (robust to noise and generalization)
Kernel Trick:
Allows SVM to handle non-linear decision boundaries by mapping data to higher-dimensional space without explicitly computing the transformation.
Linear Kernel: K(x,y) = x^T y (for linearly separable data)
Class weights: Penalize misclassification of minority class more
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
Ensemble methods: Balanced Random Forest, EasyEnsemble
3. Evaluation Metrics:
Don't use accuracy! (99% accuracy by always predicting majority)
Use: Precision, Recall, F1-Score, ROC-AUC, PR-AUC
Focus on minority class performance
4. Threshold Adjustment:
Adjust decision threshold based on cost/benefit
Lower threshold to catch more positive cases
5. Collect More Data:
Especially for minority class
6. Anomaly Detection:
For extreme imbalance (fraud, rare disease)
Treat minority as outliers
How would you prevent data leakage?
Data Leakage: When information from outside the training dataset influences the model, leading to overly optimistic performance that doesn't generalize.
Types of Data Leakage:
1. Target Leakage:
Using features that won't be available at prediction time
Example: Using "purchase amount" to predict "will purchase" (amount only exists after purchase)
2. Train-Test Contamination:
Test data influencing training process
Example: Fitting scaler on entire dataset before split
How to Prevent:
1. Proper Data Splitting:
Split data FIRST, before any preprocessing
Never touch test data during training
For time series: respect temporal order
2. Fit on Training Only:
# WRONG
scaler.fit(X) # Fits on all data
X_train, X_test = train_test_split(X)
# CORRECT
X_train, X_test = train_test_split(X)
scaler.fit(X_train) # Fit only on training
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
3. Use Pipelines:
sklearn Pipeline ensures proper order
Prevents accidentally leaking information
4. Time-Based Validation:
For time series, use TimeSeriesSplit
Train on past, predict future
Never shuffle time-series data
5. Feature Engineering Awareness:
Only use information available at prediction time
Avoid features derived from target
Be careful with aggregated features
6. Cross-Validation Done Right:
Preprocessing inside CV folds, not before
Red Flags: Performance too good to be true, huge gap between CV and test, one feature with abnormally high importance.