Machine Learning Interview Answer Guide

Comprehensive answers to common ML interview questions

1. Conceptual Questions

What is the difference between supervised, unsupervised, and reinforcement learning?

Supervised Learning: The model learns from labeled data (input-output pairs). The algorithm learns to map inputs to known outputs.

Examples: Classification (spam detection), Regression (house price prediction)
Key point: Requires labeled training data

Unsupervised Learning: The model learns patterns from unlabeled data without explicit output labels.

Examples: Clustering (customer segmentation), Dimensionality reduction (PCA)
Key point: Discovers hidden structures in data

Reinforcement Learning: The model learns through trial and error by receiving rewards or penalties for actions.

Examples: Game playing (AlphaGo), Robotics control
Key point: Learns optimal behavior through environmental feedback

Explain overfitting and underfitting. How do you detect and prevent them?

Overfitting: Model learns training data too well, including noise, resulting in poor generalization to new data.

Detection: High training accuracy but low test/validation accuracy; large gap between training and validation error
Prevention:
- Use regularization (L1/L2)
- Increase training data
- Reduce model complexity
- Early stopping
- Dropout (for neural networks)
- Cross-validation

Underfitting: Model is too simple to capture underlying patterns in the data.

Detection: Poor performance on both training and test data; high bias
Prevention:
- Increase model complexity
- Add more features
- Reduce regularization
- Train longer

What is bias-variance tradeoff?

The bias-variance tradeoff describes the relationship between a model's ability to minimize bias and variance to achieve good predictive performance.

Bias: Error from incorrect assumptions in the learning algorithm. High bias leads to underfitting.

Linear regression on non-linear data has high bias

Variance: Error from sensitivity to small fluctuations in training data. High variance leads to overfitting.

Deep decision trees have high variance

The Tradeoff:

Total Error = Bias² + Variance + Irreducible Error
Decreasing bias often increases variance and vice versa
Goal: Find the sweet spot that minimizes total error

Interview Tip: Mention that ensemble methods like Random Forest help manage this tradeoff by combining multiple models.

What is regularization? Explain L1 vs L2 regularization.

Regularization: Technique to prevent overfitting by adding a penalty term to the loss function, discouraging complex models.

L1 Regularization (Lasso):

Adds penalty: λ × Σ|wᵢ|
Produces sparse models (some weights become exactly zero)
Performs feature selection automatically
Good when you have many irrelevant features

L2 Regularization (Ridge):

Adds penalty: λ × Σwᵢ²
Shrinks weights but doesn't make them zero
Distributes weight across all features
Better when all features are somewhat relevant
More stable and commonly used

Elastic Net: Combines both L1 and L2 regularization

Explain precision, recall, F1-score, and ROC-AUC.

These are classification metrics, particularly important for imbalanced datasets:

Precision: Of all positive predictions, how many were correct?

Formula: TP / (TP + FP)
Use when false positives are costly (e.g., spam detection)

Recall (Sensitivity): Of all actual positives, how many did we catch?

Formula: TP / (TP + FN)
Use when false negatives are costly (e.g., disease detection)

F1-Score: Harmonic mean of precision and recall

Formula: 2 × (Precision × Recall) / (Precision + Recall)
Balances precision and recall
Good for imbalanced classes

ROC-AUC: Area Under the Receiver Operating Characteristic Curve

Plots True Positive Rate vs False Positive Rate at various thresholds
AUC = 0.5: Random classifier
AUC = 1.0: Perfect classifier
Threshold-independent metric

What are some common metrics for regression tasks?

Mean Absolute Error (MAE):

Average of absolute differences: (1/n) × Σ|yᵢ - ŷᵢ|
Same units as target variable
Less sensitive to outliers

Mean Squared Error (MSE):

Average of squared differences: (1/n) × Σ(yᵢ - ŷᵢ)²
Penalizes larger errors more heavily
Not in same units as target

Root Mean Squared Error (RMSE):

Square root of MSE: √MSE
Same units as target variable
Most commonly used

R² (Coefficient of Determination):

Proportion of variance explained by model
Range: 0 to 1 (can be negative for bad models)
R² = 1 means perfect fit

Mean Absolute Percentage Error (MAPE):

Percentage-based metric
Easy to interpret
Problems when actual values are near zero

Explain the curse of dimensionality.

The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces that don't occur in low-dimensional settings.

Key Issues:

Data Sparsity: As dimensions increase, data points become increasingly sparse. The volume of space grows exponentially, requiring exponentially more data to maintain density
Distance Metrics Become Meaningless: In high dimensions, distances between points become similar, making nearest neighbor algorithms less effective
Computational Cost: More features = more computation time and memory
Overfitting Risk: With many features, models can easily overfit to noise

Solutions:

Feature selection (remove irrelevant features)
Dimensionality reduction (PCA, t-SNE)
Regularization
Collect more data

Difference between parametric and non-parametric models.

Parametric Models:

Make strong assumptions about data distribution
Fixed number of parameters (doesn't grow with data)
Examples: Linear Regression, Logistic Regression, Naive Bayes
Pros: Faster, require less data, easier to interpret
Cons: Strong assumptions may not hold, less flexible

Non-Parametric Models:

Make few assumptions about data distribution
Number of parameters can grow with training data
Examples: KNN, Decision Trees, Random Forest, SVM
Pros: More flexible, can model complex patterns
Cons: Require more data, slower, risk of overfitting

2. Programming & Data Handling

How do you handle missing or corrupted data in a dataset?

1. Understanding the Missing Data:

MCAR (Missing Completely At Random)
MAR (Missing At Random)
MNAR (Missing Not At Random)

2. Techniques:

Deletion:
- Drop rows with missing values (if small percentage)
- Drop columns with many missing values
Imputation:
- Mean/Median/Mode for numerical features
- Forward/Backward fill for time series
- KNN imputation
- Multiple imputation
- Model-based imputation
Create Indicator: Add binary feature indicating missingness
Use algorithms that handle missing data: XGBoost, LightGBM

For Corrupted Data:

Detect outliers using statistical methods (IQR, Z-score)
Visual inspection (box plots, scatter plots)
Domain knowledge to identify impossible values
Remove or cap extreme outliers

How would you encode categorical variables for ML models?

1. One-Hot Encoding:

Creates binary column for each category
Best for nominal categories (no order)
Can lead to high dimensionality

from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(sparse=False, drop='first') encoded = encoder.fit_transform(df[['category']])

2. Label Encoding:

Assigns integer to each category
Good for ordinal data (has order)
May introduce false ordering for nominal data

from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() df['category_encoded'] = encoder.fit_transform(df['category'])

3. Target/Mean Encoding:

Replace category with mean of target variable
Reduces dimensionality
Risk of data leakage - use cross-validation

4. Frequency Encoding:

Replace with frequency of each category

5. Binary Encoding:

Convert to integer then to binary
Reduces dimensionality vs one-hot

6. Embeddings (for deep learning):

Learn dense vector representations
Good for high-cardinality features

Write a Python function to normalize or standardize a dataset.

import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler def normalize_data(X, method='standard'): """ Normalize or standardize dataset Parameters: X: array-like, features to normalize method: 'standard' or 'minmax' Returns: Normalized data and fitted scaler """ if method == 'standard': # Standardization: mean=0, std=1 # X_scaled = (X - mean) / std scaler = StandardScaler() elif method == 'minmax': # Min-Max normalization: scale to [0, 1] # X_scaled = (X - min) / (max - min) scaler = MinMaxScaler() else: raise ValueError("Method must be 'standard' or 'minmax'") X_scaled = scaler.fit_transform(X) return X_scaled, scaler # Manual implementation def standardize_manual(X): """Standardize without sklearn""" mean = np.mean(X, axis=0) std = np.std(X, axis=0) return (X - mean) / std def normalize_manual(X): """Min-max normalize without sklearn""" min_val = np.min(X, axis=0) max_val = np.max(X, axis=0) return (X - min_val) / (max_val - min_val)

How do you split data into training, validation, and test sets?

from sklearn.model_selection import train_test_split def split_data(X, y, train_size=0.7, val_size=0.15, test_size=0.15, random_state=42): """ Split data into train, validation, and test sets Common splits: - 70/15/15 or 80/10/10 for large datasets - 60/20/20 for smaller datasets """ # First split: separate test set X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=test_size, random_state=random_state, stratify=y # Maintain class distribution ) # Second split: separate train and validation val_ratio = val_size / (train_size + val_size) X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=val_ratio, random_state=random_state, stratify=y_temp ) return X_train, X_val, X_test, y_train, y_val, y_test # For time series data - no shuffling! def split_time_series(X, y, train_size=0.7, val_size=0.15): """Split time series maintaining temporal order""" n = len(X) train_end = int(n * train_size) val_end = int(n * (train_size + val_size)) X_train, y_train = X[:train_end], y[:train_end] X_val, y_val = X[train_end:val_end], y[train_end:val_end] X_test, y_test = X[val_end:], y[val_end:] return X_train, X_val, X_test, y_train, y_val, y_test

Given a dataset in CSV, how would you load it and prepare it for a machine learning pipeline?

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder def prepare_ml_pipeline(filepath): """ Complete ML data preparation pipeline """ # 1. Load data df = pd.read_csv(filepath) # 2. Basic exploration print(f"Shape: {df.shape}") print(f"Missing values:\n{df.isnull().sum()}") print(f"Data types:\n{df.dtypes}") # 3. Handle missing values # Numerical: fill with median numerical_cols = df.select_dtypes(include=[np.number]).columns df[numerical_cols] = df[numerical_cols].fillna( df[numerical_cols].median() ) # Categorical: fill with mode categorical_cols = df.select_dtypes(include=['object']).columns for col in categorical_cols: df[col].fillna(df[col].mode()[0], inplace=True) # 4. Encode categorical variables label_encoders = {} for col in categorical_cols: if col != 'target': # Don't encode target yet le = LabelEncoder() df[col] = le.fit_transform(df[col]) label_encoders[col] = le # 5. Separate features and target X = df.drop('target', axis=1) y = df['target'] # 6. Encode target if categorical if y.dtype == 'object': le_target = LabelEncoder() y = le_target.fit_transform(y) # 7. Train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # 8. Scale features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) return (X_train_scaled, X_test_scaled, y_train, y_test, scaler, label_encoders) # Usage X_train, X_test, y_train, y_test, scaler, encoders = \ prepare_ml_pipeline('data.csv')

What are Python libraries you have used for ML? Compare NumPy, Pandas, Scikit-learn, and TensorFlow/PyTorch.

NumPy:

Fundamental package for numerical computing
Efficient multi-dimensional arrays
Mathematical functions
Foundation for other libraries
Use for: Matrix operations, mathematical computations

Pandas:

Data manipulation and analysis
DataFrame structure for tabular data
Data cleaning, transformation, aggregation
Built on NumPy
Use for: Loading data, EDA, feature engineering

Scikit-learn:

Machine learning algorithms
Classical ML: regression, classification, clustering
Preprocessing, model selection, evaluation
Consistent API
Use for: Traditional ML models, quick prototyping

TensorFlow:

Deep learning framework by Google
Production-ready, scalable
TensorFlow Lite for mobile
TensorFlow.js for web
Steeper learning curve
Use for: Deep learning, production deployment

PyTorch:

Deep learning framework by Facebook
More Pythonic, dynamic computation graphs
Popular in research
Easier debugging
Use for: Deep learning research, prototyping

Interview Tip: Mention that you'd use Scikit-learn for traditional ML, and TensorFlow/PyTorch for deep learning. Pandas for data prep, NumPy for computations.

3. Algorithms & Math

Explain gradient descent. How does learning rate affect convergence?

Gradient Descent: Optimization algorithm that iteratively adjusts model parameters to minimize a cost function.

Algorithm:

Initialize parameters θ randomly
Compute cost function J(θ)
Calculate gradient ∇J(θ)
Update: θ = θ - α × ∇J(θ)
Repeat until convergence

Learning Rate (α) Effects:

Too High: May overshoot minimum, diverge, oscillate
Too Low: Converges very slowly, may get stuck in local minima
Optimal: Converges efficiently to minimum

Solutions:

Learning rate scheduling (decay over time)
Adaptive learning rates (Adam, RMSprop)
Grid search or random search for α

What is the difference between batch, mini-batch, and stochastic gradient descent?

Batch Gradient Descent:

Uses entire dataset to compute gradient
θ = θ - α × ∇J(θ) computed on all samples
Pros: Stable convergence, accurate gradient
Cons: Slow for large datasets, high memory usage

Stochastic Gradient Descent (SGD):

Uses one random sample at a time
Updates after each sample
Pros: Fast, can escape local minima
Cons: Noisy updates, doesn't converge smoothly

Mini-Batch Gradient Descent:

Uses small batches (typically 32, 64, 128, 256)
Best of both worlds
Pros: Balance between speed and stability, vectorization benefits
Cons: Need to tune batch size

Most Common: Mini-batch is the default in modern ML frameworks.

Explain the difference between correlation and covariance.

Covariance:

Measures how two variables change together
Formula: Cov(X,Y) = E[(X - μₓ)(Y - μᵧ)]
Positive: variables move in same direction
Negative: variables move in opposite directions
Zero: no linear relationship
Issue: Scale-dependent, hard to interpret magnitude

Correlation (Pearson's r):

Normalized version of covariance
Formula: r = Cov(X,Y) / (σₓ × σᵧ)
Range: -1 to +1
-1: perfect negative correlation
0: no linear correlation
+1: perfect positive correlation
Advantage: Scale-independent, easy to interpret

Important: Both only measure LINEAR relationships. Can miss non-linear dependencies.

What is the role of eigenvectors and eigenvalues in PCA?

Principal Component Analysis (PCA): Dimensionality reduction technique that finds orthogonal axes that capture maximum variance.

Eigenvectors:

Define the directions of principal components
Each eigenvector represents a direction in feature space
Orthogonal to each other (uncorrelated components)
The first eigenvector points in direction of maximum variance

Eigenvalues:

Represent amount of variance captured by each eigenvector
Larger eigenvalue = more important component
Used to determine how many components to keep
Sum of all eigenvalues = total variance in data

PCA Process:

Standardize the data
Compute covariance matrix
Calculate eigenvectors and eigenvalues of covariance matrix
Sort eigenvalues in descending order
Select top k eigenvectors (principal components)
Transform data to new k-dimensional space

Describe how a decision tree splits data. What is information gain?

Decision Tree Splitting:

Decision trees recursively partition data by choosing the best feature and threshold at each node to maximize homogeneity in child nodes.

Splitting Criteria:

1. Information Gain (based on Entropy):

Measures reduction in entropy (uncertainty) after split
Entropy: H(S) = -Σ pᵢ log₂(pᵢ)
Information Gain = H(parent) - Weighted_Average(H(children))
Choose split with highest information gain
Used in ID3 and C4.5 algorithms

2. Gini Impurity:

Measures probability of incorrect classification
Gini = 1 - Σ pᵢ²
Choose split with lowest Gini impurity
Used in CART algorithm (Scikit-learn default)
Faster to compute than entropy

3. Variance Reduction (for regression):

Minimize variance in child nodes

Process:

For each feature, try all possible split points
Calculate information gain or Gini for each split
Choose the split with best score
Recursively repeat for child nodes
Stop when: max depth reached, min samples met, or no improvement

What is a confusion matrix?

A confusion matrix is a table showing the performance of a classification model by comparing predicted vs actual labels.

For Binary Classification:

Predicted Positive Negative Actual Positive TP FN Negative FP TN

True Positive (TP): Correctly predicted positive
True Negative (TN): Correctly predicted negative
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)

Derived Metrics:

Accuracy: (TP + TN) / Total
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
Specificity: TN / (TN + FP)

from sklearn.metrics import confusion_matrix import seaborn as sns import matplotlib.pyplot as plt cm = confusion_matrix(y_true, y_pred) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.ylabel('Actual') plt.xlabel('Predicted') plt.show()

4. Machine Learning Algorithms

Compare Linear Regression vs Logistic Regression.

Linear Regression:

Task: Regression (predict continuous values)
Output: Real number (-∞ to +∞)
Function: y = β₀ + β₁x₁ + β₂x₂ + ...
Loss: Mean Squared Error
Example: Predicting house prices, temperature

Logistic Regression:

Task: Classification (predict categories)
Output: Probability (0 to 1)
Function: P(y=1) = 1 / (1 + e^(-z)) where z = β₀ + β₁x₁ + ...
Loss: Log Loss (Cross-Entropy)
Example: Spam detection, disease diagnosis

Key Differences:

Linear: No activation function | Logistic: Sigmoid activation
Linear: Direct prediction | Logistic: Probability then threshold
Linear: Assumes linear relationship | Logistic: Models log-odds

When would you use K-Nearest Neighbors? What are its drawbacks?

When to Use KNN:

Small to medium-sized datasets
When decision boundary is very irregular
For both classification and regression
When interpretability is important (decisions based on similar examples)
As a baseline model
When you don't want to make assumptions about data distribution

How KNN Works:

Choose k (number of neighbors)
Calculate distance to all training points
Find k nearest neighbors
Classification: Vote by majority class
Regression: Average of neighbor values

Drawbacks:

Computational Cost: Must compute distance to all points at prediction time (slow for large datasets)
Storage: Must store entire training dataset
Curse of Dimensionality: Performance degrades in high dimensions
Sensitive to Feature Scales: Must normalize features
Sensitive to Irrelevant Features: All features equally weighted
Imbalanced Data: Biased toward majority class
Choosing k: Difficult to select optimal k

Best Practices: Always normalize features, use cross-validation to select k, consider using KD-trees or Ball-trees for faster search.

Explain SVM (Support Vector Machine) and kernel trick.

Support Vector Machine:

SVM finds the optimal hyperplane that maximizes the margin between different classes.

Key Concepts:

Hyperplane: Decision boundary separating classes
Support Vectors: Data points closest to hyperplane that define the margin
Margin: Distance between hyperplane and nearest points from each class
Goal: Maximize margin (robust to noise and generalization)

Kernel Trick:

Allows SVM to handle non-linear decision boundaries by mapping data to higher-dimensional space without explicitly computing the transformation.

Linear Kernel: K(x,y) = x^T y (for linearly separable data)
Polynomial Kernel: K(x,y) = (x^T y + c)^d
RBF (Gaussian) Kernel: K(x,y) = exp(-γ ||x-y||²) (most popular, handles non-linear data)
Sigmoid Kernel: K(x,y) = tanh(αx^T y + c)

Advantages:

Effective in high dimensions
Memory efficient (only uses support vectors)
Works well with clear margin of separation

Disadvantages:

Slow training for large datasets
Sensitive to feature scaling
Choosing right kernel and parameters can be tricky
Not probabilistic output by default

What is the difference between bagging and boosting?

Both are ensemble methods that combine multiple models, but they differ in approach:

Bagging (Bootstrap Aggregating):

Training: Models trained in parallel independently
Data: Bootstrap samples (random sampling with replacement)
Goal: Reduce variance, prevent overfitting
Combination: Average (regression) or vote (classification)
Example: Random Forest
Best for: High variance, complex models (like deep trees)

Boosting:

Training: Models trained sequentially
Data: Same dataset, but reweighted based on errors
Goal: Reduce bias, focus on hard examples
Combination: Weighted sum based on model performance
Examples: AdaBoost, Gradient Boosting, XGBoost
Best for: High bias, simple models (like shallow trees)

Key Differences:

Aspect	Bagging	Boosting
Training	Parallel	Sequential
Focus	Reduce variance	Reduce bias
Weighting	Equal weights	Weighted by performance
Overfitting	Reduces overfitting	Can overfit if not careful

Explain the difference between Random Forest and Gradient Boosting.

Random Forest (Bagging):

Builds multiple deep decision trees in parallel
Each tree trained on random bootstrap sample
Random subset of features at each split
Final prediction: average/vote of all trees
Less prone to overfitting
Faster training (parallel)
Less sensitive to hyperparameters

Gradient Boosting (Boosting):

Builds shallow trees sequentially
Each tree corrects errors of previous trees
Fits residuals/gradients of loss function
Final prediction: weighted sum of all trees
Can overfit if not tuned properly
Slower training (sequential)
More hyperparameters to tune
Often achieves better accuracy

When to Use:

Random Forest: Good default choice, less tuning needed, want robustness
Gradient Boosting: Need best performance, willing to tune carefully, have time for sequential training

Modern Variants: XGBoost, LightGBM, and CatBoost are optimized gradient boosting implementations that are faster and often perform better.

How do you evaluate clustering algorithms like K-Means?

Since clustering is unsupervised, evaluation is more challenging:

Internal Metrics (No ground truth needed):

1. Inertia (Within-Cluster Sum of Squares):

Sum of squared distances to nearest cluster center
Lower is better
Used in elbow method

2. Silhouette Score:

Measures how similar object is to its cluster vs other clusters
Range: -1 to +1
Higher is better (closer to 1)
Most commonly used metric

3. Davies-Bouldin Index:

Ratio of within-cluster to between-cluster distances
Lower is better

4. Calinski-Harabasz Index:

Ratio of between-cluster to within-cluster dispersion
Higher is better

External Metrics (Ground truth available):

Adjusted Rand Index (ARI): Measures agreement between clustering and ground truth
Normalized Mutual Information (NMI): Information shared between clusters and labels
Fowlkes-Mallows Score: Geometric mean of precision and recall

Practical Methods:

Elbow Method: Plot inertia vs k, look for "elbow"
Visual Inspection: Plot clusters (if 2-3 dimensions or use dimensionality reduction)
Domain Knowledge: Do clusters make business sense?

from sklearn.metrics import silhouette_score from sklearn.cluster import KMeans # Find optimal k silhouette_scores = [] for k in range(2, 11): kmeans = KMeans(n_clusters=k, random_state=42) labels = kmeans.fit_predict(X) score = silhouette_score(X, labels) silhouette_scores.append(score) optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2

5. Applied Machine Learning

Describe the steps you would take to solve a real-world ML problem.

1. Problem Definition:

Understand business objective
Define success metrics
Determine ML task type (classification, regression, etc.)
Assess feasibility and requirements

2. Data Collection & Understanding:

Gather relevant data sources
Exploratory Data Analysis (EDA)
Check data quality, missing values, distributions
Understand features and target variable

3. Data Preparation:

Handle missing values
Remove duplicates and outliers
Feature engineering (create new features)
Encode categorical variables
Feature scaling/normalization
Split into train/validation/test sets

4. Model Selection:

Start with simple baseline model
Try multiple algorithms
Consider constraints (interpretability, latency, resources)

5. Model Training:

Train models on training data
Use cross-validation
Tune hyperparameters
Monitor for overfitting/underfitting

6. Model Evaluation:

Evaluate on validation set
Use appropriate metrics
Compare multiple models
Error analysis

7. Final Testing:

Test on held-out test set (only once!)
Ensure no data leakage

8. Deployment:

Package model for production
Set up API/service
Implement monitoring
A/B testing

9. Monitoring & Maintenance:

Track model performance over time
Monitor for data drift
Retrain periodically
Collect feedback

How would you handle class imbalance in a dataset?

Class imbalance occurs when one class significantly outnumbers others (e.g., fraud detection: 99% normal, 1% fraud)

1. Resampling Techniques:

Oversampling minority class:
- Random oversampling (duplicate examples)
- SMOTE (Synthetic Minority Over-sampling): creates synthetic examples
- ADASYN: adaptive synthetic sampling
Undersampling majority class:
- Random undersampling
- Tomek links
- Can lose information
Combination: Both over and under sampling

2. Algorithm-Level Approaches:

Class weights: Penalize misclassification of minority class more
from sklearn.linear_model import LogisticRegression model = LogisticRegression(class_weight='balanced')
Ensemble methods: Balanced Random Forest, EasyEnsemble

3. Evaluation Metrics:

Don't use accuracy! (99% accuracy by always predicting majority)
Use: Precision, Recall, F1-Score, ROC-AUC, PR-AUC
Focus on minority class performance

4. Threshold Adjustment:

Adjust decision threshold based on cost/benefit
Lower threshold to catch more positive cases

5. Collect More Data:

Especially for minority class

6. Anomaly Detection:

For extreme imbalance (fraud, rare disease)
Treat minority as outliers

How would you prevent data leakage?

Data Leakage: When information from outside the training dataset influences the model, leading to overly optimistic performance that doesn't generalize.

Types of Data Leakage:

1. Target Leakage:

Using features that won't be available at prediction time
Example: Using "purchase amount" to predict "will purchase" (amount only exists after purchase)

2. Train-Test Contamination:

Test data influencing training process
Example: Fitting scaler on entire dataset before split

How to Prevent:

1. Proper Data Splitting:

Split data FIRST, before any preprocessing
Never touch test data during training
For time series: respect temporal order

2. Fit on Training Only:

# WRONG scaler.fit(X) # Fits on all data X_train, X_test = train_test_split(X) # CORRECT X_train, X_test = train_test_split(X) scaler.fit(X_train) # Fit only on training X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)

3. Use Pipelines:

sklearn Pipeline ensures proper order
Prevents accidentally leaking information

4. Time-Based Validation:

For time series, use TimeSeriesSplit
Train on past, predict future
Never shuffle time-series data

5. Feature Engineering Awareness:

Only use information available at prediction time
Avoid features derived from target
Be careful with aggregated features

6. Cross-Validation Done Right:

Preprocessing inside CV folds, not before

Red Flags: Performance too good to be true, huge gap between CV and test, one feature with abnormally high importance.

How do you deploy an ML model into production?

1. Model Serialization:

Save trained model to disk
Tools: pickle, joblib, ONNX, TensorFlow SavedModel

import joblib joblib.dump(model, 'model.pkl') model = joblib.load('model.pkl')

2. Create API Service:

REST API using Flask, FastAPI, or Django
Input validation
Error handling

from fastapi import FastAPI import joblib app = FastAPI() model = joblib.load('model.pkl') @app.post("/predict") def predict(features: dict): X = preprocess(features) prediction = model.predict(X) return {"prediction": prediction.tolist()}

3. Containerization:

Use Docker for consistency
Include all dependencies

4. Deployment Options:

Cloud Services: AWS SageMaker, Google Cloud AI, Azure ML
Container Orchestration: Kubernetes
Serverless: AWS Lambda, Google Cloud Functions
Edge Deployment: TensorFlow Lite, ONNX Runtime

5. Monitoring:

Track prediction latency
Monitor model performance metrics
Detect data drift
Log predictions and inputs
Set up alerts

6. A/B Testing:

Test new model against existing
Gradually roll out to users

7. CI/CD Pipeline:

Automated testing
Version control for models
Automated retraining

Explain cross-validation and why it is important.

Cross-Validation: Technique to assess model performance by training and testing on different subsets of data.

Why Important:

Better estimate of model performance on unseen data
Reduces variance in performance estimate
Makes efficient use of limited data
Helps detect overfitting
More reliable than single train-test split

Types of Cross-Validation:

1. K-Fold Cross-Validation (most common):

Split data into k folds (typically 5 or 10)
Train on k-1 folds, test on remaining fold
Repeat k times, rotating test fold
Average results across all folds

from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")

2. Stratified K-Fold:

Maintains class distribution in each fold
Important for imbalanced datasets

3. Leave-One-Out (LOO):

k = n (number of samples)
Train on all but one sample
Computationally expensive
High variance, low bias estimate

4. Time Series Cross-Validation:

Respects temporal order
Train on past, test on future
Expanding or sliding window

from sklearn.model_selection import TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5) for train_idx, test_idx in tscv.split(X): X_train, X_test = X[train_idx], X[test_idx] y_train, y_test = y[train_idx], y[test_idx]

5. Group K-Fold:

Ensures samples from same group stay together
Example: Multiple samples from same patient

How would you select features for your model?

Why Feature Selection:

Reduce overfitting
Improve model performance
Reduce training time
Improve interpretability
Reduce storage and computational costs

1. Filter Methods (Statistical tests):

Correlation: Remove highly correlated features
Chi-Square Test: For categorical features
ANOVA F-test: For numerical features
Mutual Information: Measures dependency
Fast but doesn't consider feature interactions

from sklearn.feature_selection import SelectKBest, f_classif selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y)

2. Wrapper Methods (Use model):

Forward Selection: Start with no features, add one at a time
Backward Elimination: Start with all, remove one at a time
Recursive Feature Elimination (RFE): Iteratively remove least important
More accurate but computationally expensive

from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression model = LogisticRegression() rfe = RFE(model, n_features_to_select=10) X_selected = rfe.fit_transform(X, y)

3. Embedded Methods (Built into algorithm):

L1 Regularization (Lasso): Shrinks coefficients to zero
Tree-based Feature Importance: Random Forest, XGBoost
Balance between filter and wrapper methods

from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier() rf.fit(X, y) importances = rf.feature_importances_ top_features = np.argsort(importances)[-10:]

4. Dimensionality Reduction:

PCA: Creates new uncorrelated features
LDA: Supervised dimensionality reduction
t-SNE, UMAP: For visualization

5. Domain Knowledge:

Use expert knowledge to identify relevant features
Remove obviously irrelevant features (IDs, duplicates)

6. Variance Threshold:

Remove features with low variance
Constant or quasi-constant features

6. Scenario-Based / Problem Solving

You have a dataset with 1M rows and 200 features. Training your model is too slow. What would you do?

1. Data Reduction:

Sample data: Use representative subset for prototyping
Remove duplicates: Eliminate redundant rows
Stratified sampling: Maintain class distribution

2. Feature Reduction:

Feature selection: Keep only important features
PCA: Reduce dimensionality
Remove low variance features
Remove highly correlated features

3. Algorithm Choice:

Use faster algorithms: Linear models instead of SVM
Tree-based: LightGBM (designed for large data)
SGD-based: Stochastic gradient descent versions
Online learning: Process data in batches

4. Model Optimization:

Reduce model complexity: Fewer trees, shallower depth
Early stopping: Stop training when validation stops improving
Use warm_start: Resume training

5. Computational Optimization:

Parallel processing: Use n_jobs=-1 in sklearn
GPU acceleration: Use CuML, GPU versions of XGBoost
Distributed computing: Spark MLlib, Dask
Better hardware: More RAM, faster CPU

6. Data Format:

Use efficient formats: Parquet instead of CSV
Optimize dtypes: Use int8/int16 instead of int64 where possible

7. Hyperparameter Tuning:

Random search: Instead of exhaustive grid search
Bayesian optimization: More efficient search
Reduce CV folds: Use 3 instead of 10

You trained a model and it performs well on training data but poorly on test data. How do you debug it?

This is classic overfitting. Here's a systematic debugging approach:

1. Verify the Problem:

Compare training vs validation vs test performance
Check if test data is from same distribution
Look for data leakage

2. Check Data Issues:

Data leakage: Information from test in training?
Train-test split: Was it done correctly?
Stratification: Do train and test have similar distributions?
Time-based data: Did you accidentally shuffle time series?

3. Reduce Overfitting:

Regularization: Add L1/L2 regularization, increase strength
Reduce model complexity:
- Fewer layers/units in neural networks
- Shallower trees, fewer trees
- Lower polynomial degree
Dropout: Add dropout layers (neural networks)
Early stopping: Stop before overfitting

4. Get More Data:

Collect more training samples
Data augmentation: For images (rotation, flip, crop)
Synthetic data: SMOTE for tabular data

5. Feature Engineering:

Remove irrelevant features: Less features = less overfitting
Feature selection: Keep only important ones
Check for noisy features

6. Use Ensemble Methods:

Bagging to reduce variance
Random Forest instead of single tree

7. Cross-Validation:

Use k-fold CV to get more reliable performance estimate
Helps identify if problem is consistent

8. Learning Curves:

Plot training and validation error vs dataset size
Diagnose if more data would help

from sklearn.model_selection import learning_curve import matplotlib.pyplot as plt train_sizes, train_scores, val_scores = learning_curve( model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10) ) plt.plot(train_sizes, train_scores.mean(axis=1), label='Train') plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation') plt.legend() plt.show()

How would you detect outliers in a dataset?

1. Statistical Methods:

Z-Score Method:

Calculate: z = (x - μ) / σ
Threshold: |z| > 3 (or 2.5)
Assumes normal distribution

from scipy import stats import numpy as np z_scores = np.abs(stats.zscore(data)) outliers = np.where(z_scores > 3)

IQR (Interquartile Range) Method:

IQR = Q3 - Q1
Lower bound: Q1 - 1.5 × IQR
Upper bound: Q3 + 1.5 × IQR
More robust to non-normal data

Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = data[(data < lower) | (data > upper)]

2. Visualization Methods:

Box plots: Visual IQR method
Scatter plots: Identify points far from cluster
Histograms: See distribution tails

3. Machine Learning Methods:

Isolation Forest:

Isolates outliers using random splits
Outliers are easier to isolate
Works well in high dimensions

from sklearn.ensemble import IsolationForest iso_forest = IsolationForest(contamination=0.1) outliers = iso_forest.fit_predict(X) # -1 for outliers, 1 for inliers

Local Outlier Factor (LOF):

Measures local density deviation
Identifies local outliers

from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=20) outliers = lof.fit_predict(X)

DBSCAN:

Density-based clustering
Points not in any cluster are outliers

One-Class SVM:

Learns boundary around normal data
Points outside boundary are outliers

4. Domain-Specific Methods:

Use domain knowledge to set thresholds
Identify impossible values (age = -5, temperature = 500°C)

5. Multivariate Methods:

Mahalanobis Distance: Considers correlation between features
PCA: Project to lower dimensions, identify outliers

What to Do with Outliers:

Remove: If data errors or extreme anomalies
Transform: Log transform, winsorization (cap at percentile)
Keep: If legitimate extreme values
Separate model: Build separate model for outliers
Use robust algorithms: Tree-based models handle outliers well

Suppose your model predicts housing prices. Users say predictions are consistently too high. What could be wrong?

1. Data-Related Issues:

Training data bias: Training data has higher prices than current market
- Data from expensive neighborhoods over-represented
- Old data from before market correction
Data drift: Market conditions changed since training
- Economic downturn
- Interest rate changes
- Local market changes
Missing features: Not capturing factors that lower prices
- Property condition
- Recent comparable sales
- Neighborhood trends

2. Model Issues:

Incorrect feature scaling: Some features dominating
Wrong loss function: MSE penalizes large errors, might bias toward higher predictions
Regularization issues: Model not fitting well
Extrapolation problems: Predicting outside training range

3. Feature Engineering Issues:

Outlier handling: High-price outliers pulling predictions up
Categorical encoding: Misrepresenting certain categories
Interaction terms missing: Not capturing feature combinations

4. Evaluation Issues:

Wrong metric: Optimizing for wrong objective
Test set not representative: Doesn't match production data

5. Deployment Issues:

Preprocessing mismatch: Training and production preprocessing differ
Feature values changed: Input features calculated differently
Model version: Wrong model version deployed

Debugging Steps:

Analyze errors: Plot predicted vs actual, look for patterns
Check data distribution: Compare training vs production data
Feature importance: Which features driving high predictions?
Residual analysis: Are errors systematic?
Temporal analysis: When did this start? Gradual or sudden?
Segment analysis: Which types of houses affected most?

Solutions:

Retrain with recent data
Add relevant features
Adjust for systematic bias: Post-processing correction
Use different loss function: MAE instead of MSE
Ensemble with simpler model
Implement monitoring: Detect drift early

You're given a new dataset with images for classification. How do you preprocess it?

1. Data Exploration:

Check image dimensions, channels (RGB vs grayscale)
Check class distribution (balanced vs imbalanced)
Visualize sample images from each class
Check for corrupted or missing images
Analyze image quality, resolution, format

2. Image Resizing:

Resize all images to consistent dimensions
Common sizes: 224×224, 256×256 (for pretrained models)
Choose based on model architecture and computational resources

from PIL import Image import numpy as np img = Image.open('image.jpg') img_resized = img.resize((224, 224)) img_array = np.array(img_resized)

3. Normalization:

Pixel scaling: Scale to [0, 1] or [-1, 1]
Standardization: Zero mean, unit variance
For pretrained models: Use their specific normalization (ImageNet stats)

# Scale to [0, 1] img_normalized = img_array / 255.0 # ImageNet standardization mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] img_standardized = (img_array - mean) / std

4. Data Augmentation (Training only):

Geometric: Rotation, flipping, cropping, zooming
Color: Brightness, contrast, saturation adjustments
Noise: Add Gaussian noise
Cutout/Mixup: Advanced augmentation techniques

from tensorflow.keras.preprocessing.image import ImageDataGenerator datagen = ImageDataGenerator( rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True, zoom_range=0.2, shear_range=0.2, fill_mode='nearest' )

5. Channel Handling:

Convert grayscale to RGB if needed (or vice versa)
Handle alpha channel in RGBA images
Ensure correct channel ordering (RGB vs BGR)

6. Train-Validation-Test Split:

Split before preprocessing to prevent data leakage
Use stratified split for imbalanced classes
Typical: 70% train, 15% validation, 15% test

7. Batch Processing:

Load images in batches (not all at once)
Use data generators for memory efficiency

from tensorflow.keras.preprocessing.image import ImageDataGenerator train_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow_from_directory( 'train/', target_size=(224, 224), batch_size=32, class_mode='categorical' )

8. Handling Class Imbalance:

Oversample minority class
Undersample majority class
Use class weights
Generate synthetic samples via augmentation

9. Transfer Learning Considerations:

Use preprocessing specific to pretrained model (ResNet, VGG, etc.)
Match input size to pretrained model

10. Quality Control:

Remove corrupted images
Check for duplicate images
Verify labels are correct

7. Coding Questions

Implement a simple linear regression from scratch in Python (without libraries).

import numpy as np class LinearRegressionScratch: """ Simple Linear Regression: y = mx + b Using Gradient Descent """ def __init__(self, learning_rate=0.01, n_iterations=1000): self.learning_rate = learning_rate self.n_iterations = n_iterations self.weights = None self.bias = None self.losses = [] def fit(self, X, y): """ Train the model using gradient descent X: features (n_samples, n_features) y: target (n_samples,) """ n_samples, n_features = X.shape # Initialize parameters self.weights = np.zeros(n_features) self.bias = 0 # Gradient descent for i in range(self.n_iterations): # Forward pass: y_pred = X @ w + b y_pred = np.dot(X, self.weights) + self.bias # Compute loss (MSE) loss = np.mean((y_pred - y) ** 2) self.losses.append(loss) # Compute gradients dw = (2 / n_samples) * np.dot(X.T, (y_pred - y)) db = (2 / n_samples) * np.sum(y_pred - y) # Update parameters self.weights -= self.learning_rate * dw self.bias -= self.learning_rate * db # Print progress if (i + 1) % 100 == 0: print(f"Iteration {i+1}: Loss = {loss:.4f}") def predict(self, X): """Make predictions""" return np.dot(X, self.weights) + self.bias def score(self, X, y): """Calculate R² score""" y_pred = self.predict(X) ss_total = np.sum((y - np.mean(y)) ** 2) ss_residual = np.sum((y - y_pred) ** 2) r2 = 1 - (ss_residual / ss_total) return r2 # Example usage if __name__ == "__main__": # Generate sample data