Top 50 Machine Learning Interview Questions in India 2026

Top 50 Machine Learning Interview Questions 2026 — With Detailed Answers
By Cambridge Infotech | Updated June 2026 | 20 min read |
Verified by industry hiring professionals
Quick Answer
The most-asked ML interview questions in India cover: bias-variance tradeoff, overfitting/underfitting, cross-validation, precision vs recall, gradient descent, regularisation, feature engineering, handling imbalanced data, and model evaluation metrics. Fresher interviews focus on theory + Python (Pandas, Scikit-learn). Senior interviews add system design, MLOps, and business case studies.
Whether you are preparing for your first data science job at TCS or Infosys, or targeting a senior ML role at Flipkart or Google India, this guide covers the machine learning interview questions that actually get asked in Indian companies in 2026 — not generic US-centric lists.
Each question includes the answer hiring managers expect, the level it applies to, and common mistakes candidates make. Use this guide alongside the Machine Learning course at Cambridge Infotech to build both theory and practical skills.
Jump to Section
Typical ML Interview Structure in Indian Companies (2026)
MCQs: Python, ML, SQL
20–40 mins
ML Theory & Stats
45–60 mins
Python + problem solving
60–90 mins
Salary, culture, career
30–45 mins
Q1–10: Machine Learning Fundamentals
Q1. What is the bias-variance tradeoff?
Expected answer:
Bias is the error from wrong assumptions in the learning algorithm — a high-bias model is too simple and underfits the data (e.g., linear regression on non-linear data). Variance is the sensitivity to small fluctuations in the training set — a high-variance model overfits and performs poorly on new data.
The tradeoff: reducing bias typically increases variance and vice versa. The goal is to find the sweet spot with low bias and low variance — achieved through techniques like cross-validation, regularisation, ensemble methods, and appropriate model complexity.
Q2. What is overfitting? How do you prevent it?
Overfitting occurs when a model learns the training data too well — including noise — and fails to generalise to new, unseen data. Signs: training accuracy is very high but test/validation accuracy is significantly lower.
Prevention methods:
- Regularisation — L1 (Lasso) penalises coefficients to zero; L2 (Ridge) shrinks coefficients
- Cross-validation — k-fold CV gives a more reliable performance estimate
- Early stopping — stop training when validation loss starts increasing
- Dropout — randomly deactivates neurons during neural network training
- More training data — the most reliable fix when available
- Reduce model complexity — fewer layers, shallower trees, fewer features
Q3. What is the difference between supervised, unsupervised, and reinforcement learning?
Fresher
| Type | Training data | Goal | Examples |
|---|---|---|---|
| Supervised | Labelled (input + output) | Predict output for new inputs | Linear regression, SVM, decision trees |
| Unsupervised | Unlabelled (input only) | Find hidden patterns/structure | K-means, PCA, DBSCAN |
| Reinforcement | Reward/penalty signals | Maximise cumulative reward | AlphaGo, game AI, robotics |
Q4. Explain cross-validation and why it is used.
Cross-validation is a resampling technique used to evaluate how well an ML model generalises to independent data. The most common method is k-fold cross-validation: the dataset is split into k equal subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process repeats k times, each time using a different fold as the test set. The final performance metric is the average across all k iterations.
Why it matters: A single train-test split can give misleadingly good or bad results depending on which samples end up in the test set. k-fold CV provides a more reliable, less biased estimate of model performance — especially important with small datasets.
Q5. What is gradient descent? Explain its variants.
Gradient descent is an optimisation algorithm used to minimise the loss function of an ML model by iteratively moving in the direction of steepest descent (negative gradient). At each step: θ = θ − α × ∇J(θ) where θ are model parameters, α is the learning rate, and ∇J(θ) is the gradient of the cost function.
| Variant | Batch size | Speed | Use when |
|---|---|---|---|
| Batch GD | All data | Slow | Small datasets |
| Stochastic GD (SGD) | 1 sample | Noisy but fast | Online learning |
| Mini-batch GD | 32–512 samples | Balanced | Most DL training (standard) |
Q6–10: Quick-reference fundamentals
Q6. What is regularisation? Difference between L1 and L2?
L1 (Lasso): Adds absolute value of coefficients to loss. Drives some weights to exactly zero → automatic feature selection. Good for sparse models. L2 (Ridge): Adds squared value of coefficients. Shrinks weights toward zero but rarely to exactly zero → keeps all features but reduces their impact. Use L2 when all features matter; L1 when you suspect many are irrelevant.
Q7. What is the difference between classification and regression?
Classification: Predicts discrete category labels (spam/not spam, cat/dog). Metrics: accuracy, precision, recall, F1, AUC-ROC. Regression: Predicts continuous numerical values (house price, temperature). Metrics: MAE, MSE, RMSE, R². Key distinction: output type determines which algorithms and metrics apply.
Q8. What is a confusion matrix? How do you read it?
A confusion matrix shows actual vs predicted classes. For binary classification: TP (correctly predicted positive), TN (correctly predicted negative), FP (incorrectly predicted positive — Type I error), FN (incorrectly predicted negative — Type II error). From it you derive: Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = 2×(P×R)/(P+R).
Q9. When would you use precision vs recall as your primary metric?
Precision when false positives are costly — e.g., spam detection (marking a genuine email as spam is bad). Recall when false negatives are costly — e.g., cancer screening (missing a cancer case is far worse than a false alarm). In most real-world cases, use F1-score (harmonic mean) or AUC-ROC for a balanced view.
Q10. What is the curse of dimensionality?
As the number of features increases, the volume of feature space grows exponentially — making data increasingly sparse and making distance-based algorithms (KNN, SVM with RBF kernel) perform poorly. Fix: feature selection (remove irrelevant features), dimensionality reduction (PCA, t-SNE), or regularisation. In Indian interviews, this is often followed by “how would you apply PCA to this problem?”
Q11–20: ML Algorithms
Q11. How does a decision tree work? What are its pros and cons?
Fresher
A decision tree splits data recursively based on feature values to maximise information gain (or minimise Gini impurity). At each node it asks a yes/no question about a feature; at leaf nodes it assigns a class label or value.
✓ Pros
- Easy to interpret and visualise
- No feature scaling needed
- Handles both numerical and categorical data
✗ Cons
- Prone to overfitting (deep trees)
- Unstable — small data changes = different tree
- Biased toward features with more levels
Q12. What is Random Forest? Why is it better than a single decision tree?
Random Forest is an ensemble method that builds multiple decision trees on different random subsets of the data (bagging) and random subsets of features at each split. Predictions are made by majority vote (classification) or averaging (regression).
Why it is better: Individual trees overfit; averaging their predictions reduces variance without increasing bias. The randomness in feature selection also de-correlates the trees, making the ensemble more robust. In practice, Random Forest consistently outperforms single decision trees on most tabular datasets.
Q13–20: Algorithm quick answers
Q13. What is gradient boosting? How does XGBoost differ?
Gradient boosting builds trees sequentially — each tree corrects the errors of the previous. Unlike Random Forest (parallel trees), it is sequential and slower but often more accurate. XGBoost adds: second-order derivatives for better optimisation, built-in regularisation (L1+L2), column and row subsampling, and parallel processing — making it 10–50x faster than vanilla gradient boosting.
Q14. Explain SVM (Support Vector Machine) in simple terms.
SVM finds the hyperplane that best separates classes by maximising the margin — the distance between the hyperplane and the nearest data points from each class (support vectors). For non-linearly separable data, the kernel trick (RBF, polynomial) projects data into higher dimensions where separation is possible. SVM works well on high-dimensional data and small datasets but is slow on very large datasets.
Q15. What is logistic regression? Why is it called “regression” if it is a classifier?
Logistic regression predicts the probability that an input belongs to a class using the sigmoid function: σ(z) = 1/(1+e⁻ᶻ). It is called “regression” because it models a linear relationship between features and the log-odds (logit) of the outcome — the regression happens in log-odds space. A threshold (typically 0.5) converts probability output to a binary class prediction.
Q16. What is K-Nearest Neighbours (KNN)? What are its limitations?
KNN classifies a new data point based on the majority class among its k nearest neighbours (by Euclidean or other distance). Limitations: computationally expensive at prediction time (computes distances to all training points), sensitive to irrelevant features and feature scale (requires normalisation), and suffers severely from the curse of dimensionality with high-dimensional data.
Q17. Explain k-means clustering. How do you choose k?
K-means assigns n data points to k clusters by minimising within-cluster variance. Steps: (1) initialise k centroids randomly, (2) assign each point to the nearest centroid, (3) recompute centroids, (4) repeat until centroids stabilise. Choosing k: use the Elbow Method — plot inertia (within-cluster sum of squares) vs k and look for the “elbow” where improvement slows. Alternatively, use the Silhouette Score to measure cluster cohesion.
Q18. What is PCA (Principal Component Analysis)? When do you use it?
PCA reduces dimensionality by projecting data onto the directions (principal components) of maximum variance, ordered by explained variance. It decorrelates features and retains the most important structure. Use when: features are highly correlated, you need to visualise high-dimensional data (reduce to 2D/3D), or you want to speed up training by reducing feature count. Important: PCA makes the model less interpretable — don’t use it if explainability matters.
Q19. What is a neural network? Explain forward and backward propagation.
Forward propagation: Input data passes through weighted connections and activation functions layer by layer until the output layer produces a prediction. Loss is calculated by comparing prediction to actual label. Backward propagation: The gradient of the loss is computed with respect to each weight using the chain rule, flowing backward through the network. Weights are updated using gradient descent to reduce the loss.
Q20. What is the vanishing gradient problem?
In deep networks, gradients become exponentially smaller as they propagate backward through layers with sigmoid or tanh activations — early layers learn very slowly or not at all. Solutions: Use ReLU activation (gradients do not saturate for positive values), use batch normalisation, use residual connections (ResNets), or initialise weights with Xavier/He initialisation.
Q21–28: Model Evaluation Metrics
Q21. What is AUC-ROC? How do you interpret it?
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate at all classification thresholds. AUC (Area Under the Curve) = probability that the model ranks a random positive example higher than a random negative example. AUC = 1.0 is perfect; AUC = 0.5 is random guessing. AUC is threshold-independent and works well for imbalanced datasets.
Q22. What is the difference between MAE, MSE, and RMSE?
MAE (Mean Absolute Error): Average of absolute differences. Robust to outliers. Easy to interpret (same units as target). MSE (Mean Squared Error): Average of squared differences. Penalises large errors heavily. Not in same units as target. RMSE (Root MSE): Square root of MSE — same units as target, still penalises large errors. Use RMSE when large errors are especially bad; MAE when all errors should be treated equally.
Q23. How do you handle imbalanced datasets? (Very common in Indian interviews)
Resampling techniques: Oversampling minority class (SMOTE — Synthetic Minority Oversampling Technique) or undersampling majority class. Algorithm-level: Use class weights in the algorithm (class_weight=’balanced’ in Scikit-learn). Metric choice: Use F1, AUC-ROC, or PR curve instead of accuracy (accuracy is misleading for imbalanced data — a model predicting all majority class can get 95% accuracy but useless). Real answer expected: Mention all three approaches and state when you would use each.
Q24. What is R² (R-squared) in regression? What are its limitations?
R² measures the proportion of variance in the target explained by the model (1 = perfect, 0 = model no better than mean). Limitations: Always increases when you add more features — even irrelevant ones — so use Adjusted R² for comparing models with different numbers of features. R² alone does not indicate whether predictions are accurate in absolute terms — a model with R²=0.9 could still be far off in real units.
Q25. Explain the difference between Type I and Type II errors.
Type I error (False Positive): Model predicts positive when it is actually negative. Example: flagging a genuine email as spam. Cost = unnecessary action. Type II error (False Negative): Model predicts negative when it is actually positive. Example: missing a fraud transaction. Cost = missed detection. Which matters more depends on business context — the interviewer often asks you to give an example from a real domain like healthcare or finance.
Q26. What is Silhouette Score? When do you use it?
Silhouette Score measures how similar a data point is to its own cluster compared to other clusters — ranges from -1 (misclassified) to +1 (well-clustered). Used to evaluate the quality of unsupervised clustering (K-means, DBSCAN) when there are no ground truth labels. Score > 0.5 = reasonable clustering; > 0.7 = strong clustering.
Q27. What is data leakage? How do you detect and prevent it?
Data leakage occurs when information from outside the training set is used to create the model — giving unrealistically high performance during training/validation but poor performance in production. Common causes: using future data in time-series models, applying scaler/imputer fit on the full dataset before splitting, or including the target variable’s proxy in features. Prevention: always fit preprocessing only on training data, use pipelines in Scikit-learn, use time-aware train-test splits for temporal data.
Q28. What is the log loss (cross-entropy loss)? Why is it used in classification?
Log loss penalises confident wrong predictions heavily. Formula: -[y·log(p) + (1-y)·log(1-p)]. If the model is 90% confident about the wrong class, it suffers much more than if it were 60% confident. This makes log loss better at training probabilistic classifiers than accuracy (which only cares if the prediction is right or wrong, not how confident it was).
Q29–35: Feature Engineering
Q29. How do you handle missing values in a dataset?
Deletion: Drop rows (if <5% missing) or columns (if >50% missing and not critical). Imputation: Mean/median for numerical (median for skewed data), mode for categorical. Advanced: KNN imputation (uses similar rows), iterative imputation (predicts missing values from other features), or keeping a “missing” indicator column (sometimes missingness itself is informative). Always impute after train-test split to prevent leakage.
Q30. What is feature scaling? When is it necessary?
Normalisation (MinMaxScaler): Scales features to [0,1]. Use when you know the distribution is not Gaussian. Standardisation (StandardScaler): Scales to zero mean, unit variance. Use when distribution is approximately Gaussian. When necessary: Gradient descent-based models (linear regression, neural networks), distance-based models (KNN, SVM, KMeans). Not necessary: Tree-based models (decision trees, Random Forest, XGBoost) — they are scale-invariant.
Q31. How do you handle categorical variables in ML?
Label encoding: Assigns integer to each category. Only use for ordinal features (low/medium/high) or tree models — linear models will assume false ordering. One-hot encoding: Creates binary columns for each category. Use for nominal features with linear models. Can cause high dimensionality with many categories. Target encoding: Replace category with mean of target — powerful but risks overfitting, use with cross-validation. Frequency encoding: Replace with count/frequency of category occurrence.
Q32. What is feature importance? How do you compute it in Random Forest vs XGBoost?
Random Forest: Mean decrease in impurity (Gini) — average reduction in node impurity when a feature is used for splitting across all trees. XGBoost: Gain importance (contribution to reducing loss), Coverage (number of samples affected), and Frequency (how often the feature appears in splits). Both can be accessed via .feature_importances_ attribute. Also use SHAP values for more reliable, model-agnostic feature importance.
Q33. How do you handle outliers in your data?
Detection: IQR method (values below Q1-1.5×IQR or above Q3+1.5×IQR), Z-score (>3 std devs), visual (box plots, scatter plots). Handling: Remove (if data entry error), cap (winsorisation — replace with 95th/5th percentile), transform (log/sqrt transformation to reduce effect), or use robust algorithms (Random Forest, tree models are less sensitive to outliers than linear models).
Q34. What is the difference between feature selection and feature extraction?
Feature selection: Selects a subset of the original features to keep. Preserves interpretability. Methods: filter (correlation, chi-square), wrapper (RFE, forward selection), embedded (L1 regularisation). Feature extraction: Creates new features from original ones. Reduces dimensionality but loses interpretability. Methods: PCA, t-SNE, autoencoders, LDA.
Q35. What is SMOTE and when do you use it?
SMOTE (Synthetic Minority Oversampling Technique) generates synthetic minority class samples by interpolating between existing minority samples and their k-nearest minority neighbours. Use it when: class imbalance is severe (>10:1 ratio), simple oversampling leads to overfitting, and undersampling would lose too much data. Always apply SMOTE only on the training set — never before train-test split.
Q36–42: Practical Python & Coding Questions
Q36. Write Python code to train a Random Forest classifier on a dataset and print its accuracy.
This is a common screening question. The expected code demonstrates a clean scikit-learn pipeline:
Q37. How do you detect and remove duplicate rows and missing values using Pandas?
Q38–42: Short practical answers
Q38. How do you check if features are correlated using Python?
df.corr() to get correlation matrix. sns.heatmap(df.corr(), annot=True) to visualise. Features with |correlation| > 0.85 are highly correlated — consider dropping one to reduce multicollinearity in linear models.
Q39. How do you apply feature scaling in a Scikit-learn pipeline?
Use Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]). This ensures the scaler is fit only on training data during cross-validation — preventing data leakage. Always use pipelines in production code.
Q40. What is hyperparameter tuning? How do you do it in Python?
GridSearchCV: Exhaustive search over specified parameter values. Reliable but slow. RandomizedSearchCV: Samples parameter combinations randomly. Faster for large search spaces. Optuna / Bayesian Optimisation: Uses past results to focus search on promising regions. Best for expensive-to-train models.
Q41. SQL: Write a query to find the top 5 data scientists by salary in each city.
Q42. How do you save and load a trained ML model in Python?
Use joblib.dump(model, 'model.pkl') to save and joblib.load('model.pkl') to load. Prefer joblib over pickle for large NumPy arrays (more efficient). For deployment, use MLflow model registry or save as ONNX format for cross-platform serving.
Q43–50: Advanced & Case Study Questions
Q43. A model has 99% accuracy on training but 60% on test. What is happening and what do you do?
Diagnosis: Severe overfitting. The model has memorised training data. Steps: (1) Check if data leakage is causing artificially high training accuracy, (2) Reduce model complexity (fewer layers, shallower trees, less max_depth), (3) Apply regularisation (L1/L2 for linear models, max_depth/min_samples_split for trees, dropout for neural nets), (4) Get more training data if possible, (5) Use cross-validation to get a more honest performance estimate, (6) Apply feature selection to reduce noise features.
Q44. A bank asks you to build a fraud detection model. What is your approach? (Case study)
Expected answer structure: (1) Understand the problem — What is fraud rate? Consequences of FP vs FN? Real-time or batch predictions? (2) Data exploration — transaction features, customer history, temporal patterns, class imbalance (fraud ~0.1–2% typically) (3) Handle imbalance — SMOTE + class_weight, use PR curve not just AUC-ROC (4) Feature engineering — velocity features (transactions in last hour/day), location anomalies, deviation from spending patterns (5) Model — XGBoost + SHAP for interpretability, required by RBI guidelines (6) Threshold tuning — tune decision threshold based on business cost of FP vs FN (7) Monitoring — concept drift detection as fraud patterns evolve.
Q45. What is SHAP? Why is it important for explainability in India’s regulated industries?
SHAP (SHapley Additive exPlanations) computes the contribution of each feature to a single prediction by averaging over all possible feature orderings — grounded in game theory. Unlike feature importance (global average), SHAP gives per-prediction explanations. In India’s BFSI sector, RBI and SEBI increasingly require models to explain individual loan decisions, credit scores, and trading flags — SHAP values provide the audit trail needed for regulatory compliance.
Q46. What is MLOps? Why does it matter in production?
MLOps is the practice of automating and monitoring the ML lifecycle in production — model versioning (MLflow), CI/CD pipelines for model deployment, data and model drift detection, A/B testing, and rollback mechanisms. In practice, 90% of ML projects never reach production; MLOps closes the gap. Key tools: MLflow (experiment tracking), DVC (data versioning), Airflow (pipeline orchestration), Docker (containerisation), Evidently AI (drift monitoring).
Q47. What is concept drift? How do you detect and handle it?
Concept drift occurs when the statistical relationship between input features and the target variable changes over time — a fraud model trained in 2024 may underperform in 2026 because fraud patterns evolved. Detection: Monitor prediction accuracy/distribution over time, use statistical tests (PSI — Population Stability Index, KS test for distribution shift). Handling: Periodic retraining on recent data, online learning (updating model incrementally), or ensemble of models trained at different time windows.
Q48. What is a recommendation system? Explain collaborative filtering vs content-based.
Collaborative filtering: Recommends items based on what similar users liked — “users like you also liked X.” Works without item content knowledge. Problem: cold start (new users/items have no history). Content-based: Recommends items similar to what a user has liked before based on item features. Not affected by cold start but limited to known item space. Hybrid: Netflix, Amazon, Swiggy use both — content-based for new users, collaborative for established users.
Q49. How would you A/B test an ML model change?
Setup: Define the metric you are optimising (CTR, revenue, conversion rate). Randomly split traffic (50/50 control vs treatment). Ensure sample size is sufficient for statistical significance (use power analysis). Run: Serve model A to control group and model B to treatment group simultaneously. Analyse: Use hypothesis testing (t-test for continuous metrics, chi-square for proportions). Check p-value < 0.05 before concluding. Also check for novelty effects and segment-level impacts before full rollout.
Q50. Tell me about a ML project you built. Walk me through your approach. (Most important question)
Structure your answer with STAR: Situation — what problem did it solve? Task — what was your specific contribution? Action — describe data collection, EDA, feature engineering, model selection, evaluation (mention specific metrics and what you chose). Result — quantify the outcome (accuracy improved from X to Y, or reduced false positives by Z%). Most common mistake: Describing what the model does theoretically instead of what YOU specifically did. Interviewers want to see your decision-making process, not just the outcome.
Interview Tips for ML Roles in India (2026)
✓ What Indian interviewers value
- Practical experience over theory — deploy one project before interviewing
- Clean Python code — practise on HackerRank India, StrataScratch
- Business context — connect every ML answer to a business metric
- SQL proficiency — tested in almost every Indian DS interview
- Honest about limitations — no interviewer believes a perfect project
✗ Common interview mistakes
- Memorising definitions without understanding — interviewers test understanding, not memory
- No GitHub profile or deployed projects to show
- Saying “I would use neural networks for everything”
- Not knowing which metric to use for which problem type
- Not practising coding on paper/whiteboard beforehand
Crack Your ML Interview with Real Project Experience
Cambridge Infotech’s Machine Learning course in Bangalore includes mock interviews, real project work, and placement drives. Students practice these exact questions with trainers who conduct industry hiring interviews.
FAQ
What are the most common ML interview questions in India?
Bias-variance tradeoff, overfitting prevention, cross-validation, precision vs recall, gradient descent, regularisation (L1/L2), feature engineering, handling imbalanced datasets, and model evaluation metrics. SQL window functions and Python coding (Pandas + Scikit-learn) are tested in almost every interview regardless of company.
How many rounds are in a data science interview at Indian companies?
Typically 3–4 rounds: online screening (Python/SQL MCQs) → technical round 1 (ML theory and statistics) → technical round 2 (coding + case study) → HR/managerial. Product companies like Flipkart and Razorpay often add a system design or data architecture round for senior roles.
Where can I practise ML interview questions online?
StrataScratch (real interview questions from Flipkart, Amazon, and other companies), LeetCode (SQL and Python), Kaggle Learn (practical notebooks), and GeeksforGeeks ML section (theory questions).





