# Job Interview Experiences

# Post : Data Scientist / Data Analyst

1. How would you check if the model is suffering from multi Collinearity?

2. What is transfer learning? Steps you would take to perform transfer learning.

3. Why is CNN architecture suitable for image classification? Not an RNN?

4. What are the approaches for solving class imbalance problem?

5. When sampling what types of biases can be inflected? How to control the biases?

6. Explain concepts of epoch, batch, iteration in machine learning.

7. What type of performance metrics would you choose to evaluate the different classification models and why?

8. What are some of the types of activation functions and specifically when to use them?

9. What are the conditions that should be satisfied for a time series to be stationary?

10. What is the difference between Batch and Stochastic Gradient Descent?

11. What is difference between K-NN and K-Means clustering?

Date: **04/07/21**

1. What happens when neural nets are too small? What happens when they are large enough?

2. Why do we need pooling layer in CNN? Common pooling methods?

3. Are ensemble models better than individual models? Why/why – not?

4. Use Case – Consider you are working for pen manufacturing company. How would you help sales team with leads using Data analysis?

5. Assume you were given access to a website google analytics data.

6. In order to increase conversions, how do you perform A/B testing to identify best page design.

7. How is random forest different from Gradient boosting algorithm, given both are tree-based algorithm?

8. Describe steps involved in creating a neural network?

9. In brief, how would you perform the task of sentiment analysis?

Date: **03/07/21**

1. Why do we select validation data other than test data?

2. Difference between linear logistic regression?

3. Why do we take such a complex cost function for logistic?

4. Difference between random forest and decision tree?

5. How would you decide when to stop splitting the tree?

6. Measures of central tendency

7. What is the requirement of k means algorithm

8. Which clustering technique uses combining of clusters

9. Which is the oldest probability distribution

10. What all values does a random variable can take

11. Types of random variables

12. Normality of residuals

Date: **04/07/21**

1. Telecom Customer Churn Prediction. Explain the project end to end?

2. Data Pre-Processing Steps used.

3. Sales forecasting how is it done using Statistical vs DL models – Efficiency.

4. Logistic Regression – How much percent of Customer has churned and how much have not churned?

5. What are the Evaluation Metric parameters for testing Logistic Regression?

6. What packages in Python can be used for ML? Why do we prefer one over another?

7. Numpy vs Pandas basic difference.

8. Feature on which this Imputation was done, and which method did we use there?

9. Tuple vs Dictionary. Where do we use them?

10. What is NER – Named Entity Recognition?

Date: **02/07/21**

1. Central limit theorem

2. Hypotheses testing

3. P value

4. T-test

5. Assumptions of linear regression.

6. Correlation and covariance.

7. How to identify & treat outliers and missing values.

8. Explain Box and whisker plot.

9. Explain any unsupervised learning algorithm.

10. Explain Random forest.

12. Business and technical questions related to your project.

13. Explain any scope of improvement in your project.

14. Questions based on case studies.

16. Write SQL query to find employee with highest salary in each department.

17. Write SQL query to find unique email domain name & their respective count

18. Solve question (17) using Python.

Rounds:

1. Technical Test (Python, SQL, Statistics) (Coding+MCQ) (90 min).

2. Telephonic interview (10 min).

3. Technical interview (45 min).

4. Fitment interview (25 min).

5. HR interview (30 min).

Date: **03/07/21**

1) Measures of central tendency

2) What is the requirement of k means algorithm

3) Which clustering technique uses combining of clusters

4) Which is the oldest probability distribution

5) What all values does a random variable can take

6) Types of random variables

7) Normality of residuals

8) Probability questions

9) Sensitivity and specificity etc.

10) Explain bias – variance trade off. How does this affect the model?

11) What is multi collinearity? How to identify and remove it.

Date: **30/06/21**

1. What is a Python Package, and Have you created your own Python Package?

2. Explain about Time series models you have used?

3. SQL Questions – Group by Top 2 Salaries for Employees – use Row num and Partition

4. Pandas find Numeric and Categorical Columns. For Numeric columns in Data frame, find the mean of the entire column and add that mean value to each row of those numeric columns.

5. What is Gradient Descent? What is Learning Rate and Why we need to reduce or increase? Why Global minimum is reached and Why it doesn’t improve when increasing the LR after that point?

6. Two Logistic Regression Models – Which one will you choose – One is trained on 70% and other on 80% data. Accuracy is almost same.

8. What is Log-Loss and ROC-AUC?

9. Do you know to use Amazon SageMaker for MLOPS?

10. Explain your Projects end to end (15-20mins).

Date: **01/07/21**

1. What makes you feel that you would be suitable for this role, since you come from a different background?

2. What is an imbalanced data set??

3. What are the factors you will consider in order to predict the population of a city in the future?

4. Basic statistics questions?

5. What are the approaches for treating the missing values?

6. Evaluation metrics for Classification?

7. Bagging vs Boosting with examples

8. Handling of imbalanced datasets

9. What are your career aspirations?

10.What’s the graph of y = |x|-2

11. Estimate on no. Of petrol cars in Delhi

12.Case study on opening a retail store

13.Order of execution of SQL

Date: **28/06/21**

1. What are the projects done by you.

2. Suppose there is a client who wants to know if giving discounts is beneficial or not. How would you approach this problem?

3. The same client want to know how much discount he should give in the next month for maximum profits.

4. Can you have a modeling approach to say in last year what mistakes client did in giving discounts. Meaning if they should have have a different discount and increased sales.

5. What feature engineering techniques you used in past projects.

6. What models you used and selected the final model.

Date: **29/06/21**

1) curse of dimensionality? How would you handle it?

2) How to find the multi collinearity in the data set

3)Explain the difference ways to treat multi collinearity!

4) How you decide which feature to keep and which feature to eliminate after performing multi collinearity test?

5)Explain logistic regression

6)we have sigmoid function which gives us the probability between 0-1 then what is the need of logloss in logistic regression?

7) P value and its significance in statistical testing?

8) How do you split the time series data and evaluation metrics for time series data

9) How did you deploy your model in production? How often do you retrain it?

Date: **29/06/21**

11. How to handle missing data? What imputation techniques can be used?

Date: **27/06/21**

Date: **28/06/21**

Date: **23/06/21**

Date: **23/06/21**

Date: **27/06/21**

**ROUND 1 :**

**Introduction**

– Started with Classification particularly Imbalance , oversampling.

Which class should i oversample etc.

Telecom Churn Case Study Questions like Evaluation metric for imbalance data

what threshold to choose to diving the classes (0.5 in case of balanced else sensitivity / Specifivity etc.

What if i don’t use SMOTE() for handling imbalance how should i select the threshold now (messed up by me, roc , auc etc) Ans = Presion – Recall Curve

– NLP Questions

Sentiment analysis, preprocessing like (TFID, BOW), Embeddings, stemming, Lemmatization

libraries in know : nltk, spacy

– Regression Preprocessing

answered outlier, missing value imputation, Distribution, dummies, multicolinearity etc

You have two highy co-related columns which one will you drop? : “Based on Business Problem i will see accordingly.”, Answered

– Naive Bayes Explanation , Drawback of Naive Bayes(couln’t answer drawback of Naive Bayes, ‘Assume all are independent’, him)

– Hand Gesture Recognition Techniques (End to End)

– Resource Timesheet Forecasting . (What is it?? what you do on this?, ” Explained with a story based on what i do in TCS”.

– Do you know any Boosting Algorithms : YES

where have you used?? in Telecom Churn and Healthcare Analytics by AV

– Gradient Descent (How it works)

– KNN related. How do we choose value of K ??

– Satirical Computing:

Type 1 and Type 2 error

Alternate name of Type 1 error (couldn’t answer alternate name of Type 1 error, ‘False +ive, him)

What is p-Value (Explained with the example of Linear Regression from statsmodel)

– Do you have exposure of TimeSeires analysis : NO (didn’t ask anything and seems fine with him)

– Exp in DS/Analytics in TCS etc.

Initial they had asked for the explaining the project which I had done. I explained the Customer prediction case . Then I was asked with python questions by sharing my screen.

1. How do you handle the correlated variables without removing them

2. Explain the SMOTE, ADAYSN technique

3. What is stratified sampling technique

4. Explain the working of random forest and xgboost

5. How do you optimise the Recall of your output

6. What are chisquare and ANOVA test

7. In python they asked for LOC,ILOC, how do you remove duplicate,How to unique values in column,

8. In SQL they asked for the query for having matches between different teams

Date:**25/05/21**

Introduction

Current NLP architecture used in my project

How will you identify Data Drift? Once identified how would you automate the handling of Data Drift

Data Pipeline used

Fasttext word embedding vs word2vec

When should we use Tf-IDF and when predictive based word embedding will be advantageous over Tf-IDF

Metrics used to validate our model

In MongoDB write a query to find employee names from a collection

In Python write a program to separate 0s and 1s from an array- (0,1,0,1,1,0,1,0)

Date:

**31/05/21**

Date: **18/05/21**

Date: **19/05/21**

Date:**17/05/21**

SQL Round

3 tables given as below:

TRIPS

trip_id

vehicle_id

start_time

stop_time

VEHICLE_MAKE

vehicle_id

make_id

MAKES

make_id

make_name

There is a table which contains vehicle trips. Trips are not necessarily in order.

There is a table which contains vehicle makes. Makes are not necessarily known.

PROBLEM: Write SQL code that provides the number of trips that started on September 1st, 2020 for each vehicle with a KNOWN make.

Order the results by the trip count.

op

vehicle_id | trip_count

4 | 2

1 | 1

2 | 1

Date: 20/05/21

Company: Cerence

Role: NLU Developer

Question1 :

Write a function that take two strings as inputs and return true if they are anagrams of each other and false otherwise

e.g.

(hello, hlleo) –> true

(hello, helo) –> false

Question 2 :

Write a function that take an array of strings “A” and an integer “n”,

that return the list of all strings of length “n” from the array “A” that can be constructed

as the concatenation of two strings from the same array “A”

e.g.

A = [dog, tail, sky, or, hotdog, tailor, hot] and n=6

output should be “hotdog” and “tailor”

Question 3 :

Given an array “arr” of numbers and a starting number “x”,

Find “x” such that the running sums of “x” and the elements of the array “arr” are never lower than 1.

e.g.

arr = [-2, 3, 1, -5].

The running sums will be x-2, x-2+3, x-2+3+1 and x-2+3+1-5.

So, the output should be 4.

Date:**20/05/21**

Date:**21/05/21**

My 16th Data Interview experience

Date:**26/05/21**

Date:**23/05/21**

Date:**24/05/21**

Date: 27/05/21

Date: 19/06/21

Date: 18/06/21

Date: 17/06/21

Date: 16/06/21

Date: 15/06/21

Date: 14/06/21

Date: 13/06/21

Date: **21/06/21**

Date: **22/06/21**

Date: **14/07/21**

Date: **15/07/21**

Date: **15/07/21**

Date: **15/07/21**

Date: **15/07/21**

Date: **26/06/21**

Date: **28/07/21**

Date: **27/06/21**

Date: **30/07/21**

Date: **29/07/21**

Date: **01/08/21**

Date: **31/07/21**

Date: **02/08/21**

Date: **03/08/21**

1. If through training all the features in the dataset, an accuracy of 100% is obtained but with the validation set, the accuracy score is 75%. What should be looked out for?

2. For a given dataset, you decide to use SVM as the main classifier. You select RBF as your kernel. What would be the optimum gamma value that would allow you to capture the features of the dataset really well??

3. How is skewness different from kurtosis??

4. How to calculate the accuracy of a binary classification algorithm using its confusion matrix?

5. How will you measure the Euclidean distance between the two arrays in numpy?

6. Given two lists [1,2,3,4,5] and [6,7,8], you have to merge the list into a single dimension. How will you achieve this?

7. In a survey conducted, the average height was 164cm with a standard deviation of 15cm. If Alex had a z-score of 1.30, what will be his height?

Date: **06/08/21**

Date: **05/08/21**

