How to ace the data science coding challenge


Image Source: Pexels.

The Take-Home Challenge Problem (Coding Exercise)


So, you’ve efficiently gone by way of the preliminary screening section of the interview course of. It is now time for the most vital step in the interview course of, particularly, the take-home coding challenge. This is usually a data science downside, e.g., machine studying mannequin, linear regression, classification downside, time collection evaluation, and many others.

Data science coding initiatives differ in scope and complexity. Sometimes, the venture might be so simple as producing abstract statistics, charts, and visualizations. It might additionally contain constructing a regression mannequin, classification mannequin, or forecasting utilizing a time-dependent dataset. The venture is also very advanced and troublesome. In this case, no clear steering is offered as to the particular sort of mannequin to use. In this case, you’ll have to provide you with your individual mannequin that’s finest appropriate for addressing venture objectives and aims.

Generally, the interview group will offer you venture instructions and a dataset. If you’re lucky, they might present a small dataset that’s clear and saved in a comma-separated worth (CSV) file format. That approach, you don’t have to fear about mining the data and reworking it right into a kind appropriate for evaluation. For the couple of interviews I had, I labored with 2 varieties of datasets: one had 160 observations (rows), whereas the different had 50,000 observations with a number of lacking values. The take-home coding train clearly differs from firms to firms, as additional described under.

In this text, I’ll share some helpful ideas from my private expertise that might enable you to excel in the coding challenge venture. Before delving into the ideas, let’s first study some pattern coding workouts.


Sample 1 Coding Exercise: Model for recommending cruise ship crew dimension



This coding train must be carried out in python (which is the programming language utilized by the group). You are free to use the web and some other libraries. Please save your work in a Jupyter pocket book and e-mail it to us for evaluation.

Data file: cruise_ship_info.csv (this file will likely be emailed to you)

Objective: Build a regressor that recommends the “crew” dimension for potential ship consumers. Please do the following steps (trace: use numpy, scipy, pandas, sklearn and matplotlib)

  1. Read the file and show columns.
  2. Calculate fundamental statistics of the data (rely, imply, std, and many others) and study data and state your observations.
  3. Select columns that may in all probability be vital to predict “crew” dimension.
  4. If you eliminated columns, clarify why you eliminated these.
  5. Use one-hot encoding for categorical options.
  6. Create coaching and testing units (use 60% of the data for the coaching and reminder for testing).
  7. Build a machine studying mannequin to predict the ‘crew’ dimension.
  8. Calculate the Pearson correlation coefficient for the coaching set and testing datasets.
  9. Describe hyper-parameters in your mannequin and the way you’ll change them to enhance the efficiency of the mannequin.
  10. What is regularization? What is the regularization parameter in your mannequin?

Plot regularization parameter worth vs Pearson correlation for the check and coaching units, and see whether or not your mannequin has a bias downside or variance downside.

This is an instance of a really simple downside. The dataset is clear and small (160 rows and 9 columns), and the directions are very clear. So, all that’s wanted is to observe the directions and generate your code. Notice additionally that the instruction clearly specifies that python should be used as the programming language for mannequin constructing. The time allowed for finishing this coding project was three days. Only the last Jupyter pocket book has to be submitted, and no formal venture report is required.


Tips for Acing Sample 1 Coding Exercise


Since the venture entails constructing a machine studying mannequin, the first step is to guarantee we perceive the machine studying course of:

Figure 1. Illustrating the Machine Learning Process. Image by Benjamin O. Tayo.

1. Problem Framing

Define your venture objectives. What would you like to discover out? Do you may have the data to analyze?

ObjectiveThe purpose of this venture is to construct a regressor mannequin that recommends the “crew” dimension for potential cruise ship consumers utilizing the cruise ship dataset cruise_ship_info.csv.

2. Data Analysis

Import and clear the dataset, analyze options to choose the related options that correlate with the goal variable.

 2.1 Import dataset and show options and the goal variable

df = pd.read_csv("cruise_ship_info.csv")



Table 1: Shows first 5 rows of dataset.

In this instance, the dataset is clear and pristine, with no lacking values. So, no cleansing is required.

Remarks on Data Quality: One of the main flaws with the dataset is that it doesn’t present the items for the options. For instance, the passenger’s column doesn’t inform if this column is in a whole lot or hundreds. The items for cabin size, passenger density, and crew are usually not offered as effectively. The passenger_density characteristic appears to have been derived from different options, however there is no such thing as a clarification of the way it was derived. These sorts of points will be addressed by contacting the interview group to ask extra about the dataset. It is vital to perceive the intricacies of your data earlier than utilizing it for constructing real-world fashions. Keep in thoughts {that a} dangerous dataset leads to dangerous predictive fashions.

2.2 Calculate and visualize the covariance matrix

The covariance matrix plot can be utilized for characteristic choice and for quantifying the correlation between options (multi-collinearity). We observe from Figure 2 that there are sturdy correlations between options.

Figure 2. Covariance matrix plot.


2.3 Perform characteristic engineering by reworking options into the principal part evaluation (PCA) area

 Since the covariance matrix reveals multi-collinearity, it is necessary to remodel options into PCA area earlier than coaching your mannequin. This is vital as a result of multi-collinearity between options can lead to a mannequin that’s advanced and troublesome to interpret. PCA may also be used for variable choice and dimensionality discount. In this case, solely parts that contribute considerably to the complete defined variance will be retained and used for modeled constructing.

3. Model Building

 Pick the machine studying device that matches your data and desired end result. Train the mannequin with obtainable data.

3.1 Model constructing and analysis

 Since our purpose is to use regression, one might implement totally different regression algorithms reminiscent of Linear Regression (LR)KNeighbors Regression (KNR), and Support Vector Regression (SVR). The dataset has to be divided into coaching, validation, and check units. Hyperparameter tuning has to be used to fine-tune the mannequin so as to forestall overfitting. Cross-validation is important to guarantee the mannequin performs effectively on the validation set. After fine-tuning mannequin parameters, the mannequin is utilized has to be utilized to the check dataset. The mannequin’s efficiency on the check dataset is roughly equal to what could be anticipated when the mannequin is used for making predictions utilizing unseen data.

3.2 Uncertainty Quantification

 This will be executed by coaching a mannequin utilizing totally different random partitions of the coaching dataset, then averaging the cross-validation rating for every random state parameter.

Figure 3. Mean cross-validation reveals for various regression fashions.

4. Application

Score your last mannequin to generate predictions. Make your mannequin obtainable for manufacturing. Retrain your mannequin as wanted.

In this stage, the last machine studying mannequin is chosen and put into manufacturing. The mannequin is evaluated in a manufacturing setting so as to assess its efficiency. Any errors encountered when reworking from an experimental mannequin to its precise efficiency on the manufacturing line has to be analyzed. This can then be utilized in fine-tuning the authentic mannequin.

Based on the imply cross-validation rating from Figure 3, we observe that Linear Regression and Support Vector Regression carry out nearly at the identical stage and higher than KNeighbors Regression. So, the last mannequin chosen might both be Linear Regression or Support Vector Regression.

For a whole resolution of pattern 1 coding train, please see the following hyperlinks:

Sample 1 really useful resolution

Machine Learning Process Tutorial

Remarks on Sample 1 Coding Exercise

Sometimes the coding train would ask you to submit a Jupyter pocket book solely, or it could ask for a full venture report. Make positive your Jupyter pocket book is effectively organized to replicate each stage of the machine studying course of. A pattern Jupyter pocket book will be discovered right here: ML_Model_for_Predicting_Ships_Crew_Size.


Sample 2 Coding Exercise: Model for forecasting mortgage standing



In this downside, you’ll forecast the end result of a portfolio of loans. Each mortgage is scheduled to be repaid over Three years and is structured as follows:

  • First, the borrower receives the funds. This occasion is named origination.
  • The borrower then makes common repayments till one in all the following occurs:

(i) The borrower stops making funds, usually due to monetary hardship, earlier than the finish of the 3-year time period. This occasion is named charge-off, and the mortgage is then mentioned to have charged off.

(ii) The borrower continues making repayments till Three years after the origination date. At this level, the debt has been totally repaid.

In the connected CSV, every row corresponds to a mortgage, and the columns are outlined as follows:

  • The column with header days since origination signifies the variety of days that elapsed between origination and the date when the data was collected.
  • For loans that charged off earlier than the data was collected, the column with header days from origination to charge-off signifies the variety of days that elapsed between origination and charge-off. For all different loans, this column is clean.

Objective: We would love you to estimate what fraction of those loans could have charged off by the time all of their 3-year phrases are completed. Please embrace a rigorous clarification of the way you arrived at your reply, and embrace any code you used. You could make simplifying assumptions, however please state such assumptions explicitly. Feel free to current your reply in no matter format you favor; particularly, PDF and Jupyter Notebook are each tremendous. Also, we count on that this venture is not going to take greater than 3–6 hours of your time.

The dataset right here is advanced (has 50,000 rows and a pair of columns, and plenty of lacking values), and the downside shouldn’t be very simple. You have to study the dataset critically after which determine what mannequin to use. This downside was to be solved in every week. It additionally specifies {that a} formal venture report and an R script or Jupyter pocket book file be submitted.


Tips for Acing Sample 2 Coding Exercise


As in Sample 1 coding train, you want to observe the machine studying steps when tackling this downside. This particle downside doesn’t have a singular resolution. I tried an answer utilizing probabilistic modeling based mostly on Monte-Carlo simulation.

For a whole resolution of pattern 1 coding train, please see the following hyperlinks:

Sample 2 really useful resolution

R Script for Data Science Coding Exercise

Project Report for Data Science Coding Exercise

Remarks on Sample 2 Coding Exercise

The options introduced above are really useful options solely. Keep in thoughts that the resolution to a data science or machine studying venture shouldn’t be distinctive. I challenge you to remedy these issues earlier than reviewing the pattern options.




In abstract, we’ve mentioned some helpful ideas that might be helpful for any data science aspirant presently making use of for data science openings. The coding train varies in scope and complexity, relying on the firm you’re making use of to. The take-home coding train supplies a wonderful alternative for you to showcase your skill to work on a data science venture. You want to use this chance to exhibit distinctive skills in your understanding of data science and machine studying ideas. Don’t let this excellent alternative slip away. If there are specific points of the venture that you simply don’t perceive, be happy to attain out to the data science interview group when you have questions. They could present some hints or clues.



Source hyperlink

Write a comment