Visualising Data with Seaborn – Who Pays More For Health Insurance? | by Jason Chong | Nov, 2020

[ad_1]


Seaborn is an immensely powerful data visualisation library that is built upon the Python programming language.

Jason Chong

I recently caught up with a data scientist working in the consulting industry and he was telling me about the grossly underrated part of his job that he wished he had known before starting his role and that is communication. Contrary to his initial belief that technical skills are all that is required to be a successful data scientist, he quickly realised that the ability to communicate is also equally if not more important.

In the real world, clients often do not have the luxury of time to go through the code has been written for a project line by line regardless of how perfectly it was written. It is simply not feasible. Therefore, in this situation, the ability to summarise and convey the key findings in a succinct manner becomes extremely imperative.

This is where data visualisation comes in.

A picture is worth a thousand words.

When done properly, data visualisation aims to summarise the important trends and patterns in a dataset. It is used to tell a story behind certain occurrences.

In this article, I will demonstrate some useful visualisation techniques that you can use if you are doing a piece of analysis using Python. More specifically, I will be referring to the Seaborn library, which is a library that is built for the purpose of visualising data.

I will mainly focus on visualising feature distributions as well as relationships between different variables in a dataset. Furthermore, in efforts to make this tutorial more concrete, I have applied these techniques on the Kaggle medical cost dataset.

If you’re interested, you can find the complete notebook on my GitHub here.

This dataset contains information about 1,338 insurance beneficiaries living in the United States and the individual amount they pay for their health insurance premium.

Photo by Bermix Studio on Unsplash

I have chosen this dataset for the following reasons:

  • Good mix of numerical and categorical variables
  • Not too many features to analyse
  • Intuitive and straightforward relationship between the predictor variables and the response variable

In addition to the original features in the dataset, I have also created 3 new features, bringing the total number of features (columns) to 10 which consist of 6 categorical variables and 4 numerical variables.

The first five rows of the dataset.

Categorical variables

  • Sex: Insurance contractor gender
  • Smoker: Smoking status
  • Region: The beneficiary’s residential area in the United States
  • Age category: Youth (18–35), adults (36–54) or seniors (55 and above)
  • Weight condition: Underweight, normal weight, overweight or obese
  • Dependent: Yes or no

Numerical variables

  • Age: Age of beneficiary
  • BMI: Body mass index
  • Children: Number of children (dependents) covered by health insurance
  • Charges (target variable): Individual medical costs billed by health insurance

Before visualising the features in our dataset, it is always good practice to first categorise these features as either categorical or numerical. This is because values for a categorical variable are discrete whereas values for a numerical variable are continuous. Hence, they each require different plots in order to visualise them in a meaningful way.

In this section, we will look at the various plots that are available in Seaborn to visualise categorical variables as well as numerical variables.

For categorical variables, we have:

  • sns.countplot
  • sns.catplot (formerly sns.factorplot)

For numerical variables, we have:

  • sns.boxplot
  • sns.distplot
  • sns.kdeplot

sns.countplot

Countplot simply shows the count of each unique value in a column.

Here, I have applied countplot to all the categorical variables in the dataset.

There is almost an equal distribution between male and female.
There are more non-smokers than there are smokers.

We can also add percentages to the countplot. Kindly refer to my notebook for the code.

There is almost an equal distribution across the 4 regions with southeast having slightly more people.
There is almost an equal distribution between youth and adults. Seniors have the least number of people.
Oh no! More than half the population are considered obese.
The majority of beneficiaries have children.

sns.catplot (formerly sns.factorplot)

Catplot allows us to further break down a categorical variable using another categorical variable.

Here, I have divided the population by their gender, smoking status and age category.

There are more male smokers than there are female smokers. Most smokers belong to the youth and adults age category.

Here, I have divided the population by region and smoking status.

Southeast region has the highest number of smokers.

sns.boxplot

Boxplot is one of the most common plots in statistics. It gives an overview of the distribution of a continuous variable.

As expected, smokers pay a higher premium than non-smokers.

Smokers are at a much higher risk of various health complications and therefore are riskier and more expensive to insure from the perspective of the insurer. To compensate for this, smokers are charged at a higher rate.

Health insurance premium increases with age.

As we get older, we get sick more easily. The higher premium reflects the higher medical cost that is needed to insure an individual who is older.

sns.distplot

Distplot combines a histogram with a kernel density smoothing to illustrate the distribution of a continuous variable.

Average BMI for this population is 30.66.

BMI appears to follow a Gaussian distribution but the average BMI is considered obese. This comes at no surprise as the United States has one of the highest levels of obesity in the world.

sns.kdeplot

In statistics, kernel distribution estimation is a non-parametric way to estimate the probability density function of a continuous random variable.

Females pay higher premiums than males. This could be due to the fact that there are more male smokers than there are female smokers.

Premium level increases with age, as we have seen earlier.

In this section, we will go over several plots that will allow us to visualise the relationship between different variables in our dataset.

These include:

  • sns.heatmap
  • sns.barplot
  • sns.jointplot
  • sns.scatterplot
  • sns.regplot
  • sns.lmplot
  • sns.swarmplot
  • sns.violinplot
  • sns.pointplot
  • sns.pairplot

sns.heatmap

Heatmap is one of the easiest ways to analyse the correlation between numerical variables. A positive correlation implies that two variables move in the same direction. Conversely, a negative correlation implies that two variables move in the opposite direction.

Age is most correlated with charges and children is the least correlated.

The diagonal of a heatmap has the values of one because it represents the correlation between a variable and itself. In other words, a variable is perfectly correlated with itself.

The diagonal can also be seen as a mirror between the bottom triangle and the top triangle. If you look closely, the two triangles contain the same set of information.

There are two ways to interpret a heatmap: by reading across the columns or by reading down the rows.

sns.barplot

Barplot represent an estimate of central tendency for a numerical variable with the height of each rectangle. In addition, it also indicates the uncertainty around that estimate using error bars.

Southeast region pays the highest premium. This could be due to the fact that the southeast region has the highest number of smokers as we have seen earlier.

Policyholders with children pay a higher premium than those without children. This can be attributed to the higher medical cost that is needed to insure the children of the main beneficiary of the insurance product.

sns.jointplot

Jointplot shows where data points lie between two different numerical variables.

The darker region represents the majority of the population.

sns.scatterplot

Scatterplot also shows the location of data points between two separate numerical variables. It is also a great way to visualise and detect outliers in the dataset.

Here, I have visualised premium charges against BMI. The data points have been further divided by smoking status.

Charges increase with BMI. This increase is more evident in smokers.

I also plotted charges against the age of obese individuals in the population. Again, the data points are separated by smoking status.

Charges increase with age. There is a clear distinction between the two classes.

We can conclude that smoking plays a critical role in determining the amount that is paid for health insurance.

sns.regplot

Regplot plots the data and adds a linear regression model fit.

Charges increase with BMI, as we have seen earlier.

The linear line has a positive slope which suggests that there is a positive relationship between BMI and premium levels.

This is expected because similar to smoking, people who are obese are at a much higher risk of health-related issues such as diabetes, high blood pressure and stroke.

sns.lmplot

Lmplot is similar to regplot but offers the ability to categorise the data points into different groups.

Here, I have used a lmplot to visualise charges against age and further categorised the data points by age category.

Charges increase with age. This increase is the steepest for senior citizens.

Here is another example where I have plotted chargest against BMI and further grouped the data points by smoking status.

Charges increase with BMI. This increase is more significant in smokers.

sns.swarmplot

Swarmplot draws a categorical scatterplot with non-overlapping points.

There are two observations that can be made via this plot:

  • Smokers pay a higher premium than non-smokers
  • The vast majority of non-smokers can be seen at the lower end of the premium levels

sns.violinplot

Violinplot draws a combination of boxplot and kernel density estimate.

sns.pointplot

Pointplot represents an estimate of central tendency for a numerical variable by the position of scatter plot points. It also provides some indication of the uncertainty around that estimate using error bars.

Here, I have plotted charges against weight condition, categorised by smoking status.

Obese smokers pay a significantly higher premium than the rest of the population.

sns.pairplot

Pairplot plots pairwise relationships between numerical variables in the dataset.

Here is a pairplot categorised by smoking status.

To summarise, Seaborn is a robust data visualisation library that is built on top of the Python programming language.

In this article, we have considered various plots that can be used to visualise the distribution as well as the relationship between different variables in a dataset.

When visualising data, it is important to distinguish between variables that are categorical and numerical as they each require different plots in order to produce visualisations that are meaningful.

I hope that you have learned more about the Seaborn library as well as how different features like age, gender, smoking status and weight condition correlate with premium levels.

Thank you so much for reading. Happy learning!

[ad_2]




Source link

Write a comment