Visualising Data with Seaborn – Who Pays More For Health Insurance? | by Jason Chong | Nov, 2020
Seaborn is an immensely powerful data visualisation library that is built upon the Python programming language.
I recently caught up with a data scientist working in the consulting industry and he was telling me about the grossly underrated part of his job that he wished he had known before starting his role and that is communication. Contrary to his initial belief that technical skills are all that is required to be a successful data scientist, he quickly realised that the ability to communicate is also equally if not more important.
In the real world, clients often do not have the luxury of time to go through the code has been written for a project line by line regardless of how perfectly it was written. It is simply not feasible. Therefore, in this situation, the ability to summarise and convey the key findings in a succinct manner becomes extremely imperative.
This is where data visualisation comes in.
A picture is worth a thousand words.
When done properly, data visualisation aims to summarise the important trends and patterns in a dataset. It is used to tell a story behind certain occurrences.
In this article, I will demonstrate some useful visualisation techniques that you can use if you are doing a piece of analysis using Python. More specifically, I will be referring to the Seaborn library, which is a library that is built for the purpose of visualising data.
I will mainly focus on visualising feature distributions as well as relationships between different variables in a dataset. Furthermore, in efforts to make this tutorial more concrete, I have applied these techniques on the Kaggle medical cost dataset.
If you’re interested, you can find the complete notebook on my GitHub here.
This dataset contains information about 1,338 insurance beneficiaries living in the United States and the individual amount they pay for their health insurance premium.
I have chosen this dataset for the following reasons:
- Good mix of numerical and categorical variables
- Not too many features to analyse
- Intuitive and straightforward relationship between the predictor variables and the response variable
In addition to the original features in the dataset, I have also created 3 new features, bringing the total number of features (columns) to 10 which consist of 6 categorical variables and 4 numerical variables.
- Sex: Insurance contractor gender
- Smoker: Smoking status
- Region: The beneficiary’s residential area in the United States
- Age category: Youth (18–35), adults (36–54) or seniors (55 and above)
- Weight condition: Underweight, normal weight, overweight or obese
- Dependent: Yes or no
- Age: Age of beneficiary
- BMI: Body mass index
- Children: Number of children (dependents) covered by health insurance
- Charges (target variable): Individual medical costs billed by health insurance
Before visualising the features in our dataset, it is always good practice to first categorise these features as either categorical or numerical. This is because values for a categorical variable are discrete whereas values for a numerical variable are continuous. Hence, they each require different plots in order to visualise them in a meaningful way.
In this section, we will look at the various plots that are available in Seaborn to visualise categorical variables as well as numerical variables.
For categorical variables, we have:
- sns.catplot (formerly sns.factorplot)
For numerical variables, we have:
Countplot simply shows the count of each unique value in a column.
Here, I have applied countplot to all the categorical variables in the dataset.
We can also add percentages to the countplot. Kindly refer to my notebook for the code.
sns.catplot (formerly sns.factorplot)
Catplot allows us to further break down a categorical variable using another categorical variable.
Here, I have divided the population by their gender, smoking status and age category.
Here, I have divided the population by region and smoking status.
Boxplot is one of the most common plots in statistics. It gives an overview of the distribution of a continuous variable.
Smokers are at a much higher risk of various health complications and therefore are riskier and more expensive to insure from the perspective of the insurer. To compensate for this, smokers are charged at a higher rate.
As we get older, we get sick more easily. The higher premium reflects the higher medical cost that is needed to insure an individual who is older.
Distplot combines a histogram with a kernel density smoothing to illustrate the distribution of a continuous variable.
BMI appears to follow a Gaussian distribution but the average BMI is considered obese. This comes at no surprise as the United States has one of the highest levels of obesity in the world.
In statistics, kernel distribution estimation is a non-parametric way to estimate the probability density function of a continuous random variable.
Females pay higher premiums than males. This could be due to the fact that there are more male smokers than there are female smokers.
In this section, we will go over several plots that will allow us to visualise the relationship between different variables in our dataset.
Heatmap is one of the easiest ways to analyse the correlation between numerical variables. A positive correlation implies that two variables move in the same direction. Conversely, a negative correlation implies that two variables move in the opposite direction.
The diagonal of a heatmap has the values of one because it represents the correlation between a variable and itself. In other words, a variable is perfectly correlated with itself.
The diagonal can also be seen as a mirror between the bottom triangle and the top triangle. If you look closely, the two triangles contain the same set of information.
There are two ways to interpret a heatmap: by reading across the columns or by reading down the rows.
Barplot represent an estimate of central tendency for a numerical variable with the height of each rectangle. In addition, it also indicates the uncertainty around that estimate using error bars.
Southeast region pays the highest premium. This could be due to the fact that the southeast region has the highest number of smokers as we have seen earlier.
Policyholders with children pay a higher premium than those without children. This can be attributed to the higher medical cost that is needed to insure the children of the main beneficiary of the insurance product.
Jointplot shows where data points lie between two different numerical variables.
Scatterplot also shows the location of data points between two separate numerical variables. It is also a great way to visualise and detect outliers in the dataset.
Here, I have visualised premium charges against BMI. The data points have been further divided by smoking status.
I also plotted charges against the age of obese individuals in the population. Again, the data points are separated by smoking status.
We can conclude that smoking plays a critical role in determining the amount that is paid for health insurance.
Regplot plots the data and adds a linear regression model fit.
The linear line has a positive slope which suggests that there is a positive relationship between BMI and premium levels.
This is expected because similar to smoking, people who are obese are at a much higher risk of health-related issues such as diabetes, high blood pressure and stroke.
Lmplot is similar to regplot but offers the ability to categorise the data points into different groups.
Here, I have used a lmplot to visualise charges against age and further categorised the data points by age category.
Here is another example where I have plotted chargest against BMI and further grouped the data points by smoking status.
Swarmplot draws a categorical scatterplot with non-overlapping points.
There are two observations that can be made via this plot:
- Smokers pay a higher premium than non-smokers
- The vast majority of non-smokers can be seen at the lower end of the premium levels
Violinplot draws a combination of boxplot and kernel density estimate.
Pointplot represents an estimate of central tendency for a numerical variable by the position of scatter plot points. It also provides some indication of the uncertainty around that estimate using error bars.
Here, I have plotted charges against weight condition, categorised by smoking status.
Pairplot plots pairwise relationships between numerical variables in the dataset.
Here is a pairplot categorised by smoking status.
To summarise, Seaborn is a robust data visualisation library that is built on top of the Python programming language.
In this article, we have considered various plots that can be used to visualise the distribution as well as the relationship between different variables in a dataset.
When visualising data, it is important to distinguish between variables that are categorical and numerical as they each require different plots in order to produce visualisations that are meaningful.
I hope that you have learned more about the Seaborn library as well as how different features like age, gender, smoking status and weight condition correlate with premium levels.
Thank you so much for reading. Happy learning!