How to be a Pretty Pythonista? 👸

Jela Nalica
5 min readOct 7, 2020

by making pretty data plotting in Anaconda Notebook.

Photo by Tai's Captures on Unsplash

First, launch your Jupyter Notebook then create a new notebook with Python 3. Next, import the necessary libraries so you can start doing the magic.

I actually imported several libraries, but today I’ll be using more the Pandas and Seaborn.

Pandas library is for transforming datasets into a data frame (table form), and Seaborn is for plotting data, in a pretty manner 🙂.

But how does Seaborn makes data plotting “pretty”?

Simple, cause it has many data plots with a broad spectrum of colors to choose from which you can customize too. It’s like adding make-up to your graphs in a coding manner.

You can take a look at the available color palettes from a story I read here.

It's very important to remember that having pretty data plots is not enough, but having a comprehensible story from your data is really the point of it.

Note: Ensure to have the Seaborn 0.11 version installed.

Import Library and Upload File

From here, I will be coding with insurance datasets, thus we need to upload them in the notebook as seen above.

Women vs Men

The analysis I made from the insurance dataset is about analyzing the Female and Male from the ‘sex’ category in terms of their age, BMI, number of children they have, smoker or not, the region they are from, and finally the amount of their insurance charges. In each category, I will use different visuals and palettes so we can explore the beauty of using the Seaborn library.

REGION

Dataframe_Region

Here, I wanted to see how distributed the insurance dataset is in terms of region and sex category.

From this data frame, I created two plots: the countplot and catplot.

Countplot vs Catplot

The countplot shows the distribution of males and females per region in one x and y-axis. Catplot is not much different from the countplot (as used in this case) except the grouping per region is more apparent since the plot is divided into several y-axes.

Either way, from these plots we can say that most of the participants in the dataset are from the southeast region.

AGE

Age_Dataframe

As with the region, I also made another data frame exclusive for the age and sex category.

Also, I use the method “value_counts” to count the total number of males and females in our dataset. This also gives us an image of the size of the dataset we are dealing with.

Boxplot

For the age, I use the boxplot in a horizontal orientation. The boxplot shows the quartiles of our dataset. With this orientation, I can easily see that the median of the female participants is a bit older than the males.

NUMBER OF CHILDREN

To summarize the number of children per sex category, I use the “groupby” method. As a result, it displays the number of males and females who have no child, one child, and so on.

Catplot: Boxen vs Violin

In the above data plots, I used two kinds of catplot: Boxen and Violin. They basically have the same purpose, to show the distribution per category.

BMI

For BMI, I also consider the category “smoker or non-smoker”. First, I take a look at how males and females are distributed under this category. As you can see below, most are non-smokers but females have more smokers than males.

Histplot — Smoker vs Non-Smoker

Then I added the BMI into our next data plot, a barplot.

Barplot is like a bar chart comparing the central tendency of each category or mean. As we visualize it here in seaborn, you’ll see the black lines on top of each bar. These are called error bars (but lines would be more appropriate). Their meaning can either be the confidence intervals or the standard deviation to visualize the spread of the data in a category.

INSURANCE CHARGES

Lastly, let’s see who pays more insurance charges.

Barplot

Above is a very simple yet apparent visual to see that males pay higher insurance charges than females in terms of average. However, the black lines are very long in this graph compare to the previous BMI barplot, which indicates very spread data in each category.

The question is, what could be the possible factor that highly affects the amount of insurance charges?

The above data plot is called a heatmap. I used it to show the correlation of the among the numerical variables, thus we can inspect which affects each other. From here, we can see that age has the highest correlation with charges.

And now for the last data plot — Pairplot.

Pairplot — Male vs Female
Pairplot — Smoker vs Non-smoker

Pairplots plot pairs of numeric variables in our data set. We can use this to have a quick view of which among the variables affect each other.

Thanks to Python and its continuous library development, visualizing big data is now possible within our hands even in just a few seconds. We also have so many options to choose from.

The real challenge now is finding which among them fits exactly what you wanted people to see in your data story.

You can find how I code everything you see above here.

--

--