seaborn.violinplot


Violinplots summarize numeric data over a set of categories. They are essentially a box plot with a kernel density estimate (KDE) overlaid along the range of the box and reflected to make it look nice. They provide more information than a boxplot because they also include information about how the data is distributed within the inner quartiles. dataset: IMDB 5000 Movie Dataset

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
df = pd.read_csv('../../../datasets/movie_metadata.csv')
df.head()
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

For the bar plot, let’s look at the number of movies in each category, allowing each movie to be counted more than once.

# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])

# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()

director_name genres duration Fantasy War Family Mystery Sport History Comedy ... Sci-Fi Action Thriller Horror Crime Biography News Music Western Game-Show
0 James Cameron Action|Adventure|Fantasy|Sci-Fi 178.0 1 0 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 0
1 Gore Verbinski Action|Adventure|Fantasy 169.0 1 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 Sam Mendes Action|Adventure|Thriller 148.0 0 0 0 0 0 0 0 ... 0 1 1 0 0 0 0 0 0 0
3 Christopher Nolan Action|Thriller 164.0 0 0 0 0 0 0 0 ... 0 1 1 0 0 0 0 0 0 0
4 Doug Walker Documentary NaN 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)

df.head()
Duration Category Count
0 178.0 Fantasy 1
1 169.0 Fantasy 1
7 100.0 Fantasy 1
9 153.0 Fantasy 1
13 151.0 Fantasy 1

Basic plot

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')

png

The outliers here are making things a bit squished, so I’ll remove them since I am just interested in demonstrating the visualization tool.

df = df.loc[df.Duration < 250]
p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration')

png

Change the order of categories

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()))

png

Change the order that the colors are chosen

Change orientation to horizontal

p = sns.violinplot(data=df,
                   y = 'Category',
                   x = 'Duration',
                   order = sorted(df.Category.unique()),
                   orient="h")

png

Desaturate

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   saturation=.25)

png

Adjust width of violins

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   width=.25)

png

Change the size of outlier markers

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   fliersize=20)

png

Adjust the bandwidth of the KDE filtering parameter. Smaller values will use a thinner kernel and thus will contain higher feature resolution but potentially noise. Here are examples of low and high settings to demonstrate the difference.

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=.05)

png

p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   order = sorted(df.Category.unique()),
                   bw=5)

png

Finalize

sns.set(rc={"axes.facecolor":"#e6e6e6",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})


p = sns.violinplot(data=df,
                   x = 'Category',
                   y = 'Duration',
                   palette = 'spectral',
                   order = sorted(df.Category.unique()),
                   notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(4.85,200, "Violin Plot", fontsize = 95, color="black", fontstyle='italic')
<matplotlib.text.Text at 0x7f9c6e707e48>

png

p.get_figure().savefig('../../figures/violinplot.png')