Skip to article frontmatterSkip to article content

Visualization

Plotting in Python

There are several popular plotting libaries:

Plus pandas has a plot method

Pandas and seaborn use matplotlib under the hood.

Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.

Figure and axis level plots

summary of plot types

add the image to your notebook with the following:

![summary of plot types](https://seaborn.pydata.org/_images/function_overview_8_0.png)

Anatomy of a figure

annotated graph

*this was drawn with code

add the image to your notebook with the following:

![annotated graph](https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png)

figure vs axes

we will load pandas and seaborn

import pandas as pd
import seaborn as sns

and we will use a new dataset for today with more rows:

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

then load it in as normal

coffee_df = pd.read_csv(arabica_data_url,index_col=0)

since it’s new, we will take a quick look with head

coffee_df.head()
Loading...

and then see the shape to know how big it is before we start working

coffee_df.shape
(1311, 43)
coffee_df['Flavor'].describe()
count 1311.000000 mean 7.518070 std 0.399979 min 0.000000 25% 7.330000 50% 7.580000 75% 7.750000 max 8.830000 Name: Flavor, dtype: float64

Distribution Plots

sns.displot(data=coffee_df,x='Flavor')
<seaborn.axisgrid.FacetGrid at 0x7f4bc28d1730>
<Figure size 500x500 with 1 Axes>
sns.displot(data=coffee_df,x='Flavor',kind='kde')
<seaborn.axisgrid.FacetGrid at 0x7f4bc2863a10>
<Figure size 500x500 with 1 Axes>
sns.displot(data=coffee_df,x='Flavor',kind='kde',hue='Color')
<seaborn.axisgrid.FacetGrid at 0x7f4bc26ed010>
<Figure size 639.375x500 with 1 Axes>

Relation Plots

sns.relplot(data=coffee_df, x='Flavor',y='Balance')
<seaborn.axisgrid.FacetGrid at 0x7f4bbdf3d310>
<Figure size 500x500 with 1 Axes>
sns.relplot(data=coffee_df, x='Flavor',y='Balance',hue='Color')
<seaborn.axisgrid.FacetGrid at 0x7f4bbdfccd70>
<Figure size 639.375x500 with 1 Axes>
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number', 'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer', 'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year', 'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity', 'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects', 'Quakers', 'Color', 'Category.Two.Defects', 'Expiration', 'Certification.Body', 'Certification.Address', 'Certification.Contact', 'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'], dtype='object')
sns.relplot(data=coffee_df, x='Flavor',y='Balance',hue='Color',col='Country.of.Origin',col_wrap=4)
/home/runner/.local/lib/python3.12/site-packages/seaborn/axisgrid.py:854: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  func(*plot_args, **plot_kwargs)
/home/runner/.local/lib/python3.12/site-packages/seaborn/axisgrid.py:854: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  func(*plot_args, **plot_kwargs)
<seaborn.axisgrid.FacetGrid at 0x7f4bc22d83b0>
<Figure size 2139.38x4500 with 36 Axes>

Categorical Plots

By default a catplot is a stripplot

sns.catplot(coffee_df, y='Number.of.Bags', x='Country.of.Origin')
<seaborn.axisgrid.FacetGrid at 0x7f4bbb2db620>
<Figure size 511.111x500 with 1 Axes>

For this data, this is really hard to understand, let’s change to a bar version

sns.catplot(coffee_df, y='Number.of.Bags', x='Country.of.Origin',kind='bar')
<seaborn.axisgrid.FacetGrid at 0x7f4bb8b50470>
<Figure size 511.111x500 with 1 Axes>

Here, it takes the mean for each country and that is the bar height and the line is the std for each country.

It uses logic similar to:

coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].describe()
Loading...

Filtering with isin

First we can find the top countries, we noticed that the value_counts are sorted so we take the first 10

top_countries = coffee_df['Country.of.Origin'].value_counts()[:10].index
top_countries
Index(['Mexico', 'Colombia', 'Guatemala', 'Brazil', 'Taiwan', 'United States (Hawaii)', 'Honduras', 'Costa Rica', 'Ethiopia', 'Tanzania, United Republic Of'], dtype='object', name='Country.of.Origin')

we can use that to filter the original DataFrame. To do this, we use isin to check each element in the 'Country.of.Origin' column is in that list.

coffee_df['Country.of.Origin'].isin(top_countries)
1 True 2 True 3 True 4 True 5 True ... 1307 True 1308 False 1309 False 1310 True 1312 True Name: Country.of.Origin, Length: 1311, dtype: bool

This is roughly equivalent to:

[country in top_countries for country in coffee_df['Country.of.Origin'] ]
[True, True, True, True, True, True, False, True, True, True, True, False, False, True, True, False, False, True, False, True, False, True, True, False, True, True, True, False, True, True, False, True, True, True, True, False, True, True, True, False, False, True, True, True, False, True, False, True, False, False, True, True, True, False, True, False, True, True, True, True, False, False, False, False, True, False, False, True, False, True, True, False, True, True, False, False, True, True, True, False, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, False, False, True, False, False, False, False, False, True, True, True, True, True, True, True, True, True, False, True, True, True, False, False, True, False, True, True, False, True, True, True, True, False, True, True, True, True, True, True, False, False, True, True, False, True, False, True, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, False, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, False, False, True, False, True, True, True, True, False, True, False, False, True, True, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, False, False, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, False, False, False, True, True, False, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, False, True, False, True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, False, True, False, True, True, True, True, False, False, True, True, True, False, False, True, True, True, True, True, True, True, True, False, True, True, True, False, False, True, True, False, False, False, True, True, True, False, True, True, True, True, True, True, True, True, False, True, True, True, True, True, False, True, True, False, True, True, True, False, False, True, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, False, True, True, True, True, True, True, True, True, False, True, False, True, False, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, False, True, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, False, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, False, True, False, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, True, True, True, False, False, True, True, False, True, True, True, False, True, False, True, True, True, False, True, True, True, True, True, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, True, True, True, False, True, True, True, True, True, True, False, True, True, True, True, False, False, True, False, True, True, True, True, True, True, True, True, False, False, True, True, False, True, True, True, True, True, True, False, ...]

except this builds a list and the pandas way makes a pd.Series object. The Python in operator is really helpful to know and pandas offers us an isin method to get that type of pattern.

In a more basic programming format this process would be two separate loops worth of work.

c_in = []
# iterate over the country of each rating
for country in coffee_df['Country.of.Origin']:
    # make a false temp value
    cur_search = False
    # iterate over top countries
    for tc in top_countries:
        # flip the value if the current top & rating cofee match
        if tc==country:
            cur_search = True
    # save the result of the search
    c_in.append(cur_search)

With that list of booleans, we can then mask the original DataFrame. This keeps only the value where the inner quantity is True

top_coffee_df = coffee_df[coffee_df['Country.of.Origin'].isin(top_countries)]
top_coffee_df.head(1)
Loading...

And now we can plot from that new dataframe

sns.catplot(data =top_coffee_df,x='Country.of.Origin',y='Number.of.Bags', aspect=3)
<seaborn.axisgrid.FacetGrid at 0x7f4bc01f22d0>
<Figure size 1511.11x500 with 1 Axes>

Now with the stripplot we can see that it puts a point for each coffee (row in the dataframe) and adds random jitter so they do not all overlap since the x values are the countries which are discrete

Install some extra tools

pip install jupytext mystmd

Variable types and data types

Related but not the same.

Data types are literal, related to the representation in the computer.

ther can be int16, int32, int64

We can also have mathematical types of numbers

Variable types are about the meaning in a conceptual sense.

we’ll focus on the first two most of the time. Some values that are technically only integers range high enough that we treat them more like continuous most of the time.

Grading Review

We discussed the grading

Assignment 1 Q&A

For this assignment if you atted Friday office hours you can get an automatic extension, but going forward the policy will apply.

Questions After Class

Why did you use the . in between spaces for country of origin?

because that is how the column is actually named.

We are picking out the column name, not just describing it.

coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number', 'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer', 'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year', 'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma', 'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity', 'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects', 'Quakers', 'Color', 'Category.Two.Defects', 'Expiration', 'Certification.Body', 'Certification.Address', 'Certification.Contact', 'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'], dtype='object')

If we pick any of those values we get that column:

from numpy.random import choice
random_col = choice(coffee_df.columns)
coffee_df[random_col].head()
1 METAD Agricultural Development plc 2 METAD Agricultural Development plc 3 Specialty Coffee Association 4 METAD Agricultural Development plc 5 METAD Agricultural Development plc Name: Certification.Body, dtype: object

or any other one

random_col = choice(coffee_df.columns)
coffee_df[random_col].head()
1 metad plc 2 metad plc 3 grounds for health admin 4 yidnekachew dabessa 5 metad plc Name: Owner, dtype: object
typo_col = 'country of origin'
coffee_df[typo_col].head()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.local/lib/python3.12/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'country of origin'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[26], line 2
      1 typo_col = 'country of origin'
----> 2 coffee_df[typo_col].head()

File ~/.local/lib/python3.12/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
   4111 if self.columns.nlevels > 1:
   4112     return self._getitem_multilevel(key)
-> 4113 indexer = self.columns.get_loc(key)
   4114 if is_integer(indexer):
   4115     indexer = [indexer]

File ~/.local/lib/python3.12/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'country of origin'

this is because country of origin is not a column in the dataset

typo_col in coffee_df.columns
False

is the random jitter for same number inclusive or exclude (can the dots overlap completely if unlucky)

it is random, so they can overlap, but it would be exceedingly unlucky for more than two to cmopletely overlap