6. Visualization#

If your plots do not show, include this in any cell. The % signals that this is an ipython magic. This one controls matplotlib. Jupyter uses the IPython python kernel.

%matplotlib inline

Today’s imports

import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt

6.1. Summarizing Review#

We will start with the same dataset we hvae been working with

robusta_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
robusta_df = pd.read_csv(robusta_data_url)

Is the robust coffee’s Mouthfeel or the Aftertaste more consistently scored in this dataset?

Why?

robusta_df[['Mouthfeel','Aftertaste']].describe()
Mouthfeel Aftertaste
count 28.000000 28.000000
mean 7.506786 7.559643
std 0.725152 0.342469
min 5.080000 6.500000
25% 7.500000 7.397500
50% 7.670000 7.670000
75% 7.830000 7.770000
max 8.250000 7.920000

from the lower std we can see that Aftertaste is more consistently rated.

We can also save this subset into a smaller dataframe to work with it more and plot it.

rob_ma_df = robusta_df[['Mouthfeel','Aftertaste']]
rob_ma_df.head(1)
Mouthfeel Aftertaste
0 8.25 7.75

We will use sns.displot to look at how the data is distributed.

Important

For seaborn the online documentation is immensely valuable. Every function’s page has basic documentation and lots of examples, so you can see how they use different paramters to modify plots visually. I strongly recommend reading it often. I recommend reading their tutorial too

sns.displot(rob_ma_df)
<seaborn.axisgrid.FacetGrid at 0x7f9cd42ddb50>
../_images/3a84afa5b015c8af0e5f0fd1e6d05d7f973c8240df4c80bfc224b5659e2ccf4f.png

We can change the kind, for example to a Kernel Density Estimate. This approximates the distribution of the data, you can think of it rougly like a smoothed out histogram.

sns.displot(rob_ma_df,kind='kde')
<seaborn.axisgrid.FacetGrid at 0x7f9c9c2bde80>
../_images/4eb312d4568d0e45a2d3842cdd840493f4f190949a34e01850c687716603b857.png

This version makess it more visually clear that the the Aftertaste is more consistently, but it also helps us see that that might not be the whole story. Both have a second smaller bump, so the overall std might not be the best measure.

Question from class

Why do we need two sets of brackets?

It tries to use them to index in multiple ways instead.

robusta_df['Aftertaste','Mouthfeel']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3652, in Index.get_loc(self, key)
   3651 try:
-> 3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('Aftertaste', 'Mouthfeel')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[9], line 1
----> 1 robusta_df['Aftertaste','Mouthfeel']

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:3760, in DataFrame.__getitem__(self, key)
   3758 if self.columns.nlevels > 1:
   3759     return self._getitem_multilevel(key)
-> 3760 indexer = self.columns.get_loc(key)
   3761 if is_integer(indexer):
   3762     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:3654, in Index.get_loc(self, key)
   3652     return self._engine.get_loc(casted_key)
   3653 except KeyError as err:
-> 3654     raise KeyError(key) from err
   3655 except TypeError:
   3656     # If we have a listlike key, _check_indexing_error will raise
   3657     #  InvalidIndexError. Otherwise we fall through and re-raise
   3658     #  the TypeError.
   3659     self._check_indexing_error(key)

KeyError: ('Aftertaste', 'Mouthfeel')

It tries to look for a multiindex, but we do not have one so it fails. THe second square brackets, makes it a list of names to use and pandas looks for them sequentially.

We will use a larger dataset for more interesting plots.

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
coffee_df = pd.read_csv(arabica_data_url)

6.2. Plotting in Python#

  • matplotlib: low level plotting tools

  • seaborn: high level plotting with opinionated defaults

  • ggplot: plotting based on the ggplot library in R.

Pandas and seaborn use matplotlib under the hood.

Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.

There are lots of type of plots, we saw the basic patterns of how to use them and we’ve used a few types, but we cannot (and do not need to) go through every single type. There are general patterns that you can use that will help you think about what type of plot you might want and help you understand them to be able to customize plots.

[Seaborn’s main goal is opinionated defaults and flexible customization](https://seaborn.pydata.org/tutorial/introduction.html#opinionated-defaults-and-flexible-customization

6.2.1. Anatomy of a figure#

First is the matplotlib structure of a figure. BOth pandas and seaborn and other plotting libraries use matplotlib. Matplotlib was used in visualizing the first Black hole.

annotated graph

This is a lot of information, but these are good to know things. THe most important is the figure and the axes.

Try it Yourself

Make sure you can explain what is a figure and what are axes in your own words and why that distinction matters. Discuss in office hours if you are unsure.

that image was drawn with code and that page explains more.

6.2.2. Plotting Function types in Seaborn#

Seaborn has two levels or groups of plotting functions. Figure and axes. Figure level fucntions can plot with subplots.

summary of plot types

This is from thie overivew section of the official seaborn tutorial. It also includes a comparison of figure vs axes plotting.

The official introduction is also a good read.

6.2.3. More#

The seaborn gallery and matplotlib gallery are nice to look at too.

6.2.4. Styling in Seaborn#

Seaborn also lets us set a theme for visual styling This by default styles the plots to be more visually appealing

sns.set_theme(palette='colorblind')

the colorblind palette is more distinguishable under a variety fo colorblindness types. for more. Colorblind is a good default, but you can choose others that you like more too.

more on colors

6.3. Bags by country#

the catplot lets us plot vs categorical variables.

sns.catplot(data=coffee_df, y='Number.of.Bags',x='Country.of.Origin')
<seaborn.axisgrid.FacetGrid at 0x7f9c96af5670>
../_images/c176fcfc6fc95d7be4f240698f0d7326df63b32146866948a38f5c4d32498f9e.png

This is hard to read, we could try stretching it out to make it better

sns.catplot(data=coffee_df, y='Number.of.Bags',x='Country.of.Origin',aspect=2)
<seaborn.axisgrid.FacetGrid at 0x7f9c968c2fa0>
../_images/e12857f628953a4f4347c9548786524e40d795a7ba4ce9f26f1ba11e2fae7dd3.png

A better way might be to filter only the top countries. We’ll find those by grouping by country then summing each smaller dataframe that groupby creates.

tot_per_country = coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].sum()
tot_per_country.head()
Country.of.Origin
Brazil        30534
Burundi         520
China            55
Colombia      41204
Costa Rica    10354
Name: Number.of.Bags, dtype: int64

We can plot this now this way

tot_per_country.plot(kind='bar')
<Axes: xlabel='Country.of.Origin'>
../_images/db88cba0b055af0552e77537d4061776080f4241f9553357c8fd2c2ede3d7ad1.png

What if we take out only the top 10 countries? First we have to sort it. The default is to sort ascending so we use ascending=False to switch. pandas doesn’thave a plain sort method, we have to say if we want to sort by the values or the index. In this Series, the total number per for each country are the values and the country names are the index.

tot_per_country.sort_values(ascending=False)[:10]
Country.of.Origin
Colombia       41204
Guatemala      36868
Brazil         30534
Mexico         24140
Honduras       13167
Ethiopia       11761
Costa Rica     10354
Nicaragua       6406
El Salvador     4449
Kenya           3971
Name: Number.of.Bags, dtype: int64

We can alo plot this

tot_per_country.sort_values(ascending=False)[:10].plot(kind='bar')
<Axes: xlabel='Country.of.Origin'>
../_images/ab673e592ae7169cc8c61e164bf2330b6324702d838c6df28fef627dd1e47b20.png

6.4. Filtering a DataFrame#

Now, we’ll take just the country names out

top_countries = tot_per_country.sort_values(ascending=False)[:10].index
top_countries
Index(['Colombia', 'Guatemala', 'Brazil', 'Mexico', 'Honduras', 'Ethiopia',
       'Costa Rica', 'Nicaragua', 'El Salvador', 'Kenya'],
      dtype='object', name='Country.of.Origin')

and we can use that to filter the original DataFrame. To do this, we use isin to check each element in the 'Country.of.Origin' column is in that list.

coffee_df['Country.of.Origin'].isin(top_countries)
0        True
1        True
2        True
3        True
4        True
        ...  
1306     True
1307    False
1308     True
1309     True
1310     True
Name: Country.of.Origin, Length: 1311, dtype: bool

This is roughly equivalent to:

[country in top_countries for country in coffee_df['Country.of.Origin'] ]
[True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 ...]

except this builds a list and the pandas way makes a pd.Series object. The Python in operator is really helpful to know and pandas offers us an isin method to get that type of pattern.

In a more basic programming format this process would be two separate loops worth of work.

c_in = []
# iterate over the country of each rating
for country in coffee_df['Country.of.Origin']:
    # make a false temp value
    cur_search = False
    # iterate over top countries
    for tc in top_countries:
        # flip the value if the current top & rating cofee match
        if tc==country:
            cur_search = True
    # save the result of the search
    c_in.append(cur_search)

Try it yourself

Run these versions and confirm for yourself that they are the same.

With that list of booleans, we can then mask the original DataFrame. This keeps only the value where the inner quantity is True

top_coffee_df = coffee_df[coffee_df['Country.of.Origin'].isin(top_countries)]
top_coffee_df.head(1)
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

1 rows × 44 columns

top_coffee_df.shape, coffee_df.shape
((952, 44), (1311, 44))
sns.displot(data=top_coffee_df,x='Aftertaste', col='Country.of.Origin',col_wrap=5)
<seaborn.axisgrid.FacetGrid at 0x7f9c94264fa0>
../_images/9e9b896b957123f7d038b8dee310771f94c86f91b6dbd8fe2dbe68890bf48846.png

6.5. Variable types and data types#

Related but not the same.


Data types are literal, related to the representation in the computer.

ther can be int16, int32, int64


We can also have mathematical types of numbers

  • Integers can be positive, 0, or negative.

  • Reals are continuous, infinite possibilities.


Variable types are about the meaning in a conceptual sense.

  • categorical (can take a discrete number of values, could be used to group data, could be a string or integer; unordered)

  • continuous (can take on any possible value, always a number)

  • binary (like data type boolean, but could be represented as yes/no, true/false, or 1/0, could be categorical also, but often makes sense to calculate rates)

  • ordinal (ordered, but appropriately categorical)

we’ll focus on the first two most of the time. Some values that are technically only integers range high enough that we treat them more like continuous most of the time.

6.6. Questions After Class#

6.6.1. Do we earn level 3’s the same way level 1 and 2 are or are there more steps required?#

You earn level 3s from your portfio. The portfolio makes more sense after you have completed assignment 3, so we will follow up on it next week after you all get a3 feedback.

6.6.2. How can I check what parameters can go into a method?#

You can use the documentation online, or in jupyter, you can get help from the docstring. I usually use shift+tab to read the docstring but you can also use the help() function or the ? in jupyter.

6.6.3. How do you know you can put kind = “bar” into the method?#

I happen to reembmer this now, but to know what values you can read the docstring as above.

6.6.4. Do companies use things like “sns” for more in depth/graphical plots?#

It depends on your role within the company. If you are a data scientist in a more reasearch role you might use seaborn more, but if you build customer facing visualizations, you might use something else.

For more interactive visualization, you could use plotly or bokeh that generate more javascript for you. Plotly as a company now also has a product called dash for building data dashboard apps.

6.6.5. Does “component disciplines” mean statistics, computer science and domain expertise, and does “phases” mean collect, clean, explore, model and deploy?#

Yes.

Important

I updated the assignment text to clarify in response to some questions