7. Visualization#

import pandas as pd
import seaborn as sns
sns.set_theme(palette= "colorblind")

7.1. Jupyter FAQ#

Question from class

Why doesn’t my jupyter print things out that are on the last line?

Whem we create a variable and then put that on the last line of a cell, jupyter displays it.

name = 'sarah'
name
'sarah'

How it diplsays it depends on the type

type(name)
str

For a string, it uses print

print(name)
sarah

so this and the one above look the same. For objects that have a _repr_html_ method, juypter uses that, and uses html to render the object in a more visually appealing way.

7.2. Review of describe#

we’re going to work with the arabica data today, because it’s a little bigger and more interesting for plotting

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url)

We can describe it again, to see it has mostly the same variables we saw before, but some different as well.

coffee_df.describe()
Unnamed: 0 Number.of.Bags Aroma Flavor Aftertaste Acidity Body Balance Uniformity Clean.Cup Sweetness Cupper.Points Total.Cup.Points Moisture Category.One.Defects Quakers Category.Two.Defects altitude_low_meters altitude_high_meters altitude_mean_meters
count 1311.000000 1311.000000 1311.000000 1311.000000 1311.000000 1311.000000 1311.000000 1311.000000 1311.000000 1311.00000 1311.000000 1311.000000 1311.000000 1311.000000 1311.000000 1310.000000 1311.000000 1084.000000 1084.000000 1084.000000
mean 656.000763 153.887872 7.563806 7.518070 7.397696 7.533112 7.517727 7.517506 9.833394 9.83312 9.903272 7.497864 82.115927 0.088863 0.426392 0.177099 3.591915 1759.548954 1808.843803 1784.196379
std 378.598733 129.733734 0.378666 0.399979 0.405119 0.381599 0.359213 0.406316 0.559343 0.77135 0.530832 0.474610 3.515761 0.047957 1.832415 0.840583 5.350371 8767.847252 8767.187498 8767.016913
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000
25% 328.500000 14.500000 7.420000 7.330000 7.250000 7.330000 7.330000 7.330000 10.000000 10.00000 10.000000 7.250000 81.170000 0.090000 0.000000 0.000000 0.000000 1100.000000 1100.000000 1100.000000
50% 656.000000 175.000000 7.580000 7.580000 7.420000 7.500000 7.500000 7.500000 10.000000 10.00000 10.000000 7.500000 82.500000 0.110000 0.000000 0.000000 2.000000 1310.640000 1350.000000 1310.640000
75% 983.500000 275.000000 7.750000 7.750000 7.580000 7.750000 7.670000 7.750000 10.000000 10.00000 10.000000 7.750000 83.670000 0.120000 0.000000 0.000000 4.000000 1600.000000 1650.000000 1600.000000
max 1312.000000 1062.000000 8.750000 8.830000 8.670000 8.750000 8.580000 8.750000 10.000000 10.00000 10.000000 10.000000 90.580000 0.280000 31.000000 11.000000 55.000000 190164.000000 190164.000000 190164.000000

Question from class

Why do we need the () on the describe but not on just the data

As is often the case, again this comes back to the type.

type(coffee_df)
pandas.core.frame.DataFrame

is a data frame which has the _repr_html_ method

coffee_df
Unnamed: 0 Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
0 1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
1 2 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 1 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
2 3 Arabica grounds for health admin Guatemala san marcos barrancas "san cristobal cuch NaN NaN NaN NaN 1600 - 1800 m ... NaN 0 May 31st, 2011 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 1600.00 1800.00 1700.00
3 4 Arabica yidnekachew dabessa Ethiopia yidnekachew dabessa coffee plantation NaN wolensu NaN yidnekachew debessa coffee plantation 1800-2200 ... Green 2 March 25th, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1800.00 2200.00 2000.00
4 5 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 ... Green 2 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.00 2200.00 2075.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1306 1307 Arabica juan carlos garcia lopez Mexico el centenario NaN la esperanza, municipio juchique de ferrer, ve... 1104328663 terra mia 900 ... None 20 September 17th, 2013 AMECAFE 59e396ad6e22a1c22b248f958e1da2bd8af85272 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 m 900.00 900.00 900.00
1307 1308 Arabica myriam kaplan-pasternak Haiti 200 farms NaN coeb koperativ ekselsyo basen (350 members) NaN haiti coffee ~350m ... Blue-Green 16 May 24th, 2013 Specialty Coffee Association 36d0d00a3724338ba7937c52a378d085f2172daa 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m 350.00 350.00 350.00
1308 1309 Arabica exportadora atlantic, s.a. Nicaragua finca las marías 017-053-0211/ 017-053-0212 beneficio atlantic condega 017-053-0211/ 017-053-0212 exportadora atlantic s.a 1100 ... Green 5 June 6th, 2018 Instituto Hondureño del Café b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 7f521ca403540f81ec99daec7da19c2788393880 m 1100.00 1100.00 1100.00
1309 1310 Arabica juan luis alvarado romero Guatemala finca el limon NaN beneficio serben 11/853/165 unicafe 4650 ... Green 4 May 24th, 2013 Asociacion Nacional Del Café b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 ft 1417.32 1417.32 1417.32
1310 1312 Arabica bismarck castro Honduras los hicaques 103 cigrah s.a de c.v. 13-111-053 cigrah s.a de c.v 1400 ... Green 2 April 28th, 2018 Instituto Hondureño del Café b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 7f521ca403540f81ec99daec7da19c2788393880 m 1400.00 1400.00 1400.00

1311 rows × 44 columns

so it prints nicely as tdid the coffee_df.describe()

If we leave the () off we don’t get nice formatting

coffee_df.describe
<bound method NDFrame.describe of       Unnamed: 0  Species                       Owner Country.of.Origin  \
0              1  Arabica                   metad plc          Ethiopia   
1              2  Arabica                   metad plc          Ethiopia   
2              3  Arabica    grounds for health admin         Guatemala   
3              4  Arabica         yidnekachew dabessa          Ethiopia   
4              5  Arabica                   metad plc          Ethiopia   
...          ...      ...                         ...               ...   
1306        1307  Arabica    juan carlos garcia lopez            Mexico   
1307        1308  Arabica     myriam kaplan-pasternak             Haiti   
1308        1309  Arabica  exportadora atlantic, s.a.         Nicaragua   
1309        1310  Arabica   juan luis alvarado romero         Guatemala   
1310        1312  Arabica             bismarck castro          Honduras   

                                     Farm.Name                  Lot.Number  \
0                                    metad plc                         NaN   
1                                    metad plc                         NaN   
2     san marcos barrancas "san cristobal cuch                         NaN   
3        yidnekachew dabessa coffee plantation                         NaN   
4                                    metad plc                         NaN   
...                                        ...                         ...   
1306                             el centenario                         NaN   
1307                                 200 farms                         NaN   
1308                          finca las marías  017-053-0211/ 017-053-0212   
1309                            finca el limon                         NaN   
1310                              los hicaques                         103   

                                                   Mill  \
0                                             metad plc   
1                                             metad plc   
2                                                   NaN   
3                                               wolensu   
4                                             metad plc   
...                                                 ...   
1306  la esperanza, municipio juchique de ferrer, ve...   
1307        coeb koperativ ekselsyo basen (350 members)   
1308                         beneficio atlantic condega   
1309                                   beneficio serben   
1310                                 cigrah s.a de c.v.   

                      ICO.Number                                Company  \
0                      2014/2015      metad agricultural developmet plc   
1                      2014/2015      metad agricultural developmet plc   
2                            NaN                                    NaN   
3                            NaN  yidnekachew debessa coffee plantation   
4                      2014/2015      metad agricultural developmet plc   
...                          ...                                    ...   
1306                  1104328663                              terra mia   
1307                         NaN                           haiti coffee   
1308  017-053-0211/ 017-053-0212               exportadora atlantic s.a   
1309                  11/853/165                                unicafe   
1310                  13-111-053                      cigrah s.a de c.v   

           Altitude  ...       Color Category.Two.Defects  \
0         1950-2200  ...       Green                    0   
1         1950-2200  ...       Green                    1   
2     1600 - 1800 m  ...         NaN                    0   
3         1800-2200  ...       Green                    2   
4         1950-2200  ...       Green                    2   
...             ...  ...         ...                  ...   
1306            900  ...        None                   20   
1307          ~350m  ...  Blue-Green                   16   
1308           1100  ...       Green                    5   
1309           4650  ...       Green                    4   
1310           1400  ...       Green                    2   

                Expiration                  Certification.Body  \
0          April 3rd, 2016  METAD Agricultural Development plc   
1          April 3rd, 2016  METAD Agricultural Development plc   
2           May 31st, 2011        Specialty Coffee Association   
3         March 25th, 2016  METAD Agricultural Development plc   
4          April 3rd, 2016  METAD Agricultural Development plc   
...                    ...                                 ...   
1306  September 17th, 2013                             AMECAFE   
1307        May 24th, 2013        Specialty Coffee Association   
1308        June 6th, 2018        Instituto Hondureño del Café   
1309        May 24th, 2013        Asociacion Nacional Del Café   
1310      April 28th, 2018        Instituto Hondureño del Café   

                         Certification.Address  \
0     309fcf77415a3661ae83e027f7e5f05dad786e44   
1     309fcf77415a3661ae83e027f7e5f05dad786e44   
2     36d0d00a3724338ba7937c52a378d085f2172daa   
3     309fcf77415a3661ae83e027f7e5f05dad786e44   
4     309fcf77415a3661ae83e027f7e5f05dad786e44   
...                                        ...   
1306  59e396ad6e22a1c22b248f958e1da2bd8af85272   
1307  36d0d00a3724338ba7937c52a378d085f2172daa   
1308  b4660a57e9f8cc613ae5b8f02bfce8634c763ab4   
1309  b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53   
1310  b4660a57e9f8cc613ae5b8f02bfce8634c763ab4   

                         Certification.Contact unit_of_measurement  \
0     19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
1     19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
2     0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660                   m   
3     19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
4     19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
...                                        ...                 ...   
1306  0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7                   m   
1307  0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660                   m   
1308  7f521ca403540f81ec99daec7da19c2788393880                   m   
1309  724f04ad10ed31dbb9d260f0dfd221ba48be8a95                  ft   
1310  7f521ca403540f81ec99daec7da19c2788393880                   m   

     altitude_low_meters altitude_high_meters altitude_mean_meters  
0                1950.00              2200.00              2075.00  
1                1950.00              2200.00              2075.00  
2                1600.00              1800.00              1700.00  
3                1800.00              2200.00              2000.00  
4                1950.00              2200.00              2075.00  
...                  ...                  ...                  ...  
1306              900.00               900.00               900.00  
1307              350.00               350.00               350.00  
1308             1100.00              1100.00              1100.00  
1309             1417.32              1417.32              1417.32  
1310             1400.00              1400.00              1400.00  

[1311 rows x 44 columns]>

so lets check the type of that.

type(coffee_df.describe)
method

it’s a bound method or a function that will be applied to the DataFrame, but we didn’t actually run the method. To see that it hasn’t run, we can use an ipython1 magic %%timeit

%%timeit
coffee_df.describe
74.8 ns ± 0.0775 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%%timeit
coffee_df.describe()
28.3 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that without the () it runs much much faster, signaling that it did less finding the method, is less calcuation than computing statistics on the data

7.3. Basic plots in pandas#

Pandas gives us basic plots.

coffee_df['Flavor'].plot()
<AxesSubplot:>
../_images/2021-09-22_24_1.png

Since we chose a series, it plotted that data as line vs the index.

coffee_df.index
RangeIndex(start=0, stop=1311, step=1)

We can change the kind, for example to a Kernel Density Estimate. This approximates the distribution of the data, you can think of it rougly like a smoothed out histogram.

coffee_df['Flavor'].plot(kind='kde')
<AxesSubplot:ylabel='Density'>
../_images/2021-09-22_28_1.png

We can also plot two variables as a scatter plot, by specifying the x, y and kind

coffee_df.plot(x='Flavor',y='Balance', kind='scatter')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
<AxesSubplot:xlabel='Flavor', ylabel='Balance'>
../_images/2021-09-22_30_2.png

Let’s Make a histogram plot of the Balance variable

coffee_df['Balance'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
../_images/2021-09-22_32_1.png

Question from class

Can we plot two histograms with coffee_df['Balance']['Flavor'].plot(kind='hist')

:tags: ["raises-exception"]

coffee_df['Balance']['Flavor'].plot(kind='hist')
  Input In [18]
    :tags: ["raises-exception"]
    ^
SyntaxError: invalid syntax

Let’s break down why that errors. When we append things to the left, python interprets them by passing the output of one step to the input of the next one.
So coffee_df['Balance'].plot(kind='hist') first made a series, then plotted it. In the above, we again got the series, which works

coffee_df['Balance'].head(2)
0    8.42
1    8.42
Name: Balance, dtype: float64

But then, we tried to index it with ‘Flavor’, but we don’t have that any more

coffee_df['Balance']['Flavor']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 coffee_df['Balance']['Flavor']

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/series.py:958, in Series.__getitem__(self, key)
    955     return self._values[key]
    957 elif key_is_scalar:
--> 958     return self._get_value(key)
    960 if is_hashable(key):
    961     # Otherwise index.get_value will raise InvalidIndexError
    962     try:
    963         # For labels that don't resolve as scalars like tuples and frozensets

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/series.py:1069, in Series._get_value(self, label, takeable)
   1066     return self._values[label]
   1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
   1070 return self.index._get_values_for_loc(self, loc, label)

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/indexes/range.py:389, in RangeIndex.get_loc(self, key, method, tolerance)
    387             raise KeyError(key) from err
    388     self._check_indexing_error(key)
--> 389     raise KeyError(key)
    390 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'Flavor'

So we get a key error and we know this is the part of the line we have to change.

We need to index into the DataFrame and pick two columns at once. When we index, we can use the name of a variable as a string or a list. We can buil this list on the fly and python exectues fromt he inside out.
The outer [ ] index and the inner [ ] make alist

coffee_df[['Balance','Flavor']].head(2)
Balance Flavor
0 8.42 8.83
1 8.42 8.67

we could also build the list first, then index for readability

hist_vars = ['Balance','Flavor'].head(2)
coffee_df[hist_vars]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 hist_vars = ['Balance','Flavor'].head(2)
      2 coffee_df[hist_vars]

AttributeError: 'list' object has no attribute 'head'

This gives us a data frame, which we can plot.

coffee_df[['Balance','Flavor']].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
../_images/2021-09-22_44_1.png

We’ll see ways to improve this on Friday.

7.4. Plotting in Python#

  • matplotlib: low level plotting tools

  • seaborn: high level plotting with opinionated defaults

  • ggplot: plotting based on the ggplot library in R.

Pandas and seaborn use matplotlib under the hood.

Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.

We can get that basic plot back.

sns.scatterplot(data=coffee_df,x='Flavor',y='Balance')
<AxesSubplot:xlabel='Flavor', ylabel='Balance'>
../_images/2021-09-22_46_1.png

But now we have more power to investigate more relationships in the data.

sns.scatterplot(data=coffee_df,x='Flavor',y='Balance',hue='Color')
<AxesSubplot:xlabel='Flavor', ylabel='Balance'>
../_images/2021-09-22_48_1.png

From this we can see that the color doesn’t appear to be related to the flavor or balance scores, but that the flavor and balacne are related.

We can also break this apart. lmplot is a higher level plotting function so it allows us to create grids of plots and by default also includes a regression line. We’ll turn that off for now, with ,fit_reg=False.

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               col='Color',fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7ff7d31776a0>
../_images/2021-09-22_50_1.png

col stands for column. We can also use row

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               row='Color')
<seaborn.axisgrid.FacetGrid at 0x7ff7d2d51fa0>
../_images/2021-09-22_52_1.png

We can also use both together:

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               row='Color',col='Variety')
<seaborn.axisgrid.FacetGrid at 0x7ff7d27eefa0>
../_images/2021-09-22_54_1.png

How could we choose which countries to select to make this not show the ones with very few points?

coffee_df['Country.of.Origin'].value_counts()
Mexico                          236
Colombia                        183
Guatemala                       181
Brazil                          132
Taiwan                           75
United States (Hawaii)           73
Honduras                         53
Costa Rica                       51
Ethiopia                         44
Tanzania, United Republic Of     40
Thailand                         32
Uganda                           26
Nicaragua                        26
Kenya                            25
El Salvador                      21
Indonesia                        20
China                            16
Malawi                           11
Peru                             10
United States                     8
Myanmar                           8
Vietnam                           7
Haiti                             6
Philippines                       5
Panama                            4
United States (Puerto Rico)       4
Laos                              3
Burundi                           2
Ecuador                           1
Rwanda                            1
Japan                             1
Zambia                            1
Papua New Guinea                  1
Mauritius                         1
Cote d?Ivoire                     1
India                             1
Name: Country.of.Origin, dtype: int64

Or we can focus on the countried, but wrap them.

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               col='Country.of.Origin',col_wrap=5)
<seaborn.axisgrid.FacetGrid at 0x7ff7cc6b22e0>
../_images/2021-09-22_58_1.png

7.5. Questions after class#

Ram Token Opportunity

add a question with a pull request; earn 1-2 ram tokens for submitting a question with the answer (with sources)

7.6. More practice#

  1. Plot the kde for the Aftertaste

  2. How does Total.Cup.Points vary by Certification.Body

  3. Are moisture and sweetness related? Does that relationship vary by Color?


1

the kernel of python we’re using