Visualization
Contents
7. Visualization#
import pandas as pd
import seaborn as sns
sns.set_theme(palette= "colorblind")
7.1. Jupyter FAQ#
Question from class
Why doesn’t my jupyter print things out that are on the last line?
Whem we create a variable and then put that on the last line of a cell, jupyter displays it.
name = 'sarah'
name
'sarah'
How it diplsays it depends on the type
type(name)
str
For a string, it uses print
print(name)
sarah
so this and the one above look the same. For objects that have a _repr_html_
method, juypter uses that, and uses html to render the object in a more visually
appealing way.
7.2. Review of describe#
we’re going to work with the arabica data today, because it’s a little bigger and more interesting for plotting
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
coffee_df = pd.read_csv(arabica_data_url)
We can describe it again, to see it has mostly the same variables we saw before, but some different as well.
coffee_df.describe()
Unnamed: 0 | Number.of.Bags | Aroma | Flavor | Aftertaste | Acidity | Body | Balance | Uniformity | Clean.Cup | Sweetness | Cupper.Points | Total.Cup.Points | Moisture | Category.One.Defects | Quakers | Category.Two.Defects | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.00000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1311.000000 | 1310.000000 | 1311.000000 | 1084.000000 | 1084.000000 | 1084.000000 |
mean | 656.000763 | 153.887872 | 7.563806 | 7.518070 | 7.397696 | 7.533112 | 7.517727 | 7.517506 | 9.833394 | 9.83312 | 9.903272 | 7.497864 | 82.115927 | 0.088863 | 0.426392 | 0.177099 | 3.591915 | 1759.548954 | 1808.843803 | 1784.196379 |
std | 378.598733 | 129.733734 | 0.378666 | 0.399979 | 0.405119 | 0.381599 | 0.359213 | 0.406316 | 0.559343 | 0.77135 | 0.530832 | 0.474610 | 3.515761 | 0.047957 | 1.832415 | 0.840583 | 5.350371 | 8767.847252 | 8767.187498 | 8767.016913 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 328.500000 | 14.500000 | 7.420000 | 7.330000 | 7.250000 | 7.330000 | 7.330000 | 7.330000 | 10.000000 | 10.00000 | 10.000000 | 7.250000 | 81.170000 | 0.090000 | 0.000000 | 0.000000 | 0.000000 | 1100.000000 | 1100.000000 | 1100.000000 |
50% | 656.000000 | 175.000000 | 7.580000 | 7.580000 | 7.420000 | 7.500000 | 7.500000 | 7.500000 | 10.000000 | 10.00000 | 10.000000 | 7.500000 | 82.500000 | 0.110000 | 0.000000 | 0.000000 | 2.000000 | 1310.640000 | 1350.000000 | 1310.640000 |
75% | 983.500000 | 275.000000 | 7.750000 | 7.750000 | 7.580000 | 7.750000 | 7.670000 | 7.750000 | 10.000000 | 10.00000 | 10.000000 | 7.750000 | 83.670000 | 0.120000 | 0.000000 | 0.000000 | 4.000000 | 1600.000000 | 1650.000000 | 1600.000000 |
max | 1312.000000 | 1062.000000 | 8.750000 | 8.830000 | 8.670000 | 8.750000 | 8.580000 | 8.750000 | 10.000000 | 10.00000 | 10.000000 | 10.000000 | 90.580000 | 0.280000 | 31.000000 | 11.000000 | 55.000000 | 190164.000000 | 190164.000000 | 190164.000000 |
Question from class
Why do we need the ()
on the describe but not on just the data
As is often the case, again this comes back to the type.
type(coffee_df)
pandas.core.frame.DataFrame
is a data frame which has the _repr_html_
method
coffee_df
Unnamed: 0 | Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
1 | 2 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 1 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
2 | 3 | Arabica | grounds for health admin | Guatemala | san marcos barrancas "san cristobal cuch | NaN | NaN | NaN | NaN | 1600 - 1800 m | ... | NaN | 0 | May 31st, 2011 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 1600.00 | 1800.00 | 1700.00 |
3 | 4 | Arabica | yidnekachew dabessa | Ethiopia | yidnekachew dabessa coffee plantation | NaN | wolensu | NaN | yidnekachew debessa coffee plantation | 1800-2200 | ... | Green | 2 | March 25th, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1800.00 | 2200.00 | 2000.00 |
4 | 5 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 2 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.00 | 2200.00 | 2075.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1306 | 1307 | Arabica | juan carlos garcia lopez | Mexico | el centenario | NaN | la esperanza, municipio juchique de ferrer, ve... | 1104328663 | terra mia | 900 | ... | None | 20 | September 17th, 2013 | AMECAFE | 59e396ad6e22a1c22b248f958e1da2bd8af85272 | 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 | m | 900.00 | 900.00 | 900.00 |
1307 | 1308 | Arabica | myriam kaplan-pasternak | Haiti | 200 farms | NaN | coeb koperativ ekselsyo basen (350 members) | NaN | haiti coffee | ~350m | ... | Blue-Green | 16 | May 24th, 2013 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 350.00 | 350.00 | 350.00 |
1308 | 1309 | Arabica | exportadora atlantic, s.a. | Nicaragua | finca las marías | 017-053-0211/ 017-053-0212 | beneficio atlantic condega | 017-053-0211/ 017-053-0212 | exportadora atlantic s.a | 1100 | ... | Green | 5 | June 6th, 2018 | Instituto Hondureño del Café | b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 | 7f521ca403540f81ec99daec7da19c2788393880 | m | 1100.00 | 1100.00 | 1100.00 |
1309 | 1310 | Arabica | juan luis alvarado romero | Guatemala | finca el limon | NaN | beneficio serben | 11/853/165 | unicafe | 4650 | ... | Green | 4 | May 24th, 2013 | Asociacion Nacional Del Café | b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53 | 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 | ft | 1417.32 | 1417.32 | 1417.32 |
1310 | 1312 | Arabica | bismarck castro | Honduras | los hicaques | 103 | cigrah s.a de c.v. | 13-111-053 | cigrah s.a de c.v | 1400 | ... | Green | 2 | April 28th, 2018 | Instituto Hondureño del Café | b4660a57e9f8cc613ae5b8f02bfce8634c763ab4 | 7f521ca403540f81ec99daec7da19c2788393880 | m | 1400.00 | 1400.00 | 1400.00 |
1311 rows × 44 columns
so it prints nicely as tdid the coffee_df.describe()
If we leave the ()
off we don’t get nice formatting
coffee_df.describe
<bound method NDFrame.describe of Unnamed: 0 Species Owner Country.of.Origin \
0 1 Arabica metad plc Ethiopia
1 2 Arabica metad plc Ethiopia
2 3 Arabica grounds for health admin Guatemala
3 4 Arabica yidnekachew dabessa Ethiopia
4 5 Arabica metad plc Ethiopia
... ... ... ... ...
1306 1307 Arabica juan carlos garcia lopez Mexico
1307 1308 Arabica myriam kaplan-pasternak Haiti
1308 1309 Arabica exportadora atlantic, s.a. Nicaragua
1309 1310 Arabica juan luis alvarado romero Guatemala
1310 1312 Arabica bismarck castro Honduras
Farm.Name Lot.Number \
0 metad plc NaN
1 metad plc NaN
2 san marcos barrancas "san cristobal cuch NaN
3 yidnekachew dabessa coffee plantation NaN
4 metad plc NaN
... ... ...
1306 el centenario NaN
1307 200 farms NaN
1308 finca las marías 017-053-0211/ 017-053-0212
1309 finca el limon NaN
1310 los hicaques 103
Mill \
0 metad plc
1 metad plc
2 NaN
3 wolensu
4 metad plc
... ...
1306 la esperanza, municipio juchique de ferrer, ve...
1307 coeb koperativ ekselsyo basen (350 members)
1308 beneficio atlantic condega
1309 beneficio serben
1310 cigrah s.a de c.v.
ICO.Number Company \
0 2014/2015 metad agricultural developmet plc
1 2014/2015 metad agricultural developmet plc
2 NaN NaN
3 NaN yidnekachew debessa coffee plantation
4 2014/2015 metad agricultural developmet plc
... ... ...
1306 1104328663 terra mia
1307 NaN haiti coffee
1308 017-053-0211/ 017-053-0212 exportadora atlantic s.a
1309 11/853/165 unicafe
1310 13-111-053 cigrah s.a de c.v
Altitude ... Color Category.Two.Defects \
0 1950-2200 ... Green 0
1 1950-2200 ... Green 1
2 1600 - 1800 m ... NaN 0
3 1800-2200 ... Green 2
4 1950-2200 ... Green 2
... ... ... ... ...
1306 900 ... None 20
1307 ~350m ... Blue-Green 16
1308 1100 ... Green 5
1309 4650 ... Green 4
1310 1400 ... Green 2
Expiration Certification.Body \
0 April 3rd, 2016 METAD Agricultural Development plc
1 April 3rd, 2016 METAD Agricultural Development plc
2 May 31st, 2011 Specialty Coffee Association
3 March 25th, 2016 METAD Agricultural Development plc
4 April 3rd, 2016 METAD Agricultural Development plc
... ... ...
1306 September 17th, 2013 AMECAFE
1307 May 24th, 2013 Specialty Coffee Association
1308 June 6th, 2018 Instituto Hondureño del Café
1309 May 24th, 2013 Asociacion Nacional Del Café
1310 April 28th, 2018 Instituto Hondureño del Café
Certification.Address \
0 309fcf77415a3661ae83e027f7e5f05dad786e44
1 309fcf77415a3661ae83e027f7e5f05dad786e44
2 36d0d00a3724338ba7937c52a378d085f2172daa
3 309fcf77415a3661ae83e027f7e5f05dad786e44
4 309fcf77415a3661ae83e027f7e5f05dad786e44
... ...
1306 59e396ad6e22a1c22b248f958e1da2bd8af85272
1307 36d0d00a3724338ba7937c52a378d085f2172daa
1308 b4660a57e9f8cc613ae5b8f02bfce8634c763ab4
1309 b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53
1310 b4660a57e9f8cc613ae5b8f02bfce8634c763ab4
Certification.Contact unit_of_measurement \
0 19fef5a731de2db57d16da10287413f5f99bc2dd m
1 19fef5a731de2db57d16da10287413f5f99bc2dd m
2 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m
3 19fef5a731de2db57d16da10287413f5f99bc2dd m
4 19fef5a731de2db57d16da10287413f5f99bc2dd m
... ... ...
1306 0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7 m
1307 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 m
1308 7f521ca403540f81ec99daec7da19c2788393880 m
1309 724f04ad10ed31dbb9d260f0dfd221ba48be8a95 ft
1310 7f521ca403540f81ec99daec7da19c2788393880 m
altitude_low_meters altitude_high_meters altitude_mean_meters
0 1950.00 2200.00 2075.00
1 1950.00 2200.00 2075.00
2 1600.00 1800.00 1700.00
3 1800.00 2200.00 2000.00
4 1950.00 2200.00 2075.00
... ... ... ...
1306 900.00 900.00 900.00
1307 350.00 350.00 350.00
1308 1100.00 1100.00 1100.00
1309 1417.32 1417.32 1417.32
1310 1400.00 1400.00 1400.00
[1311 rows x 44 columns]>
so lets check the type of that.
type(coffee_df.describe)
method
it’s a bound method
or a function that will be applied to the DataFrame, but
we didn’t actually run the method. To see that it hasn’t run, we can use an
ipython1 magic %%timeit
%%timeit
coffee_df.describe
74.8 ns ± 0.0775 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
%%timeit
coffee_df.describe()
28.3 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that without the ()
it runs much much faster, signaling that it did less
finding the method, is less calcuation than computing statistics on the data
7.3. Basic plots in pandas#
Pandas gives us basic plots.
coffee_df['Flavor'].plot()
<AxesSubplot:>
Since we chose a series, it plotted that data as line vs the index.
coffee_df.index
RangeIndex(start=0, stop=1311, step=1)
We can change the kind, for example to a Kernel Density Estimate. This approximates the distribution of the data, you can think of it rougly like a smoothed out histogram.
coffee_df['Flavor'].plot(kind='kde')
<AxesSubplot:ylabel='Density'>
We can also plot two variables as a scatter plot, by specifying the x
, y
and
kind
coffee_df.plot(x='Flavor',y='Balance', kind='scatter')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
<AxesSubplot:xlabel='Flavor', ylabel='Balance'>
Let’s Make a histogram plot of the Balance variable
coffee_df['Balance'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
Question from class
Can we plot two histograms with coffee_df['Balance']['Flavor'].plot(kind='hist')
:tags: ["raises-exception"]
coffee_df['Balance']['Flavor'].plot(kind='hist')
Input In [18]
:tags: ["raises-exception"]
^
SyntaxError: invalid syntax
Let’s break down why that errors. When we append things to the left, python
interprets them by passing the output of one step to the input of the next one.
So coffee_df['Balance'].plot(kind='hist')
first made a series, then plotted it.
In the above, we again got the series, which works
coffee_df['Balance'].head(2)
0 8.42
1 8.42
Name: Balance, dtype: float64
But then, we tried to index it with ‘Flavor’, but we don’t have that any more
coffee_df['Balance']['Flavor']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 coffee_df['Balance']['Flavor']
File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/series.py:958, in Series.__getitem__(self, key)
955 return self._values[key]
957 elif key_is_scalar:
--> 958 return self._get_value(key)
960 if is_hashable(key):
961 # Otherwise index.get_value will raise InvalidIndexError
962 try:
963 # For labels that don't resolve as scalars like tuples and frozensets
File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/series.py:1069, in Series._get_value(self, label, takeable)
1066 return self._values[label]
1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
1070 return self.index._get_values_for_loc(self, loc, label)
File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/indexes/range.py:389, in RangeIndex.get_loc(self, key, method, tolerance)
387 raise KeyError(key) from err
388 self._check_indexing_error(key)
--> 389 raise KeyError(key)
390 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 'Flavor'
So we get a key error and we know this is the part of the line we have to change.
We need to index into the DataFrame and pick two columns at once. When we index,
we can use the name of a variable as a string or a list. We can buil this list
on the fly and python exectues fromt he inside out.
The outer [ ]
index and the inner [ ]
make alist
coffee_df[['Balance','Flavor']].head(2)
Balance | Flavor | |
---|---|---|
0 | 8.42 | 8.83 |
1 | 8.42 | 8.67 |
we could also build the list first, then index for readability
hist_vars = ['Balance','Flavor'].head(2)
coffee_df[hist_vars]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 hist_vars = ['Balance','Flavor'].head(2)
2 coffee_df[hist_vars]
AttributeError: 'list' object has no attribute 'head'
This gives us a data frame, which we can plot.
coffee_df[['Balance','Flavor']].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
We’ll see ways to improve this on Friday.
7.4. Plotting in Python#
matplotlib: low level plotting tools
seaborn: high level plotting with opinionated defaults
ggplot: plotting based on the ggplot library in R.
Pandas and seaborn use matplotlib under the hood.
Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.
We can get that basic plot back.
sns.scatterplot(data=coffee_df,x='Flavor',y='Balance')
<AxesSubplot:xlabel='Flavor', ylabel='Balance'>
But now we have more power to investigate more relationships in the data.
sns.scatterplot(data=coffee_df,x='Flavor',y='Balance',hue='Color')
<AxesSubplot:xlabel='Flavor', ylabel='Balance'>
From this we can see that the color doesn’t appear to be related to the flavor or balance scores, but that the flavor and balacne are related.
We can also break this apart. lmplot
is a higher level plotting function so
it allows us to create grids of plots and by default also includes a regression
line. We’ll turn that off for now, with ,fit_reg=False
.
sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
col='Color',fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x7ff7d31776a0>
col
stands for column. We can also use row
sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
row='Color')
<seaborn.axisgrid.FacetGrid at 0x7ff7d2d51fa0>
We can also use both together:
sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
row='Color',col='Variety')
<seaborn.axisgrid.FacetGrid at 0x7ff7d27eefa0>
How could we choose which countries to select to make this not show the ones with very few points?
coffee_df['Country.of.Origin'].value_counts()
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
Honduras 53
Costa Rica 51
Ethiopia 44
Tanzania, United Republic Of 40
Thailand 32
Uganda 26
Nicaragua 26
Kenya 25
El Salvador 21
Indonesia 20
China 16
Malawi 11
Peru 10
United States 8
Myanmar 8
Vietnam 7
Haiti 6
Philippines 5
Panama 4
United States (Puerto Rico) 4
Laos 3
Burundi 2
Ecuador 1
Rwanda 1
Japan 1
Zambia 1
Papua New Guinea 1
Mauritius 1
Cote d?Ivoire 1
India 1
Name: Country.of.Origin, dtype: int64
Or we can focus on the countried, but wrap them.
sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
col='Country.of.Origin',col_wrap=5)
<seaborn.axisgrid.FacetGrid at 0x7ff7cc6b22e0>
7.5. Questions after class#
Ram Token Opportunity
add a question with a pull request; earn 1-2 ram tokens for submitting a question with the answer (with sources)
7.6. More practice#
Plot the kde for the
Aftertaste
How does
Total.Cup.Points
vary byCertification.Body
Are moisture and sweetness related? Does that relationship vary by Color?
- 1
the kernel of python we’re using