7. Visualization#

import pandas as pd
import seaborn as sns
sns.set_theme(palette= "colorblind")

7.1. Jupyter FAQ#

Question from class

Why doesn’t my jupyter print things out that are on the last line?

Whem we create a variable and then put that on the last line of a cell, jupyter displays it.

name = 'sarah'
name

'sarah'

How it diplsays it depends on the type

type(name)

str

For a string, it uses print

print(name)

sarah

so this and the one above look the same. For objects that have a _repr_html_ method, juypter uses that, and uses html to render the object in a more visually appealing way.

7.2. Review of describe#

we’re going to work with the arabica data today, because it’s a little bigger and more interesting for plotting

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url)

We can describe it again, to see it has mostly the same variables we saw before, but some different as well.

coffee_df.describe()

	Unnamed: 0	Number.of.Bags	Aroma	Flavor	Aftertaste	Acidity	Body	Balance	Uniformity	Clean.Cup	Sweetness	Cupper.Points	Total.Cup.Points	Moisture	Category.One.Defects	Quakers	Category.Two.Defects	altitude_low_meters	altitude_high_meters	altitude_mean_meters
count	1311.000000	1311.000000	1311.000000	1311.000000	1311.000000	1311.000000	1311.000000	1311.000000	1311.000000	1311.00000	1311.000000	1311.000000	1311.000000	1311.000000	1311.000000	1310.000000	1311.000000	1084.000000	1084.000000	1084.000000
mean	656.000763	153.887872	7.563806	7.518070	7.397696	7.533112	7.517727	7.517506	9.833394	9.83312	9.903272	7.497864	82.115927	0.088863	0.426392	0.177099	3.591915	1759.548954	1808.843803	1784.196379
std	378.598733	129.733734	0.378666	0.399979	0.405119	0.381599	0.359213	0.406316	0.559343	0.77135	0.530832	0.474610	3.515761	0.047957	1.832415	0.840583	5.350371	8767.847252	8767.187498	8767.016913
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000
25%	328.500000	14.500000	7.420000	7.330000	7.250000	7.330000	7.330000	7.330000	10.000000	10.00000	10.000000	7.250000	81.170000	0.090000	0.000000	0.000000	0.000000	1100.000000	1100.000000	1100.000000
50%	656.000000	175.000000	7.580000	7.580000	7.420000	7.500000	7.500000	7.500000	10.000000	10.00000	10.000000	7.500000	82.500000	0.110000	0.000000	0.000000	2.000000	1310.640000	1350.000000	1310.640000
75%	983.500000	275.000000	7.750000	7.750000	7.580000	7.750000	7.670000	7.750000	10.000000	10.00000	10.000000	7.750000	83.670000	0.120000	0.000000	0.000000	4.000000	1600.000000	1650.000000	1600.000000
max	1312.000000	1062.000000	8.750000	8.830000	8.670000	8.750000	8.580000	8.750000	10.000000	10.00000	10.000000	10.000000	90.580000	0.280000	31.000000	11.000000	55.000000	190164.000000	190164.000000	190164.000000

Question from class

Why do we need the () on the describe but not on just the data

As is often the case, again this comes back to the type.

type(coffee_df)

pandas.core.frame.DataFrame

is a data frame which has the _repr_html_ method

coffee_df

	Unnamed: 0	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
0	1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.00	2200.00	2075.00
1	2	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	...	Green	1	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.00	2200.00	2075.00
2	3	Arabica	grounds for health admin	Guatemala	san marcos barrancas "san cristobal cuch	NaN	NaN	NaN	NaN	1600 - 1800 m	...	NaN	0	May 31st, 2011	Specialty Coffee Association	36d0d00a3724338ba7937c52a378d085f2172daa	0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660	m	1600.00	1800.00	1700.00
3	4	Arabica	yidnekachew dabessa	Ethiopia	yidnekachew dabessa coffee plantation	NaN	wolensu	NaN	yidnekachew debessa coffee plantation	1800-2200	...	Green	2	March 25th, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1800.00	2200.00	2000.00
4	5	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	...	Green	2	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.00	2200.00	2075.00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1306	1307	Arabica	juan carlos garcia lopez	Mexico	el centenario	NaN	la esperanza, municipio juchique de ferrer, ve...	1104328663	terra mia	900	...	None	20	September 17th, 2013	AMECAFE	59e396ad6e22a1c22b248f958e1da2bd8af85272	0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7	m	900.00	900.00	900.00
1307	1308	Arabica	myriam kaplan-pasternak	Haiti	200 farms	NaN	coeb koperativ ekselsyo basen (350 members)	NaN	haiti coffee	~350m	...	Blue-Green	16	May 24th, 2013	Specialty Coffee Association	36d0d00a3724338ba7937c52a378d085f2172daa	0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660	m	350.00	350.00	350.00
1308	1309	Arabica	exportadora atlantic, s.a.	Nicaragua	finca las marías	017-053-0211/ 017-053-0212	beneficio atlantic condega	017-053-0211/ 017-053-0212	exportadora atlantic s.a	1100	...	Green	5	June 6th, 2018	Instituto Hondureño del Café	b4660a57e9f8cc613ae5b8f02bfce8634c763ab4	7f521ca403540f81ec99daec7da19c2788393880	m	1100.00	1100.00	1100.00
1309	1310	Arabica	juan luis alvarado romero	Guatemala	finca el limon	NaN	beneficio serben	11/853/165	unicafe	4650	...	Green	4	May 24th, 2013	Asociacion Nacional Del Café	b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53	724f04ad10ed31dbb9d260f0dfd221ba48be8a95	ft	1417.32	1417.32	1417.32
1310	1312	Arabica	bismarck castro	Honduras	los hicaques	103	cigrah s.a de c.v.	13-111-053	cigrah s.a de c.v	1400	...	Green	2	April 28th, 2018	Instituto Hondureño del Café	b4660a57e9f8cc613ae5b8f02bfce8634c763ab4	7f521ca403540f81ec99daec7da19c2788393880	m	1400.00	1400.00	1400.00

1311 rows × 44 columns

so it prints nicely as tdid the coffee_df.describe()

If we leave the () off we don’t get nice formatting

coffee_df.describe

<bound method NDFrame.describe of       Unnamed: 0  Species                       Owner Country.of.Origin  \
            1  Arabica                   metad plc          Ethiopia   
            2  Arabica                   metad plc          Ethiopia   
            3  Arabica    grounds for health admin         Guatemala   
            4  Arabica         yidnekachew dabessa          Ethiopia   
            5  Arabica                   metad plc          Ethiopia   
...          ...      ...                         ...               ...   
      1307  Arabica    juan carlos garcia lopez            Mexico   
      1308  Arabica     myriam kaplan-pasternak             Haiti   
      1309  Arabica  exportadora atlantic, s.a.         Nicaragua   
      1310  Arabica   juan luis alvarado romero         Guatemala   
      1312  Arabica             bismarck castro          Honduras   

                                     Farm.Name                  Lot.Number  \
                                  metad plc                         NaN   
                                  metad plc                         NaN   
   san marcos barrancas "san cristobal cuch                         NaN   
      yidnekachew dabessa coffee plantation                         NaN   
                                  metad plc                         NaN   
...                                        ...                         ...   
                           el centenario                         NaN   
                               200 farms                         NaN   
                        finca las marías  017-053-0211/ 017-053-0212   
                          finca el limon                         NaN   
                            los hicaques                         103   

                                                   Mill  \
                                           metad plc   
                                           metad plc   
                                                 NaN   
                                             wolensu   
                                           metad plc   
...                                                 ...   
la esperanza, municipio juchique de ferrer, ve...   
      coeb koperativ ekselsyo basen (350 members)   
                       beneficio atlantic condega   
                                 beneficio serben   
                               cigrah s.a de c.v.   

                      ICO.Number                                Company  \
                    2014/2015      metad agricultural developmet plc   
                    2014/2015      metad agricultural developmet plc   
                          NaN                                    NaN   
                          NaN  yidnekachew debessa coffee plantation   
                    2014/2015      metad agricultural developmet plc   
...                          ...                                    ...   
                1104328663                              terra mia   
                       NaN                           haiti coffee   
017-053-0211/ 017-053-0212               exportadora atlantic s.a   
                11/853/165                                unicafe   
                13-111-053                      cigrah s.a de c.v   

           Altitude  ...       Color Category.Two.Defects  \
       1950-2200  ...       Green                    0   
       1950-2200  ...       Green                    1   
   1600 - 1800 m  ...         NaN                    0   
       1800-2200  ...       Green                    2   
       1950-2200  ...       Green                    2   
...             ...  ...         ...                  ...   
          900  ...        None                   20   
        ~350m  ...  Blue-Green                   16   
         1100  ...       Green                    5   
         4650  ...       Green                    4   
         1400  ...       Green                    2   

                Expiration                  Certification.Body  \
        April 3rd, 2016  METAD Agricultural Development plc   
        April 3rd, 2016  METAD Agricultural Development plc   
         May 31st, 2011        Specialty Coffee Association   
       March 25th, 2016  METAD Agricultural Development plc   
        April 3rd, 2016  METAD Agricultural Development plc   
...                    ...                                 ...   
September 17th, 2013                             AMECAFE   
      May 24th, 2013        Specialty Coffee Association   
      June 6th, 2018        Instituto Hondureño del Café   
      May 24th, 2013        Asociacion Nacional Del Café   
    April 28th, 2018        Instituto Hondureño del Café   

                         Certification.Address  \
   309fcf77415a3661ae83e027f7e5f05dad786e44   
   309fcf77415a3661ae83e027f7e5f05dad786e44   
   36d0d00a3724338ba7937c52a378d085f2172daa   
   309fcf77415a3661ae83e027f7e5f05dad786e44   
   309fcf77415a3661ae83e027f7e5f05dad786e44   
...                                        ...   
59e396ad6e22a1c22b248f958e1da2bd8af85272   
36d0d00a3724338ba7937c52a378d085f2172daa   
b4660a57e9f8cc613ae5b8f02bfce8634c763ab4   
b1f20fe3a819fd6b2ee0eb8fdc3da256604f1e53   
b4660a57e9f8cc613ae5b8f02bfce8634c763ab4   

                         Certification.Contact unit_of_measurement  \
   19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
   19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
   0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660                   m   
   19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
   19fef5a731de2db57d16da10287413f5f99bc2dd                   m   
...                                        ...                 ...   
0eb4ee5b3f47b20b049548a2fd1e7d4a2b70d0a7                   m   
0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660                   m   
7f521ca403540f81ec99daec7da19c2788393880                   m   
724f04ad10ed31dbb9d260f0dfd221ba48be8a95                  ft   
7f521ca403540f81ec99daec7da19c2788393880                   m   

     altitude_low_meters altitude_high_meters altitude_mean_meters  
              1950.00              2200.00              2075.00  
              1950.00              2200.00              2075.00  
              1600.00              1800.00              1700.00  
              1800.00              2200.00              2000.00  
              1950.00              2200.00              2075.00  
...                  ...                  ...                  ...  
            900.00               900.00               900.00  
            350.00               350.00               350.00  
           1100.00              1100.00              1100.00  
           1417.32              1417.32              1417.32  
           1400.00              1400.00              1400.00  

[1311 rows x 44 columns]>

so lets check the type of that.

type(coffee_df.describe)

method

it’s a bound method or a function that will be applied to the DataFrame, but we didn’t actually run the method. To see that it hasn’t run, we can use an ipython1 magic %%timeit

%%timeit
coffee_df.describe

74.8 ns ± 0.0775 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

%%timeit
coffee_df.describe()

28.3 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that without the () it runs much much faster, signaling that it did less finding the method, is less calcuation than computing statistics on the data

7.3. Basic plots in pandas#

Pandas gives us basic plots.

coffee_df['Flavor'].plot()

<AxesSubplot:>

Since we chose a series, it plotted that data as line vs the index.

coffee_df.index

RangeIndex(start=0, stop=1311, step=1)

We can change the kind, for example to a Kernel Density Estimate. This approximates the distribution of the data, you can think of it rougly like a smoothed out histogram.

coffee_df['Flavor'].plot(kind='kde')

<AxesSubplot:ylabel='Density'>

We can also plot two variables as a scatter plot, by specifying the x, y and kind

coffee_df.plot(x='Flavor',y='Balance', kind='scatter')

*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.

<AxesSubplot:xlabel='Flavor', ylabel='Balance'>

Let’s Make a histogram plot of the Balance variable

coffee_df['Balance'].plot(kind='hist')

<AxesSubplot:ylabel='Frequency'>

Question from class

Can we plot two histograms with coffee_df['Balance']['Flavor'].plot(kind='hist')

:tags: ["raises-exception"]

coffee_df['Balance']['Flavor'].plot(kind='hist')

  Input In [18]
    :tags: ["raises-exception"]
    ^
SyntaxError: invalid syntax

Let’s break down why that errors. When we append things to the left, python interprets them by passing the output of one step to the input of the next one.
So coffee_df['Balance'].plot(kind='hist') first made a series, then plotted it. In the above, we again got the series, which works

coffee_df['Balance'].head(2)

0    8.42
1    8.42
Name: Balance, dtype: float64

But then, we tried to index it with ‘Flavor’, but we don’t have that any more

coffee_df['Balance']['Flavor']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 coffee_df['Balance']['Flavor']

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/series.py:958, in Series.__getitem__(self, key)
    955     return self._values[key]
    957 elif key_is_scalar:
--> 958     return self._get_value(key)
    960 if is_hashable(key):
    961     # Otherwise index.get_value will raise InvalidIndexError
    962     try:
    963         # For labels that don't resolve as scalars like tuples and frozensets

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/series.py:1069, in Series._get_value(self, label, takeable)
   1066     return self._values[label]
   1068 # Similar to Index.get_value, but we do not fall back to positional
-> 1069 loc = self.index.get_loc(label)
   1070 return self.index._get_values_for_loc(self, loc, label)

File /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/pandas/core/indexes/range.py:389, in RangeIndex.get_loc(self, key, method, tolerance)
    387             raise KeyError(key) from err
    388     self._check_indexing_error(key)
--> 389     raise KeyError(key)
    390 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'Flavor'

So we get a key error and we know this is the part of the line we have to change.

We need to index into the DataFrame and pick two columns at once. When we index, we can use the name of a variable as a string or a list. We can buil this list on the fly and python exectues fromt he inside out.
The outer [ ] index and the inner [ ] make alist

coffee_df[['Balance','Flavor']].head(2)

	Balance	Flavor
0	8.42	8.83
1	8.42	8.67

we could also build the list first, then index for readability

hist_vars = ['Balance','Flavor'].head(2)
coffee_df[hist_vars]

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 hist_vars = ['Balance','Flavor'].head(2)
      2 coffee_df[hist_vars]

AttributeError: 'list' object has no attribute 'head'

This gives us a data frame, which we can plot.

coffee_df[['Balance','Flavor']].plot(kind='hist')

<AxesSubplot:ylabel='Frequency'>

We’ll see ways to improve this on Friday.

7.4. Plotting in Python#

matplotlib: low level plotting tools
seaborn: high level plotting with opinionated defaults
ggplot: plotting based on the ggplot library in R.

Pandas and seaborn use matplotlib under the hood.

Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.

We can get that basic plot back.

sns.scatterplot(data=coffee_df,x='Flavor',y='Balance')

<AxesSubplot:xlabel='Flavor', ylabel='Balance'>

But now we have more power to investigate more relationships in the data.

sns.scatterplot(data=coffee_df,x='Flavor',y='Balance',hue='Color')

<AxesSubplot:xlabel='Flavor', ylabel='Balance'>

From this we can see that the color doesn’t appear to be related to the flavor or balance scores, but that the flavor and balacne are related.

We can also break this apart. lmplot is a higher level plotting function so it allows us to create grids of plots and by default also includes a regression line. We’ll turn that off for now, with ,fit_reg=False.

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               col='Color',fit_reg=False)

<seaborn.axisgrid.FacetGrid at 0x7ff7d31776a0>

col stands for column. We can also use row

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               row='Color')

<seaborn.axisgrid.FacetGrid at 0x7ff7d2d51fa0>

We can also use both together:

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               row='Color',col='Variety')

<seaborn.axisgrid.FacetGrid at 0x7ff7d27eefa0>

How could we choose which countries to select to make this not show the ones with very few points?

coffee_df['Country.of.Origin'].value_counts()

Mexico                          236
Colombia                        183
Guatemala                       181
Brazil                          132
Taiwan                           75
United States (Hawaii)           73
Honduras                         53
Costa Rica                       51
Ethiopia                         44
Tanzania, United Republic Of     40
Thailand                         32
Uganda                           26
Nicaragua                        26
Kenya                            25
El Salvador                      21
Indonesia                        20
China                            16
Malawi                           11
Peru                             10
United States                     8
Myanmar                           8
Vietnam                           7
Haiti                             6
Philippines                       5
Panama                            4
United States (Puerto Rico)       4
Laos                              3
Burundi                           2
Ecuador                           1
Rwanda                            1
Japan                             1
Zambia                            1
Papua New Guinea                  1
Mauritius                         1
Cote d?Ivoire                     1
India                             1
Name: Country.of.Origin, dtype: int64

Or we can focus on the countried, but wrap them.

sns.lmplot(data=coffee_df,x='Flavor',y='Balance',hue='Color',
               col='Country.of.Origin',col_wrap=5)

<seaborn.axisgrid.FacetGrid at 0x7ff7cc6b22e0>

7.5. Questions after class#

Ram Token Opportunity

add a question with a pull request; earn 1-2 ram tokens for submitting a question with the answer (with sources)

7.6. More practice#

Plot the kde for the Aftertaste
How does Total.Cup.Points vary by Certification.Body
Are moisture and sweetness related? Does that relationship vary by Color?

1: the kernel of python we’re using

Programming for Data Science at URI Fall 2021

Visualization

Contents

7. Visualization#

7.1. Jupyter FAQ#

7.2. Review of describe#

7.3. Basic plots in pandas#

7.4. Plotting in Python#

7.5. Questions after class#

7.6. More practice#