5. Visualization#

Warning

If your plots do not show, include this in any cell. The % signals that this is an ipython magic. This one controls matplotlib. Jupyter uses the IPython python kernel.

import pandas as pd
import seaborn as sns

5.1. Summarizing Review#

We will start with the same dataset we hvae been working with

robusta_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/robusta_data_cleaned.csv'
robusta_df = pd.read_csv(robusta_data_url,index_col=0)
robusta_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Robusta ankole coffee producers coop Uganda kyangundu cooperative society NaN ankole coffee producers 0 ankole coffee producers coop 1488 sheema south western ... Green 2 June 26th, 2015 Uganda Coffee Development Authority e36d0270932c3b657e96b7b0278dfd85dc0fe743 03077a1c6bac60e6f514691634a7f6eb5c85aae8 m 1488.0 1488.0 1488.0

1 rows × 43 columns

Is the robust coffee’s Mouthfeel or the Aftertaste more consistently scored in this dataset?

Why?

cols_to_compare = ['Mouthfeel', 'Aftertaste']
robusta_df[cols_to_compare].std()
Mouthfeel     0.725152
Aftertaste    0.342469
dtype: float64

from the lower std we can see that Aftertaste is more consistently rated.

We will use a larger dataset for more interesting plots.

arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
coffee_df = pd.read_csv(arabica_data_url,index_col=0)
coffee_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

1 rows × 43 columns

Recall, we can us built in plots in pandas.

coffee_df['Aftertaste'].plot(kind='hist')
<Axes: ylabel='Frequency'>
../_images/92fdc2b9f727d0682edc131c3fb09e35c1380501fe642c2eda6365d623d55f5a.png

5.2. Plotting in Python#

  • matplotlib: low level plotting tools

  • seaborn: high level plotting with opinionated defaults

  • ggplot: plotting based on the ggplot library in R.

Pandas and seaborn use matplotlib under the hood.

Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.

There are lots of type of plots, we saw the basic patterns of how to use them and we’ve used a few types, but we cannot (and do not need to) go through every single type. There are general patterns that you can use that will help you think about what type of plot you might want and help you understand them to be able to customize plots.

[Seaborn’s main goal is opinionated defaults and flexible customization](https://seaborn.pydata.org/tutorial/introduction.html#opinionated-defaults-and-flexible-customization

5.2.1. Anatomy of a figure#

First is the matplotlib structure of a figure. BOth pandas and seaborn and other plotting libraries use matplotlib. Matplotlib was used in visualizing the first Black hole.

annotated graph

This is a lot of information, but these are good to know things. THe most important is the figure and the axes.

Try it Yourself

Make sure you can explain what is a figure and what are axes in your own words and why that distinction matters. Discuss in office hours if you are unsure.

that image was drawn with code and that page explains more.

5.2.2. Plotting Function types in Seaborn#

Seaborn has two levels or groups of plotting functions. Figure and axes. Figure level fucntions can plot with subplots.

summary of plot types

This is from thie overivew section of the official seaborn tutorial. It also includes a comparison of figure vs axes plotting.

The official introduction is also a good read.

5.2.3. More#

The seaborn gallery and matplotlib gallery are nice to look at too.

5.2.4. Styling in Seaborn#

Important

This was not covered in class, but can be helpful

Seaborn also lets us set a theme for visual styling This by default styles the plots to be more visually appealing

sns.set_theme(palette='colorblind')

the colorblind palette is more distinguishable under a variety fo colorblindness types. for more. Colorblind is a good default, but you can choose others that you like more too.

more on colors

robusta_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method',
       'Fragrance...Aroma', 'Flavor', 'Aftertaste', 'Salt...Acid',
       'Bitter...Sweet', 'Mouthfeel', 'Uniform.Cup', 'Clean.Cup', 'Balance',
       'Cupper.Points', 'Total.Cup.Points', 'Moisture', 'Category.One.Defects',
       'Quakers', 'Color', 'Category.Two.Defects', 'Expiration',
       'Certification.Body', 'Certification.Address', 'Certification.Contact',
       'unit_of_measurement', 'altitude_low_meters', 'altitude_high_meters',
       'altitude_mean_meters'],
      dtype='object')
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
       'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
       'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
       'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
       'Category.Two.Defects', 'Expiration', 'Certification.Body',
       'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')

Important

For seaborn the online documentation is immensely valuable. Every function’s page has basic documentation and lots of examples, so you can see how they use different paramters to modify plots visually. I strongly recommend reading it often. I recommend reading their tutorial too

sns.displot(data = coffee_df, x='Flavor')
<seaborn.axisgrid.FacetGrid at 0x7f3cd2879160>
../_images/286c1ff95741ccb4f715e2102c769767c8c48d82c737ef37294186f401fc9590.png

Note explain the layout warning

sns.displot(data = coffee_df, x='Aftertaste',)
<seaborn.axisgrid.FacetGrid at 0x7f3c87f8c160>
../_images/ea372c48d95db48a80a0309930530ca49a2737b6e27c6ac04d776ad69e6f9ac2.png
sns.scatterplot(data=coffee_df, x='Flavor', y='Balance')
<Axes: xlabel='Flavor', ylabel='Balance'>
../_images/ec5ac274e6a02d66f66cea66f682efb5a469a9ee4440c4cdd97123e9b36af81c.png
sns.relplot(data=coffee_df, x='Flavor', y='Balance',)
<seaborn.axisgrid.FacetGrid at 0x7f3c87f902e0>
../_images/c1646efd32fd4ef42611f1eefe5cbb53dd98f294e9be240b7ce57ae19e15ce5f.png

Plotting functions return an object that you can use to customize the plots further.

g = sns.relplot(data=coffee_df, x='Flavor', y='Balance',
           hue='Country.of.Origin',)
../_images/dcead8b89971cf158ff57f871ca6afee52bdcf37562d4246e2d0fb963c4d6d27.png
type(g)
seaborn.axisgrid.FacetGrid
g. # tab can show the options for attributes andmethods on this object
  Cell In[16], line 1
    g. # tab can show the options for attributes andmethods on this object
       ^
SyntaxError: invalid syntax

5.3. Bags per country#

How many bags of coffee are produced per country?

sns.catplot(data=coffee_df, x='Country.of.Origin',y='Number.of.Bags',
           kind='count');
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 1
----> 1 sns.catplot(data=coffee_df, x='Country.of.Origin',y='Number.of.Bags',
      2            kind='count');

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/seaborn/categorical.py:2764, in catplot(data, x, y, hue, row, col, kind, estimator, errorbar, n_boot, units, seed, order, hue_order, row_order, col_order, col_wrap, height, aspect, log_scale, native_scale, formatter, orient, color, palette, hue_norm, legend, legend_out, sharex, sharey, margin_titles, facet_kws, ci, **kwargs)
   2762         y = 1
   2763     elif x is not None and y is not None:
-> 2764         raise ValueError("Cannot pass values for both `x` and `y`.")
   2766 p = Plotter(
   2767     data=data,
   2768     variables=dict(x=x, y=y, hue=hue, row=row, col=col, units=units),
   (...)
   2774     legend=legend,
   2775 )
   2777 for var in ["row", "col"]:
   2778     # Handle faceting variables that lack name information

ValueError: Cannot pass values for both `x` and `y`.
coffee_df.shape
(1311, 43)
coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].sum()
Country.of.Origin
Brazil                          30534
Burundi                           520
China                              55
Colombia                        41204
Costa Rica                      10354
Cote d?Ivoire                       2
Ecuador                             1
El Salvador                      4449
Ethiopia                        11761
Guatemala                       36868
Haiti                             390
Honduras                        13167
India                              20
Indonesia                        1658
Japan                              20
Kenya                            3971
Laos                               81
Malawi                            557
Mauritius                           1
Mexico                          24140
Myanmar                            10
Nicaragua                        6406
Panama                            537
Papua New Guinea                    7
Peru                             2336
Philippines                       259
Rwanda                            150
Taiwan                           1914
Tanzania, United Republic Of     3760
Thailand                         1310
Uganda                           3868
United States                     361
United States (Hawaii)            833
United States (Puerto Rico)        71
Vietnam                            10
Zambia                             13
Name: Number.of.Bags, dtype: int64

split-apply-combine

country_grouped = coffee_df.groupby('Country.of.Origin')
country_grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f3c879e2d30>
bag_total_dict = {}

for country,df in country_grouped:
    tot_bags =  df['Number.of.Bags'].sum()
    bag_total_dict[country] = tot_bags

pd.DataFrame.from_dict(bag_total_dict, orient='index',
                           columns = ['Number.of.Bags.Sum'])
Number.of.Bags.Sum
Brazil 30534
Burundi 520
China 55
Colombia 41204
Costa Rica 10354
Cote d?Ivoire 2
Ecuador 1
El Salvador 4449
Ethiopia 11761
Guatemala 36868
Haiti 390
Honduras 13167
India 20
Indonesia 1658
Japan 20
Kenya 3971
Laos 81
Malawi 557
Mauritius 1
Mexico 24140
Myanmar 10
Nicaragua 6406
Panama 537
Papua New Guinea 7
Peru 2336
Philippines 259
Rwanda 150
Taiwan 1914
Tanzania, United Republic Of 3760
Thailand 1310
Uganda 3868
United States 361
United States (Hawaii) 833
United States (Puerto Rico) 71
Vietnam 10
Zambia 13
'a b'.split(' ')
['a', 'b']
a,b = 'a b'.split(' ')
a
'a'
b
'b'
bags_per_country_df = coffee_df.groupby('Country.of.Origin')['Number.of.Bags'].sum()
bags_per_country_df.plot(kind='bar')
<Axes: xlabel='Country.of.Origin'>
../_images/9a0ffa60bb90d6ce94c572610d41de115901af4c730622db3ecb54923e0ca86e.png
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
       'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
       'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
       'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
       'Category.Two.Defects', 'Expiration', 'Certification.Body',
       'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')
flavor_by_color = coffee_df.groupby('Color')['Flavor']
flavor_by_color.mean()
Color
Blue-Green      7.577317
Bluish-Green    7.581518
Green           7.491482
Name: Flavor, dtype: float64
flavor_by_color.std()
Color
Blue-Green      0.276513
Bluish-Green    0.301241
Green           0.413324
Name: Flavor, dtype: float64
sns.displot(data=coffee_df, x='Flavor',col='Color' )
<seaborn.axisgrid.FacetGrid at 0x7f3c7fec7f10>
../_images/f8038e56dcd31ee459668541b1f631138d844b03d473d8a305af578e355f3dcf.png
sns.displot(data=coffee_df, x='Flavor',hue='Color' ,kind='kde')
<seaborn.axisgrid.FacetGrid at 0x7f3c7ff34370>
../_images/c77907c626abc4df830ba41a417d055a967e2aad573a85a7721aaf62b73ad111.png

5.4. Filtering a DataFrame#

Now, we’ll take just the country names out

How does color vary by country of origin for the top 10 countries with the most ratings?

df_country = coffee_df['Country.of.Origin'].value_counts()
top_countries = df_country[:10].index
top_countries
Index(['Mexico', 'Colombia', 'Guatemala', 'Brazil', 'Taiwan',
       'United States (Hawaii)', 'Honduras', 'Costa Rica', 'Ethiopia',
       'Tanzania, United Republic Of'],
      dtype='object', name='Country.of.Origin')

5.4.1. Filtering with a lambda and the apply method?#

check_top_country = lambda r: r['Country.of.Origin'] in top_countries
'Mexico' in top_countries
True
'USA' in top_countries
False
top_country_df = coffee_df[coffee_df.apply(check_top_country,axis=1)]

5.4.2. Filtering with isin#

and we can use that to filter the original DataFrame. To do this, we use isin to check each element in the 'Country.of.Origin' column is in that list.

coffee_df['Country.of.Origin'].isin(top_countries)
1        True
2        True
3        True
4        True
5        True
        ...  
1307     True
1308    False
1309    False
1310     True
1312     True
Name: Country.of.Origin, Length: 1311, dtype: bool

This is roughly equivalent to:

[country in top_countries for country in coffee_df['Country.of.Origin'] ]
[True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 ...]

except this builds a list and the pandas way makes a pd.Series object. The Python in operator is really helpful to know and pandas offers us an isin method to get that type of pattern.

In a more basic programming format this process would be two separate loops worth of work.

c_in = []
# iterate over the country of each rating
for country in coffee_df['Country.of.Origin']:
    # make a false temp value
    cur_search = False
    # iterate over top countries
    for tc in top_countries:
        # flip the value if the current top & rating cofee match
        if tc==country:
            cur_search = True
    # save the result of the search
    c_in.append(cur_search)

Try it yourself

Run these versions and confirm for yourself that they are the same.

With that list of booleans, we can then mask the original DataFrame. This keeps only the value where the inner quantity is True

top_coffee_df = coffee_df[coffee_df['Country.of.Origin'].isin(top_countries)]
top_coffee_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

1 rows × 43 columns

top_coffee_df.shape, coffee_df.shape
((1068, 43), (1311, 43))
sns.displot(data=top_coffee_df,x='Aftertaste', col='Country.of.Origin',col_wrap=5)
<seaborn.axisgrid.FacetGrid at 0x7f3cd2474cd0>
../_images/8f1d7f2519d939b899f3b3ee27051523104deee55b30e413480039bc20f277ae.png

5.5. Variable types and data types#

Related but not the same.

Data types are literal, related to the representation in the computer.

For exmaple, int16, int32, int64

We can also have mathematical types of numbers

  • Integers can be positive, 0, or negative.

  • Reals are continuous, infinite possibilities.

Variable types are about the meaning in a conceptual sense.

  • categorical (can take a discrete number of values, could be used to group data, could be a string or integer; unordered)

  • continuous (can take on any possible value, always a number)

  • binary (like data type boolean, but could be represented as yes/no, true/false, or 1/0, could be categorical also, but often makes sense to calculate rates)

  • ordinal (ordered, but appropriately categorical)

we’ll focus on the first two most of the time. Some values that are technically only integers range high enough that we treat them more like continuous most of the time.

5.6. Questions After Class#

5.6.1. how can I use data to make different graphs?#

You can use the help function on any of the plot functions we have seen. You can also use the documentation.

The seaborn gallery is a good place to get ideas

5.6.2. Will we learn about different types of graphs?#

We will use a few more types in class, but also you will probably use the documentation to learn more types. The seaborn gallery is a good place to start.

5.6.3. What are the advantages with work with the lambda function?#

The lambda function is convenient because it is defined on one line.

5.6.4. Is a lambda function similar to an arrow function in Javascript?#

It does look similar, but I am not an expert in javascipt.

5.6.5. How do you decide on the best graphs to use for each piece of data?#

Different types of plots make different conclusions easy to see.

5.6.6. why do you use axis = 1 for the check_top_countries#

5.6.7. could you do group by instead of value counts for the last question?#

yes

5.6.8. more about different ways to plot data and how to get specific rows and compare them#

5.6.9. is there a document that shows us all the diffrent commands that allow us to do things to files#