Reparing values

8. Reparing values#

So far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.

Today, we’ll deal with how to fix the values within the data.

8.1. Cleaning Data review#

Instead of more practice with these manipulations, below are more examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general idea of what good data looks like and good habits for cleaning data and keeping it reproducible.

Cleaning the Adult Dataset
All Shades Also here are some tips on general data management and organization.

This article is a comprehensive discussion of data cleaning.

8.1.1. A Cleaning Data Recipe#

not everything possible, but good enough for this course

Can you use parameters to read the data in better?
Fix the index and column headers (making these easier to use makes the rest easier)
Is the data strucutred well?
Are there missing values?
Do the datatypes match what you expect by looking at the head or a sample?
Are categorical variables represented in usable way?
Does your analysis require filtering or augmenting the data?

import pandas as pd
import seaborn as sns
import numpy as np #
na_toy_df_np = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan]])
na_toy_df_pd = pd.DataFrame(data = [[1,3,4,5],[2 ,6, pd.NA]])

# make plots look nicer and increase font size
sns.set_theme(font_scale=2)
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url,index_col=0)


rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)

8.2. What is clean enough?#

This is a great question, without an easy answer.

It depends on what you want to do. This is why it’s important to have potential questions in mind if you are cleaning data for others and why we often have to do a little bit more preparation after a dataset has been “cleaned”

8.3. Fixing Column names#

coffee_df.columns

Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
       'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
       'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
       'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
       'Category.Two.Defects', 'Expiration', 'Certification.Body',
       'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')

col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns}

coffee_df.rename(columns=col_name_mapper).head(1)

	species	owner	country_of_origin	farm_name	lot_number	mill	ico_number	company	altitude	region	...	color	category_two_defects	expiration	certification_body	certification_address	certification_contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0

1 rows × 43 columns

coffee_df.head(1)

	Species	Owner	Country.of.Origin	Farm.Name	Lot.Number	Mill	ICO.Number	Company	Altitude	Region	...	Color	Category.Two.Defects	Expiration	Certification.Body	Certification.Address	Certification.Contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0

1 rows × 43 columns

coffee_df_fixedcols = coffee_df.rename(columns=col_name_mapper)
coffee_df_fixedcols.head(1)

	species	owner	country_of_origin	farm_name	lot_number	mill	ico_number	company	altitude	region	...	color	category_two_defects	expiration	certification_body	certification_address	certification_contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	Green	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0

1 rows × 43 columns

coffee_df_fixedcols['unit_of_measurement'].value_counts()

unit_of_measurement
m     1129
ft     182
Name: count, dtype: int64

coffee_df_fixedcols['unit_of_measurement'].replace({'m':'meters','ft':'feet'})

     meters
     meters
     meters
     meters
     meters
         ...  
  meters
  meters
  meters
    feet
  meters
Name: unit_of_measurement, Length: 1311, dtype: object

coffee_df_fixedcols['unit_of_measurement_long'] = coffee_df_fixedcols['unit_of_measurement'].replace(
                                    {'m':'meters','ft':'feet'})
coffee_df_fixedcols.head(1)

	species	owner	country_of_origin	farm_name	lot_number	mill	ico_number	company	altitude	region	...	category_two_defects	expiration	certification_body	certification_address	certification_contact	unit_of_measurement	altitude_low_meters	altitude_high_meters	altitude_mean_meters	unit_of_measurement_long
1	Arabica	metad plc	Ethiopia	metad plc	NaN	metad plc	2014/2015	metad agricultural developmet plc	1950-2200	guji-hambela	...	0	April 3rd, 2016	METAD Agricultural Development plc	309fcf77415a3661ae83e027f7e5f05dad786e44	19fef5a731de2db57d16da10287413f5f99bc2dd	m	1950.0	2200.0	2075.0	meters

1 rows × 44 columns

8.4. Missing Values#

Dealing with missing data is a whole research area. There isn’t one solution.

in 2020 there was a whole workshop on missing

one organizer is the main developer of sci-kit learn the ML package we will use soon. In a 2020 invited talk he listed more automatic handling as an active area of research and a development goal for sklearn.

There are also many classic approaches both when training and when applying models.

example application in breast cancer detection

Even in pandas, dealing with missing values is under experimentation as to how to represent it symbolically

Missing values even causes the datatypes to change

That said, there are are om Pandas gives a few basic tools:

dropna
fillna

Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.

Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.

Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example

you can approximate with another column
you can approximate with that column from other rows

Special case, what if we’re filling a summary table?

filling with a symbol for printing can be a good choice, but not for analysis.

whatever you do, document it

coffee_df_fixedcols.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1311 entries, 1 to 1312
Data columns (total 44 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 species                   1311 non-null   object 
 owner                     1304 non-null   object 
 country_of_origin         1310 non-null   object 
 farm_name                 955 non-null    object 
 lot_number                270 non-null    object 
 mill                      1001 non-null   object 
 ico_number                1163 non-null   object 
 company                   1102 non-null   object 
 altitude                  1088 non-null   object 
 region                    1254 non-null   object 
producer                  1081 non-null   object 
number_of_bags            1311 non-null   int64  
bag_weight                1311 non-null   object 
in_country_partner        1311 non-null   object 
harvest_year              1264 non-null   object 
grading_date              1311 non-null   object 
owner_1                   1304 non-null   object 
variety                   1110 non-null   object 
processing_method         1159 non-null   object 
aroma                     1311 non-null   float64
flavor                    1311 non-null   float64
aftertaste                1311 non-null   float64
acidity                   1311 non-null   float64
body                      1311 non-null   float64
balance                   1311 non-null   float64
uniformity                1311 non-null   float64
clean_cup                 1311 non-null   float64
sweetness                 1311 non-null   float64
cupper_points             1311 non-null   float64
total_cup_points          1311 non-null   float64
moisture                  1311 non-null   float64
category_one_defects      1311 non-null   int64  
quakers                   1310 non-null   float64
color                     1044 non-null   object 
category_two_defects      1311 non-null   int64  
expiration                1311 non-null   object 
certification_body        1311 non-null   object 
certification_address     1311 non-null   object 
certification_contact     1311 non-null   object 
unit_of_measurement       1311 non-null   object 
altitude_low_meters       1084 non-null   float64
altitude_high_meters      1084 non-null   float64
altitude_mean_meters      1084 non-null   float64
unit_of_measurement_long  1311 non-null   object 
dtypes: float64(16), int64(3), object(25)
memory usage: 460.9+ KB

8.4.1. Filling missing values#

The ‘Lot.Number’ has a lot of NaN values, how can we explore it?

We can look at the type:

coffee_df_fixedcols['lot_number'].dtype

dtype('O')

And we can look at the value counts.

coffee_df_fixedcols['lot_number'].value_counts()

lot_number
1                             18
020/17                         6
019/17                         5
2                              3
102                            3
                              ..
11/23/0696                     1
3-59-2318                      1
8885                           1
5055                           1
017-053-0211/ 017-053-0212     1
Name: count, Length: 221, dtype: int64

We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:

coffee_df_fixedcols['lot_number'].fillna('1')

                              1
                              1
                              1
                              1
                              1
                   ...            
                           1
                           1
  017-053-0211/ 017-053-0212
                           1
                         103
Name: lot_number, Length: 1311, dtype: object

Note that even after we called fillna we display it again and the original data is unchanged. To save the filled in column we have a few choices:

use the inplace parameter. This doesn’t offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to deprecate
write to a new DataFrame
add a column

We’ll use adding a column:

coffee_df_fixedcols['lot_number_clean'] = coffee_df_fixedcols['lot_number'].fillna('1')

coffee_df_fixedcols['lot_number_clean'].value_counts()

lot_number_clean
1                             1059
020/17                           6
019/17                           5
102                              3
103                              3
                              ... 
3-59-2318                        1
8885                             1
5055                             1
MCCFWXA15/16                     1
017-053-0211/ 017-053-0212       1
Name: count, Length: 221, dtype: int64

8.4.2. Dropping missing values#

To illustrate how dropna works, we’ll use the shape method:

coffee_df_fixedcols.shape

(1311, 45)

coffee_df_fixedcols.dropna().shape

(130, 45)

By default, it drops any row with one or more NaN values.

We could instead tell it to only drop rows with NaN in a subset of the columns.

coffee_df_fixedcols.dropna(subset=['altitude_low_meters']).shape

(1084, 45)

coffee_alt_df = coffee_df_fixedcols.dropna(subset=['altitude_low_meters'])

In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh parameter and axis=1 to drop along columns.

n_rows, n_cols = coffee_df_fixedcols.shape
coffee_df_fixedcols.dropna(thresh = .7*n_rows, axis=1).shape

(1311, 44)

This dataset is actually in pretty good shape, but if we use a more stringent threshold it drops more columns.

coffee_df_fixedcols.dropna(thresh = .85*n_rows, axis=1).shape

(1311, 34)

8.5. Inconsistent values#

This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts() Series, the values from the coffee_df become the index, so we use sort_index.

Let’s look at the in_country_partner column

coffee_df_fixedcols['in_country_partner'].value_counts().sort_index()

in_country_partner
AMECAFE                                                                                  205
Africa Fine Coffee Association                                                            49
Almacafé                                                                                 178
Asociacion Nacional Del Café                                                             155
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.                             6
Asociación de Cafés Especiales de Nicaragua                                                8
Blossom Valley International                                                              58
Blossom Valley International\n                                                             1
Brazil Specialty Coffee Association                                                       67
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao      1
Centro Agroecológico del Café A.C.                                                         8
Coffee Quality Institute                                                                   7
Ethiopia Commodity Exchange                                                               18
Instituto Hondureño del Café                                                              60
Kenya Coffee Traders Association                                                          22
METAD Agricultural Development plc                                                        15
NUCOFFEE                                                                                  36
Salvadoran Coffee Council                                                                 11
Specialty Coffee Ass                                                                       1
Specialty Coffee Association                                                             295
Specialty Coffee Association of Costa Rica                                                42
Specialty Coffee Association of Indonesia                                                 10
Specialty Coffee Institute of Asia                                                        16
Tanzanian Coffee Board                                                                     6
Torch Coffee Lab Yunnan                                                                    2
Uganda Coffee Development Authority                                                       22
Yunnan Coffee Exchange                                                                    12
Name: count, dtype: int64

We can see there’s only one Blossom Valley International\n but 58 Blossom Valley International, the former is likely a typo, especially since \n is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.

partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
  'Specialty Coffee Ass':'Specialty Coffee Association'}

coffee_df_clean = coffee_df_fixedcols.replace(partner_corrections)

8.6. Example: Unpacking Jsons#

rhodyprog4ds_gh_events_url

'https://api.github.com/orgs/rhodyprog4ds/events'

gh_df = pd.read_json(rhodyprog4ds_gh_events_url)
gh_df.head()

	id	type	actor	repo	payload	public	created_at	org
0	28565033114	CreateEvent	{'id': 10656079, 'login': 'brownsarahm', 'disp...	{'id': 592944632, 'name': 'rhodyprog4ds/BrownS...	{'ref': 'c22', 'ref_type': 'tag', 'master_bran...	True	2023-04-21 00:47:50+00:00	{'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
1	28565028952	PushEvent	{'id': 10656079, 'login': 'brownsarahm', 'disp...	{'id': 592944632, 'name': 'rhodyprog4ds/BrownS...	{'repository_id': 592944632, 'push_id': 133757...	True	2023-04-21 00:47:24+00:00	{'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
2	28508780091	PushEvent	{'id': 41898282, 'login': 'github-actions[bot]...	{'id': 592944632, 'name': 'rhodyprog4ds/BrownS...	{'repository_id': 592944632, 'push_id': 133481...	True	2023-04-19 01:08:30+00:00	{'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
3	28508755066	PushEvent	{'id': 41898282, 'login': 'github-actions[bot]...	{'id': 592944632, 'name': 'rhodyprog4ds/BrownS...	{'repository_id': 592944632, 'push_id': 133481...	True	2023-04-19 01:06:30+00:00	{'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
4	28508702433	PushEvent	{'id': 10656079, 'login': 'brownsarahm', 'disp...	{'id': 592944632, 'name': 'rhodyprog4ds/BrownS...	{'repository_id': 592944632, 'push_id': 133481...	True	2023-04-19 01:02:22+00:00	{'id': 69595187, 'login': 'rhodyprog4ds', 'gra...

Some datasets have a nested structure

We want to transform each one of those from a dictionary like thing into a row in a data frame.

We can see each row is a Series type.

type(gh_df.loc[0])

pandas.core.series.Series

a= '1'
type(a)

str

Recall, that base python types can be used as function, to cast an object from type to another.

type(int(a))

int

This works with Pandas Series too

pd.Series(gh_df.loc[0]['actor'])

id                                                        10656079
login                                                  brownsarahm
display_login                                          brownsarahm
gravatar_id                                                       
url                       https://api.github.com/users/brownsarahm
avatar_url       https://avatars.githubusercontent.com/u/10656079?
dtype: object

We can use pandas apply to do the same thing to every item in a dataset (over rows or columns as items ) For example

gh_df['actor'].apply(pd.Series).head()

	id	login	display_login	url	avatar_url
0	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
1	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
2	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
3	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
4	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?

compared to the original:

gh_df.head(1)

	id	type	actor	repo	payload	public	created_at	org
0	28565033114	CreateEvent	{'id': 10656079, 'login': 'brownsarahm', 'disp...	{'id': 592944632, 'name': 'rhodyprog4ds/BrownS...	{'ref': 'c22', 'ref_type': 'tag', 'master_bran...	True	2023-04-21 00:47:50+00:00	{'id': 69595187, 'login': 'rhodyprog4ds', 'gra...

We want to handle several columns this way, so we’ll make alist of the names.

js_cols = ['actor','repo','payload','org']

pd.concat takes a list of dataframes and puts the together in one DataFrame.

pd.concat([gh_df[col].apply(pd.Series) for col in js_cols],axis=1).head()

	id	login	display_login	url	avatar_url	id	name	url	ref	...	before	commits	action	release	issue	id	login	url	avatar_url
0	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	592944632	rhodyprog4ds/BrownSpring23	https://api.github.com/repos/rhodyprog4ds/Brow...	c22	...	NaN	NaN	NaN	NaN	NaN	69595187	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
1	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	592944632	rhodyprog4ds/BrownSpring23	https://api.github.com/repos/rhodyprog4ds/Brow...	refs/heads/main	...	14247b91b29fdf6641b07785ab87920d1e9e26eb	[{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae...	NaN	NaN	NaN	69595187	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
2	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?	592944632	rhodyprog4ds/BrownSpring23	https://api.github.com/repos/rhodyprog4ds/Brow...	refs/heads/gh-pages	...	c79137af62e22428db8a3e5614a496e85e2094a6	[{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5...	NaN	NaN	NaN	69595187	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
3	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?	592944632	rhodyprog4ds/BrownSpring23	https://api.github.com/repos/rhodyprog4ds/Brow...	refs/heads/gh-pages	...	972dcd4e3117f378c346547e20beb2905689d57d	[{'sha': 'c79137af62e22428db8a3e5614a496e85e20...	NaN	NaN	NaN	69595187	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
4	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	592944632	rhodyprog4ds/BrownSpring23	https://api.github.com/repos/rhodyprog4ds/Brow...	refs/heads/main	...	9d273b140d5a2e303253bf568b3d6bddf78c42d1	[{'sha': '14247b91b29fdf6641b07785ab87920d1e9e...	NaN	NaN	NaN	69595187	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?

5 rows × 29 columns

This is close, but a lot of columns have the same name. To fix this we will rename the new columns so that they have the original column name prepended to the new name.

pandas has a rename method for this.

and this is another job for lambdas.

pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[35], line 1
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

Cell In[35], line 1, in <listcomp>(.0)
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:5440, in DataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   5321 def rename(
   5322     self,
   5323     mapper: Renamer | None = None,
   (...)
   5331     errors: IgnoreRaise = "ignore",
   5332 ) -> DataFrame | None:
   5333     """
   5334     Rename columns or index labels.
   5335 
   (...)
   5438     4  3  6
   5439     """
-> 5440     return super()._rename(
   5441         mapper=mapper,
   5442         index=index,
   5443         columns=columns,
   5444         axis=axis,
   5445         copy=copy,
   5446         inplace=inplace,
   5447         level=level,
   5448         errors=errors,
   5449     )

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/generic.py:1034, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   1027         missing_labels = [
   1028             label
   1029             for index, label in enumerate(replacements)
   1030             if indexer[index] == -1
   1031         ]
   1032         raise KeyError(f"{missing_labels} not found in axis")
-> 1034 new_index = ax._transform_index(f, level=level)
   1035 result._set_axis_nocheck(new_index, axis=axis_no, inplace=True, copy=False)
   1036 result._clear_item_cache()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:6204, in Index._transform_index(self, func, level)
   6202     return type(self).from_arrays(values)
   6203 else:
-> 6204     items = [func(x) for x in self]
   6205     return Index(items, name=self.name, tupleize_cols=False)

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:6204, in <listcomp>(.0)
   6202     return type(self).from_arrays(values)
   6203 else:
-> 6204     items = [func(x) for x in self]
   6205     return Index(items, name=self.name, tupleize_cols=False)

Cell In[35], line 1, in <lambda>(c)
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

TypeError: sequence item 0: expected str instance, int found

gh_df['actor'].apply(pd.Series).rename(columns=lambda c: '_'.join([c,'actor']))

	id_actor	login_actor	display_login_actor	url_actor	avatar_url_actor
0	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
1	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
2	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
3	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
4	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
5	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
6	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
7	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
8	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
9	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
10	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
11	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
12	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
13	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
14	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
15	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
16	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
17	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
18	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
19	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
20	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
21	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
22	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
23	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
24	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
25	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
26	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
27	41898282	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?
28	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?
29	10656079	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?

json_cols_df = pd.concat([gh_df[col].apply(pd.Series).rename(columns=lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

gh_df.columns

Index(['id', 'type', 'actor', 'repo', 'payload', 'public', 'created_at',
       'org'],
      dtype='object')

json_cols_df.columns

Index(['id_actor', 'login_actor', 'display_login_actor', 'gravatar_id_actor',
       'url_actor', 'avatar_url_actor', 'id_repo', 'name_repo', 'url_repo',
       'ref_payload', 'ref_type_payload', 'master_branch_payload',
       'description_payload', 'pusher_type_payload', 'repository_id_payload',
       'push_id_payload', 'size_payload', 'distinct_size_payload',
       'head_payload', 'before_payload', 'commits_payload', 'action_payload',
       'release_payload', 'issue_payload', 'id_org', 'login_org',
       'gravatar_id_org', 'url_org', 'avatar_url_org'],
      dtype='object')

Then we can put the two parts of the data together

pd.concat([gh_df[['id','type','public','created_at']],json_cols_df],)

	id	type	public	created_at	id_actor	login_actor	display_login_actor	gravatar_id_actor	url_actor	avatar_url_actor	...	before_payload	commits_payload	action_payload	release_payload	issue_payload	id_org	login_org	gravatar_id_org	url_org	avatar_url_org
0	2.856503e+10	CreateEvent	True	2023-04-21 00:47:50+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2.856503e+10	PushEvent	True	2023-04-21 00:47:24+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2.850878e+10	PushEvent	True	2023-04-19 01:08:30+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2.850876e+10	PushEvent	True	2023-04-19 01:06:30+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2.850870e+10	PushEvent	True	2023-04-19 01:02:22+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	2.850869e+10	ReleaseEvent	True	2023-04-19 01:01:45+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	2.850868e+10	CreateEvent	True	2023-04-19 01:00:40+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	2.850868e+10	PushEvent	True	2023-04-19 01:00:20+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	2.838014e+10	PushEvent	True	2023-04-13 01:55:52+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	2.838009e+10	ReleaseEvent	True	2023-04-13 01:51:24+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	2.838007e+10	CreateEvent	True	2023-04-13 01:49:36+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	2.838006e+10	PushEvent	True	2023-04-13 01:49:22+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	2.835099e+10	PushEvent	True	2023-04-12 02:28:24+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	2.835091e+10	ReleaseEvent	True	2023-04-12 02:22:35+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	2.835088e+10	CreateEvent	True	2023-04-12 02:21:10+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	2.835088e+10	PushEvent	True	2023-04-12 02:21:12+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	2.834636e+10	PushEvent	True	2023-04-11 21:18:20+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	2.834622e+10	PushEvent	True	2023-04-11 21:10:21+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	2.833450e+10	PushEvent	True	2023-04-11 13:02:29+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	2.833432e+10	PushEvent	True	2023-04-11 12:56:32+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20	2.826148e+10	PushEvent	True	2023-04-06 23:43:27+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	2.826142e+10	PushEvent	True	2023-04-06 23:36:57+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	2.820908e+10	PushEvent	True	2023-04-05 01:28:11+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23	2.820905e+10	ReleaseEvent	True	2023-04-05 01:26:09+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	2.820902e+10	CreateEvent	True	2023-04-05 01:22:48+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	2.820901e+10	PushEvent	True	2023-04-05 01:22:25+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	2.820560e+10	IssuesEvent	True	2023-04-04 21:04:21+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	2.817440e+10	PushEvent	True	2023-04-03 19:24:10+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	2.817424e+10	PushEvent	True	2023-04-03 19:16:08+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	2.810662e+10	ReleaseEvent	True	2023-03-31 01:37:29+00:00	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
0	NaN	NaN	NaN	NaT	10656079.0	brownsarahm	brownsarahm		https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	...	NaN	NaN	NaN	NaN	NaN	69595187.0	rhodyprog4ds		https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
1	NaN	NaN	NaN	NaT	10656079.0	brownsarahm	brownsarahm		https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	...	14247b91b29fdf6641b07785ab87920d1e9e26eb	[{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae...	NaN	NaN	NaN	69595187.0	rhodyprog4ds		https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
2	NaN	NaN	NaN	NaT	41898282.0	github-actions[bot]	github-actions		https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?	...	c79137af62e22428db8a3e5614a496e85e2094a6	[{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5...	NaN	NaN	NaN	69595187.0	rhodyprog4ds		https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
3	NaN	NaN	NaN	NaT	41898282.0	github-actions[bot]	github-actions		https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?	...	972dcd4e3117f378c346547e20beb2905689d57d	[{'sha': 'c79137af62e22428db8a3e5614a496e85e20...	NaN	NaN	NaN	69595187.0	rhodyprog4ds		https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
4	NaN	NaN	NaN	NaT	10656079.0	brownsarahm	brownsarahm		https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	...	9d273b140d5a2e303253bf568b3d6bddf78c42d1	[{'sha': '14247b91b29fdf6641b07785ab87920d1e9e...	NaN	NaN	NaN	69595187.0	rhodyprog4ds		https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?

35 rows × 33 columns

and finally save this

gh_df_clean = pd.concat([gh_df[['id','type','public','created_at']],json_cols_df],axis=1)
gh_df_clean.head()

	id	type	public	created_at	id_actor	login_actor	display_login_actor	url_actor	avatar_url_actor	...	before_payload	commits_payload	action_payload	release_payload	issue_payload	id_org	login_org	url_org	avatar_url_org
0	28565033114	CreateEvent	True	2023-04-21 00:47:50+00:00	10656079.0	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	...	NaN	NaN	NaN	NaN	NaN	69595187.0	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
1	28565028952	PushEvent	True	2023-04-21 00:47:24+00:00	10656079.0	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	...	14247b91b29fdf6641b07785ab87920d1e9e26eb	[{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae...	NaN	NaN	NaN	69595187.0	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
2	28508780091	PushEvent	True	2023-04-19 01:08:30+00:00	41898282.0	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?	...	c79137af62e22428db8a3e5614a496e85e2094a6	[{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5...	NaN	NaN	NaN	69595187.0	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
3	28508755066	PushEvent	True	2023-04-19 01:06:30+00:00	41898282.0	github-actions[bot]	github-actions	https://api.github.com/users/github-actions[bot]	https://avatars.githubusercontent.com/u/41898282?	...	972dcd4e3117f378c346547e20beb2905689d57d	[{'sha': 'c79137af62e22428db8a3e5614a496e85e20...	NaN	NaN	NaN	69595187.0	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?
4	28508702433	PushEvent	True	2023-04-19 01:02:22+00:00	10656079.0	brownsarahm	brownsarahm	https://api.github.com/users/brownsarahm	https://avatars.githubusercontent.com/u/10656079?	...	9d273b140d5a2e303253bf568b3d6bddf78c42d1	[{'sha': '14247b91b29fdf6641b07785ab87920d1e9e...	NaN	NaN	NaN	69595187.0	rhodyprog4ds	https://api.github.com/orgs/rhodyprog4ds	https://avatars.githubusercontent.com/u/69595187?

5 rows × 33 columns

If we want to analyze this data, this is a good place to save it to disk and start an analysis in separate notebook.

gh_df_clean.to_csv('gh_events_unpacked.csv')

8.7. Questions After Class#

8.7.1. How the apply function works/use cases?#

A4 will give you some examples, espeically the airline dataset. We will also keep seing it come up as we manipulate data more.

the apply docs have tiny examples that help illustrate what it does and some of how it works. The pandas faq has a section on apply and similar methods that gives some more use cases.

8.7.2. Is there a better way to see how many missing values?#

There are lots of ways. All are fine. We used info in class because I was trying to use the way we had already seen. Info focuses on how many values are present instead of what is missing because it makes more sense in most cases. The more common question is: are there enough values to make decisions with?

If you wanted to get counts of the missing values, you can use the pandas isna function. It is a pandas function, the docs say pandas.isna not a DataFrame method (which would be described like pandas.DataFrame.methodname). This means we use it like

value_to_test = 4
pd.isna(value_to_test)

False

Try it Yourself

pass different values like: False, np.nan (also import numpy as np) and, pd.NA, hello to this function

help(pd.isna)

Help on function isna in module pandas.core.dtypes.missing:

isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'
    Detect missing values for an array-like object.
    
    This function takes a scalar or array-like object and indicates
    whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
    in object arrays, ``NaT`` in datetimelike).
    
    Parameters
    ----------
    obj : scalar or array-like
        Object to check for null or missing values.
    
    Returns
    -------
    bool or array-like of bool
        For scalar input, returns a scalar boolean.
        For array input, returns an array of boolean indicating whether each
        corresponding element is missing.
    
    See Also
    --------
    notna : Boolean inverse of pandas.isna.
    Series.isna : Detect missing values in a Series.
    DataFrame.isna : Detect missing values in a DataFrame.
    Index.isna : Detect missing values in an Index.
    
    Examples
    --------
    Scalar arguments (including strings) result in a scalar boolean.
    
    >>> pd.isna('dog')
    False
    
    >>> pd.isna(pd.NA)
    True
    
    >>> pd.isna(np.nan)
    True
    
    ndarrays result in an ndarray of booleans.
    
    >>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
    >>> array
    array([[ 1., nan,  3.],
           [ 4.,  5., nan]])
    >>> pd.isna(array)
    array([[False,  True, False],
           [False, False,  True]])
    
    For indexes, an ndarray of booleans is returned.
    
    >>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
    ...                           "2017-07-08"])
    >>> index
    DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
                  dtype='datetime64[ns]', freq=None)
    >>> pd.isna(index)
    array([False, False,  True, False])
    
    For Series and DataFrame, the same type is returned, containing booleans.
    
    >>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
    >>> df
         0     1    2
    0  ant   bee  cat
    1  dog  None  fly
    >>> pd.isna(df)
           0      1      2
    0  False  False  False
    1  False   True  False
    
    >>> pd.isna(df[1])
    0    False
    1     True
    Name: 1, dtype: bool

The docstring says that it returns “bool or array-like of bool” but if we go to the website docs that have more examples, we can find out what that it will return a DataFrame if we pass it a DataFrame. Then we can use the pandas.DataFrame.sum method.

pd.isna(coffee_df_clean).sum()

species                        0
owner                          7
country_of_origin              1
farm_name                    356
lot_number                  1041
mill                         310
ico_number                   148
company                      209
altitude                     223
region                        57
producer                     230
number_of_bags                 0
bag_weight                     0
in_country_partner             0
harvest_year                  47
grading_date                   0
owner_1                        7
variety                      201
processing_method            152
aroma                          0
flavor                         0
aftertaste                     0
acidity                        0
body                           0
balance                        0
uniformity                     0
clean_cup                      0
sweetness                      0
cupper_points                  0
total_cup_points               0
moisture                       0
category_one_defects           0
quakers                        1
color                        267
category_two_defects           0
expiration                     0
certification_body             0
certification_address          0
certification_contact          0
unit_of_measurement            0
altitude_low_meters          227
altitude_high_meters         227
altitude_mean_meters         227
unit_of_measurement_long       0
lot_number_clean               0
dtype: int64

8.7.3. in `col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns}` what is the `{}` for?#

This is called a dictionary comphrehension. It is very similar to a list comprehension. It is one of the defined ways to build a dict type object

We also saw one when we looked at different types in a previous class.

{char:i for i,char in enumerate('abcde')}

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

enumerate is a built in function that iterates over items in an iterable type(list-like) and pops the each value paired with its index within the structure.

This way we get each character and it’s position. We could use this as follows

num_chars = {char:i for i,char in enumerate('abcde')}
alpha_data = ['a','d','e','c','b',']

  Cell In[47], line 2
    alpha_data = ['a','d','e','c','b',']
                                        ^
SyntaxError: EOL while scanning string literal

Reparing values

Contents

8. Reparing values#

8.1. Cleaning Data review#

8.1.1. A Cleaning Data Recipe#

8.2. What is clean enough?#

8.3. Fixing Column names#

8.4. Missing Values#

8.4.1. Filling missing values#

8.4.2. Dropping missing values#

8.5. Inconsistent values#

8.6. Example: Unpacking Jsons#

8.7. Questions After Class#

8.7.1. How the apply function works/use cases?#

8.7.2. Is there a better way to see how many missing values?#

8.7.3. in col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns} what is the {} for?#

8.7.3. in `col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns}` what is the `{}` for?#