8. Reparing values#
So far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.
Today, we’ll deal with how to fix the values within the data.
8.1. Cleaning Data review#
Instead of more practice with these manipulations, below are more
examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general
idea of what good data looks like and good habits for cleaning data and keeping
it reproducible.
All Shades Also here are some tips on general data management and organization.
This article is a comprehensive discussion of data cleaning.
8.1.1. A Cleaning Data Recipe#
not everything possible, but good enough for this course
Can you use parameters to read the data in better?
Fix the index and column headers (making these easier to use makes the rest easier)
Is the data strucutred well?
Are there missing values?
Do the datatypes match what you expect by looking at the head or a sample?
Are categorical variables represented in usable way?
Does your analysis require filtering or augmenting the data?
import pandas as pd
import seaborn as sns
import numpy as np #
na_toy_df_np = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan]])
na_toy_df_pd = pd.DataFrame(data = [[1,3,4,5],[2 ,6, pd.NA]])
# make plots look nicer and increase font size
sns.set_theme(font_scale=2)
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
coffee_df = pd.read_csv(arabica_data_url,index_col=0)
rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)
8.2. What is clean enough?#
This is a great question, without an easy answer.
It depends on what you want to do. This is why it’s important to have potential questions in mind if you are cleaning data for others and why we often have to do a little bit more preparation after a dataset has been “cleaned”
8.3. Fixing Column names#
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
'Category.Two.Defects', 'Expiration', 'Certification.Body',
'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
dtype='object')
col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns}
coffee_df.rename(columns=col_name_mapper).head(1)
species | owner | country_of_origin | farm_name | lot_number | mill | ico_number | company | altitude | region | ... | color | category_two_defects | expiration | certification_body | certification_address | certification_contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | guji-hambela | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 |
1 rows × 43 columns
coffee_df.head(1)
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Color | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | guji-hambela | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 |
1 rows × 43 columns
coffee_df_fixedcols = coffee_df.rename(columns=col_name_mapper)
coffee_df_fixedcols.head(1)
species | owner | country_of_origin | farm_name | lot_number | mill | ico_number | company | altitude | region | ... | color | category_two_defects | expiration | certification_body | certification_address | certification_contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | guji-hambela | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 |
1 rows × 43 columns
coffee_df_fixedcols['unit_of_measurement'].value_counts()
unit_of_measurement
m 1129
ft 182
Name: count, dtype: int64
coffee_df_fixedcols['unit_of_measurement'].replace({'m':'meters','ft':'feet'})
1 meters
2 meters
3 meters
4 meters
5 meters
...
1307 meters
1308 meters
1309 meters
1310 feet
1312 meters
Name: unit_of_measurement, Length: 1311, dtype: object
coffee_df_fixedcols['unit_of_measurement_long'] = coffee_df_fixedcols['unit_of_measurement'].replace(
{'m':'meters','ft':'feet'})
coffee_df_fixedcols.head(1)
species | owner | country_of_origin | farm_name | lot_number | mill | ico_number | company | altitude | region | ... | category_two_defects | expiration | certification_body | certification_address | certification_contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | unit_of_measurement_long | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | guji-hambela | ... | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 | meters |
1 rows × 44 columns
8.4. Missing Values#
Dealing with missing data is a whole research area. There isn’t one solution.
in 2020 there was a whole workshop on missing
one organizer is the main developer of sci-kit learn the ML package we will use soon. In a 2020 invited talk he listed more automatic handling as an active area of research and a development goal for sklearn.
There are also many classic approaches both when training and when applying models.
example application in breast cancer detection
Even in pandas, dealing with missing values is under experimentation as to how to represent it symbolically
Missing values even causes the datatypes to change
That said, there are are om Pandas gives a few basic tools:
dropna
fillna
Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.
Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.
Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example
you can approximate with another column
you can approximate with that column from other rows
Special case, what if we’re filling a summary table?
filling with a symbol for printing can be a good choice, but not for analysis.
whatever you do, document it
coffee_df_fixedcols.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1311 entries, 1 to 1312
Data columns (total 44 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 1311 non-null object
1 owner 1304 non-null object
2 country_of_origin 1310 non-null object
3 farm_name 955 non-null object
4 lot_number 270 non-null object
5 mill 1001 non-null object
6 ico_number 1163 non-null object
7 company 1102 non-null object
8 altitude 1088 non-null object
9 region 1254 non-null object
10 producer 1081 non-null object
11 number_of_bags 1311 non-null int64
12 bag_weight 1311 non-null object
13 in_country_partner 1311 non-null object
14 harvest_year 1264 non-null object
15 grading_date 1311 non-null object
16 owner_1 1304 non-null object
17 variety 1110 non-null object
18 processing_method 1159 non-null object
19 aroma 1311 non-null float64
20 flavor 1311 non-null float64
21 aftertaste 1311 non-null float64
22 acidity 1311 non-null float64
23 body 1311 non-null float64
24 balance 1311 non-null float64
25 uniformity 1311 non-null float64
26 clean_cup 1311 non-null float64
27 sweetness 1311 non-null float64
28 cupper_points 1311 non-null float64
29 total_cup_points 1311 non-null float64
30 moisture 1311 non-null float64
31 category_one_defects 1311 non-null int64
32 quakers 1310 non-null float64
33 color 1044 non-null object
34 category_two_defects 1311 non-null int64
35 expiration 1311 non-null object
36 certification_body 1311 non-null object
37 certification_address 1311 non-null object
38 certification_contact 1311 non-null object
39 unit_of_measurement 1311 non-null object
40 altitude_low_meters 1084 non-null float64
41 altitude_high_meters 1084 non-null float64
42 altitude_mean_meters 1084 non-null float64
43 unit_of_measurement_long 1311 non-null object
dtypes: float64(16), int64(3), object(25)
memory usage: 460.9+ KB
8.4.1. Filling missing values#
The ‘Lot.Number’ has a lot of NaN values, how can we explore it?
We can look at the type:
coffee_df_fixedcols['lot_number'].dtype
dtype('O')
And we can look at the value counts.
coffee_df_fixedcols['lot_number'].value_counts()
lot_number
1 18
020/17 6
019/17 5
2 3
102 3
..
11/23/0696 1
3-59-2318 1
8885 1
5055 1
017-053-0211/ 017-053-0212 1
Name: count, Length: 221, dtype: int64
We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:
coffee_df_fixedcols['lot_number'].fillna('1')
1 1
2 1
3 1
4 1
5 1
...
1307 1
1308 1
1309 017-053-0211/ 017-053-0212
1310 1
1312 103
Name: lot_number, Length: 1311, dtype: object
Note that even after we called fillna
we display it again and the original data is unchanged.
To save the filled in column we have a few choices:
use the
inplace
parameter. This doesn’t offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to deprecatewrite to a new DataFrame
add a column
We’ll use adding a column:
coffee_df_fixedcols['lot_number_clean'] = coffee_df_fixedcols['lot_number'].fillna('1')
coffee_df_fixedcols['lot_number_clean'].value_counts()
lot_number_clean
1 1059
020/17 6
019/17 5
102 3
103 3
...
3-59-2318 1
8885 1
5055 1
MCCFWXA15/16 1
017-053-0211/ 017-053-0212 1
Name: count, Length: 221, dtype: int64
8.4.2. Dropping missing values#
To illustrate how dropna
works, we’ll use the shape
method:
coffee_df_fixedcols.shape
(1311, 45)
coffee_df_fixedcols.dropna().shape
(130, 45)
By default, it drops any row with one or more NaN
values.
We could instead tell it to only drop rows with NaN
in a subset of the columns.
coffee_df_fixedcols.dropna(subset=['altitude_low_meters']).shape
(1084, 45)
coffee_alt_df = coffee_df_fixedcols.dropna(subset=['altitude_low_meters'])
In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh
parameter and axis=1
to drop along columns.
n_rows, n_cols = coffee_df_fixedcols.shape
coffee_df_fixedcols.dropna(thresh = .7*n_rows, axis=1).shape
(1311, 44)
This dataset is actually in pretty good shape, but if we use a more stringent threshold it drops more columns.
coffee_df_fixedcols.dropna(thresh = .85*n_rows, axis=1).shape
(1311, 34)
8.5. Inconsistent values#
This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts
and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts()
Series, the values from the coffee_df
become the index, so we use sort_index
.
Let’s look at the in_country_partner
column
coffee_df_fixedcols['in_country_partner'].value_counts().sort_index()
in_country_partner
AMECAFE 205
Africa Fine Coffee Association 49
Almacafé 178
Asociacion Nacional Del Café 155
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. 6
Asociación de Cafés Especiales de Nicaragua 8
Blossom Valley International 58
Blossom Valley International\n 1
Brazil Specialty Coffee Association 67
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao 1
Centro Agroecológico del Café A.C. 8
Coffee Quality Institute 7
Ethiopia Commodity Exchange 18
Instituto Hondureño del Café 60
Kenya Coffee Traders Association 22
METAD Agricultural Development plc 15
NUCOFFEE 36
Salvadoran Coffee Council 11
Specialty Coffee Ass 1
Specialty Coffee Association 295
Specialty Coffee Association of Costa Rica 42
Specialty Coffee Association of Indonesia 10
Specialty Coffee Institute of Asia 16
Tanzanian Coffee Board 6
Torch Coffee Lab Yunnan 2
Uganda Coffee Development Authority 22
Yunnan Coffee Exchange 12
Name: count, dtype: int64
We can see there’s only one Blossom Valley International\n
but 58 Blossom Valley International
, the former is likely a typo, especially since \n
is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.
partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
'Specialty Coffee Ass':'Specialty Coffee Association'}
coffee_df_clean = coffee_df_fixedcols.replace(partner_corrections)
8.6. Example: Unpacking Jsons#
rhodyprog4ds_gh_events_url
'https://api.github.com/orgs/rhodyprog4ds/events'
gh_df = pd.read_json(rhodyprog4ds_gh_events_url)
gh_df.head()
id | type | actor | repo | payload | public | created_at | org | |
---|---|---|---|---|---|---|---|---|
0 | 28565033114 | CreateEvent | {'id': 10656079, 'login': 'brownsarahm', 'disp... | {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... | {'ref': 'c22', 'ref_type': 'tag', 'master_bran... | True | 2023-04-21 00:47:50+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
1 | 28565028952 | PushEvent | {'id': 10656079, 'login': 'brownsarahm', 'disp... | {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... | {'repository_id': 592944632, 'push_id': 133757... | True | 2023-04-21 00:47:24+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
2 | 28508780091 | PushEvent | {'id': 41898282, 'login': 'github-actions[bot]... | {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... | {'repository_id': 592944632, 'push_id': 133481... | True | 2023-04-19 01:08:30+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
3 | 28508755066 | PushEvent | {'id': 41898282, 'login': 'github-actions[bot]... | {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... | {'repository_id': 592944632, 'push_id': 133481... | True | 2023-04-19 01:06:30+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
4 | 28508702433 | PushEvent | {'id': 10656079, 'login': 'brownsarahm', 'disp... | {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... | {'repository_id': 592944632, 'push_id': 133481... | True | 2023-04-19 01:02:22+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
Some datasets have a nested structure
We want to transform each one of those from a dictionary like thing into a row in a data frame.
We can see each row is a Series type.
type(gh_df.loc[0])
pandas.core.series.Series
a= '1'
type(a)
str
Recall, that base python types can be used as function, to cast an object from type to another.
type(int(a))
int
This works with Pandas Series too
pd.Series(gh_df.loc[0]['actor'])
id 10656079
login brownsarahm
display_login brownsarahm
gravatar_id
url https://api.github.com/users/brownsarahm
avatar_url https://avatars.githubusercontent.com/u/10656079?
dtype: object
We can use pandas apply
to do the same thing to every item in a dataset (over rows or columns as items )
For example
gh_df['actor'].apply(pd.Series).head()
id | login | display_login | gravatar_id | url | avatar_url | |
---|---|---|---|---|---|---|
0 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
1 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
2 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
3 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
4 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? |
compared to the original:
gh_df.head(1)
id | type | actor | repo | payload | public | created_at | org | |
---|---|---|---|---|---|---|---|---|
0 | 28565033114 | CreateEvent | {'id': 10656079, 'login': 'brownsarahm', 'disp... | {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... | {'ref': 'c22', 'ref_type': 'tag', 'master_bran... | True | 2023-04-21 00:47:50+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
We want to handle several columns this way, so we’ll make alist of the names.
js_cols = ['actor','repo','payload','org']
pd.concat
takes a list of dataframes and puts the together in one DataFrame.
pd.concat([gh_df[col].apply(pd.Series) for col in js_cols],axis=1).head()
id | login | display_login | gravatar_id | url | avatar_url | id | name | url | ref | ... | before | commits | action | release | issue | id | login | gravatar_id | url | avatar_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 592944632 | rhodyprog4ds/BrownSpring23 | https://api.github.com/repos/rhodyprog4ds/Brow... | c22 | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
1 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 592944632 | rhodyprog4ds/BrownSpring23 | https://api.github.com/repos/rhodyprog4ds/Brow... | refs/heads/main | ... | 14247b91b29fdf6641b07785ab87920d1e9e26eb | [{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae... | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
2 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | 592944632 | rhodyprog4ds/BrownSpring23 | https://api.github.com/repos/rhodyprog4ds/Brow... | refs/heads/gh-pages | ... | c79137af62e22428db8a3e5614a496e85e2094a6 | [{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5... | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
3 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | 592944632 | rhodyprog4ds/BrownSpring23 | https://api.github.com/repos/rhodyprog4ds/Brow... | refs/heads/gh-pages | ... | 972dcd4e3117f378c346547e20beb2905689d57d | [{'sha': 'c79137af62e22428db8a3e5614a496e85e20... | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
4 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 592944632 | rhodyprog4ds/BrownSpring23 | https://api.github.com/repos/rhodyprog4ds/Brow... | refs/heads/main | ... | 9d273b140d5a2e303253bf568b3d6bddf78c42d1 | [{'sha': '14247b91b29fdf6641b07785ab87920d1e9e... | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? |
5 rows × 29 columns
This is close, but a lot of columns have the same name. To fix this we will rename the new columns so that they have the original column name prepended to the new name.
pandas has a rename method for this.
and this is another job for lambdas.
pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[35], line 1
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
Cell In[35], line 1, in <listcomp>(.0)
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:5440, in DataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
5321 def rename(
5322 self,
5323 mapper: Renamer | None = None,
(...)
5331 errors: IgnoreRaise = "ignore",
5332 ) -> DataFrame | None:
5333 """
5334 Rename columns or index labels.
5335
(...)
5438 4 3 6
5439 """
-> 5440 return super()._rename(
5441 mapper=mapper,
5442 index=index,
5443 columns=columns,
5444 axis=axis,
5445 copy=copy,
5446 inplace=inplace,
5447 level=level,
5448 errors=errors,
5449 )
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/generic.py:1034, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
1027 missing_labels = [
1028 label
1029 for index, label in enumerate(replacements)
1030 if indexer[index] == -1
1031 ]
1032 raise KeyError(f"{missing_labels} not found in axis")
-> 1034 new_index = ax._transform_index(f, level=level)
1035 result._set_axis_nocheck(new_index, axis=axis_no, inplace=True, copy=False)
1036 result._clear_item_cache()
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:6204, in Index._transform_index(self, func, level)
6202 return type(self).from_arrays(values)
6203 else:
-> 6204 items = [func(x) for x in self]
6205 return Index(items, name=self.name, tupleize_cols=False)
File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:6204, in <listcomp>(.0)
6202 return type(self).from_arrays(values)
6203 else:
-> 6204 items = [func(x) for x in self]
6205 return Index(items, name=self.name, tupleize_cols=False)
Cell In[35], line 1, in <lambda>(c)
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
TypeError: sequence item 0: expected str instance, int found
gh_df['actor'].apply(pd.Series).rename(columns=lambda c: '_'.join([c,'actor']))
id_actor | login_actor | display_login_actor | gravatar_id_actor | url_actor | avatar_url_actor | |
---|---|---|---|---|---|---|
0 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
1 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
2 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
3 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
4 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
5 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
6 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
7 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
8 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
9 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
10 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
11 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
12 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
13 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
14 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
15 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
16 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
17 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
18 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
19 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
20 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
21 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
22 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
23 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
24 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
25 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
26 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
27 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | |
28 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | |
29 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? |
json_cols_df = pd.concat([gh_df[col].apply(pd.Series).rename(columns=lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
gh_df.columns
Index(['id', 'type', 'actor', 'repo', 'payload', 'public', 'created_at',
'org'],
dtype='object')
json_cols_df.columns
Index(['id_actor', 'login_actor', 'display_login_actor', 'gravatar_id_actor',
'url_actor', 'avatar_url_actor', 'id_repo', 'name_repo', 'url_repo',
'ref_payload', 'ref_type_payload', 'master_branch_payload',
'description_payload', 'pusher_type_payload', 'repository_id_payload',
'push_id_payload', 'size_payload', 'distinct_size_payload',
'head_payload', 'before_payload', 'commits_payload', 'action_payload',
'release_payload', 'issue_payload', 'id_org', 'login_org',
'gravatar_id_org', 'url_org', 'avatar_url_org'],
dtype='object')
Then we can put the two parts of the data together
pd.concat([gh_df[['id','type','public','created_at']],json_cols_df],)
id | type | public | created_at | id_actor | login_actor | display_login_actor | gravatar_id_actor | url_actor | avatar_url_actor | ... | before_payload | commits_payload | action_payload | release_payload | issue_payload | id_org | login_org | gravatar_id_org | url_org | avatar_url_org | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.856503e+10 | CreateEvent | True | 2023-04-21 00:47:50+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2.856503e+10 | PushEvent | True | 2023-04-21 00:47:24+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2.850878e+10 | PushEvent | True | 2023-04-19 01:08:30+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2.850876e+10 | PushEvent | True | 2023-04-19 01:06:30+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2.850870e+10 | PushEvent | True | 2023-04-19 01:02:22+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 2.850869e+10 | ReleaseEvent | True | 2023-04-19 01:01:45+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 2.850868e+10 | CreateEvent | True | 2023-04-19 01:00:40+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | 2.850868e+10 | PushEvent | True | 2023-04-19 01:00:20+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 2.838014e+10 | PushEvent | True | 2023-04-13 01:55:52+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | 2.838009e+10 | ReleaseEvent | True | 2023-04-13 01:51:24+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 | 2.838007e+10 | CreateEvent | True | 2023-04-13 01:49:36+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11 | 2.838006e+10 | PushEvent | True | 2023-04-13 01:49:22+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 | 2.835099e+10 | PushEvent | True | 2023-04-12 02:28:24+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
13 | 2.835091e+10 | ReleaseEvent | True | 2023-04-12 02:22:35+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
14 | 2.835088e+10 | CreateEvent | True | 2023-04-12 02:21:10+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 | 2.835088e+10 | PushEvent | True | 2023-04-12 02:21:12+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
16 | 2.834636e+10 | PushEvent | True | 2023-04-11 21:18:20+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17 | 2.834622e+10 | PushEvent | True | 2023-04-11 21:10:21+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
18 | 2.833450e+10 | PushEvent | True | 2023-04-11 13:02:29+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
19 | 2.833432e+10 | PushEvent | True | 2023-04-11 12:56:32+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 | 2.826148e+10 | PushEvent | True | 2023-04-06 23:43:27+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
21 | 2.826142e+10 | PushEvent | True | 2023-04-06 23:36:57+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22 | 2.820908e+10 | PushEvent | True | 2023-04-05 01:28:11+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
23 | 2.820905e+10 | ReleaseEvent | True | 2023-04-05 01:26:09+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
24 | 2.820902e+10 | CreateEvent | True | 2023-04-05 01:22:48+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
25 | 2.820901e+10 | PushEvent | True | 2023-04-05 01:22:25+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
26 | 2.820560e+10 | IssuesEvent | True | 2023-04-04 21:04:21+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
27 | 2.817440e+10 | PushEvent | True | 2023-04-03 19:24:10+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
28 | 2.817424e+10 | PushEvent | True | 2023-04-03 19:16:08+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
29 | 2.810662e+10 | ReleaseEvent | True | 2023-03-31 01:37:29+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
0 | NaN | NaN | NaN | NaT | 10656079.0 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
1 | NaN | NaN | NaN | NaT | 10656079.0 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | 14247b91b29fdf6641b07785ab87920d1e9e26eb | [{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
2 | NaN | NaN | NaN | NaT | 41898282.0 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | c79137af62e22428db8a3e5614a496e85e2094a6 | [{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
3 | NaN | NaN | NaN | NaT | 41898282.0 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | 972dcd4e3117f378c346547e20beb2905689d57d | [{'sha': 'c79137af62e22428db8a3e5614a496e85e20... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
4 | NaN | NaN | NaN | NaT | 10656079.0 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | 9d273b140d5a2e303253bf568b3d6bddf78c42d1 | [{'sha': '14247b91b29fdf6641b07785ab87920d1e9e... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? |
35 rows × 33 columns
and finally save this
gh_df_clean = pd.concat([gh_df[['id','type','public','created_at']],json_cols_df],axis=1)
gh_df_clean.head()
id | type | public | created_at | id_actor | login_actor | display_login_actor | gravatar_id_actor | url_actor | avatar_url_actor | ... | before_payload | commits_payload | action_payload | release_payload | issue_payload | id_org | login_org | gravatar_id_org | url_org | avatar_url_org | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 28565033114 | CreateEvent | True | 2023-04-21 00:47:50+00:00 | 10656079.0 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
1 | 28565028952 | PushEvent | True | 2023-04-21 00:47:24+00:00 | 10656079.0 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | 14247b91b29fdf6641b07785ab87920d1e9e26eb | [{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
2 | 28508780091 | PushEvent | True | 2023-04-19 01:08:30+00:00 | 41898282.0 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | c79137af62e22428db8a3e5614a496e85e2094a6 | [{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
3 | 28508755066 | PushEvent | True | 2023-04-19 01:06:30+00:00 | 41898282.0 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | 972dcd4e3117f378c346547e20beb2905689d57d | [{'sha': 'c79137af62e22428db8a3e5614a496e85e20... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
4 | 28508702433 | PushEvent | True | 2023-04-19 01:02:22+00:00 | 10656079.0 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | 9d273b140d5a2e303253bf568b3d6bddf78c42d1 | [{'sha': '14247b91b29fdf6641b07785ab87920d1e9e... | NaN | NaN | NaN | 69595187.0 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? |
5 rows × 33 columns
If we want to analyze this data, this is a good place to save it to disk and start an analysis in separate notebook.
gh_df_clean.to_csv('gh_events_unpacked.csv')
8.7. Questions After Class#
8.7.1. How the apply function works/use cases?#
A4 will give you some examples, espeically the airline dataset. We will also keep seing it come up as we manipulate data more.
the apply docs have tiny examples that help illustrate what it does and some of how it works. The pandas faq has a section on apply and similar methods that gives some more use cases.
8.7.2. Is there a better way to see how many missing values?#
There are lots of ways. All are fine. We used info
in class because I was trying to use the way we had already seen.
Info focuses on how many values are present instead of what is missing because it makes more sense in most cases. The more common question is: are there enough values to make decisions with?
If you wanted to get counts of the missing values, you can use the pandas isna
function. It is a pandas function, the docs say pandas.isna
not a DataFrame method (which would be described like pandas.DataFrame.methodname
).
This means we use it like
value_to_test = 4
pd.isna(value_to_test)
False
Try it Yourself
pass different values like: False
, np.nan
(also import numpy as np
) and, pd.NA
, hello
to this function
help(pd.isna)
Help on function isna in module pandas.core.dtypes.missing:
isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates
whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
in object arrays, ``NaT`` in datetimelike).
Parameters
----------
obj : scalar or array-like
Object to check for null or missing values.
Returns
-------
bool or array-like of bool
For scalar input, returns a scalar boolean.
For array input, returns an array of boolean indicating whether each
corresponding element is missing.
See Also
--------
notna : Boolean inverse of pandas.isna.
Series.isna : Detect missing values in a Series.
DataFrame.isna : Detect missing values in a DataFrame.
Index.isna : Detect missing values in an Index.
Examples
--------
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog')
False
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)
True
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan, 3.],
[ 4., 5., nan]])
>>> pd.isna(array)
array([[False, True, False],
[False, False, True]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
... "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
dtype='datetime64[ns]', freq=None)
>>> pd.isna(index)
array([False, False, True, False])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
0 1 2
0 ant bee cat
1 dog None fly
>>> pd.isna(df)
0 1 2
0 False False False
1 False True False
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
The docstring says that it returns “bool or array-like of bool” but if we go to the website docs that have more examples, we can find out what that it will return a DataFrame if we pass it a DataFrame. Then we can use the pandas.DataFrame.sum
method.
pd.isna(coffee_df_clean).sum()
species 0
owner 7
country_of_origin 1
farm_name 356
lot_number 1041
mill 310
ico_number 148
company 209
altitude 223
region 57
producer 230
number_of_bags 0
bag_weight 0
in_country_partner 0
harvest_year 47
grading_date 0
owner_1 7
variety 201
processing_method 152
aroma 0
flavor 0
aftertaste 0
acidity 0
body 0
balance 0
uniformity 0
clean_cup 0
sweetness 0
cupper_points 0
total_cup_points 0
moisture 0
category_one_defects 0
quakers 1
color 267
category_two_defects 0
expiration 0
certification_body 0
certification_address 0
certification_contact 0
unit_of_measurement 0
altitude_low_meters 227
altitude_high_meters 227
altitude_mean_meters 227
unit_of_measurement_long 0
lot_number_clean 0
dtype: int64
8.7.3. in col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns}
what is the {}
for?#
This is called a dictionary comphrehension. It is very similar to a list comprehension. It is one of the defined ways to build a dict
type object
We also saw one when we looked at different types in a previous class.
{char:i for i,char in enumerate('abcde')}
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
enumerate
is a built in function that
iterates over items in an iterable type(list-like) and pops the each value paired with its index within
the structure.
This way we get each character and it’s position. We could use this as follows
num_chars = {char:i for i,char in enumerate('abcde')}
alpha_data = ['a','d','e','c','b',']
Cell In[47], line 2
alpha_data = ['a','d','e','c','b',']
^
SyntaxError: EOL while scanning string literal