7. Reparing values#
So far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.
Today, we’ll deal with how to fix the values within the data.
7.1. Cleaning Data review#
Instead of more practice with these manipulations, below are more
examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general
idea of what good data looks like and good habits for cleaning data and keeping
it reproducible.
All Shades Also here are some tips on general data management and organization.
This article is a comprehensive discussion of data cleaning.
7.1.1. A Cleaning Data Recipe#
not everything possible, but good enough for this course
Can you use parameters to read the data in better?
Fix the index and column headers (making these easier to use makes the rest easier)
Is the data strucutred well?
Are there missing values?
Do the datatypes match what you expect by looking at the head or a sample?
Are categorical variables represented in usable way?
Does your analysis require filtering or augmenting the data?
import pandas as pd
import seaborn as sns
import numpy as np
sns.set_theme(palette= "colorblind")
# toy data set
na_toy_df = pd.DataFrame(data = [[1,3,4,5],[2 ,6, pd.NA,3]])
# coffee data
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'
coffee_df = pd.read_csv(arabica_data_url,index_col=0)
# github api data
rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)
# make plots look nicer and increase font size
sns.set_theme(font_scale=2,palette='colorblind')
7.2. What is clean enough?#
This is a great question, without an easy answer.
It depends on what you want to do. This is why it’s important to have potential questions in mind if you are cleaning data for others and why we often have to do a little bit more preparation after a dataset has been “cleaned”
Dealing with missing data is a whole research area. There isn’t one solution.
in 2020 there was a whole workshop on missing
one organizer is the main developer of sci-kit learn the ML package we will use soon. In a 2020 invited talk he listed more automatic handling as an active area of research and a development goal for sklearn.
There are also many classic approaches both when training and when applying models.
example application in breast cancer detection
Even in pandas, dealing with missing values is under experimentation as to how to represent it symbolically
Missing values even causes the datatypes to change
That said, there are are om Pandas gives a few basic tools:
dropna
fillna
Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example
you can approximate with another column
you can approximate with that column from other rows
Special case, what if we’re filling a summary table?
filling with a symbol for printing can be a good choice, but not for analysis.
whatever you do, document it
coffee_df_fixedcols.info()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 coffee_df_fixedcols.info()
NameError: name 'coffee_df_fixedcols' is not defined
7.2.1. Filling missing values#
The ‘Lot.Number’ has a lot of NaN values, how can we explore it?
We can look at the type:
coffee_df_fixedcols['lot_number'].dtype
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 coffee_df_fixedcols['lot_number'].dtype
NameError: name 'coffee_df_fixedcols' is not defined
And we can look at the value counts.
coffee_df_fixedcols['lot_number'].value_counts()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 coffee_df_fixedcols['lot_number'].value_counts()
NameError: name 'coffee_df_fixedcols' is not defined
We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:
coffee_df_fixedcols['lot_number'].fillna('1')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 coffee_df_fixedcols['lot_number'].fillna('1')
NameError: name 'coffee_df_fixedcols' is not defined
Note that even after we called fillna
we display it again and the original data is unchanged.
To save the filled in column we have a few choices:
use the
inplace
parameter. This doesn’t offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to deprecatewrite to a new DataFrame
add a column
coffee_df['lot_number_clean'] = coffee_df['Lot.Number'].fillna('1')
coffee_df.head(1)
Species | Owner | Country.of.Origin | Farm.Name | Lot.Number | Mill | ICO.Number | Company | Altitude | Region | ... | Category.Two.Defects | Expiration | Certification.Body | Certification.Address | Certification.Contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | lot_number_clean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | guji-hambela | ... | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 | 1 |
1 rows × 44 columns
7.3. Dropping#
Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.
Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.
We can look at dropping in this toy data set.
na_toy_df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1 | 3 | 4 | 5 |
1 | 2 | 6 | <NA> | 3 |
na_toy_df.dtypes
0 int64
1 int64
2 object
3 int64
dtype: object
na_toy_df.dropna()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1 | 3 | 4 | 5 |
na_toy_df.dropna(axis=1)
0 | 1 | 3 | |
---|---|---|---|
0 | 1 | 3 | 5 |
1 | 2 | 6 | 3 |
na_toy_df.mean()
0 1.5
1 4.5
2 4.0
3 4.0
dtype: object
why is this object?
7.3.1. Dropping missing values#
To illustrate how dropna
works, we’ll use the shape
method:
coffee_df.shape
(1311, 44)
coffee_df.dropna().shape
(130, 44)
We could instead tell it to only drop rows with NaN
in a subset of the columns.
coffee_df.dropna(subset=['altitude_low_meters']).shape
(1084, 44)
By default, it drops any row with one or more NaN
values.
In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh
parameter and axis=1
to drop along columns.
n_rows, n_cols = coffee_df.shape
coffee_df.dropna(thresh=.7*n_rows, axis=1).shape
(1311, 43)
n_rows, _ = coffee_df.shape
7.4. Inconsistent values#
This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts
and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts()
Series, the values from the coffee_df
become the index, so we use sort_index
.
Let’s look at the in_country_partner
column
coffee_df['In.Country.Partner'].value_counts()
In.Country.Partner
Specialty Coffee Association 295
AMECAFE 205
Almacafé 178
Asociacion Nacional Del Café 155
Brazil Specialty Coffee Association 67
Instituto Hondureño del Café 60
Blossom Valley International 58
Africa Fine Coffee Association 49
Specialty Coffee Association of Costa Rica 42
NUCOFFEE 36
Uganda Coffee Development Authority 22
Kenya Coffee Traders Association 22
Ethiopia Commodity Exchange 18
Specialty Coffee Institute of Asia 16
METAD Agricultural Development plc 15
Yunnan Coffee Exchange 12
Salvadoran Coffee Council 11
Specialty Coffee Association of Indonesia 10
Centro Agroecológico del Café A.C. 8
Asociación de Cafés Especiales de Nicaragua 8
Coffee Quality Institute 7
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. 6
Tanzanian Coffee Board 6
Torch Coffee Lab Yunnan 2
Specialty Coffee Ass 1
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao 1
Blossom Valley International\n 1
Name: count, dtype: int64
We can see there’s only one Blossom Valley International\n
but 58 Blossom Valley International
, the former is likely a typo, especially since \n
is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.
partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
'Specialty Coffee Ass':'Specialty Coffee Association'}
coffee_df['in_country_partner_clean'] = coffee_df['In.Country.Partner'].replace(
to_replace=partner_corrections)
coffee_df['in_country_partner_clean'].value_counts().sort_index()
in_country_partner_clean
AMECAFE 205
Africa Fine Coffee Association 49
Almacafé 178
Asociacion Nacional Del Café 155
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. 6
Asociación de Cafés Especiales de Nicaragua 8
Blossom Valley International 59
Brazil Specialty Coffee Association 67
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao 1
Centro Agroecológico del Café A.C. 8
Coffee Quality Institute 7
Ethiopia Commodity Exchange 18
Instituto Hondureño del Café 60
Kenya Coffee Traders Association 22
METAD Agricultural Development plc 15
NUCOFFEE 36
Salvadoran Coffee Council 11
Specialty Coffee Association 296
Specialty Coffee Association of Costa Rica 42
Specialty Coffee Association of Indonesia 10
Specialty Coffee Institute of Asia 16
Tanzanian Coffee Board 6
Torch Coffee Lab Yunnan 2
Uganda Coffee Development Authority 22
Yunnan Coffee Exchange 12
Name: count, dtype: int64
coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
'Category.Two.Defects', 'Expiration', 'Certification.Body',
'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters',
'lot_number_clean', 'in_country_partner_clean'],
dtype='object')
coffee_df_clean = coffee_df.rename(lambda s: s.lower().replace('.','_'),axis=1)
coffee_df_clean.head(1)
species | owner | country_of_origin | farm_name | lot_number | mill | ico_number | company | altitude | region | ... | expiration | certification_body | certification_address | certification_contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | lot_number_clean | in_country_partner_clean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | guji-hambela | ... | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 | 1 | METAD Agricultural Development plc |
1 rows × 45 columns
7.5. JSons#
Some datasets have a nested structure
We want to transform each one of those from a dictionary like thing into a row in a data frame.
course_gh_df.head(2)
id | type | actor | repo | payload | public | created_at | org | |
---|---|---|---|---|---|---|---|---|
0 | 34027166567 | PushEvent | {'id': 10656079, 'login': 'brownsarahm', 'disp... | {'id': 688125102, 'name': 'rhodyprog4ds/BrownF... | {'repository_id': 688125102, 'push_id': 162019... | True | 2023-12-08 21:20:17+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
1 | 33776329929 | PushEvent | {'id': 41898282, 'login': 'github-actions[bot]... | {'id': 688125102, 'name': 'rhodyprog4ds/BrownF... | {'repository_id': 688125102, 'push_id': 160517... | True | 2023-12-01 02:32:36+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
7.5.1. Casting Review#
If we have a variable that is not the type we want like this:
a ='5'
we can check type
type(a)
str
and we can use the name of the type we want, as a function to cast it to the new type.
type(int(a))
int
7.5.2. Handling dicts within a Data Frame#
We can see each row is a Series type.
type(course_gh_df.loc[0])
pandas.core.series.Series
The individual values in the actor column is then a dictionary
type(course_gh_df.loc[0]['actor'])
dict
We can use the series constructor to transform it.
pd.Series(course_gh_df.loc[0]['actor'])
id 10656079
login brownsarahm
display_login brownsarahm
gravatar_id
url https://api.github.com/users/brownsarahm
avatar_url https://avatars.githubusercontent.com/u/10656079?
dtype: object
We can use pandas apply
to do the same thing to every item in a dataset (over rows or columns as items )
course_gh_df['actor'].apply(pd.Series).head(1)
id | login | display_login | gravatar_id | url | avatar_url | |
---|---|---|---|---|---|---|
0 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? |
compared to the original:
course_gh_df.head(1)
id | type | actor | repo | payload | public | created_at | org | |
---|---|---|---|---|---|---|---|---|
0 | 34027166567 | PushEvent | {'id': 10656079, 'login': 'brownsarahm', 'disp... | {'id': 688125102, 'name': 'rhodyprog4ds/BrownF... | {'repository_id': 688125102, 'push_id': 162019... | True | 2023-12-08 21:20:17+00:00 | {'id': 69595187, 'login': 'rhodyprog4ds', 'gra... |
7.5.3. Unpacking at scale#
here we see how the list comprehensions we looked at in isolation before start to come in handy.
We want to handle several columns this way, so we’ll make alist of the names.
js_col = ['actor','repo','payload','org']
pd.concat
takes a list of dataframes and puts the together in one DataFrame. see its docs for more detail
So, we use a list comprehension to iterate over all of the columsn that we want to transform, transform them, store the fixed DataFrame
s in a list and concat them together into a single new DataFrame
pd.concat([course_gh_df[col].apply(pd.Series) for col in js_col],axis=1).head(1)
id | login | display_login | gravatar_id | url | avatar_url | id | name | url | repository_id | ... | master_branch | description | pusher_type | issue | comment | id | login | gravatar_id | url | avatar_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 688125102 | rhodyprog4ds/BrownFall23 | https://api.github.com/repos/rhodyprog4ds/Brow... | 688125102.0 | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? |
1 rows × 30 columns
This is close, but a lot of columns have the same name. To fix this we will rename the new columns so that they have the original column name prepended to the new name.
pandas has a rename method for this.
and this is another job for lambdas.
pd.concat([course_gh_df[col].apply(pd.Series,).rename(
columns= lambda i_col: col + '_' + i_col )
for col in js_col],axis=1).head()
actor_id | actor_login | actor_display_login | actor_gravatar_id | actor_url | actor_avatar_url | repo_id | repo_name | repo_url | payload_repository_id | ... | payload_master_branch | payload_description | payload_pusher_type | payload_issue | payload_comment | org_id | org_login | org_gravatar_id | org_url | org_avatar_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 688125102 | rhodyprog4ds/BrownFall23 | https://api.github.com/repos/rhodyprog4ds/Brow... | 688125102.0 | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
1 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | 688125102 | rhodyprog4ds/BrownFall23 | https://api.github.com/repos/rhodyprog4ds/Brow... | 688125102.0 | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
2 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 688125102 | rhodyprog4ds/BrownFall23 | https://api.github.com/repos/rhodyprog4ds/Brow... | NaN | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
3 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 688125102 | rhodyprog4ds/BrownFall23 | https://api.github.com/repos/rhodyprog4ds/Brow... | NaN | ... | main | NaN | user | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
4 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | 688125102 | rhodyprog4ds/BrownFall23 | https://api.github.com/repos/rhodyprog4ds/Brow... | 688125102.0 | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? |
5 rows × 30 columns
The rename
method can take a lambda
function to rename columns in a pattern. we want to combine the original column name with the new column name. col + '_' + i_col
does this where i_col
is the column name after the .apply(pd.Series)
and the col
is the column name of the original column before unpacking.
To finish off, we can first get the columns that are not in the unpacked, put them in a list, then add the two lists together before concatenating them all together.
cols_not_unpacked_list = [course_gh_df[[col for col in
course_gh_df.columns if not(col in js_col)] ]]
unpacked_cols_list = [course_gh_df[col].apply(pd.Series,).rename(
columns= lambda i_col: col + '_' + i_col )
for col in js_col]
pd.concat(cols_not_unpacked_list +unpacked_cols_list,axis=1)
id | type | public | created_at | actor_id | actor_login | actor_display_login | actor_gravatar_id | actor_url | actor_avatar_url | ... | payload_master_branch | payload_description | payload_pusher_type | payload_issue | payload_comment | org_id | org_login | org_gravatar_id | org_url | org_avatar_url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34027166567 | PushEvent | True | 2023-12-08 21:20:17+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
1 | 33776329929 | PushEvent | True | 2023-12-01 02:32:36+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
2 | 33776253600 | ReleaseEvent | True | 2023-12-01 02:29:25+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
3 | 33776238341 | CreateEvent | True | 2023-12-01 02:28:46+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | main | NaN | user | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
4 | 33776199782 | PushEvent | True | 2023-12-01 02:27:06+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
5 | 33703248568 | IssuesEvent | True | 2023-11-29 01:47:17+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | {'url': 'https://api.github.com/repos/rhodypro... | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
6 | 33703246441 | IssuesEvent | True | 2023-11-29 01:47:08+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | {'url': 'https://api.github.com/repos/rhodypro... | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
7 | 33703246322 | IssuesEvent | True | 2023-11-29 01:47:08+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | {'url': 'https://api.github.com/repos/rhodypro... | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
8 | 33702807533 | PushEvent | True | 2023-11-29 01:18:02+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
9 | 33702796873 | ReleaseEvent | True | 2023-11-29 01:17:15+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
10 | 33702780926 | CreateEvent | True | 2023-11-29 01:16:11+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | main | NaN | user | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
11 | 33702724730 | PushEvent | True | 2023-11-29 01:12:31+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
12 | 33702608401 | PushEvent | True | 2023-11-29 01:05:06+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
13 | 33702521578 | PushEvent | True | 2023-11-29 00:59:41+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
14 | 33531427411 | PushEvent | True | 2023-11-22 02:06:46+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
15 | 33531389938 | PushEvent | True | 2023-11-22 02:04:21+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
16 | 33531338454 | PushEvent | True | 2023-11-22 02:01:20+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
17 | 33531319634 | ReleaseEvent | True | 2023-11-22 02:00:13+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
18 | 33531308241 | CreateEvent | True | 2023-11-22 01:59:30+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | main | NaN | user | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
19 | 33531300722 | PushEvent | True | 2023-11-22 01:58:59+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
20 | 33529095254 | PushEvent | True | 2023-11-21 23:18:00+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
21 | 33529014986 | PushEvent | True | 2023-11-21 23:12:42+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
22 | 33527775178 | PushEvent | True | 2023-11-21 22:01:40+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
23 | 33527670318 | PushEvent | True | 2023-11-21 21:56:30+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
24 | 33499445425 | PushEvent | True | 2023-11-21 02:54:42+00:00 | 41898282 | github-actions[bot] | github-actions | https://api.github.com/users/github-actions[bot] | https://avatars.githubusercontent.com/u/41898282? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
25 | 33499373280 | PushEvent | True | 2023-11-21 02:49:20+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
26 | 33495012063 | PushEvent | True | 2023-11-20 22:04:40+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
27 | 33492376390 | IssueCommentEvent | True | 2023-11-20 19:59:08+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | {'url': 'https://api.github.com/repos/rhodypro... | {'url': 'https://api.github.com/repos/rhodypro... | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
28 | 33491026007 | IssuesEvent | True | 2023-11-20 18:56:44+00:00 | 90425926 | MJSher | MJSher | https://api.github.com/users/MJSher | https://avatars.githubusercontent.com/u/90425926? | ... | NaN | NaN | NaN | {'url': 'https://api.github.com/repos/rhodypro... | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? | ||
29 | 33416793174 | ReleaseEvent | True | 2023-11-17 01:34:21+00:00 | 10656079 | brownsarahm | brownsarahm | https://api.github.com/users/brownsarahm | https://avatars.githubusercontent.com/u/10656079? | ... | NaN | NaN | NaN | NaN | NaN | 69595187 | rhodyprog4ds | https://api.github.com/orgs/rhodyprog4ds | https://avatars.githubusercontent.com/u/69595187? |
30 rows × 34 columns
7.6. Questions after class#
7.6.1. After you do analysis with a specific column and cleaned it for that, should you restore the original dataframe and reclean it to do a different analysis?#
You might, if the analyses are compeltely different and unrelated. More often, however, we would clean the whole dataset, save the cleaning script/notebook (can have more context), and save the cleaned dataset to a csv. Building more breadth of understanding of these practices, is what you will do with the last part of A4. Your task there is to look at a few examples of cleaning that I have gathered for you and answer questions that start to build your intuition with this.
Ultimately though, cleaning data is something that you do not know everything there is to know about it in one shot, over time you see more and more examples.
7.6.2. I don’t fully understand the lambda function#
If you want a technical specific understanding of it, I recommend the Python language documentation on lambda functions and the wikipedia article on anonymous functions for more breadth and other related concepts across languages.
At a practial level it is a shortand syntax for defining a small function. For example the following two functions do the same thing.
repeat_lambda = lambda content, reps: content*reps
def repeat_func(content, reps):
return content*reps
First, we can examine them
type(repeat_lambda), type( repeat_func)
(function, function)
they are both callable, but slightly different types.
Now we can call our functions:
repeat_lambda('a',3) == repeat_func('a',3)
True
and this is not a specific case, but always works. We can do a small random experiment to see
We’ll use the string library to get a string of the alphabet
import string
string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
We can pick even a random length, then random characters and a random number of repetitions
rand_length = np.random.randint(10)
random_content = np.random.choice(list(string.ascii_uppercase),size=rand_length)
rand_reps = np.random.randint(10)
random_content, rand_reps
(array(['U'], dtype='<U1'), 3)
We can still apply this and see that it is the same.
repeat_lambda(random_content, rand_reps) == repeat_func(random_content, rand_reps)
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
Cell In[40], line 1
----> 1 repeat_lambda(random_content, rand_reps) == repeat_func(random_content, rand_reps)
Cell In[35], line 1, in <lambda>(content, reps)
----> 1 repeat_lambda = lambda content, reps: content*reps
3 def repeat_func(content, reps):
4 return content*reps
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U1'), dtype('int64')) -> None
7.6.3. Json use cases vs csv use cases#
Once we read the data in, there is no difference. Where they are generated there are tradeoffs. JSON is a popular way to log activity
7.6.4. Why are so many datasets so messy in the first place?#
7.6.5. Are there more resources to see when its appropriate to fill in missing data with certain values?#
I have not found a lot of good resources on this, unfortunately. Data Science is a complex discipline and very new especially at the undergraduate level. The first data science degrees were only at the graduate level.
The complexity lies in integrating information from computer science, statistics, and domain knowledge. Domain knowledge is going to be different in every dataset.
It is okay to now know for sure the best thing to do. The most important thing is document what you did and why so that you can justify the choices and consider their impact later in your analysis.
7.6.6. Can we get a more in-depth explanation of what is going on in the last piece of code you provided?#
above
7.6.7. What is the normal percent of NAs that need to be filled for most people to get rid of that line?#
Again, unfortunately there are not fixed rules.
Missing 10% of only 50 samples might be detrimental, where missing 30% of 10000 could be okay.
It depends what you are going to do with the data after cleaning, what the threshold is.