8. Reparing values#

So far, we’ve dealt with structural issues in data. but there’s a lot more to cleaning.

Today, we’ll deal with how to fix the values within the data.

8.1. Cleaning Data review#

Instead of more practice with these manipulations, below are more examples of cleaning data to see how these types of manipulations get used.
Your goal here is not to memorize every possible thing, but to build a general idea of what good data looks like and good habits for cleaning data and keeping it reproducible.

This article is a comprehensive discussion of data cleaning.

8.1.1. A Cleaning Data Recipe#

not everything possible, but good enough for this course

  1. Can you use parameters to read the data in better?

  2. Fix the index and column headers (making these easier to use makes the rest easier)

  3. Is the data strucutred well?

  4. Are there missing values?

  5. Do the datatypes match what you expect by looking at the head or a sample?

  6. Are categorical variables represented in usable way?

  7. Does your analysis require filtering or augmenting the data?

import pandas as pd
import seaborn as sns
import numpy as np #
na_toy_df_np = pd.DataFrame(data = [[1,3,4,5],[2 ,6, np.nan]])
na_toy_df_pd = pd.DataFrame(data = [[1,3,4,5],[2 ,6, pd.NA]])

# make plots look nicer and increase font size
sns.set_theme(font_scale=2)
arabica_data_url = 'https://raw.githubusercontent.com/jldbc/coffee-quality-database/master/data/arabica_data_cleaned.csv'

coffee_df = pd.read_csv(arabica_data_url,index_col=0)


rhodyprog4ds_gh_events_url = 'https://api.github.com/orgs/rhodyprog4ds/events'
course_gh_df = pd.read_json(rhodyprog4ds_gh_events_url)

8.2. What is clean enough?#

This is a great question, without an easy answer.

It depends on what you want to do. This is why it’s important to have potential questions in mind if you are cleaning data for others and why we often have to do a little bit more preparation after a dataset has been “cleaned”

8.3. Fixing Column names#

coffee_df.columns
Index(['Species', 'Owner', 'Country.of.Origin', 'Farm.Name', 'Lot.Number',
       'Mill', 'ICO.Number', 'Company', 'Altitude', 'Region', 'Producer',
       'Number.of.Bags', 'Bag.Weight', 'In.Country.Partner', 'Harvest.Year',
       'Grading.Date', 'Owner.1', 'Variety', 'Processing.Method', 'Aroma',
       'Flavor', 'Aftertaste', 'Acidity', 'Body', 'Balance', 'Uniformity',
       'Clean.Cup', 'Sweetness', 'Cupper.Points', 'Total.Cup.Points',
       'Moisture', 'Category.One.Defects', 'Quakers', 'Color',
       'Category.Two.Defects', 'Expiration', 'Certification.Body',
       'Certification.Address', 'Certification.Contact', 'unit_of_measurement',
       'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
      dtype='object')
col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns}
coffee_df.rename(columns=col_name_mapper).head(1)
species owner country_of_origin farm_name lot_number mill ico_number company altitude region ... color category_two_defects expiration certification_body certification_address certification_contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

1 rows × 43 columns

coffee_df.head(1)
Species Owner Country.of.Origin Farm.Name Lot.Number Mill ICO.Number Company Altitude Region ... Color Category.Two.Defects Expiration Certification.Body Certification.Address Certification.Contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

1 rows × 43 columns

coffee_df_fixedcols = coffee_df.rename(columns=col_name_mapper)
coffee_df_fixedcols.head(1)
species owner country_of_origin farm_name lot_number mill ico_number company altitude region ... color category_two_defects expiration certification_body certification_address certification_contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters
1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela ... Green 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0

1 rows × 43 columns

coffee_df_fixedcols['unit_of_measurement'].value_counts()
unit_of_measurement
m     1129
ft     182
Name: count, dtype: int64
coffee_df_fixedcols['unit_of_measurement'].replace({'m':'meters','ft':'feet'})
1       meters
2       meters
3       meters
4       meters
5       meters
         ...  
1307    meters
1308    meters
1309    meters
1310      feet
1312    meters
Name: unit_of_measurement, Length: 1311, dtype: object
coffee_df_fixedcols['unit_of_measurement_long'] = coffee_df_fixedcols['unit_of_measurement'].replace(
                                    {'m':'meters','ft':'feet'})
coffee_df_fixedcols.head(1)
species owner country_of_origin farm_name lot_number mill ico_number company altitude region ... category_two_defects expiration certification_body certification_address certification_contact unit_of_measurement altitude_low_meters altitude_high_meters altitude_mean_meters unit_of_measurement_long
1 Arabica metad plc Ethiopia metad plc NaN metad plc 2014/2015 metad agricultural developmet plc 1950-2200 guji-hambela ... 0 April 3rd, 2016 METAD Agricultural Development plc 309fcf77415a3661ae83e027f7e5f05dad786e44 19fef5a731de2db57d16da10287413f5f99bc2dd m 1950.0 2200.0 2075.0 meters

1 rows × 44 columns

8.4. Missing Values#

Dealing with missing data is a whole research area. There isn’t one solution.

in 2020 there was a whole workshop on missing

one organizer is the main developer of sci-kit learn the ML package we will use soon. In a 2020 invited talk he listed more automatic handling as an active area of research and a development goal for sklearn.

There are also many classic approaches both when training and when applying models.

example application in breast cancer detection

Even in pandas, dealing with missing values is under experimentation as to how to represent it symbolically

Missing values even causes the datatypes to change

That said, there are are om Pandas gives a few basic tools:

  • dropna

  • fillna

Dropping is a good choice when you otherwise have a lot of data and the data is missing at random.

Dropping can be risky if it’s not missing at random. For example, if we saw in the coffee data that one of the scores was missing for all of the rows from one country, or even just missing more often in one country, that could bias our results.

Filling can be good if you know how to fill reasonably, but don’t have data to spare by dropping. For example

  • you can approximate with another column

  • you can approximate with that column from other rows

Special case, what if we’re filling a summary table?

  • filling with a symbol for printing can be a good choice, but not for analysis.

whatever you do, document it

coffee_df_fixedcols.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1311 entries, 1 to 1312
Data columns (total 44 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   species                   1311 non-null   object 
 1   owner                     1304 non-null   object 
 2   country_of_origin         1310 non-null   object 
 3   farm_name                 955 non-null    object 
 4   lot_number                270 non-null    object 
 5   mill                      1001 non-null   object 
 6   ico_number                1163 non-null   object 
 7   company                   1102 non-null   object 
 8   altitude                  1088 non-null   object 
 9   region                    1254 non-null   object 
 10  producer                  1081 non-null   object 
 11  number_of_bags            1311 non-null   int64  
 12  bag_weight                1311 non-null   object 
 13  in_country_partner        1311 non-null   object 
 14  harvest_year              1264 non-null   object 
 15  grading_date              1311 non-null   object 
 16  owner_1                   1304 non-null   object 
 17  variety                   1110 non-null   object 
 18  processing_method         1159 non-null   object 
 19  aroma                     1311 non-null   float64
 20  flavor                    1311 non-null   float64
 21  aftertaste                1311 non-null   float64
 22  acidity                   1311 non-null   float64
 23  body                      1311 non-null   float64
 24  balance                   1311 non-null   float64
 25  uniformity                1311 non-null   float64
 26  clean_cup                 1311 non-null   float64
 27  sweetness                 1311 non-null   float64
 28  cupper_points             1311 non-null   float64
 29  total_cup_points          1311 non-null   float64
 30  moisture                  1311 non-null   float64
 31  category_one_defects      1311 non-null   int64  
 32  quakers                   1310 non-null   float64
 33  color                     1044 non-null   object 
 34  category_two_defects      1311 non-null   int64  
 35  expiration                1311 non-null   object 
 36  certification_body        1311 non-null   object 
 37  certification_address     1311 non-null   object 
 38  certification_contact     1311 non-null   object 
 39  unit_of_measurement       1311 non-null   object 
 40  altitude_low_meters       1084 non-null   float64
 41  altitude_high_meters      1084 non-null   float64
 42  altitude_mean_meters      1084 non-null   float64
 43  unit_of_measurement_long  1311 non-null   object 
dtypes: float64(16), int64(3), object(25)
memory usage: 460.9+ KB

8.4.1. Filling missing values#

The ‘Lot.Number’ has a lot of NaN values, how can we explore it?

We can look at the type:

coffee_df_fixedcols['lot_number'].dtype
dtype('O')

And we can look at the value counts.

coffee_df_fixedcols['lot_number'].value_counts()
lot_number
1                             18
020/17                         6
019/17                         5
2                              3
102                            3
                              ..
11/23/0696                     1
3-59-2318                      1
8885                           1
5055                           1
017-053-0211/ 017-053-0212     1
Name: count, Length: 221, dtype: int64

We see that a lot are ‘1’, maybe we know that when the data was collected, if the Farm only has one lot, some people recorded ‘1’ and others left it as missing. So we could fill in with 1:

coffee_df_fixedcols['lot_number'].fillna('1')
1                                1
2                                1
3                                1
4                                1
5                                1
                   ...            
1307                             1
1308                             1
1309    017-053-0211/ 017-053-0212
1310                             1
1312                           103
Name: lot_number, Length: 1311, dtype: object

Note that even after we called fillna we display it again and the original data is unchanged. To save the filled in column we have a few choices:

  • use the inplace parameter. This doesn’t offer performance advantages, but does It still copies the object, but then reassigns the pointer. Its under discussion to deprecate

  • write to a new DataFrame

  • add a column

We’ll use adding a column:

coffee_df_fixedcols['lot_number_clean'] = coffee_df_fixedcols['lot_number'].fillna('1')
coffee_df_fixedcols['lot_number_clean'].value_counts()
lot_number_clean
1                             1059
020/17                           6
019/17                           5
102                              3
103                              3
                              ... 
3-59-2318                        1
8885                             1
5055                             1
MCCFWXA15/16                     1
017-053-0211/ 017-053-0212       1
Name: count, Length: 221, dtype: int64

8.4.2. Dropping missing values#

To illustrate how dropna works, we’ll use the shape method:

coffee_df_fixedcols.shape
(1311, 45)
coffee_df_fixedcols.dropna().shape
(130, 45)

By default, it drops any row with one or more NaN values.

We could instead tell it to only drop rows with NaN in a subset of the columns.

coffee_df_fixedcols.dropna(subset=['altitude_low_meters']).shape
(1084, 45)
coffee_alt_df = coffee_df_fixedcols.dropna(subset=['altitude_low_meters'])

In the Open Policing Project Data Summary we saw that they made a summary information that showed which variables had at least 70% not missing values. We can similarly choose to keep only variables that have more than a specific threshold of data, using the thresh parameter and axis=1 to drop along columns.

n_rows, n_cols = coffee_df_fixedcols.shape
coffee_df_fixedcols.dropna(thresh = .7*n_rows, axis=1).shape
(1311, 44)

This dataset is actually in pretty good shape, but if we use a more stringent threshold it drops more columns.

coffee_df_fixedcols.dropna(thresh = .85*n_rows, axis=1).shape
(1311, 34)

8.5. Inconsistent values#

This was one of the things that many of you anticipated or had observed. A useful way to investigate for this, is to use value_counts and sort them alphabetically by the values from the original data, so that similar ones will be consecutive in the list. Once we have the value_counts() Series, the values from the coffee_df become the index, so we use sort_index.

Let’s look at the in_country_partner column

coffee_df_fixedcols['in_country_partner'].value_counts().sort_index()
in_country_partner
AMECAFE                                                                                  205
Africa Fine Coffee Association                                                            49
Almacafé                                                                                 178
Asociacion Nacional Del Café                                                             155
Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.                             6
Asociación de Cafés Especiales de Nicaragua                                                8
Blossom Valley International                                                              58
Blossom Valley International\n                                                             1
Brazil Specialty Coffee Association                                                       67
Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao      1
Centro Agroecológico del Café A.C.                                                         8
Coffee Quality Institute                                                                   7
Ethiopia Commodity Exchange                                                               18
Instituto Hondureño del Café                                                              60
Kenya Coffee Traders Association                                                          22
METAD Agricultural Development plc                                                        15
NUCOFFEE                                                                                  36
Salvadoran Coffee Council                                                                 11
Specialty Coffee Ass                                                                       1
Specialty Coffee Association                                                             295
Specialty Coffee Association of Costa Rica                                                42
Specialty Coffee Association of Indonesia                                                 10
Specialty Coffee Institute of Asia                                                        16
Tanzanian Coffee Board                                                                     6
Torch Coffee Lab Yunnan                                                                    2
Uganda Coffee Development Authority                                                       22
Yunnan Coffee Exchange                                                                    12
Name: count, dtype: int64

We can see there’s only one Blossom Valley International\n but 58 Blossom Valley International, the former is likely a typo, especially since \n is a special character for a newline. Similarly, with ‘Specialty Coffee Ass’ and ‘Specialty Coffee Association’.

partner_corrections = {'Blossom Valley International\n':'Blossom Valley International',
  'Specialty Coffee Ass':'Specialty Coffee Association'}
coffee_df_clean = coffee_df_fixedcols.replace(partner_corrections)

8.6. Example: Unpacking Jsons#

rhodyprog4ds_gh_events_url
'https://api.github.com/orgs/rhodyprog4ds/events'
gh_df = pd.read_json(rhodyprog4ds_gh_events_url)
gh_df.head()
id type actor repo payload public created_at org
0 28565033114 CreateEvent {'id': 10656079, 'login': 'brownsarahm', 'disp... {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... {'ref': 'c22', 'ref_type': 'tag', 'master_bran... True 2023-04-21 00:47:50+00:00 {'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
1 28565028952 PushEvent {'id': 10656079, 'login': 'brownsarahm', 'disp... {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... {'repository_id': 592944632, 'push_id': 133757... True 2023-04-21 00:47:24+00:00 {'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
2 28508780091 PushEvent {'id': 41898282, 'login': 'github-actions[bot]... {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... {'repository_id': 592944632, 'push_id': 133481... True 2023-04-19 01:08:30+00:00 {'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
3 28508755066 PushEvent {'id': 41898282, 'login': 'github-actions[bot]... {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... {'repository_id': 592944632, 'push_id': 133481... True 2023-04-19 01:06:30+00:00 {'id': 69595187, 'login': 'rhodyprog4ds', 'gra...
4 28508702433 PushEvent {'id': 10656079, 'login': 'brownsarahm', 'disp... {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... {'repository_id': 592944632, 'push_id': 133481... True 2023-04-19 01:02:22+00:00 {'id': 69595187, 'login': 'rhodyprog4ds', 'gra...

Some datasets have a nested structure

We want to transform each one of those from a dictionary like thing into a row in a data frame.

We can see each row is a Series type.

type(gh_df.loc[0])
pandas.core.series.Series
a= '1'
type(a)
str

Recall, that base python types can be used as function, to cast an object from type to another.

type(int(a))
int

This works with Pandas Series too

pd.Series(gh_df.loc[0]['actor'])
id                                                        10656079
login                                                  brownsarahm
display_login                                          brownsarahm
gravatar_id                                                       
url                       https://api.github.com/users/brownsarahm
avatar_url       https://avatars.githubusercontent.com/u/10656079?
dtype: object

We can use pandas apply to do the same thing to every item in a dataset (over rows or columns as items ) For example

gh_df['actor'].apply(pd.Series).head()
id login display_login gravatar_id url avatar_url
0 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
1 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
2 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
3 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
4 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?

compared to the original:

gh_df.head(1)
id type actor repo payload public created_at org
0 28565033114 CreateEvent {'id': 10656079, 'login': 'brownsarahm', 'disp... {'id': 592944632, 'name': 'rhodyprog4ds/BrownS... {'ref': 'c22', 'ref_type': 'tag', 'master_bran... True 2023-04-21 00:47:50+00:00 {'id': 69595187, 'login': 'rhodyprog4ds', 'gra...

We want to handle several columns this way, so we’ll make alist of the names.

js_cols = ['actor','repo','payload','org']

pd.concat takes a list of dataframes and puts the together in one DataFrame.

pd.concat([gh_df[col].apply(pd.Series) for col in js_cols],axis=1).head()
id login display_login gravatar_id url avatar_url id name url ref ... before commits action release issue id login gravatar_id url avatar_url
0 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? 592944632 rhodyprog4ds/BrownSpring23 https://api.github.com/repos/rhodyprog4ds/Brow... c22 ... NaN NaN NaN NaN NaN 69595187 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
1 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? 592944632 rhodyprog4ds/BrownSpring23 https://api.github.com/repos/rhodyprog4ds/Brow... refs/heads/main ... 14247b91b29fdf6641b07785ab87920d1e9e26eb [{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae... NaN NaN NaN 69595187 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
2 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282? 592944632 rhodyprog4ds/BrownSpring23 https://api.github.com/repos/rhodyprog4ds/Brow... refs/heads/gh-pages ... c79137af62e22428db8a3e5614a496e85e2094a6 [{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5... NaN NaN NaN 69595187 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
3 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282? 592944632 rhodyprog4ds/BrownSpring23 https://api.github.com/repos/rhodyprog4ds/Brow... refs/heads/gh-pages ... 972dcd4e3117f378c346547e20beb2905689d57d [{'sha': 'c79137af62e22428db8a3e5614a496e85e20... NaN NaN NaN 69595187 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
4 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? 592944632 rhodyprog4ds/BrownSpring23 https://api.github.com/repos/rhodyprog4ds/Brow... refs/heads/main ... 9d273b140d5a2e303253bf568b3d6bddf78c42d1 [{'sha': '14247b91b29fdf6641b07785ab87920d1e9e... NaN NaN NaN 69595187 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?

5 rows × 29 columns

This is close, but a lot of columns have the same name. To fix this we will rename the new columns so that they have the original column name prepended to the new name.

pandas has a rename method for this.

and this is another job for lambdas.

pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[35], line 1
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

Cell In[35], line 1, in <listcomp>(.0)
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/frame.py:5440, in DataFrame.rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   5321 def rename(
   5322     self,
   5323     mapper: Renamer | None = None,
   (...)
   5331     errors: IgnoreRaise = "ignore",
   5332 ) -> DataFrame | None:
   5333     """
   5334     Rename columns or index labels.
   5335 
   (...)
   5438     4  3  6
   5439     """
-> 5440     return super()._rename(
   5441         mapper=mapper,
   5442         index=index,
   5443         columns=columns,
   5444         axis=axis,
   5445         copy=copy,
   5446         inplace=inplace,
   5447         level=level,
   5448         errors=errors,
   5449     )

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/generic.py:1034, in NDFrame._rename(self, mapper, index, columns, axis, copy, inplace, level, errors)
   1027         missing_labels = [
   1028             label
   1029             for index, label in enumerate(replacements)
   1030             if indexer[index] == -1
   1031         ]
   1032         raise KeyError(f"{missing_labels} not found in axis")
-> 1034 new_index = ax._transform_index(f, level=level)
   1035 result._set_axis_nocheck(new_index, axis=axis_no, inplace=True, copy=False)
   1036 result._clear_item_cache()

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:6204, in Index._transform_index(self, func, level)
   6202     return type(self).from_arrays(values)
   6203 else:
-> 6204     items = [func(x) for x in self]
   6205     return Index(items, name=self.name, tupleize_cols=False)

File /opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/pandas/core/indexes/base.py:6204, in <listcomp>(.0)
   6202     return type(self).from_arrays(values)
   6203 else:
-> 6204     items = [func(x) for x in self]
   6205     return Index(items, name=self.name, tupleize_cols=False)

Cell In[35], line 1, in <lambda>(c)
----> 1 pd.concat([gh_df[col].apply(pd.Series).rename(lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()

TypeError: sequence item 0: expected str instance, int found
gh_df['actor'].apply(pd.Series).rename(columns=lambda c: '_'.join([c,'actor']))
id_actor login_actor display_login_actor gravatar_id_actor url_actor avatar_url_actor
0 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
1 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
2 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
3 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
4 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
5 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
6 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
7 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
8 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
9 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
10 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
11 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
12 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
13 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
14 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
15 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
16 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
17 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
18 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
19 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
20 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
21 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
22 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
23 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
24 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
25 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
26 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
27 41898282 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282?
28 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
29 10656079 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079?
json_cols_df = pd.concat([gh_df[col].apply(pd.Series).rename(columns=lambda c: '_'.join([c,col])) for col in js_cols],axis=1).head()
gh_df.columns
Index(['id', 'type', 'actor', 'repo', 'payload', 'public', 'created_at',
       'org'],
      dtype='object')
json_cols_df.columns
Index(['id_actor', 'login_actor', 'display_login_actor', 'gravatar_id_actor',
       'url_actor', 'avatar_url_actor', 'id_repo', 'name_repo', 'url_repo',
       'ref_payload', 'ref_type_payload', 'master_branch_payload',
       'description_payload', 'pusher_type_payload', 'repository_id_payload',
       'push_id_payload', 'size_payload', 'distinct_size_payload',
       'head_payload', 'before_payload', 'commits_payload', 'action_payload',
       'release_payload', 'issue_payload', 'id_org', 'login_org',
       'gravatar_id_org', 'url_org', 'avatar_url_org'],
      dtype='object')

Then we can put the two parts of the data together

pd.concat([gh_df[['id','type','public','created_at']],json_cols_df],)
id type public created_at id_actor login_actor display_login_actor gravatar_id_actor url_actor avatar_url_actor ... before_payload commits_payload action_payload release_payload issue_payload id_org login_org gravatar_id_org url_org avatar_url_org
0 2.856503e+10 CreateEvent True 2023-04-21 00:47:50+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2.856503e+10 PushEvent True 2023-04-21 00:47:24+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.850878e+10 PushEvent True 2023-04-19 01:08:30+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2.850876e+10 PushEvent True 2023-04-19 01:06:30+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2.850870e+10 PushEvent True 2023-04-19 01:02:22+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 2.850869e+10 ReleaseEvent True 2023-04-19 01:01:45+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 2.850868e+10 CreateEvent True 2023-04-19 01:00:40+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 2.850868e+10 PushEvent True 2023-04-19 01:00:20+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 2.838014e+10 PushEvent True 2023-04-13 01:55:52+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 2.838009e+10 ReleaseEvent True 2023-04-13 01:51:24+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 2.838007e+10 CreateEvent True 2023-04-13 01:49:36+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 2.838006e+10 PushEvent True 2023-04-13 01:49:22+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 2.835099e+10 PushEvent True 2023-04-12 02:28:24+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 2.835091e+10 ReleaseEvent True 2023-04-12 02:22:35+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 2.835088e+10 CreateEvent True 2023-04-12 02:21:10+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 2.835088e+10 PushEvent True 2023-04-12 02:21:12+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 2.834636e+10 PushEvent True 2023-04-11 21:18:20+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 2.834622e+10 PushEvent True 2023-04-11 21:10:21+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2.833450e+10 PushEvent True 2023-04-11 13:02:29+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 2.833432e+10 PushEvent True 2023-04-11 12:56:32+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 2.826148e+10 PushEvent True 2023-04-06 23:43:27+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 2.826142e+10 PushEvent True 2023-04-06 23:36:57+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 2.820908e+10 PushEvent True 2023-04-05 01:28:11+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 2.820905e+10 ReleaseEvent True 2023-04-05 01:26:09+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 2.820902e+10 CreateEvent True 2023-04-05 01:22:48+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 2.820901e+10 PushEvent True 2023-04-05 01:22:25+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 2.820560e+10 IssuesEvent True 2023-04-04 21:04:21+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 2.817440e+10 PushEvent True 2023-04-03 19:24:10+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 2.817424e+10 PushEvent True 2023-04-03 19:16:08+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 2.810662e+10 ReleaseEvent True 2023-03-31 01:37:29+00:00 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaT 10656079.0 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? ... NaN NaN NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
1 NaN NaN NaN NaT 10656079.0 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? ... 14247b91b29fdf6641b07785ab87920d1e9e26eb [{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
2 NaN NaN NaN NaT 41898282.0 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282? ... c79137af62e22428db8a3e5614a496e85e2094a6 [{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
3 NaN NaN NaN NaT 41898282.0 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282? ... 972dcd4e3117f378c346547e20beb2905689d57d [{'sha': 'c79137af62e22428db8a3e5614a496e85e20... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
4 NaN NaN NaN NaT 10656079.0 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? ... 9d273b140d5a2e303253bf568b3d6bddf78c42d1 [{'sha': '14247b91b29fdf6641b07785ab87920d1e9e... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?

35 rows × 33 columns

and finally save this

gh_df_clean = pd.concat([gh_df[['id','type','public','created_at']],json_cols_df],axis=1)
gh_df_clean.head()
id type public created_at id_actor login_actor display_login_actor gravatar_id_actor url_actor avatar_url_actor ... before_payload commits_payload action_payload release_payload issue_payload id_org login_org gravatar_id_org url_org avatar_url_org
0 28565033114 CreateEvent True 2023-04-21 00:47:50+00:00 10656079.0 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? ... NaN NaN NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
1 28565028952 PushEvent True 2023-04-21 00:47:24+00:00 10656079.0 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? ... 14247b91b29fdf6641b07785ab87920d1e9e26eb [{'sha': '0723a9a16696f9b5ffd606678a6acb6c71ae... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
2 28508780091 PushEvent True 2023-04-19 01:08:30+00:00 41898282.0 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282? ... c79137af62e22428db8a3e5614a496e85e2094a6 [{'sha': '18628aa22d10f2150ec0d725a7e45592c1b5... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
3 28508755066 PushEvent True 2023-04-19 01:06:30+00:00 41898282.0 github-actions[bot] github-actions https://api.github.com/users/github-actions[bot] https://avatars.githubusercontent.com/u/41898282? ... 972dcd4e3117f378c346547e20beb2905689d57d [{'sha': 'c79137af62e22428db8a3e5614a496e85e20... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?
4 28508702433 PushEvent True 2023-04-19 01:02:22+00:00 10656079.0 brownsarahm brownsarahm https://api.github.com/users/brownsarahm https://avatars.githubusercontent.com/u/10656079? ... 9d273b140d5a2e303253bf568b3d6bddf78c42d1 [{'sha': '14247b91b29fdf6641b07785ab87920d1e9e... NaN NaN NaN 69595187.0 rhodyprog4ds https://api.github.com/orgs/rhodyprog4ds https://avatars.githubusercontent.com/u/69595187?

5 rows × 33 columns

If we want to analyze this data, this is a good place to save it to disk and start an analysis in separate notebook.

gh_df_clean.to_csv('gh_events_unpacked.csv')

8.7. Questions After Class#

8.7.1. How the apply function works/use cases?#

A4 will give you some examples, espeically the airline dataset. We will also keep seing it come up as we manipulate data more.

the apply docs have tiny examples that help illustrate what it does and some of how it works. The pandas faq has a section on apply and similar methods that gives some more use cases.

8.7.2. Is there a better way to see how many missing values?#

There are lots of ways. All are fine. We used info in class because I was trying to use the way we had already seen. Info focuses on how many values are present instead of what is missing because it makes more sense in most cases. The more common question is: are there enough values to make decisions with?

If you wanted to get counts of the missing values, you can use the pandas isna function. It is a pandas function, the docs say pandas.isna not a DataFrame method (which would be described like pandas.DataFrame.methodname). This means we use it like

value_to_test = 4
pd.isna(value_to_test)
False

Try it Yourself

pass different values like: False, np.nan (also import numpy as np) and, pd.NA, hello to this function

help(pd.isna)
Help on function isna in module pandas.core.dtypes.missing:

isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'
    Detect missing values for an array-like object.
    
    This function takes a scalar or array-like object and indicates
    whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
    in object arrays, ``NaT`` in datetimelike).
    
    Parameters
    ----------
    obj : scalar or array-like
        Object to check for null or missing values.
    
    Returns
    -------
    bool or array-like of bool
        For scalar input, returns a scalar boolean.
        For array input, returns an array of boolean indicating whether each
        corresponding element is missing.
    
    See Also
    --------
    notna : Boolean inverse of pandas.isna.
    Series.isna : Detect missing values in a Series.
    DataFrame.isna : Detect missing values in a DataFrame.
    Index.isna : Detect missing values in an Index.
    
    Examples
    --------
    Scalar arguments (including strings) result in a scalar boolean.
    
    >>> pd.isna('dog')
    False
    
    >>> pd.isna(pd.NA)
    True
    
    >>> pd.isna(np.nan)
    True
    
    ndarrays result in an ndarray of booleans.
    
    >>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
    >>> array
    array([[ 1., nan,  3.],
           [ 4.,  5., nan]])
    >>> pd.isna(array)
    array([[False,  True, False],
           [False, False,  True]])
    
    For indexes, an ndarray of booleans is returned.
    
    >>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
    ...                           "2017-07-08"])
    >>> index
    DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
                  dtype='datetime64[ns]', freq=None)
    >>> pd.isna(index)
    array([False, False,  True, False])
    
    For Series and DataFrame, the same type is returned, containing booleans.
    
    >>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
    >>> df
         0     1    2
    0  ant   bee  cat
    1  dog  None  fly
    >>> pd.isna(df)
           0      1      2
    0  False  False  False
    1  False   True  False
    
    >>> pd.isna(df[1])
    0    False
    1     True
    Name: 1, dtype: bool

The docstring says that it returns “bool or array-like of bool” but if we go to the website docs that have more examples, we can find out what that it will return a DataFrame if we pass it a DataFrame. Then we can use the pandas.DataFrame.sum method.

pd.isna(coffee_df_clean).sum()
species                        0
owner                          7
country_of_origin              1
farm_name                    356
lot_number                  1041
mill                         310
ico_number                   148
company                      209
altitude                     223
region                        57
producer                     230
number_of_bags                 0
bag_weight                     0
in_country_partner             0
harvest_year                  47
grading_date                   0
owner_1                        7
variety                      201
processing_method            152
aroma                          0
flavor                         0
aftertaste                     0
acidity                        0
body                           0
balance                        0
uniformity                     0
clean_cup                      0
sweetness                      0
cupper_points                  0
total_cup_points               0
moisture                       0
category_one_defects           0
quakers                        1
color                        267
category_two_defects           0
expiration                     0
certification_body             0
certification_address          0
certification_contact          0
unit_of_measurement            0
altitude_low_meters          227
altitude_high_meters         227
altitude_mean_meters         227
unit_of_measurement_long       0
lot_number_clean               0
dtype: int64

8.7.3. in col_name_mapper = {col_name:col_name.lower().replace('.','_') for col_name in coffee_df.columns} what is the {} for?#

This is called a dictionary comphrehension. It is very similar to a list comprehension. It is one of the defined ways to build a dict type object

We also saw one when we looked at different types in a previous class.

{char:i for i,char in enumerate('abcde')}
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

enumerate is a built in function that iterates over items in an iterable type(list-like) and pops the each value paired with its index within the structure.

This way we get each character and it’s position. We could use this as follows

num_chars = {char:i for i,char in enumerate('abcde')}
alpha_data = ['a','d','e','c','b',']
  Cell In[47], line 2
    alpha_data = ['a','d','e','c','b',']
                                        ^
SyntaxError: EOL while scanning string literal