Class 9: Preparing Data For Analysis¶
Say hello in the zoom chat
join Prismia
Checking types is an important part of cleaning data, we need to figure out what is wrong with data before we can fix it.
Remember that data cleaning is a lot of exploration and iteration in the data. Let’s review a few data types we’ve seen.
Warmup: type review¶
Lists are enclosed by square brackets([]
), they’re ordered and iterable.
type([char for char in 'abcde'])
list
Dictionaries are comprised of key:value
pairs and enclosed in curly brackets ({}
) they’re iterable and indexable by the keys.
type({char:i for i, char in enumerate('abcde')})
dict
This is a tuple, this data type is used to tie together items, but not usually to iterate over. It’s used for pairs of things often or groups to iterate over multiple things at once. For example with the zip function.
type(('a','b','c','d','e'))
tuple
Splitting a string makes a list.
type('a b c d e'.split(' '))
list
Loading Data¶
Loading the data and packages as normal.
# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns
safi_url = 'https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_full_shortname.csv'
safi_df = pd.read_csv(safi_url)
Exploring Types¶
On Friday we figured out that the following function would clean one cell of the safi_df['items_owned']
column.
def separate_items(row):
'''
'''
return [i.strip() for i in row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
But when we applied it, we got an error.
safi_df.apply(separate_items,axis=1)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-26c2d810f773> in <module>
----> 1 safi_df.apply(separate_items,axis=1)
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
<ipython-input-7-1663c693d570> in separate_items(row)
3 '''
4
----> 5 return [i.strip() for i in row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
AttributeError: 'float' object has no attribute 'replace'
Let’s figure out why. First we’ll look at the data again to remember what we expected.
safi_df['items_owned']
0 ['bicycle' ; 'television' ; 'solar_panel' ; ...
1 ['cow_cart' ; 'bicycle' ; 'radio' ; 'cow_pl...
2 ['solar_torch']
3 ['bicycle' ; 'radio' ; 'cow_plough' ; 'sola...
4 ['motorcyle' ; 'radio' ; 'cow_plough' ; 'mo...
...
126 ['motorcyle' ; 'radio' ; 'solar_panel']
127 ['car' ; 'lorry' ; 'television' ; 'radio' ;...
128 ['radio' ; 'solar_panel' ; 'solar_torch' ; ...
129 ['cow_cart' ; 'lorry' ; 'motorcyle' ; 'comp...
130 ['radio' ; 'cow_plough' ; 'solar_panel' ; '...
Name: items_owned, Length: 131, dtype: object
The error says that some item is a float, let’s look at the value of only the ones that are float
[item_list for item_list in safi_df['items_owned'] if type(item_list)==float]
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]
Now we can modify the cleaning function
def separate_items(row):
'''
'''
if type(row['items_owned'])==str:
return [i.strip() for i in row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
else:
return []
and apply it again, to see if it works now.
safi_df.apply(separate_items,axis=1)
0 [bicycle, television, solar_panel, table]
1 [cow_cart, bicycle, radio, cow_plough, solar_p...
2 [solar_torch]
3 [bicycle, radio, cow_plough, solar_panel, mobi...
4 [motorcyle, radio, cow_plough, mobile_phone]
...
126 [motorcyle, radio, solar_panel]
127 [car, lorry, television, radio, sterio, cow_pl...
128 [radio, solar_panel, solar_torch, mobile_phone]
129 [cow_cart, lorry, motorcyle, computer, televis...
130 [radio, cow_plough, solar_panel, solar_torch, ...
Length: 131, dtype: object
Dummies From Lists¶
safi_df['items_owned'] = safi_df.apply(separate_items,axis=1)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack())
bicycle | car | computer | cow_cart | cow_plough | electricity | fridge | lorry | mobile_phone | motorcyle | radio | sofa_set | solar_panel | solar_torch | sterio | table | television | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
130 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
621 rows × 17 columns
safi_df['items_owned'].shape
(131,)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).index
MultiIndex([( 0, 0),
( 0, 1),
( 0, 2),
( 0, 3),
( 1, 0),
( 1, 1),
( 1, 2),
( 1, 3),
( 1, 4),
( 1, 5),
...
(129, 8),
(129, 9),
(129, 10),
(129, 11),
(130, 0),
(130, 1),
(130, 2),
(130, 3),
(130, 4),
(130, 5)],
length=621)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)
bicycle | car | computer | cow_cart | cow_plough | electricity | fridge | lorry | mobile_phone | motorcyle | radio | sofa_set | solar_panel | solar_torch | sterio | table | television | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
126 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
127 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
128 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
129 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 |
130 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
121 rows × 17 columns
safi_df['items_owned'].head()
0 [bicycle, television, solar_panel, table]
1 [cow_cart, bicycle, radio, cow_plough, solar_p...
2 [solar_torch]
3 [bicycle, radio, cow_plough, solar_panel, mobi...
4 [motorcyle, radio, cow_plough, mobile_phone]
Name: items_owned, dtype: object
safi_item_df = pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)
Pandas concat¶
safi_df = pd.concat([safi_df, safi_item_df],axis=1)
safi_df.shape
(131, 82)
safi_df.head()
key_id | interview_date | quest_no | start | end | province | district | ward | village | years_farm | ... | lorry | mobile_phone | motorcyle | radio | sofa_set | solar_panel | solar_torch | sterio | table | television | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 17 November 2016 | 1 | 2017-03-23T09:49:57.000Z | 2017-04-02T17:29:08.000Z | Manica | Manica | Bandula | God | 11 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
1 | 2 | 17 November 2016 | 1 | 2017-04-02T09:48:16.000Z | 2017-04-02T17:26:19.000Z | Manica | Manica | Bandula | God | 2 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 |
2 | 3 | 17 November 2016 | 3 | 2017-04-02T14:35:26.000Z | 2017-04-02T17:26:53.000Z | Manica | Manica | Bandula | God | 40 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 4 | 17 November 2016 | 4 | 2017-04-02T14:55:18.000Z | 2017-04-02T17:27:16.000Z | Manica | Manica | Bandula | God | 6 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 5 | 17 November 2016 | 5 | 2017-04-02T15:10:35.000Z | 2017-04-02T17:27:35.000Z | Manica | Manica | Bandula | God | 18 | ... | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 82 columns
Working with Dummy Variables¶
safi_df['bicycle']
0 1.0
1 1.0
2 0.0
3 1.0
4 0.0
...
126 0.0
127 0.0
128 0.0
129 0.0
130 0.0
Name: bicycle, Length: 131, dtype: float64
sum(safi_df['bicycle'])
nan
safi_df['bicycle'].sum()
60.0
safi_df['bicycle'].count()
121
safi_df['bicycle'].shape
(131,)
pd.concat([safi_df, safi_item_df],axis=0).shape
(252, 82)
Questions after class¶
What does the get_dummies
function do?¶
To illustrate, let’s first make a small dataframe to look at it
ex_data = [['a',2,'x'],['b',5,'o'],['a',4,'o'],['c',2,'x'],['b',3,'x']]
ex_df = pd.DataFrame(data =ex_data, columns=['char','num','symbol'])
ex_df
char | num | symbol | |
---|---|---|---|
0 | a | 2 | x |
1 | b | 5 | o |
2 | a | 4 | o |
3 | c | 2 | x |
4 | b | 3 | x |
When we apply get_dummies
to one column (a pd.Series
) it makes a column for each value of that column.
First we look at that column to focus on what it looks like. Note that there are 3 different values.
ex_df['char']
0 a
1 b
2 a
3 c
4 b
Name: char, dtype: object
Now conver it to dummy variables, now we have 3 columns. For each row there is a 1 in the column corresponding to the value and zeros elsewhere.
pd.get_dummies(ex_df['char'])
a | b | c | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 1 | 0 |
We can confirm by summing across the rows, it’s all ones.
pd.get_dummies(ex_df['char']).sum(axis=1)
0 1
1 1
2 1
3 1
4 1
dtype: int64
If we sum down the columns, we get the count of each value
pd.get_dummies(ex_df['char']).sum(axis=0)
a 2
b 2
c 1
dtype: int64
it’s the same thing as value_counts()
ex_df['char'].value_counts()
a 2
b 2
c 1
Name: char, dtype: int64
We can also apply it one column, but append it to the whole dataset this way.
pd.get_dummies(ex_df,columns=['char'])
num | symbol | char_a | char_b | char_c | |
---|---|---|---|---|---|
0 | 2 | x | 1 | 0 | 0 |
1 | 5 | o | 0 | 1 | 0 |
2 | 4 | o | 1 | 0 | 0 |
3 | 2 | x | 0 | 0 | 1 |
4 | 3 | x | 0 | 1 | 0 |
This way, it prepends the column values with the original column name. This way concatenates on it’s own and we don’t have to do it separately.
When we apply get_dummies
on a whole DataFrame, without indicating a column it converts all of the columns of pandas type object
ex_df.dtypes
char object
num int64
symbol object
dtype: object
pd.get_dummies(ex_df,)
num | char_a | char_b | char_c | symbol_o | symbol_x | |
---|---|---|---|---|---|---|
0 | 2 | 1 | 0 | 0 | 0 | 1 |
1 | 5 | 0 | 1 | 0 | 1 | 0 |
2 | 4 | 1 | 0 | 0 | 1 | 0 |
3 | 2 | 0 | 0 | 1 | 0 | 1 |
4 | 3 | 0 | 1 | 0 | 0 | 1 |