Class 9: Preparing Data For Analysis¶
Say hello in the zoom chat
join Prismia
Checking types is an important part of cleaning data, we need to figure out what is wrong with data before we can fix it.
Remember that data cleaning is a lot of exploration and iteration in the data. Let’s review a few data types we’ve seen.
Warmup: type review¶
Lists are enclosed by square brackets([]), they’re ordered and iterable.
type([char for char in 'abcde'])
list
Dictionaries are comprised of key:value pairs and enclosed in curly brackets ({}) they’re iterable and indexable by the keys.
type({char:i for i, char in enumerate('abcde')})
dict
This is a tuple, this data type is used to tie together items, but not usually to iterate over. It’s used for pairs of things often or groups to iterate over multiple things at once. For example with the zip function.
type(('a','b','c','d','e'))
tuple
Splitting a string makes a list.
 type('a b c d e'.split(' '))
list
Loading Data¶
Loading the data and packages as normal.
# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns
safi_url = 'https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_full_shortname.csv'
safi_df = pd.read_csv(safi_url)
Exploring Types¶
On Friday we figured out that the following function would clean one cell of the safi_df['items_owned'] column.
def separate_items(row):
    '''
    '''
    return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
But when we applied it, we got an error.
safi_df.apply(separate_items,axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-26c2d810f773> in <module>
----> 1 safi_df.apply(separate_items,axis=1)
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because
<ipython-input-7-1663c693d570> in separate_items(row)
      3     '''
      4 
----> 5     return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
AttributeError: 'float' object has no attribute 'replace'
Let’s figure out why. First we’ll look at the data again to remember what we expected.
safi_df['items_owned']
0      ['bicycle' ;  'television' ;  'solar_panel' ; ...
1      ['cow_cart' ;  'bicycle' ;  'radio' ;  'cow_pl...
2                                        ['solar_torch']
3      ['bicycle' ;  'radio' ;  'cow_plough' ;  'sola...
4      ['motorcyle' ;  'radio' ;  'cow_plough' ;  'mo...
                             ...                        
126            ['motorcyle' ;  'radio' ;  'solar_panel']
127    ['car' ;  'lorry' ;  'television' ;  'radio' ;...
128    ['radio' ;  'solar_panel' ;  'solar_torch' ;  ...
129    ['cow_cart' ;  'lorry' ;  'motorcyle' ;  'comp...
130    ['radio' ;  'cow_plough' ;  'solar_panel' ;  '...
Name: items_owned, Length: 131, dtype: object
The error says that some item is a float, let’s look at the value of only the ones that are float
[item_list for item_list in  safi_df['items_owned'] if type(item_list)==float]
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]
Now we can modify the cleaning function
def separate_items(row):
    '''
    '''
    if type(row['items_owned'])==str:
        return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
    else:
        return []
and apply it again, to see if it works now.
safi_df.apply(separate_items,axis=1)
0              [bicycle, television, solar_panel, table]
1      [cow_cart, bicycle, radio, cow_plough, solar_p...
2                                          [solar_torch]
3      [bicycle, radio, cow_plough, solar_panel, mobi...
4           [motorcyle, radio, cow_plough, mobile_phone]
                             ...                        
126                      [motorcyle, radio, solar_panel]
127    [car, lorry, television, radio, sterio, cow_pl...
128      [radio, solar_panel, solar_torch, mobile_phone]
129    [cow_cart, lorry, motorcyle, computer, televis...
130    [radio, cow_plough, solar_panel, solar_torch, ...
Length: 131, dtype: object
Dummies From Lists¶
safi_df['items_owned'] = safi_df.apply(separate_items,axis=1)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack())
| bicycle | car | computer | cow_cart | cow_plough | electricity | fridge | lorry | mobile_phone | motorcyle | radio | sofa_set | solar_panel | solar_torch | sterio | table | television | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 130 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 
621 rows × 17 columns
safi_df['items_owned'].shape
(131,)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).index
MultiIndex([(  0,  0),
            (  0,  1),
            (  0,  2),
            (  0,  3),
            (  1,  0),
            (  1,  1),
            (  1,  2),
            (  1,  3),
            (  1,  4),
            (  1,  5),
            ...
            (129,  8),
            (129,  9),
            (129, 10),
            (129, 11),
            (130,  0),
            (130,  1),
            (130,  2),
            (130,  3),
            (130,  4),
            (130,  5)],
           length=621)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)
| bicycle | car | computer | cow_cart | cow_plough | electricity | fridge | lorry | mobile_phone | motorcyle | radio | sofa_set | solar_panel | solar_torch | sterio | table | television | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 
| 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 
| 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 126 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 
| 127 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 
| 128 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 
| 129 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 
| 130 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 
121 rows × 17 columns
safi_df['items_owned'].head()
0            [bicycle, television, solar_panel, table]
1    [cow_cart, bicycle, radio, cow_plough, solar_p...
2                                        [solar_torch]
3    [bicycle, radio, cow_plough, solar_panel, mobi...
4         [motorcyle, radio, cow_plough, mobile_phone]
Name: items_owned, dtype: object
safi_item_df = pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)
Pandas concat¶
safi_df = pd.concat([safi_df, safi_item_df],axis=1)
safi_df.shape
(131, 82)
safi_df.head()
| key_id | interview_date | quest_no | start | end | province | district | ward | village | years_farm | ... | lorry | mobile_phone | motorcyle | radio | sofa_set | solar_panel | solar_torch | sterio | table | television | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 17 November 2016 | 1 | 2017-03-23T09:49:57.000Z | 2017-04-02T17:29:08.000Z | Manica | Manica | Bandula | God | 11 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 
| 1 | 2 | 17 November 2016 | 1 | 2017-04-02T09:48:16.000Z | 2017-04-02T17:26:19.000Z | Manica | Manica | Bandula | God | 2 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 
| 2 | 3 | 17 November 2016 | 3 | 2017-04-02T14:35:26.000Z | 2017-04-02T17:26:53.000Z | Manica | Manica | Bandula | God | 40 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 
| 3 | 4 | 17 November 2016 | 4 | 2017-04-02T14:55:18.000Z | 2017-04-02T17:27:16.000Z | Manica | Manica | Bandula | God | 6 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 
| 4 | 5 | 17 November 2016 | 5 | 2017-04-02T15:10:35.000Z | 2017-04-02T17:27:35.000Z | Manica | Manica | Bandula | God | 18 | ... | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 
5 rows × 82 columns
Working with Dummy Variables¶
safi_df['bicycle']
0      1.0
1      1.0
2      0.0
3      1.0
4      0.0
      ... 
126    0.0
127    0.0
128    0.0
129    0.0
130    0.0
Name: bicycle, Length: 131, dtype: float64
sum(safi_df['bicycle'])
nan
safi_df['bicycle'].sum()
60.0
safi_df['bicycle'].count()
121
safi_df['bicycle'].shape
(131,)
pd.concat([safi_df, safi_item_df],axis=0).shape
(252, 82)
Questions after class¶
What does the get_dummies function do?¶
To illustrate, let’s first make a small dataframe to look at it
ex_data = [['a',2,'x'],['b',5,'o'],['a',4,'o'],['c',2,'x'],['b',3,'x']]
ex_df = pd.DataFrame(data =ex_data, columns=['char','num','symbol'])
ex_df
| char | num | symbol | |
|---|---|---|---|
| 0 | a | 2 | x | 
| 1 | b | 5 | o | 
| 2 | a | 4 | o | 
| 3 | c | 2 | x | 
| 4 | b | 3 | x | 
When we apply get_dummies to one column (a pd.Series) it makes a column for each value of that column.
First we look at that column to focus on what it looks like. Note that there are 3 different values.
ex_df['char']
0    a
1    b
2    a
3    c
4    b
Name: char, dtype: object
Now conver it to dummy variables, now we have 3 columns. For each row there is a 1 in the column corresponding to the value and zeros elsewhere.
pd.get_dummies(ex_df['char'])
| a | b | c | |
|---|---|---|---|
| 0 | 1 | 0 | 0 | 
| 1 | 0 | 1 | 0 | 
| 2 | 1 | 0 | 0 | 
| 3 | 0 | 0 | 1 | 
| 4 | 0 | 1 | 0 | 
We can confirm by summing across the rows, it’s all ones.
pd.get_dummies(ex_df['char']).sum(axis=1)
0    1
1    1
2    1
3    1
4    1
dtype: int64
If we sum down the columns, we get the count of each value
pd.get_dummies(ex_df['char']).sum(axis=0)
a    2
b    2
c    1
dtype: int64
it’s the same thing as value_counts()
ex_df['char'].value_counts()
a    2
b    2
c    1
Name: char, dtype: int64
We can also apply it one column, but append it to the whole dataset this way.
pd.get_dummies(ex_df,columns=['char'])
| num | symbol | char_a | char_b | char_c | |
|---|---|---|---|---|---|
| 0 | 2 | x | 1 | 0 | 0 | 
| 1 | 5 | o | 0 | 1 | 0 | 
| 2 | 4 | o | 1 | 0 | 0 | 
| 3 | 2 | x | 0 | 0 | 1 | 
| 4 | 3 | x | 0 | 1 | 0 | 
This way, it prepends the column values with the original column name. This way concatenates on it’s own and we don’t have to do it separately.
When we apply get_dummies on a whole DataFrame, without indicating a column it converts  all of the columns of pandas type object
ex_df.dtypes
char      object
num        int64
symbol    object
dtype: object
pd.get_dummies(ex_df,)
| num | char_a | char_b | char_c | symbol_o | symbol_x | |
|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 0 | 0 | 0 | 1 | 
| 1 | 5 | 0 | 1 | 0 | 1 | 0 | 
| 2 | 4 | 1 | 0 | 0 | 1 | 0 | 
| 3 | 2 | 0 | 0 | 1 | 0 | 1 | 
| 4 | 3 | 0 | 1 | 0 | 0 | 1 |