Class 9: Preparing Data For Analysis

  • Say hello in the zoom chat

  • join Prismia

Checking types is an important part of cleaning data, we need to figure out what is wrong with data before we can fix it.

Remember that data cleaning is a lot of exploration and iteration in the data. Let’s review a few data types we’ve seen.

Warmup: type review

Lists are enclosed by square brackets([]), they’re ordered and iterable.

type([char for char in 'abcde'])
list

Dictionaries are comprised of key:value pairs and enclosed in curly brackets ({}) they’re iterable and indexable by the keys.

type({char:i for i, char in enumerate('abcde')})
dict

This is a tuple, this data type is used to tie together items, but not usually to iterate over. It’s used for pairs of things often or groups to iterate over multiple things at once. For example with the zip function.

type(('a','b','c','d','e'))
tuple

Splitting a string makes a list.

 type('a b c d e'.split(' '))
list

Loading Data

Loading the data and packages as normal.

# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns
safi_url = 'https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_full_shortname.csv'
safi_df = pd.read_csv(safi_url)

Exploring Types

On Friday we figured out that the following function would clean one cell of the safi_df['items_owned'] column.

def separate_items(row):
    '''
    '''

    return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]

But when we applied it, we got an error.

safi_df.apply(separate_items,axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-26c2d810f773> in <module>
----> 1 safi_df.apply(separate_items,axis=1)

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-7-1663c693d570> in separate_items(row)
      3     '''
      4 
----> 5     return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]

AttributeError: 'float' object has no attribute 'replace'

Let’s figure out why. First we’ll look at the data again to remember what we expected.

safi_df['items_owned']
0      ['bicycle' ;  'television' ;  'solar_panel' ; ...
1      ['cow_cart' ;  'bicycle' ;  'radio' ;  'cow_pl...
2                                        ['solar_torch']
3      ['bicycle' ;  'radio' ;  'cow_plough' ;  'sola...
4      ['motorcyle' ;  'radio' ;  'cow_plough' ;  'mo...
                             ...                        
126            ['motorcyle' ;  'radio' ;  'solar_panel']
127    ['car' ;  'lorry' ;  'television' ;  'radio' ;...
128    ['radio' ;  'solar_panel' ;  'solar_torch' ;  ...
129    ['cow_cart' ;  'lorry' ;  'motorcyle' ;  'comp...
130    ['radio' ;  'cow_plough' ;  'solar_panel' ;  '...
Name: items_owned, Length: 131, dtype: object

The error says that some item is a float, let’s look at the value of only the ones that are float

[item_list for item_list in  safi_df['items_owned'] if type(item_list)==float]
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Now we can modify the cleaning function

def separate_items(row):
    '''
    '''
    if type(row['items_owned'])==str:
        return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
    else:
        return []

and apply it again, to see if it works now.

safi_df.apply(separate_items,axis=1)
0              [bicycle, television, solar_panel, table]
1      [cow_cart, bicycle, radio, cow_plough, solar_p...
2                                          [solar_torch]
3      [bicycle, radio, cow_plough, solar_panel, mobi...
4           [motorcyle, radio, cow_plough, mobile_phone]
                             ...                        
126                      [motorcyle, radio, solar_panel]
127    [car, lorry, television, radio, sterio, cow_pl...
128      [radio, solar_panel, solar_torch, mobile_phone]
129    [cow_cart, lorry, motorcyle, computer, televis...
130    [radio, cow_plough, solar_panel, solar_torch, ...
Length: 131, dtype: object

Dummies From Lists

safi_df['items_owned'] = safi_df.apply(separate_items,axis=1)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack())
bicycle car computer cow_cart cow_plough electricity fridge lorry mobile_phone motorcyle radio sofa_set solar_panel solar_torch sterio table television
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
130 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

621 rows × 17 columns

safi_df['items_owned'].shape
(131,)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).index
MultiIndex([(  0,  0),
            (  0,  1),
            (  0,  2),
            (  0,  3),
            (  1,  0),
            (  1,  1),
            (  1,  2),
            (  1,  3),
            (  1,  4),
            (  1,  5),
            ...
            (129,  8),
            (129,  9),
            (129, 10),
            (129, 11),
            (130,  0),
            (130,  1),
            (130,  2),
            (130,  3),
            (130,  4),
            (130,  5)],
           length=621)
pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)
bicycle car computer cow_cart cow_plough electricity fridge lorry mobile_phone motorcyle radio sofa_set solar_panel solar_torch sterio table television
0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
1 1 0 0 1 1 0 0 0 1 0 1 0 1 1 0 1 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
3 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0
4 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
126 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
127 0 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 1
128 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0
129 0 0 1 1 1 1 0 1 1 1 1 0 1 1 1 0 1
130 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0

121 rows × 17 columns

safi_df['items_owned'].head()
0            [bicycle, television, solar_panel, table]
1    [cow_cart, bicycle, radio, cow_plough, solar_p...
2                                        [solar_torch]
3    [bicycle, radio, cow_plough, solar_panel, mobi...
4         [motorcyle, radio, cow_plough, mobile_phone]
Name: items_owned, dtype: object
safi_item_df = pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)

Pandas concat

safi_df = pd.concat([safi_df, safi_item_df],axis=1)
safi_df.shape
(131, 82)
safi_df.head()
key_id interview_date quest_no start end province district ward village years_farm ... lorry mobile_phone motorcyle radio sofa_set solar_panel solar_torch sterio table television
0 1 17 November 2016 1 2017-03-23T09:49:57.000Z 2017-04-02T17:29:08.000Z Manica Manica Bandula God 11 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0
1 2 17 November 2016 1 2017-04-02T09:48:16.000Z 2017-04-02T17:26:19.000Z Manica Manica Bandula God 2 ... 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0
2 3 17 November 2016 3 2017-04-02T14:35:26.000Z 2017-04-02T17:26:53.000Z Manica Manica Bandula God 40 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 4 17 November 2016 4 2017-04-02T14:55:18.000Z 2017-04-02T17:27:16.000Z Manica Manica Bandula God 6 ... 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0
4 5 17 November 2016 5 2017-04-02T15:10:35.000Z 2017-04-02T17:27:35.000Z Manica Manica Bandula God 18 ... 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 82 columns

Working with Dummy Variables

safi_df['bicycle']
0      1.0
1      1.0
2      0.0
3      1.0
4      0.0
      ... 
126    0.0
127    0.0
128    0.0
129    0.0
130    0.0
Name: bicycle, Length: 131, dtype: float64
sum(safi_df['bicycle'])
nan
safi_df['bicycle'].sum()
60.0
safi_df['bicycle'].count()
121
safi_df['bicycle'].shape
(131,)
pd.concat([safi_df, safi_item_df],axis=0).shape
(252, 82)

Questions after class

What does the get_dummies function do?

To illustrate, let’s first make a small dataframe to look at it

ex_data = [['a',2,'x'],['b',5,'o'],['a',4,'o'],['c',2,'x'],['b',3,'x']]
ex_df = pd.DataFrame(data =ex_data, columns=['char','num','symbol'])
ex_df
char num symbol
0 a 2 x
1 b 5 o
2 a 4 o
3 c 2 x
4 b 3 x

When we apply get_dummies to one column (a pd.Series) it makes a column for each value of that column. First we look at that column to focus on what it looks like. Note that there are 3 different values.

ex_df['char']
0    a
1    b
2    a
3    c
4    b
Name: char, dtype: object

Now conver it to dummy variables, now we have 3 columns. For each row there is a 1 in the column corresponding to the value and zeros elsewhere.

pd.get_dummies(ex_df['char'])
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0

We can confirm by summing across the rows, it’s all ones.

pd.get_dummies(ex_df['char']).sum(axis=1)
0    1
1    1
2    1
3    1
4    1
dtype: int64

If we sum down the columns, we get the count of each value

pd.get_dummies(ex_df['char']).sum(axis=0)
a    2
b    2
c    1
dtype: int64

it’s the same thing as value_counts()

ex_df['char'].value_counts()
a    2
b    2
c    1
Name: char, dtype: int64

We can also apply it one column, but append it to the whole dataset this way.

pd.get_dummies(ex_df,columns=['char'])
num symbol char_a char_b char_c
0 2 x 1 0 0
1 5 o 0 1 0
2 4 o 1 0 0
3 2 x 0 0 1
4 3 x 0 1 0

This way, it prepends the column values with the original column name. This way concatenates on it’s own and we don’t have to do it separately.

When we apply get_dummies on a whole DataFrame, without indicating a column it converts all of the columns of pandas type object

ex_df.dtypes
char      object
num        int64
symbol    object
dtype: object
pd.get_dummies(ex_df,)
num char_a char_b char_c symbol_o symbol_x
0 2 1 0 0 0 1
1 5 0 1 0 1 0
2 4 1 0 0 1 0
3 2 0 0 1 0 1
4 3 0 1 0 0 1