Class 9: Preparing Data For Analysis¶

Say hello in the zoom chat
join Prismia

Checking types is an important part of cleaning data, we need to figure out what is wrong with data before we can fix it.

Remember that data cleaning is a lot of exploration and iteration in the data. Let’s review a few data types we’ve seen.

Warmup: type review¶

Lists are enclosed by square brackets([]), they’re ordered and iterable.

type([char for char in 'abcde'])

list

Dictionaries are comprised of key:value pairs and enclosed in curly brackets ({}) they’re iterable and indexable by the keys.

type({char:i for i, char in enumerate('abcde')})

dict

This is a tuple, this data type is used to tie together items, but not usually to iterate over. It’s used for pairs of things often or groups to iterate over multiple things at once. For example with the zip function.

type(('a','b','c','d','e'))

tuple

Splitting a string makes a list.

 type('a b c d e'.split(' '))

list

Loading Data¶

Loading the data and packages as normal.

# %load http://drsmb.co/310
import pandas as pd
import seaborn as sns

safi_url = 'https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_full_shortname.csv'
safi_df = pd.read_csv(safi_url)

Exploring Types¶

On Friday we figured out that the following function would clean one cell of the safi_df['items_owned'] column.

def separate_items(row):
    '''
    '''

    return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]

But when we applied it, we got an error.

safi_df.apply(separate_items,axis=1)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-26c2d810f773> in <module>
----> 1 safi_df.apply(separate_items,axis=1)

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-7-1663c693d570> in separate_items(row)
      3     '''
      4 
----> 5     return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]

AttributeError: 'float' object has no attribute 'replace'

Let’s figure out why. First we’ll look at the data again to remember what we expected.

safi_df['items_owned']

    ['bicycle' ;  'television' ;  'solar_panel' ; ...
    ['cow_cart' ;  'bicycle' ;  'radio' ;  'cow_pl...
                                      ['solar_torch']
    ['bicycle' ;  'radio' ;  'cow_plough' ;  'sola...
    ['motorcyle' ;  'radio' ;  'cow_plough' ;  'mo...
                             ...                        
          ['motorcyle' ;  'radio' ;  'solar_panel']
  ['car' ;  'lorry' ;  'television' ;  'radio' ;...
  ['radio' ;  'solar_panel' ;  'solar_torch' ;  ...
  ['cow_cart' ;  'lorry' ;  'motorcyle' ;  'comp...
  ['radio' ;  'cow_plough' ;  'solar_panel' ;  '...
Name: items_owned, Length: 131, dtype: object

The error says that some item is a float, let’s look at the value of only the ones that are float

[item_list for item_list in  safi_df['items_owned'] if type(item_list)==float]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Now we can modify the cleaning function

def separate_items(row):
    '''
    '''
    if type(row['items_owned'])==str:
        return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
    else:
        return []

and apply it again, to see if it works now.

safi_df.apply(separate_items,axis=1)

            [bicycle, television, solar_panel, table]
    [cow_cart, bicycle, radio, cow_plough, solar_p...
                                        [solar_torch]
    [bicycle, radio, cow_plough, solar_panel, mobi...
         [motorcyle, radio, cow_plough, mobile_phone]
                             ...                        
                    [motorcyle, radio, solar_panel]
  [car, lorry, television, radio, sterio, cow_pl...
    [radio, solar_panel, solar_torch, mobile_phone]
  [cow_cart, lorry, motorcyle, computer, televis...
  [radio, cow_plough, solar_panel, solar_torch, ...
Length: 131, dtype: object

Dummies From Lists¶

safi_df['items_owned'] = safi_df.apply(separate_items,axis=1)

pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack())

		bicycle	car	computer	cow_cart	cow_plough	electricity	fridge	lorry	mobile_phone	motorcyle	radio	sofa_set	solar_panel	solar_torch	sterio	table	television
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
	2	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
130	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
	5	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0

621 rows × 17 columns

safi_df['items_owned'].shape

(131,)

pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).index

MultiIndex([(  0,  0),
            (  0,  1),
            (  0,  2),
            (  0,  3),
            (  1,  0),
            (  1,  1),
            (  1,  2),
            (  1,  3),
            (  1,  4),
            (  1,  5),
            ...
            (129,  8),
            (129,  9),
            (129, 10),
            (129, 11),
            (130,  0),
            (130,  1),
            (130,  2),
            (130,  3),
            (130,  4),
            (130,  5)],
           length=621)

pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)

	bicycle	car	computer	cow_cart	cow_plough	electricity	fridge	lorry	mobile_phone	motorcyle	radio	sofa_set	solar_panel	solar_torch	sterio	table	television
0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	1
1	1	0	0	1	1	0	0	0	1	0	1	0	1	1	0	1	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3	1	0	0	0	1	0	0	0	1	0	1	0	1	0	0	0	0
4	0	0	0	0	1	0	0	0	1	1	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126	0	0	0	0	0	0	0	0	0	1	1	0	1	0	0	0	0
127	0	1	0	0	1	1	1	1	1	0	1	1	0	1	1	1	1
128	0	0	0	0	0	0	0	0	1	0	1	0	1	1	0	0	0
129	0	0	1	1	1	1	0	1	1	1	1	0	1	1	1	0	1
130	0	0	0	0	1	0	0	0	1	0	1	0	1	1	0	1	0

121 rows × 17 columns

safi_df['items_owned'].head()

          [bicycle, television, solar_panel, table]
  [cow_cart, bicycle, radio, cow_plough, solar_p...
                                      [solar_torch]
  [bicycle, radio, cow_plough, solar_panel, mobi...
       [motorcyle, radio, cow_plough, mobile_phone]
Name: items_owned, dtype: object

safi_item_df = pd.get_dummies(safi_df['items_owned'].apply(pd.Series).stack()).sum(level=0)

Pandas concat¶

safi_df = pd.concat([safi_df, safi_item_df],axis=1)

safi_df.shape

(131, 82)

safi_df.head()

	key_id	interview_date	quest_no	start	end	province	district	ward	village	years_farm	...	mobile_phone	motorcyle	radio	solar_panel	solar_torch	table	television
0	1	17 November 2016	1	2017-03-23T09:49:57.000Z	2017-04-02T17:29:08.000Z	Manica	Manica	Bandula	God	11	...	0.0	0.0	0.0	1.0	0.0	1.0	1.0
1	2	17 November 2016	1	2017-04-02T09:48:16.000Z	2017-04-02T17:26:19.000Z	Manica	Manica	Bandula	God	2	...	1.0	0.0	1.0	1.0	1.0	1.0	0.0
2	3	17 November 2016	3	2017-04-02T14:35:26.000Z	2017-04-02T17:26:53.000Z	Manica	Manica	Bandula	God	40	...	0.0	0.0	0.0	0.0	1.0	0.0	0.0
3	4	17 November 2016	4	2017-04-02T14:55:18.000Z	2017-04-02T17:27:16.000Z	Manica	Manica	Bandula	God	6	...	1.0	0.0	1.0	1.0	0.0	0.0	0.0
4	5	17 November 2016	5	2017-04-02T15:10:35.000Z	2017-04-02T17:27:35.000Z	Manica	Manica	Bandula	God	18	...	1.0	1.0	1.0	0.0	0.0	0.0	0.0

5 rows × 82 columns

Working with Dummy Variables¶

safi_df['bicycle']

    1.0
    1.0
    0.0
    1.0
    0.0
      ... 
  0.0
  0.0
  0.0
  0.0
  0.0
Name: bicycle, Length: 131, dtype: float64

sum(safi_df['bicycle'])

nan

safi_df['bicycle'].sum()

60.0

safi_df['bicycle'].count()

safi_df['bicycle'].shape

(131,)

pd.concat([safi_df, safi_item_df],axis=0).shape

(252, 82)

Questions after class¶

What does the `get_dummies` function do?¶

To illustrate, let’s first make a small dataframe to look at it

ex_data = [['a',2,'x'],['b',5,'o'],['a',4,'o'],['c',2,'x'],['b',3,'x']]
ex_df = pd.DataFrame(data =ex_data, columns=['char','num','symbol'])
ex_df

	char	num	symbol
0	a	2	x
1	b	5	o
2	a	4	o
3	c	2	x
4	b	3	x

When we apply get_dummies to one column (a pd.Series) it makes a column for each value of that column. First we look at that column to focus on what it looks like. Note that there are 3 different values.

ex_df['char']

  a
  b
  a
  c
  b
Name: char, dtype: object

Now conver it to dummy variables, now we have 3 columns. For each row there is a 1 in the column corresponding to the value and zeros elsewhere.

pd.get_dummies(ex_df['char'])

	a	b	c
0	1	0	0
1	0	1	0
2	1	0	0
3	0	0	1
4	0	1	0

We can confirm by summing across the rows, it’s all ones.

pd.get_dummies(ex_df['char']).sum(axis=1)

  1
  1
  1
  1
  1
dtype: int64

If we sum down the columns, we get the count of each value

pd.get_dummies(ex_df['char']).sum(axis=0)

a    2
b    2
c    1
dtype: int64

it’s the same thing as value_counts()

ex_df['char'].value_counts()

a    2
b    2
c    1
Name: char, dtype: int64

We can also apply it one column, but append it to the whole dataset this way.

pd.get_dummies(ex_df,columns=['char'])

	num	symbol	char_a	char_b	char_c
0	2	x	1	0	0
1	5	o	0	1	0
2	4	o	1	0	0
3	2	x	0	0	1
4	3	x	0	1	0

This way, it prepends the column values with the original column name. This way concatenates on it’s own and we don’t have to do it separately.

When we apply get_dummies on a whole DataFrame, without indicating a column it converts all of the columns of pandas type object

ex_df.dtypes

char      object
num        int64
symbol    object
dtype: object

pd.get_dummies(ex_df,)

	num	char_a	char_b	char_c	symbol_o	symbol_x
0	2	1	0	0	0	1
1	5	0	1	0	1	0
2	4	1	0	0	1	0
3	2	0	0	1	0	1
4	3	0	1	0	0	1

		bicycle	car	computer	cow_cart	cow_plough	electricity	fridge	lorry	mobile_phone	motorcyle	radio	sofa_set	solar_panel	solar_torch	sterio	table	television
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
	2	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
130	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
	5	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0

	bicycle	car	computer	cow_cart	cow_plough	electricity	fridge	lorry	mobile_phone	motorcyle	radio	sofa_set	solar_panel	solar_torch	sterio	table	television
0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	1
1	1	0	0	1	1	0	0	0	1	0	1	0	1	1	0	1	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3	1	0	0	0	1	0	0	0	1	0	1	0	1	0	0	0	0
4	0	0	0	0	1	0	0	0	1	1	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126	0	0	0	0	0	0	0	0	0	1	1	0	1	0	0	0	0
127	0	1	0	0	1	1	1	1	1	0	1	1	0	1	1	1	1
128	0	0	0	0	0	0	0	0	1	0	1	0	1	1	0	0	0
129	0	0	1	1	1	1	0	1	1	1	1	0	1	1	1	0	1
130	0	0	0	0	1	0	0	0	1	0	1	0	1	1	0	1	0

		bicycle	car	computer	cow_cart	cow_plough	electricity	fridge	lorry	mobile_phone	motorcyle	radio	sofa_set	solar_panel	solar_torch	sterio	table	television
0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
	2	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
130	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
	2	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
	3	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
	5	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0

	bicycle	car	computer	cow_cart	cow_plough	electricity	fridge	lorry	mobile_phone	motorcyle	radio	sofa_set	solar_panel	solar_torch	sterio	table	television
0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	1
1	1	0	0	1	1	0	0	0	1	0	1	0	1	1	0	1	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3	1	0	0	0	1	0	0	0	1	0	1	0	1	0	0	0	0
4	0	0	0	0	1	0	0	0	1	1	1	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126	0	0	0	0	0	0	0	0	0	1	1	0	1	0	0	0	0
127	0	1	0	0	1	1	1	1	1	0	1	1	0	1	1	1	1
128	0	0	0	0	0	0	0	0	1	0	1	0	1	1	0	0	0
129	0	0	1	1	1	1	0	1	1	1	1	0	1	1	1	0	1
130	0	0	0	0	1	0	0	0	1	0	1	0	1	1	0	1	0

Programming for Data Science at URI Fall 2020