Class 8: Visualization and Starting to Clean Data

Announcements

  1. closing notebooks

    • recall that they don’t stop when you close the tabs

    • you need to stop it at the terminal it launched from

    • with: ctr + c as it says when you first launch a notebook

  2. restart and rerun notebooks

    • notebooks are a continuous REPL as long as you have it open

    • if you change code, you could have, for example a variable that is no longer defined in the notebook as written, but that still exists in memory, so code that depends on it will still run, for now, but not after you restart next (eg if we run it while grading)

    • to check, restart and rerun your notebook

  3. say hello on zoom for attendance

Setup

First we import packages and load data as normal.

import pandas as pd
import seaborn as sns

data_url = 'https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_full_shortname.csv'
sns.set(font_scale=1.25)

I’ve added one new command here, Seaborn’s []set function](https://seaborn.pydata.org/generated/seaborn.set.html). It sets a bunch of theme aspects for plotting, here I used it to increase the font size.

Warmup Activity

How recreate this plot?

image to copy

First, load the data

safi_df = pd.read_csv(data_url)

This is a displot.

sns.displot(data=safi_df, x='members_count',row='parents_liv',
            col = 'sp_parents_liv',hue='village',kind='kde')
<seaborn.axisgrid.FacetGrid at 0x7f710ba0c690>
../_images/2020-09-25_7_1.png

This allows comparisons of how the village, respondent’s parents living (parents_liv), and the spouse’s (sp_parents_liv) influence the number of members of the household (members_count). The kind='kde' parameter uses the underlying kdeplot to apply kernel density estimation

We can understand better what this does, by comparing what happens when we take it out.

sns.displot(data=safi_df, x='members_count',row='parents_liv',
            col = 'sp_parents_liv',hue='village')
<seaborn.axisgrid.FacetGrid at 0x7f70cab6fdd0>
../_images/2020-09-25_9_1.png

What about the rest of the columns?

We’ve used this SAFI dataset a lot, but we’ve only used a few of the many columns. Let’s look at all of them.

safi_df.columns
Index(['key_id', 'interview_date', 'quest_no', 'start', 'end', 'province',
       'district', 'ward', 'village', 'years_farm', 'agr_assoc', 'note2',
       'no_membrs', 'members_count', 'remittance_money', 'years_liv',
       'parents_liv', 'sp_parents_liv', 'grand_liv', 'sp_grand_liv',
       'respondent_roof_type', 'respondent_wall_type',
       'respondent_wall_type_other', 'respondent_floor_type', 'window_type',
       'buildings_in_compound', 'rooms', 'other_buildings', 'no_plots',
       'plots_count', 'water_use', 'no_group_count', 'yes_group_count',
       'no_enough_water', 'months_no_water', 'period_use', 'exper_other',
       'other_meth', 'res_change', 'memb_assoc', 'resp_assoc', 'fees_water',
       'affect_conflicts', 'note', 'need_money', 'money_source',
       'money_source_other', 'crops_contr', 'emply_lab', 'du_labour',
       'liv_owned', 'liv_owned_other', 'liv_count', 'poultry',
       'du_look_aftr_cows', 'items_owned', 'items_owned_other', 'no_meals',
       'months_lack_food', 'no_food_mitigation', 'gps_Latitude',
       'gps_Longitude', 'gps_Altitude', 'gps_Accuracy', 'instanceID'],
      dtype='object')

We can look at the first few row to recall what sort of data are available in each column.

safi_df.head()
key_id interview_date quest_no start end province district ward village years_farm ... items_owned items_owned_other no_meals months_lack_food no_food_mitigation gps_Latitude gps_Longitude gps_Altitude gps_Accuracy instanceID
0 1 17 November 2016 1 2017-03-23T09:49:57.000Z 2017-04-02T17:29:08.000Z Manica Manica Bandula God 11 ... ['bicycle' ; 'television' ; 'solar_panel' ; ... NaN 2 ['Jan'] ['na' ; 'rely_less_food' ; 'reduce_meals' ; ... -19.112259 33.483456 698 14.0 uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1 2 17 November 2016 1 2017-04-02T09:48:16.000Z 2017-04-02T17:26:19.000Z Manica Manica Bandula God 2 ... ['cow_cart' ; 'bicycle' ; 'radio' ; 'cow_pl... NaN 2 ['Jan' ; 'Sept' ; 'Oct' ; 'Nov' ; 'Dec'] ['na' ; 'reduce_meals' ; 'restrict_adults' ;... -19.112477 33.483416 690 19.0 uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2 3 17 November 2016 3 2017-04-02T14:35:26.000Z 2017-04-02T17:26:53.000Z Manica Manica Bandula God 40 ... ['solar_torch'] NaN 2 ['Jan' ; 'Feb' ; 'Mar' ; 'Oct' ; 'Nov' ; ... ['na' ; 'restrict_adults' ; 'lab_ex_food'] -19.112108 33.483450 674 13.0 uuid:193d7daf-9582-409b-bf09-027dd36f9007
3 4 17 November 2016 4 2017-04-02T14:55:18.000Z 2017-04-02T17:27:16.000Z Manica Manica Bandula God 6 ... ['bicycle' ; 'radio' ; 'cow_plough' ; 'sola... NaN 2 ['Sept' ; 'Oct' ; 'Nov' ; 'Dec'] ['na' ; 'reduce_meals' ; 'restrict_adults' ;... -19.112229 33.483424 679 5.0 uuid:148d1105-778a-4755-aa71-281eadd4a973
4 5 17 November 2016 5 2017-04-02T15:10:35.000Z 2017-04-02T17:27:35.000Z Manica Manica Bandula God 18 ... ['motorcyle' ; 'radio' ; 'cow_plough' ; 'mo... NaN 2 ['Aug' ; 'Sept' ; 'Oct' ; 'Nov'] ['na' ; 'go_forest' ; 'migrate'] -19.112217 33.483425 689 10.0 uuid:2c867811-9696-4966-9866-f35c3e97d02d

5 rows × 65 columns

It might be interesting to use the items_owned column to compare find out if there are differences between farms based on what items they have.

Maybe we’d want to see how many farms own each item, maybe we want to know how many farms own bicycle. Let’s try value_counts out like we used before.

safi_df['items_owned'].value_counts()
['radio']                                                                                                                           4
['mobile_phone']                                                                                                                    4
['bicycle' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'table' ;  'mobile_phone']                                                3
['bicycle' ;  'radio' ;  'mobile_phone']                                                                                            3
['bicycle' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'solar_torch' ;  'mobile_phone']                                          3
                                                                                                                                   ..
['radio' ;  'cow_plough' ;  'table' ;  'mobile_phone']                                                                              1
['motorcyle' ;  'television' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'mobile_phone']                                         1
['radio' ;  'table' ;  'sofa_set' ;  'mobile_phone']                                                                                1
['cow_cart' ;  'motorcyle' ;  'bicycle' ;  'television' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'solar_torch' ;  'table']    1
['motorcyle' ;  'bicycle' ;  'radio' ;  'table' ;  'mobile_phone']                                                                  1
Name: items_owned, Length: 95, dtype: int64

The problem is this gives the count of each list of items instead of each item. What we’d want instead is one column for each item, where the values in the column are 1 when the farm owns that item and 0 when they don’t. This representation is sometimes called 1 hot encoding and other times (including in pandas) it’s called dummy variables.

Using Dummy Variables

Let’s try this function out, first on the village column.

pd.get_dummies(data= safi_df,columns=['village'])
key_id interview_date quest_no start end province district ward years_farm agr_assoc ... months_lack_food no_food_mitigation gps_Latitude gps_Longitude gps_Altitude gps_Accuracy instanceID village_Chirodzo village_God village_Ruaca
0 1 17 November 2016 1 2017-03-23T09:49:57.000Z 2017-04-02T17:29:08.000Z Manica Manica Bandula 11 no ... ['Jan'] ['na' ; 'rely_less_food' ; 'reduce_meals' ; ... -19.112259 33.483456 698 14.0 uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef 0 1 0
1 2 17 November 2016 1 2017-04-02T09:48:16.000Z 2017-04-02T17:26:19.000Z Manica Manica Bandula 2 yes ... ['Jan' ; 'Sept' ; 'Oct' ; 'Nov' ; 'Dec'] ['na' ; 'reduce_meals' ; 'restrict_adults' ;... -19.112477 33.483416 690 19.0 uuid:099de9c9-3e5e-427b-8452-26250e840d6e 0 1 0
2 3 17 November 2016 3 2017-04-02T14:35:26.000Z 2017-04-02T17:26:53.000Z Manica Manica Bandula 40 no ... ['Jan' ; 'Feb' ; 'Mar' ; 'Oct' ; 'Nov' ; ... ['na' ; 'restrict_adults' ; 'lab_ex_food'] -19.112108 33.483450 674 13.0 uuid:193d7daf-9582-409b-bf09-027dd36f9007 0 1 0
3 4 17 November 2016 4 2017-04-02T14:55:18.000Z 2017-04-02T17:27:16.000Z Manica Manica Bandula 6 no ... ['Sept' ; 'Oct' ; 'Nov' ; 'Dec'] ['na' ; 'reduce_meals' ; 'restrict_adults' ;... -19.112229 33.483424 679 5.0 uuid:148d1105-778a-4755-aa71-281eadd4a973 0 1 0
4 5 17 November 2016 5 2017-04-02T15:10:35.000Z 2017-04-02T17:27:35.000Z Manica Manica Bandula 18 no ... ['Aug' ; 'Sept' ; 'Oct' ; 'Nov'] ['na' ; 'go_forest' ; 'migrate'] -19.112217 33.483425 689 10.0 uuid:2c867811-9696-4966-9866-f35c3e97d02d 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
126 127 18 May 2017 126 2017-05-18T04:13:37.000Z 2017-05-18T04:35:47.000Z Manica Manica Bandula 5 yes ... ['Oct' ; 'Nov' ; 'Dec'] ['rely_less_food' ; 'lab_ex_food'] -19.112194 33.483379 700 7.0 uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965 0 0 1
127 128 04 June 2017 193 2017-06-04T09:36:20.000Z 2017-06-04T10:13:32.000Z Manica Manica Bandula 10 no ... ['none'] ['na'] -19.112157 33.483390 720 9.0 uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df 0 0 1
128 129 04 June 2017 194 2017-06-04T10:13:36.000Z 2017-06-04T10:32:06.000Z Manica Manica Bandula 5 no ... ['Sept' ; 'Oct' ; 'Nov'] ['lab_ex_food'] -19.112271 33.483471 719 10.0 uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf 0 0 1
129 130 04 June 2017 199 2017-06-04T10:33:55.000Z 2017-06-04T10:52:22.000Z Manica Manica Bandula 17 yes ... ['Nov' ; 'Dec'] ['rely_less_food' ; 'limit_variety' ; 'reduc... -19.112278 33.483390 711 5.0 uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b 1 0 0
130 131 04 June 2017 200 2017-06-04T10:52:46.000Z 2017-06-04T11:08:13.000Z Manica Manica Bandula 20 yes ... ['Oct' ; 'Nov'] ['rely_less_food' ; 'restrict_adults' ; 'bor... -19.112183 33.483372 681 20.0 uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7 1 0 0

131 rows × 67 columns

Instead of scrolling, we can isolate the new columns we just created.

safi_df_villages = pd.get_dummies(data= safi_df,columns=['village'])
keep_cols = [col for col in safi_df_villages.columns if 'village_' in col]
safi_df_villages[keep_cols]
village_Chirodzo village_God village_Ruaca
0 0 1 0
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
... ... ... ...
126 0 0 1
127 0 0 1
128 0 0 1
129 1 0 0
130 1 0 0

131 rows × 3 columns

Note that now we have 3 columns, one for each village. This column was already usable, but it was well formatted an useful to illustrate what get_dummies does.

Note

try this out and compare the two, test out get_dummies on another column to be sure you know what it does.

Now, we can try it on items_owned. _this time we’ll filter the columns for display and inspection right away

safi_df_items = pd.get_dummies(data= safi_df,columns=['items_owned'])
keep_cols = [col for col in safi_df_items.columns if 'village_' in col]
safi_df_items[keep_cols]
0
1
2
3
4
...
126
127
128
129
130

131 rows × 0 columns

This still isn’t quite what we want, because each value in the items_owned column looks like a list but it’s actually a string, we can check that with the type function.

type(safi_df['items_owned'][0])
str

Cleaning One Cell of Data

So, let’s clean the value from one cell of that column and see what else is going on. First, we save it to a variable.

farm1_items = safi_df['items_owned'][0]

Let’s look at it first.

farm1_items
"['bicycle' ;  'television' ;  'solar_panel' ;  'table']"

We could try first casting it to a list.

list(farm1_items)
['[',
 "'",
 'b',
 'i',
 'c',
 'y',
 'c',
 'l',
 'e',
 "'",
 ' ',
 ';',
 ' ',
 ' ',
 "'",
 't',
 'e',
 'l',
 'e',
 'v',
 'i',
 's',
 'i',
 'o',
 'n',
 "'",
 ' ',
 ';',
 ' ',
 ' ',
 "'",
 's',
 'o',
 'l',
 'a',
 'r',
 '_',
 'p',
 'a',
 'n',
 'e',
 'l',
 "'",
 ' ',
 ';',
 ' ',
 ' ',
 "'",
 't',
 'a',
 'b',
 'l',
 'e',
 "'",
 ']']

That doesn’t quite do it though, let’s try separating it with the split method at the ‘;’.

farm1_items.split(';')
["['bicycle' ", "  'television' ", "  'solar_panel' ", "  'table']"]

This still has some extra characters, in it though. We should remove those, probably before we do the split so that we don’t end up with empty items in the list.

farm1_items.replace('[','').replace(']','').replace("'","")
'bicycle ;  television ;  solar_panel ;  table'

That is now ready to split.

farm1_items.replace('[','').replace(']','').replace("'","").split(';')
['bicycle ', '  television ', '  solar_panel ', '  table']

There are some empty spaces lest. Python has a string function strip that removes leading and trailing (on the ends) whitespace (spaces, tabs, new lines), we have to apply it to each individual item of that list above. We can use a list comprehension for this.

[i.strip() for i in  farm1_items.replace('[','').replace(']','').replace("'","").split(';')]
['bicycle', 'television', 'solar_panel', 'table']

Applying this to the rest of the data

We can put that in a function so that we can reuse it.

def separate_items(row):
    '''
    clean the items owned column of one row of the safi_df
    '''

    return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]

Pandas provides us special function to apply a function to a dataframe along a given axis.

 safi_df.apply(separate_items,axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-26c2d810f773> in <module>
----> 1 safi_df.apply(separate_items,axis=1)

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-19-f7c175567037> in separate_items(row)
      4     '''
      5 
----> 6     return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
      7 

AttributeError: 'float' object has no attribute 'replace'

This error means that there are some elements of that column that is a float instead of a string. We should check what that might be. We can check with a list comprehension and look at the ones that are float

[item_list for item_list in safi_df['items_owned'] if type(item_list)==float]
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Since the floats are nan, we know that our function more or less is the right thing to do, but we need to modify our function to accommodate those values. We can replace them with an empty list.

def separate_items(row):
    '''
    '''
    if type(row['items_owned'])==str:
        return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
    else:
        return []

We can try this again:

safi_df['items_owned_clean'] = safi_df.apply(separate_items,axis=1)

it runs successfully and we can look at the output.

safi_df[['items_owned_clean']].head()
items_owned_clean
0 [bicycle, television, solar_panel, table]
1 [cow_cart, bicycle, radio, cow_plough, solar_p...
2 [solar_torch]
3 [bicycle, radio, cow_plough, solar_panel, mobi...
4 [motorcyle, radio, cow_plough, mobile_phone]

This looks all good and we’ll pick up from here on Monday.