Class 8: Visualization and Starting to Clean Data¶

Announcements¶

closing notebooks
- recall that they don’t stop when you close the tabs
- you need to stop it at the terminal it launched from
- with: ctr + c as it says when you first launch a notebook
restart and rerun notebooks
- notebooks are a continuous REPL as long as you have it open
- if you change code, you could have, for example a variable that is no longer defined in the notebook as written, but that still exists in memory, so code that depends on it will still run, for now, but not after you restart next (eg if we run it while grading)
- to check, restart and rerun your notebook
say hello on zoom for attendance

Setup¶

First we import packages and load data as normal.

import pandas as pd
import seaborn as sns

data_url = 'https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_full_shortname.csv'
sns.set(font_scale=1.25)

I’ve added one new command here, Seaborn’s []set function](https://seaborn.pydata.org/generated/seaborn.set.html). It sets a bunch of theme aspects for plotting, here I used it to increase the font size.

Warmup Activity¶

How recreate this plot?

image to copy

First, load the data

safi_df = pd.read_csv(data_url)

This is a displot.

sns.displot(data=safi_df, x='members_count',row='parents_liv',
            col = 'sp_parents_liv',hue='village',kind='kde')

<seaborn.axisgrid.FacetGrid at 0x7f710ba0c690>

This allows comparisons of how the village, respondent’s parents living (parents_liv), and the spouse’s (sp_parents_liv) influence the number of members of the household (members_count). The kind='kde' parameter uses the underlying kdeplot to apply kernel density estimation

We can understand better what this does, by comparing what happens when we take it out.

sns.displot(data=safi_df, x='members_count',row='parents_liv',
            col = 'sp_parents_liv',hue='village')

<seaborn.axisgrid.FacetGrid at 0x7f70cab6fdd0>

What about the rest of the columns?¶

We’ve used this SAFI dataset a lot, but we’ve only used a few of the many columns. Let’s look at all of them.

safi_df.columns

Index(['key_id', 'interview_date', 'quest_no', 'start', 'end', 'province',
       'district', 'ward', 'village', 'years_farm', 'agr_assoc', 'note2',
       'no_membrs', 'members_count', 'remittance_money', 'years_liv',
       'parents_liv', 'sp_parents_liv', 'grand_liv', 'sp_grand_liv',
       'respondent_roof_type', 'respondent_wall_type',
       'respondent_wall_type_other', 'respondent_floor_type', 'window_type',
       'buildings_in_compound', 'rooms', 'other_buildings', 'no_plots',
       'plots_count', 'water_use', 'no_group_count', 'yes_group_count',
       'no_enough_water', 'months_no_water', 'period_use', 'exper_other',
       'other_meth', 'res_change', 'memb_assoc', 'resp_assoc', 'fees_water',
       'affect_conflicts', 'note', 'need_money', 'money_source',
       'money_source_other', 'crops_contr', 'emply_lab', 'du_labour',
       'liv_owned', 'liv_owned_other', 'liv_count', 'poultry',
       'du_look_aftr_cows', 'items_owned', 'items_owned_other', 'no_meals',
       'months_lack_food', 'no_food_mitigation', 'gps_Latitude',
       'gps_Longitude', 'gps_Altitude', 'gps_Accuracy', 'instanceID'],
      dtype='object')

We can look at the first few row to recall what sort of data are available in each column.

safi_df.head()

	key_id	interview_date	quest_no	start	end	province	district	ward	village	years_farm	...	items_owned	items_owned_other	no_meals	months_lack_food	no_food_mitigation	gps_Latitude	gps_Longitude	gps_Altitude	gps_Accuracy	instanceID
0	1	17 November 2016	1	2017-03-23T09:49:57.000Z	2017-04-02T17:29:08.000Z	Manica	Manica	Bandula	God	11	...	['bicycle' ; 'television' ; 'solar_panel' ; ...	NaN	2	['Jan']	['na' ; 'rely_less_food' ; 'reduce_meals' ; ...	-19.112259	33.483456	698	14.0	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	2	17 November 2016	1	2017-04-02T09:48:16.000Z	2017-04-02T17:26:19.000Z	Manica	Manica	Bandula	God	2	...	['cow_cart' ; 'bicycle' ; 'radio' ; 'cow_pl...	NaN	2	['Jan' ; 'Sept' ; 'Oct' ; 'Nov' ; 'Dec']	['na' ; 'reduce_meals' ; 'restrict_adults' ;...	-19.112477	33.483416	690	19.0	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	17 November 2016	3	2017-04-02T14:35:26.000Z	2017-04-02T17:26:53.000Z	Manica	Manica	Bandula	God	40	...	['solar_torch']	NaN	2	['Jan' ; 'Feb' ; 'Mar' ; 'Oct' ; 'Nov' ; ...	['na' ; 'restrict_adults' ; 'lab_ex_food']	-19.112108	33.483450	674	13.0	uuid:193d7daf-9582-409b-bf09-027dd36f9007
3	4	17 November 2016	4	2017-04-02T14:55:18.000Z	2017-04-02T17:27:16.000Z	Manica	Manica	Bandula	God	6	...	['bicycle' ; 'radio' ; 'cow_plough' ; 'sola...	NaN	2	['Sept' ; 'Oct' ; 'Nov' ; 'Dec']	['na' ; 'reduce_meals' ; 'restrict_adults' ;...	-19.112229	33.483424	679	5.0	uuid:148d1105-778a-4755-aa71-281eadd4a973
4	5	17 November 2016	5	2017-04-02T15:10:35.000Z	2017-04-02T17:27:35.000Z	Manica	Manica	Bandula	God	18	...	['motorcyle' ; 'radio' ; 'cow_plough' ; 'mo...	NaN	2	['Aug' ; 'Sept' ; 'Oct' ; 'Nov']	['na' ; 'go_forest' ; 'migrate']	-19.112217	33.483425	689	10.0	uuid:2c867811-9696-4966-9866-f35c3e97d02d

5 rows × 65 columns

It might be interesting to use the items_owned column to compare find out if there are differences between farms based on what items they have.

Maybe we’d want to see how many farms own each item, maybe we want to know how many farms own bicycle. Let’s try value_counts out like we used before.

safi_df['items_owned'].value_counts()

['radio']                                                                                                                           4
['mobile_phone']                                                                                                                    4
['bicycle' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'table' ;  'mobile_phone']                                                3
['bicycle' ;  'radio' ;  'mobile_phone']                                                                                            3
['bicycle' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'solar_torch' ;  'mobile_phone']                                          3
                                                                                                                                   ..
['radio' ;  'cow_plough' ;  'table' ;  'mobile_phone']                                                                              1
['motorcyle' ;  'television' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'mobile_phone']                                         1
['radio' ;  'table' ;  'sofa_set' ;  'mobile_phone']                                                                                1
['cow_cart' ;  'motorcyle' ;  'bicycle' ;  'television' ;  'radio' ;  'cow_plough' ;  'solar_panel' ;  'solar_torch' ;  'table']    1
['motorcyle' ;  'bicycle' ;  'radio' ;  'table' ;  'mobile_phone']                                                                  1
Name: items_owned, Length: 95, dtype: int64

The problem is this gives the count of each list of items instead of each item. What we’d want instead is one column for each item, where the values in the column are 1 when the farm owns that item and 0 when they don’t. This representation is sometimes called 1 hot encoding and other times (including in pandas) it’s called dummy variables.

Using Dummy Variables¶

Let’s try this function out, first on the village column.

pd.get_dummies(data= safi_df,columns=['village'])

	key_id	interview_date	quest_no	start	end	province	district	ward	years_farm	agr_assoc	...	months_lack_food	no_food_mitigation	gps_Latitude	gps_Longitude	gps_Altitude	gps_Accuracy	instanceID	village_Chirodzo	village_God	village_Ruaca
0	1	17 November 2016	1	2017-03-23T09:49:57.000Z	2017-04-02T17:29:08.000Z	Manica	Manica	Bandula	11	no	...	['Jan']	['na' ; 'rely_less_food' ; 'reduce_meals' ; ...	-19.112259	33.483456	698	14.0	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef	0	1	0
1	2	17 November 2016	1	2017-04-02T09:48:16.000Z	2017-04-02T17:26:19.000Z	Manica	Manica	Bandula	2	yes	...	['Jan' ; 'Sept' ; 'Oct' ; 'Nov' ; 'Dec']	['na' ; 'reduce_meals' ; 'restrict_adults' ;...	-19.112477	33.483416	690	19.0	uuid:099de9c9-3e5e-427b-8452-26250e840d6e	0	1	0
2	3	17 November 2016	3	2017-04-02T14:35:26.000Z	2017-04-02T17:26:53.000Z	Manica	Manica	Bandula	40	no	...	['Jan' ; 'Feb' ; 'Mar' ; 'Oct' ; 'Nov' ; ...	['na' ; 'restrict_adults' ; 'lab_ex_food']	-19.112108	33.483450	674	13.0	uuid:193d7daf-9582-409b-bf09-027dd36f9007	0	1	0
3	4	17 November 2016	4	2017-04-02T14:55:18.000Z	2017-04-02T17:27:16.000Z	Manica	Manica	Bandula	6	no	...	['Sept' ; 'Oct' ; 'Nov' ; 'Dec']	['na' ; 'reduce_meals' ; 'restrict_adults' ;...	-19.112229	33.483424	679	5.0	uuid:148d1105-778a-4755-aa71-281eadd4a973	0	1	0
4	5	17 November 2016	5	2017-04-02T15:10:35.000Z	2017-04-02T17:27:35.000Z	Manica	Manica	Bandula	18	no	...	['Aug' ; 'Sept' ; 'Oct' ; 'Nov']	['na' ; 'go_forest' ; 'migrate']	-19.112217	33.483425	689	10.0	uuid:2c867811-9696-4966-9866-f35c3e97d02d	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126	127	18 May 2017	126	2017-05-18T04:13:37.000Z	2017-05-18T04:35:47.000Z	Manica	Manica	Bandula	5	yes	...	['Oct' ; 'Nov' ; 'Dec']	['rely_less_food' ; 'lab_ex_food']	-19.112194	33.483379	700	7.0	uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965	0	0	1
127	128	04 June 2017	193	2017-06-04T09:36:20.000Z	2017-06-04T10:13:32.000Z	Manica	Manica	Bandula	10	no	...	['none']	['na']	-19.112157	33.483390	720	9.0	uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df	0	0	1
128	129	04 June 2017	194	2017-06-04T10:13:36.000Z	2017-06-04T10:32:06.000Z	Manica	Manica	Bandula	5	no	...	['Sept' ; 'Oct' ; 'Nov']	['lab_ex_food']	-19.112271	33.483471	719	10.0	uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf	0	0	1
129	130	04 June 2017	199	2017-06-04T10:33:55.000Z	2017-06-04T10:52:22.000Z	Manica	Manica	Bandula	17	yes	...	['Nov' ; 'Dec']	['rely_less_food' ; 'limit_variety' ; 'reduc...	-19.112278	33.483390	711	5.0	uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b	1	0	0
130	131	04 June 2017	200	2017-06-04T10:52:46.000Z	2017-06-04T11:08:13.000Z	Manica	Manica	Bandula	20	yes	...	['Oct' ; 'Nov']	['rely_less_food' ; 'restrict_adults' ; 'bor...	-19.112183	33.483372	681	20.0	uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7	1	0	0

131 rows × 67 columns

Instead of scrolling, we can isolate the new columns we just created.

safi_df_villages = pd.get_dummies(data= safi_df,columns=['village'])
keep_cols = [col for col in safi_df_villages.columns if 'village_' in col]
safi_df_villages[keep_cols]

	village_Chirodzo	village_God	village_Ruaca
0	0	1	0
1	0	1	0
2	0	1	0
3	0	1	0
4	0	1	0
...	...	...	...
126	0	0	1
127	0	0	1
128	0	0	1
129	1	0	0
130	1	0	0

131 rows × 3 columns

Note that now we have 3 columns, one for each village. This column was already usable, but it was well formatted an useful to illustrate what get_dummies does.

Note

try this out and compare the two, test out get_dummies on another column to be sure you know what it does.

Now, we can try it on items_owned. _this time we’ll filter the columns for display and inspection right away

safi_df_items = pd.get_dummies(data= safi_df,columns=['items_owned'])
keep_cols = [col for col in safi_df_items.columns if 'village_' in col]
safi_df_items[keep_cols]


0
1
2
3
4
...
126
127
128
129
130

131 rows × 0 columns

This still isn’t quite what we want, because each value in the items_owned column looks like a list but it’s actually a string, we can check that with the type function.

type(safi_df['items_owned'][0])

str

Cleaning One Cell of Data¶

So, let’s clean the value from one cell of that column and see what else is going on. First, we save it to a variable.

farm1_items = safi_df['items_owned'][0]

Let’s look at it first.

farm1_items

"['bicycle' ;  'television' ;  'solar_panel' ;  'table']"

We could try first casting it to a list.

list(farm1_items)

['[',
 "'",
 'b',
 'i',
 'c',
 'y',
 'c',
 'l',
 'e',
 "'",
 ' ',
 ';',
 ' ',
 ' ',
 "'",
 't',
 'e',
 'l',
 'e',
 'v',
 'i',
 's',
 'i',
 'o',
 'n',
 "'",
 ' ',
 ';',
 ' ',
 ' ',
 "'",
 's',
 'o',
 'l',
 'a',
 'r',
 '_',
 'p',
 'a',
 'n',
 'e',
 'l',
 "'",
 ' ',
 ';',
 ' ',
 ' ',
 "'",
 't',
 'a',
 'b',
 'l',
 'e',
 "'",
 ']']

That doesn’t quite do it though, let’s try separating it with the split method at the ‘;’.

farm1_items.split(';')

["['bicycle' ", "  'television' ", "  'solar_panel' ", "  'table']"]

This still has some extra characters, in it though. We should remove those, probably before we do the split so that we don’t end up with empty items in the list.

farm1_items.replace('[','').replace(']','').replace("'","")

'bicycle ;  television ;  solar_panel ;  table'

That is now ready to split.

farm1_items.replace('[','').replace(']','').replace("'","").split(';')

['bicycle ', '  television ', '  solar_panel ', '  table']

There are some empty spaces lest. Python has a string function strip that removes leading and trailing (on the ends) whitespace (spaces, tabs, new lines), we have to apply it to each individual item of that list above. We can use a list comprehension for this.

[i.strip() for i in  farm1_items.replace('[','').replace(']','').replace("'","").split(';')]

['bicycle', 'television', 'solar_panel', 'table']

Applying this to the rest of the data¶

We can put that in a function so that we can reuse it.

def separate_items(row):
    '''
    clean the items owned column of one row of the safi_df
    '''

    return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]

Pandas provides us special function to apply a function to a dataframe along a given axis.

 safi_df.apply(separate_items,axis=1)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-26c2d810f773> in <module>
----> 1 safi_df.apply(separate_items,axis=1)

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7766             kwds=kwds,
   7767         )
-> 7768         return op.get_result()
   7769 
   7770     def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
    183             return self.apply_raw()
    184 
--> 185         return self.apply_standard()
    186 
    187     def apply_empty_result(self):

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
    274 
    275     def apply_standard(self):
--> 276         results, res_index = self.apply_series_generator()
    277 
    278         # wrap results

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
    288             for i, v in enumerate(series_gen):
    289                 # ignore SettingWithCopy here in case the user mutates
--> 290                 results[i] = self.f(v)
    291                 if isinstance(results[i], ABCSeries):
    292                     # If we have a view on v, we need to make a copy because

<ipython-input-19-f7c175567037> in separate_items(row)
      4     '''
      5 
----> 6     return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
      7 

AttributeError: 'float' object has no attribute 'replace'

This error means that there are some elements of that column that is a float instead of a string. We should check what that might be. We can check with a list comprehension and look at the ones that are float

[item_list for item_list in safi_df['items_owned'] if type(item_list)==float]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Since the floats are nan, we know that our function more or less is the right thing to do, but we need to modify our function to accommodate those values. We can replace them with an empty list.

def separate_items(row):
    '''
    '''
    if type(row['items_owned'])==str:
        return [i.strip() for i in  row['items_owned'].replace('[','').replace(']','').replace("'","").split(';')]
    else:
        return []

We can try this again:

safi_df['items_owned_clean'] = safi_df.apply(separate_items,axis=1)

it runs successfully and we can look at the output.

safi_df[['items_owned_clean']].head()

	items_owned_clean
0	[bicycle, television, solar_panel, table]
1	[cow_cart, bicycle, radio, cow_plough, solar_p...
2	[solar_torch]
3	[bicycle, radio, cow_plough, solar_panel, mobi...
4	[motorcyle, radio, cow_plough, mobile_phone]

This looks all good and we’ll pick up from here on Monday.

	village_Chirodzo	village_God	village_Ruaca
0	0	1	0
1	0	1	0
2	0	1	0
3	0	1	0
4	0	1	0
...	...	...	...
126	0	0	1
127	0	0	1
128	0	0	1
129	1	0	0
130	1	0	0

	village_Chirodzo	village_God	village_Ruaca
0	0	1	0
1	0	1	0
2	0	1	0
3	0	1	0
4	0	1	0
...	...	...	...
126	0	0	1
127	0	0	1
128	0	0	1
129	1	0	0
130	1	0	0

Programming for Data Science at URI Fall 2020