Class 5: Accessing Data, continued¶

Today’s agenda:

warm up/ review
announcements
working with dataframes
the power of functions as objects
(maybe) exploratory data analysis

Try it out –>

Read the tables off of the syllabus course map page with read_html and make a list of the shapes of all of the tables on the page. Save the output to a variable and paste the value of that variable as your answer to the question.

import pandas as pd
[df.shape for df in pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html')]

[(14, 3), (15, 5), (15, 15), (15, 6)]

Announcements¶

annotated notes are up
beginning portfolio prompts and instructions are up
Assignment due Sunday,
office hours will remain Fridays
TA office hours posted.

More Pandas¶

We’ll go back to the SAFI dataset from Wednesday.

safi_df = pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')

We downloaded the data into memory, but we can also write it to disk.

safi_df.to_csv('safi_clean.csv')

It will go to the same folder as the notebook, but we can also use a relative path. If we make a data folder in the folder where we’ve saved the notebook, we can write the file there instead.

safi_df.to_csv('data/safi_clean.csv')

Now we can read it in using the same path

safi_df2= pd.read_csv('data/safi_clean.csv')

Note that now it has an extra column

safi_df2.head(2)

	Unnamed: 0	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e

safi_df.head(2)

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e

We can prevent this by writing it out with the index parameter set to False

safi_df.to_csv('data/safi_clean.csv',index=False)

Now when we read it in, there’s no extra column.

safi_df3 = pd.read_csv('data/safi_clean.csv')
safi_df3.head(3)

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	God	2016-11-17T00:00:00Z	10	15	burntbricks	1	NaN	NaN	1	solar_torch	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:193d7daf-9582-409b-bf09-027dd36f9007

Recall, we indexed a column with the name in square brackets

safi_df['village']

         God
         God
         God
         God
         God
         ...   
     Ruaca
     Ruaca
     Ruaca
  Chirodzo
  Chirodzo
Name: village, Length: 131, dtype: object

To index rows, we can use loc

safi_df.loc[3]

key_ID                                                                  4
village                                                               God
interview_date                                       2016-11-17T00:00:00Z
no_membrs                                                               7
years_liv                                                               6
respondent_wall_type                                          burntbricks
rooms                                                                   1
memb_assoc                                                            NaN
affect_conflicts                                                      NaN
liv_count                                                               2
items_owned             bicycle;radio;cow_plough;solar_panel;mobile_phone
no_meals                                                                2
months_lack_food                                         Sept;Oct;Nov;Dec
instanceID                      uuid:148d1105-778a-4755-aa71-281eadd4a973
Name: 3, dtype: object

To select a range, use :

safi_df.loc[3:5]

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
3	4	God	2016-11-17T00:00:00Z	7	6	burntbricks	1	NaN	NaN	2	bicycle;radio;cow_plough;solar_panel;mobile_phone	2	Sept;Oct;Nov;Dec	uuid:148d1105-778a-4755-aa71-281eadd4a973
4	5	God	2016-11-17T00:00:00Z	7	40	burntbricks	1	NaN	NaN	4	motorcyle;radio;cow_plough;mobile_phone	2	Aug;Sept;Oct;Nov	uuid:2c867811-9696-4966-9866-f35c3e97d02d
5	6	God	2016-11-17T00:00:00Z	3	3	muddaub	1	NaN	NaN	1	NaN	2	Aug;Sept;Oct	uuid:daa56c91-c8e3-44c3-a663-af6a49a2ca70

You only have to have a number on one side of the colon, it will go from the beginnig up to that number like this:

safi_df.loc[:4]

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	God	2016-11-17T00:00:00Z	10	15	burntbricks	1	NaN	NaN	1	solar_torch	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:193d7daf-9582-409b-bf09-027dd36f9007
3	4	God	2016-11-17T00:00:00Z	7	6	burntbricks	1	NaN	NaN	2	bicycle;radio;cow_plough;solar_panel;mobile_phone	2	Sept;Oct;Nov;Dec	uuid:148d1105-778a-4755-aa71-281eadd4a973
4	5	God	2016-11-17T00:00:00Z	7	40	burntbricks	1	NaN	NaN	4	motorcyle;radio;cow_plough;mobile_phone	2	Aug;Sept;Oct;Nov	uuid:2c867811-9696-4966-9866-f35c3e97d02d

With two :: we can also set an increment

safi_df.loc[::5]

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
5	6	God	2016-11-17T00:00:00Z	3	3	muddaub	1	NaN	NaN	1	NaN	2	Aug;Sept;Oct	uuid:daa56c91-c8e3-44c3-a663-af6a49a2ca70
10	11	God	2016-11-21T00:00:00Z	6	20	sunbricks	1	NaN	NaN	2	radio;cow_plough	2	Oct;Nov	uuid:d29b44e3-3348-4afc-aa4d-9eb34c89d483
15	16	God	2016-11-24T00:00:00Z	6	47	muddaub	1	NaN	NaN	4	radio;cow_plough;solar_panel;solar_torch	3	Jan;Feb	uuid:d17db52f-4b87-4768-b534-ea8f9704c565
20	21	God	2016-11-21T00:00:00Z	8	20	burntbricks	1	no	never	3	NaN	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:6570a7d0-6a0b-452c-aa2e-922500e35749
25	26	Ruaca	2016-11-21T00:00:00Z	3	20	burntbricks	2	no	never	2	radio;cow_plough;table;mobile_phone	2	none	uuid:1c54ee24-22c4-4ee9-b1ad-42d483c08e2e
30	31	Ruaca	2016-11-21T00:00:00Z	3	2	muddaub	1	NaN	NaN	1	NaN	3	none	uuid:cb06eb49-dd39-4150-8bbe-a599e074afe8
35	36	Chirodzo	2016-11-17T00:00:00Z	6	23	sunbricks	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	3	none	uuid:c90eade0-1148-4a12-8c0e-6387a36f45b1
40	41	God	2016-11-17T00:00:00Z	7	22	muddaub	1	NaN	NaN	2	motorcyle;bicycle;radio;cow_plough;table	3	Oct;Nov	uuid:b3ba34d8-eea1-453d-bc73-c141bcbbc5e5
45	46	Chirodzo	2016-11-17T00:00:00Z	10	42	burntbricks	2	no	once	2	motorcyle;computer;television;sterio;solar_pan...	2	Sept;Oct;Nov	uuid:35f297e0-aa5d-4149-9b7b-4965004cfc37
50	51	Chirodzo	2016-11-16T00:00:00Z	5	30	muddaub	1	NaN	NaN	1	radio	3	Oct;Nov	uuid:18ac8e77-bdaf-47ab-85a2-e4c947c9d3ce
55	56	Chirodzo	2016-11-16T00:00:00Z	12	23	burntbricks	2	yes	never	2	motorcyle;bicycle;mobile_phone	3	none	uuid:973c4ac6-f887-48e7-aeaf-4476f2cfab76
60	61	Chirodzo	2016-11-16T00:00:00Z	10	14	muddaub	1	yes	more_once	3	cow_cart;motorcyle;bicycle;television;radio;co...	3	Jan;Feb;Dec	uuid:2401cf50-8859-44d9-bd14-1bf9128766f2
65	66	Chirodzo	2016-11-16T00:00:00Z	10	37	burntbricks	3	yes	frequently	4	cow_cart;motorcyle;bicycle;television;radio;co...	3	none	uuid:a457eab8-971b-4417-a971-2e55b8702816
70	71	Ruaca	2016-11-18T00:00:00Z	6	14	burntbricks	1	yes	more_once	3	radio;cow_plough;mobile_phone	2	Aug;Sept;Oct;Nov	uuid:761f9c49-ec93-4932-ba4c-cc7b78dfcef1
75	155	God	2016-11-24T00:00:00Z	4	4	burntbricks	1	NaN	NaN	1	electricity	2	Jan;Sept;Oct;Nov;Dec	uuid:77b3021b-a9d6-4276-aaeb-5bfcfd413852
80	182	God	2016-11-25T00:00:00Z	7	21	muddaub	3	no	more_once	2	solar_panel	3	Jan;Feb;Nov;Dec	uuid:394033e8-a6e2-4e39-bfac-458753a1ed78
85	197	God	2016-11-28T00:00:00Z	5	19	burntbricks	2	no	more_once	3	bicycle;television;radio;cow_plough;solar_torc...	2	Nov	uuid:85c99fd2-775f-40c9-8654-68223f59d091
90	73	Ruaca	2017-04-26T00:00:00Z	7	9	burntbricks	2	yes	more_once	3	cow_cart;motorcyle;bicycle;television;radio;co...	3	Jan;Sept;Oct	uuid:ac3da862-9e6c-4962-94b6-f4c31624f207
95	101	God	2017-04-27T00:00:00Z	3	4	muddaub	1	no	never	1	bicycle;solar_torch	3	Sept;Oct;Nov	uuid:3c174acd-e431-4523-9ad6-eb14cddca805
100	104	Ruaca	2017-04-28T00:00:00Z	14	52	sunbricks	1	yes	never	4	cow_cart;bicycle;cow_plough	3	Jan;Feb;Dec	uuid:bb2bb365-7d7d-4fe9-9353-b21269676119
105	113	Ruaca	2017-05-03T00:00:00Z	11	26	burntbricks	3	no	never	4	cow_cart;motorcyle;bicycle;radio;cow_plough;so...	3	none	uuid:01210861-aba1-4268-98d0-0260e05f5155
110	108	God	2017-05-11T00:00:00Z	15	22	burntbricks	2	no	never	4	cow_cart;bicycle;radio;cow_plough;solar_panel;...	3	Aug;Sept;Oct;Nov	uuid:e4f4d6ba-e698-45a5-947f-ba6da88cc22b
115	150	Ruaca	2017-05-18T00:00:00Z	7	8	muddaub	1	no	never	1	mobile_phone	3	Sept;Oct;Nov	uuid:92613d0d-e7b1-4d62-8ea4-451d7cd0a982
120	167	Ruaca	2017-06-03T00:00:00Z	8	24	muddaub	1	no	never	3	motorcyle;radio;cow_plough;solar_panel;solar_t...	2	Jan;Nov;Dec	uuid:a9d1a013-043b-475d-a71b-77ed80abe970
125	192	Chirodzo	2017-06-03T00:00:00Z	9	20	burntbricks	1	no	once	1	bicycle;television;radio;sterio;solar_panel;so...	3	Jan;Nov;Dec	uuid:f94409a6-e461-4e4c-a6fb-0072d3d58b00
130	200	Chirodzo	2017-06-04T00:00:00Z	8	20	burntbricks	2	NaN	NaN	3	radio;cow_plough;solar_panel;solar_torch;table...	3	Oct;Nov	uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7

These can be combined to index a subset at an increment.

We can index columns in two ways, as we did on Wednesday

safi_df['village'].head(2)

0    God
1    God
Name: village, dtype: object

Or using a .

safi_df.village.head(2)

0    God
1    God
Name: village, dtype: object

We can select multiple columns, using a list of column names. We can define the list inline.

safi_df[['village','no_membrs','years_liv']].head(2)

	village	no_membrs	years_liv
0	God	3	4
1	God	7	9

or in a separate variable

columns_of_interest = ['village','no_membrs','years_liv']
safi_df[columns_of_interest].head(2)

	village	no_membrs	years_liv
0	God	3	4
1	God	7	9

Functions are objects¶

syllabus_df_list =  pd.read_html('https://rhodyprog4ds.github.io/BrownFall20/syllabus/course_map.html')

And we can put them in a dictionary. lambda functions are special functions defined in a single line.

greetingl = lambda name: 'hello ' + name
greetingl('sarah')

'hello sarah'

is the same as

def greetingf(name):
    return 'hello ' + name
greetingf('sarah')

'hello sarah'

So, we can define a function in a dictionary like this:

view_rows = {0: lambda df: print(df.head()),
            1: lambda df: print(df.tail())}

The len function works on all iterables

for df in syllabus_df_list:
    num_row = len(df)
    view_rows[num_row%2](df)

  Unnamed: 0_level_0                         topics  \
                week             Unnamed: 1_level_1   
                1         [admin, python review]   
                2    Loading data, Python review   
                3      Exploratory Data Analysis   
                4                  Data Cleaning   
                5  Databases, Merging DataFrames   

                            skills  
                Unnamed: 2_level_1  
                        process  
   [access, prepare, summarize]  
         [summarize, visualize]  
[prepare, summarize, visualize]  
 [access, construct, summarize]  
   Unnamed: 0_level_0                                              skill  \
              keyword                                 Unnamed: 1_level_1   
         evaluate                         Evaluate model performance   
         optimize                          Optimize model parameters   
          compare                                     compare models   
     unstructured                            model unstructured data   
         workflow  use industry standard data science tools and w...   

                                              Level 1  \
                                   Unnamed: 2_level_1   
Explain basic performance metrics for differen...   
Identify when model parameters need to be opti...   
              Qualitatively compare model classes   
Identify options for representing text data an...   
Solve well strucutred problems with a single t...   

                                              Level 2  \
                                   Unnamed: 3_level_1   
Apply basic model evaluation metrics to a held...   
Manually optimize basic model parameters such ...   
Compare model classes in specific terms and fi...   
Apply at least one representation to transform...   
Solve semi-strucutred, completely specified pr...   

                                              Level 3  
                                   Unnamed: 4_level_1  
Evaluate a model with multiple metrics and cro...  
Select optimal parameters based of mutiple qua...  
Evaluate tradeoffs between different model com...  
apply multiple representations and compare and...  
Scope, choose an appropriate tool pipeline and...  
   Unnamed: 0_level_0                 A1                 A2  \
              keyword Unnamed: 1_level_1 Unnamed: 2_level_1   
         evaluate                  0                  0   
         optimize                  0                  0   
          compare                  0                  0   
     unstructured                  0                  0   
         workflow                  0                  0   

                   A3                 A4                 A5  \
   Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   

                   A6                 A7                 A8  \
   Unnamed: 6_level_1 Unnamed: 7_level_1 Unnamed: 8_level_1   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   
                0                  0                  0   

                   A9                 A10                 A11  \
   Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1   
                0                   1                   1   
                0                   1                   1   
                0                   0                   1   
                0                   0                   0   
                0                   1                   1   

                   A12                 A13       # Assignments  
   Unnamed: 12_level_1 Unnamed: 13_level_1 Unnamed: 14_level_1  
                 0                   0                   2  
                 0                   0                   2  
                 0                   1                   2  
                 1                   1                   2  
                 1                   1                   4  
   Unnamed: 0_level_0                                            Level 3  \
              keyword                                 Unnamed: 1_level_1   
         evaluate  Evaluate a model with multiple metrics and cro...   
         optimize  Select optimal parameters based of mutiple qua...   
          compare  Evaluate tradeoffs between different model com...   
     unstructured  apply multiple representations and compare and...   
         workflow  Scope, choose an appropriate tool pipeline and...   

                   P1                 P2                 P3                 P4  
   Unnamed: 2_level_1 Unnamed: 3_level_1 Unnamed: 4_level_1 Unnamed: 5_level_1  
                0                  1                  1                  0  
                0                  0                  1                  1  
                0                  0                  1                  1  
                0                  0                  1                  1  
                0                  0                  1                  1  

The beginning of Exploratory Data Analysis¶

Pandas will give us descriptive statistics

safi_df.describe()

	key_ID	no_membrs	years_liv	rooms	liv_count	no_meals
count	131.000000	131.00000	131.000000	131.000000	131.000000	131.000000
mean	85.473282	7.19084	23.053435	1.740458	2.366412	2.603053
std	63.151628	3.17227	16.913041	1.092547	1.082775	0.491143
min	1.000000	2.00000	1.000000	1.000000	1.000000	2.000000
25%	32.500000	5.00000	12.000000	1.000000	1.000000	2.000000
50%	66.000000	7.00000	20.000000	1.000000	2.000000	3.000000
75%	138.000000	9.00000	27.500000	2.000000	3.000000	3.000000
max	202.000000	19.00000	96.000000	8.000000	5.000000	3.000000

The statistics of the key_ID column don’t make a lot of sense. We can avoid that by making it the index

safi_df.head()

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	God	2016-11-17T00:00:00Z	10	15	burntbricks	1	NaN	NaN	1	solar_torch	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:193d7daf-9582-409b-bf09-027dd36f9007
3	4	God	2016-11-17T00:00:00Z	7	6	burntbricks	1	NaN	NaN	2	bicycle;radio;cow_plough;solar_panel;mobile_phone	2	Sept;Oct;Nov;Dec	uuid:148d1105-778a-4755-aa71-281eadd4a973
4	5	God	2016-11-17T00:00:00Z	7	40	burntbricks	1	NaN	NaN	4	motorcyle;radio;cow_plough;mobile_phone	2	Aug;Sept;Oct;Nov	uuid:2c867811-9696-4966-9866-f35c3e97d02d

the inplace parameter of a pandas functions applies the operation to the DataFrame in memory, but then the function returns nothing, but if we display after that, we see that now the key_ID column is now the index.

safi_df.set_index('key_ID',inplace=True)
safi_df.head(2)

	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
key_ID
1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e

and if we describe again, we see it doesn’t compute on that column

safi_df.describe()

	no_membrs	years_liv	rooms	liv_count	no_meals
count	131.00000	131.000000	131.000000	131.000000	131.000000
mean	7.19084	23.053435	1.740458	2.366412	2.603053
std	3.17227	16.913041	1.092547	1.082775	0.491143
min	2.00000	1.000000	1.000000	1.000000	2.000000
25%	5.00000	12.000000	1.000000	1.000000	2.000000
50%	7.00000	20.000000	1.000000	2.000000	3.000000
75%	9.00000	27.500000	2.000000	3.000000	3.000000
max	19.00000	96.000000	8.000000	5.000000	3.000000

We can also call any of those on one column or one statistic.

safi_df['rooms'].mean()

1.7404580152671756

Pandas also has some built in plotting functions.

safi_df.plot.scatter('no_membrs','rooms')

<AxesSubplot:xlabel='no_membrs', ylabel='rooms'>

After Class Questions¶

More Practice¶

These additional questions are for if you want more practice with things we’ve done this week, before class next week.

Programming for Data Science at URI Fall 2020