Class 4: Pandas

Today we will:

Remember, Programming is a Practice

  • if you’re curious about something try it

  • you don’t need me to give you answers about how code works, the interpreter will tell you

  • if you don’t remember details, remember you can get help from Jupyter

with a ? after the function name withouth ()

print?

or using the tab key inside the () for a function

print()

or from the core python, with the help fucntion

help(print)
Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

Data in Pandas

We can import pandas again as before

import pandas as pd

and we can read in data.

pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')
key_ID village interview_date no_membrs years_liv respondent_wall_type rooms memb_assoc affect_conflicts liv_count items_owned no_meals months_lack_food instanceID
0 1 God 2016-11-17T00:00:00Z 3 4 muddaub 1 NaN NaN 1 bicycle;television;solar_panel;table 2 Jan uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1 1 God 2016-11-17T00:00:00Z 7 9 muddaub 1 yes once 3 cow_cart;bicycle;radio;cow_plough;solar_panel;... 2 Jan;Sept;Oct;Nov;Dec uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2 3 God 2016-11-17T00:00:00Z 10 15 burntbricks 1 NaN NaN 1 solar_torch 2 Jan;Feb;Mar;Oct;Nov;Dec uuid:193d7daf-9582-409b-bf09-027dd36f9007
3 4 God 2016-11-17T00:00:00Z 7 6 burntbricks 1 NaN NaN 2 bicycle;radio;cow_plough;solar_panel;mobile_phone 2 Sept;Oct;Nov;Dec uuid:148d1105-778a-4755-aa71-281eadd4a973
4 5 God 2016-11-17T00:00:00Z 7 40 burntbricks 1 NaN NaN 4 motorcyle;radio;cow_plough;mobile_phone 2 Aug;Sept;Oct;Nov uuid:2c867811-9696-4966-9866-f35c3e97d02d
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
126 126 Ruaca 2017-05-18T00:00:00Z 3 7 burntbricks 1 no more_once 3 motorcyle;radio;solar_panel 3 Oct;Nov;Dec uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965
127 193 Ruaca 2017-06-04T00:00:00Z 7 10 cement 3 no more_once 3 car;lorry;television;radio;sterio;cow_plough;s... 3 none uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df
128 194 Ruaca 2017-06-04T00:00:00Z 4 5 muddaub 1 no more_once 1 radio;solar_panel;solar_torch;mobile_phone 3 Sept;Oct;Nov uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf
129 199 Chirodzo 2017-06-04T00:00:00Z 7 17 burntbricks 2 yes more_once 2 cow_cart;lorry;motorcyle;computer;television;r... 3 Nov;Dec uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b
130 200 Chirodzo 2017-06-04T00:00:00Z 8 20 burntbricks 2 NaN NaN 3 radio;cow_plough;solar_panel;solar_torch;table... 3 Oct;Nov uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7

131 rows × 14 columns

to be able to use this, we need to save it to a variable.

safi_df = pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')

This is an excerpt from the SAFI dataset.

Another important thing to do is to check datatypes, this is how we know what things we can do with a variable.

type(safi_df)
pandas.core.frame.DataFrame

An important thing to check is the size of the dataset.

safi_df.shape
(131, 14)

Recall that you can also tab complete

safi_df.shape
(131, 14)

To see the first 5 rows of the dataset, use the head() function

safi_df.head()
key_ID village interview_date no_membrs years_liv respondent_wall_type rooms memb_assoc affect_conflicts liv_count items_owned no_meals months_lack_food instanceID
0 1 God 2016-11-17T00:00:00Z 3 4 muddaub 1 NaN NaN 1 bicycle;television;solar_panel;table 2 Jan uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1 1 God 2016-11-17T00:00:00Z 7 9 muddaub 1 yes once 3 cow_cart;bicycle;radio;cow_plough;solar_panel;... 2 Jan;Sept;Oct;Nov;Dec uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2 3 God 2016-11-17T00:00:00Z 10 15 burntbricks 1 NaN NaN 1 solar_torch 2 Jan;Feb;Mar;Oct;Nov;Dec uuid:193d7daf-9582-409b-bf09-027dd36f9007
3 4 God 2016-11-17T00:00:00Z 7 6 burntbricks 1 NaN NaN 2 bicycle;radio;cow_plough;solar_panel;mobile_phone 2 Sept;Oct;Nov;Dec uuid:148d1105-778a-4755-aa71-281eadd4a973
4 5 God 2016-11-17T00:00:00Z 7 40 burntbricks 1 NaN NaN 4 motorcyle;radio;cow_plough;mobile_phone 2 Aug;Sept;Oct;Nov uuid:2c867811-9696-4966-9866-f35c3e97d02d

We can call this function with a value to change how many rows are returned

safi_df.head(3)
key_ID village interview_date no_membrs years_liv respondent_wall_type rooms memb_assoc affect_conflicts liv_count items_owned no_meals months_lack_food instanceID
0 1 God 2016-11-17T00:00:00Z 3 4 muddaub 1 NaN NaN 1 bicycle;television;solar_panel;table 2 Jan uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1 1 God 2016-11-17T00:00:00Z 7 9 muddaub 1 yes once 3 cow_cart;bicycle;radio;cow_plough;solar_panel;... 2 Jan;Sept;Oct;Nov;Dec uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2 3 God 2016-11-17T00:00:00Z 10 15 burntbricks 1 NaN NaN 1 solar_torch 2 Jan;Feb;Mar;Oct;Nov;Dec uuid:193d7daf-9582-409b-bf09-027dd36f9007

To know how this works, we can view the documentation for the function

help(safi_df.head)
Help on method head in module pandas.core.generic:

head(n: 'int' = 5) -> 'FrameOrSeries' method of pandas.core.frame.DataFrame instance
    Return the first `n` rows.
    
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.
    
    For negative values of `n`, this function returns all rows except
    the last `n` rows, equivalent to ``df[:-n]``.
    
    Parameters
    ----------
    n : int, default 5
        Number of rows to select.
    
    Returns
    -------
    same type as caller
        The first `n` rows of the caller object.
    
    See Also
    --------
    DataFrame.tail: Returns the last `n` rows.
    
    Examples
    --------
    >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >>> df
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot
    6      shark
    7      whale
    8      zebra
    
    Viewing the first 5 lines
    
    >>> df.head()
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    
    Viewing the first `n` lines (three in this case)
    
    >>> df.head(3)
          animal
    0  alligator
    1        bee
    2     falcon
    
    For negative values of `n`
    
    >>> df.head(-3)
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot

Since it says n =5 we know that the default value of the parameter n is 5. When a function has a default value, we can call the function without a value.

To view the last few lines, we use tail

safi_df.tail()
key_ID village interview_date no_membrs years_liv respondent_wall_type rooms memb_assoc affect_conflicts liv_count items_owned no_meals months_lack_food instanceID
126 126 Ruaca 2017-05-18T00:00:00Z 3 7 burntbricks 1 no more_once 3 motorcyle;radio;solar_panel 3 Oct;Nov;Dec uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965
127 193 Ruaca 2017-06-04T00:00:00Z 7 10 cement 3 no more_once 3 car;lorry;television;radio;sterio;cow_plough;s... 3 none uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df
128 194 Ruaca 2017-06-04T00:00:00Z 4 5 muddaub 1 no more_once 1 radio;solar_panel;solar_torch;mobile_phone 3 Sept;Oct;Nov uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf
129 199 Chirodzo 2017-06-04T00:00:00Z 7 17 burntbricks 2 yes more_once 2 cow_cart;lorry;motorcyle;computer;television;r... 3 Nov;Dec uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b
130 200 Chirodzo 2017-06-04T00:00:00Z 8 20 burntbricks 2 NaN NaN 3 radio;cow_plough;solar_panel;solar_torch;table... 3 Oct;Nov uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7

We can also get an Index for the columns of the DataFrame.

safi_df.columns
Index(['key_ID', 'village', 'interview_date', 'no_membrs', 'years_liv',
       'respondent_wall_type', 'rooms', 'memb_assoc', 'affect_conflicts',
       'liv_count', 'items_owned', 'no_meals', 'months_lack_food',
       'instanceID'],
      dtype='object')

an Index variable is iterable so we can index into it

Try it Yourself

How would you view the name of the 3rd column?

First the correct answer:

safi_df.columns[2]
'interview_date'

Now some misconceptions:

safi_df['interview_date']
0      2016-11-17T00:00:00Z
1      2016-11-17T00:00:00Z
2      2016-11-17T00:00:00Z
3      2016-11-17T00:00:00Z
4      2016-11-17T00:00:00Z
               ...         
126    2017-05-18T00:00:00Z
127    2017-06-04T00:00:00Z
128    2017-06-04T00:00:00Z
129    2017-06-04T00:00:00Z
130    2017-06-04T00:00:00Z
Name: interview_date, Length: 131, dtype: object

Indexing with the column name) will return the values in the column

safi_df.columns(2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-bd02c7e8a4a6> in <module>
----> 1 safi_df.columns(2)

TypeError: 'Index' object is not callable

Using () returns an error, because columns is an attribute which is referenced as is with no (). We get a type error because functions in python are objects of type callable and properties are values not functions.

pd.DataFrame.columns[2]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-40e277f3074e> in <module>
----> 1 pd.DataFrame.columns[2]

TypeError: 'pandas._libs.properties.AxisProperty' object is not subscriptable

This doesn’t work because columns is an attribute of an object of type pandas.DataFrame and pd.DataFrame.columns is not an object.

We can see what the type of pd.DataFrame is with the type function.

type(pd.DataFrame)
type

Knowing about types is helpful for the individual columns of a dataset as well.

safi_df.dtypes
key_ID                   int64
village                 object
interview_date          object
no_membrs                int64
years_liv                int64
respondent_wall_type    object
rooms                    int64
memb_assoc              object
affect_conflicts        object
liv_count                int64
items_owned             object
no_meals                 int64
months_lack_food        object
instanceID              object
dtype: object

Note that it uses int64 and object as the types.

safi_df.head(2)
key_ID village interview_date no_membrs years_liv respondent_wall_type rooms memb_assoc affect_conflicts liv_count items_owned no_meals months_lack_food instanceID
0 1 God 2016-11-17T00:00:00Z 3 4 muddaub 1 NaN NaN 1 bicycle;television;solar_panel;table 2 Jan uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1 1 God 2016-11-17T00:00:00Z 7 9 muddaub 1 yes once 3 cow_cart;bicycle;radio;cow_plough;solar_panel;... 2 Jan;Sept;Oct;Nov;Dec uuid:099de9c9-3e5e-427b-8452-26250e840d6e

We might want to look at what villages were included in the data.

pd.unique(safi_df['village'])
array(['God', 'Chirodzo', 'Ruaca'], dtype=object)

We can also get count of the number of of each value

safi_df['village'].value_counts()
Ruaca       49
God         43
Chirodzo    39
Name: village, dtype: int64

Try it Yourself!

how many surveyed farms have all type mauddaub?

46 or 45 count as good answers.

safi_df['respondent_wall_type'].value_counts()
burntbricks     65
muddaub         45
sunbricks       17
 burntbricks     2
 muddaub         1
cement           1
Name: respondent_wall_type, dtype: int64

Review and Further reading

If you’ve made it this far, let me know how you found these notes.