Class 4: Pandas¶

Today we will:

Remember, Programming is a Practice¶

if you’re curious about something try it
you don’t need me to give you answers about how code works, the interpreter will tell you
if you don’t remember details, remember you can get help from Jupyter

with a ? after the function name withouth ()

print?

or using the tab key inside the () for a function

print()

or from the core python, with the help fucntion

help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

Data in Pandas¶

We can import pandas again as before

import pandas as pd

and we can read in data.

pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	God	2016-11-17T00:00:00Z	10	15	burntbricks	1	NaN	NaN	1	solar_torch	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:193d7daf-9582-409b-bf09-027dd36f9007
3	4	God	2016-11-17T00:00:00Z	7	6	burntbricks	1	NaN	NaN	2	bicycle;radio;cow_plough;solar_panel;mobile_phone	2	Sept;Oct;Nov;Dec	uuid:148d1105-778a-4755-aa71-281eadd4a973
4	5	God	2016-11-17T00:00:00Z	7	40	burntbricks	1	NaN	NaN	4	motorcyle;radio;cow_plough;mobile_phone	2	Aug;Sept;Oct;Nov	uuid:2c867811-9696-4966-9866-f35c3e97d02d
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
126	126	Ruaca	2017-05-18T00:00:00Z	3	7	burntbricks	1	no	more_once	3	motorcyle;radio;solar_panel	3	Oct;Nov;Dec	uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965
127	193	Ruaca	2017-06-04T00:00:00Z	7	10	cement	3	no	more_once	3	car;lorry;television;radio;sterio;cow_plough;s...	3	none	uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df
128	194	Ruaca	2017-06-04T00:00:00Z	4	5	muddaub	1	no	more_once	1	radio;solar_panel;solar_torch;mobile_phone	3	Sept;Oct;Nov	uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf
129	199	Chirodzo	2017-06-04T00:00:00Z	7	17	burntbricks	2	yes	more_once	2	cow_cart;lorry;motorcyle;computer;television;r...	3	Nov;Dec	uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b
130	200	Chirodzo	2017-06-04T00:00:00Z	8	20	burntbricks	2	NaN	NaN	3	radio;cow_plough;solar_panel;solar_torch;table...	3	Oct;Nov	uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7

131 rows × 14 columns

to be able to use this, we need to save it to a variable.

safi_df = pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')

This is an excerpt from the SAFI dataset.

Another important thing to do is to check datatypes, this is how we know what things we can do with a variable.

type(safi_df)

pandas.core.frame.DataFrame

An important thing to check is the size of the dataset.

safi_df.shape

(131, 14)

Recall that you can also tab complete

safi_df.shape

(131, 14)

To see the first 5 rows of the dataset, use the head() function

safi_df.head()

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	God	2016-11-17T00:00:00Z	10	15	burntbricks	1	NaN	NaN	1	solar_torch	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:193d7daf-9582-409b-bf09-027dd36f9007
3	4	God	2016-11-17T00:00:00Z	7	6	burntbricks	1	NaN	NaN	2	bicycle;radio;cow_plough;solar_panel;mobile_phone	2	Sept;Oct;Nov;Dec	uuid:148d1105-778a-4755-aa71-281eadd4a973
4	5	God	2016-11-17T00:00:00Z	7	40	burntbricks	1	NaN	NaN	4	motorcyle;radio;cow_plough;mobile_phone	2	Aug;Sept;Oct;Nov	uuid:2c867811-9696-4966-9866-f35c3e97d02d

We can call this function with a value to change how many rows are returned

safi_df.head(3)

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e
2	3	God	2016-11-17T00:00:00Z	10	15	burntbricks	1	NaN	NaN	1	solar_torch	2	Jan;Feb;Mar;Oct;Nov;Dec	uuid:193d7daf-9582-409b-bf09-027dd36f9007

To know how this works, we can view the documentation for the function

help(safi_df.head)

Help on method head in module pandas.core.generic:

head(n: 'int' = 5) -> 'FrameOrSeries' method of pandas.core.frame.DataFrame instance
    Return the first `n` rows.
    
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.
    
    For negative values of `n`, this function returns all rows except
    the last `n` rows, equivalent to ``df[:-n]``.
    
    Parameters
    ----------
    n : int, default 5
        Number of rows to select.
    
    Returns
    -------
    same type as caller
        The first `n` rows of the caller object.
    
    See Also
    --------
    DataFrame.tail: Returns the last `n` rows.
    
    Examples
    --------
    >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >>> df
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot
    6      shark
    7      whale
    8      zebra
    
    Viewing the first 5 lines
    
    >>> df.head()
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    
    Viewing the first `n` lines (three in this case)
    
    >>> df.head(3)
          animal
    0  alligator
    1        bee
    2     falcon
    
    For negative values of `n`
    
    >>> df.head(-3)
          animal
    0  alligator
    1        bee
    2     falcon
    3       lion
    4     monkey
    5     parrot

Since it says n =5 we know that the default value of the parameter n is 5. When a function has a default value, we can call the function without a value.

To view the last few lines, we use tail

safi_df.tail()

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
126	126	Ruaca	2017-05-18T00:00:00Z	3	7	burntbricks	1	no	more_once	3	motorcyle;radio;solar_panel	3	Oct;Nov;Dec	uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965
127	193	Ruaca	2017-06-04T00:00:00Z	7	10	cement	3	no	more_once	3	car;lorry;television;radio;sterio;cow_plough;s...	3	none	uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df
128	194	Ruaca	2017-06-04T00:00:00Z	4	5	muddaub	1	no	more_once	1	radio;solar_panel;solar_torch;mobile_phone	3	Sept;Oct;Nov	uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf
129	199	Chirodzo	2017-06-04T00:00:00Z	7	17	burntbricks	2	yes	more_once	2	cow_cart;lorry;motorcyle;computer;television;r...	3	Nov;Dec	uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b
130	200	Chirodzo	2017-06-04T00:00:00Z	8	20	burntbricks	2	NaN	NaN	3	radio;cow_plough;solar_panel;solar_torch;table...	3	Oct;Nov	uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7

We can also get an Index for the columns of the DataFrame.

safi_df.columns

Index(['key_ID', 'village', 'interview_date', 'no_membrs', 'years_liv',
       'respondent_wall_type', 'rooms', 'memb_assoc', 'affect_conflicts',
       'liv_count', 'items_owned', 'no_meals', 'months_lack_food',
       'instanceID'],
      dtype='object')

an Index variable is iterable so we can index into it

Try it Yourself

How would you view the name of the 3rd column?

First the correct answer:

safi_df.columns[2]

'interview_date'

Now some misconceptions:

safi_df['interview_date']

    2016-11-17T00:00:00Z
    2016-11-17T00:00:00Z
    2016-11-17T00:00:00Z
    2016-11-17T00:00:00Z
    2016-11-17T00:00:00Z
               ...         
  2017-05-18T00:00:00Z
  2017-06-04T00:00:00Z
  2017-06-04T00:00:00Z
  2017-06-04T00:00:00Z
  2017-06-04T00:00:00Z
Name: interview_date, Length: 131, dtype: object

Indexing with the column name) will return the values in the column

safi_df.columns(2)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-bd02c7e8a4a6> in <module>
----> 1 safi_df.columns(2)

TypeError: 'Index' object is not callable

Using () returns an error, because columns is an attribute which is referenced as is with no (). We get a type error because functions in python are objects of type callable and properties are values not functions.

pd.DataFrame.columns[2]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-40e277f3074e> in <module>
----> 1 pd.DataFrame.columns[2]

TypeError: 'pandas._libs.properties.AxisProperty' object is not subscriptable

This doesn’t work because columns is an attribute of an object of type pandas.DataFrame and pd.DataFrame.columns is not an object.

We can see what the type of pd.DataFrame is with the type function.

type(pd.DataFrame)

type

Knowing about types is helpful for the individual columns of a dataset as well.

safi_df.dtypes

key_ID                   int64
village                 object
interview_date          object
no_membrs                int64
years_liv                int64
respondent_wall_type    object
rooms                    int64
memb_assoc              object
affect_conflicts        object
liv_count                int64
items_owned             object
no_meals                 int64
months_lack_food        object
instanceID              object
dtype: object

Note that it uses int64 and object as the types.

safi_df.head(2)

	key_ID	village	interview_date	no_membrs	years_liv	respondent_wall_type	rooms	memb_assoc	affect_conflicts	liv_count	items_owned	no_meals	months_lack_food	instanceID
0	1	God	2016-11-17T00:00:00Z	3	4	muddaub	1	NaN	NaN	1	bicycle;television;solar_panel;table	2	Jan	uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
1	1	God	2016-11-17T00:00:00Z	7	9	muddaub	1	yes	once	3	cow_cart;bicycle;radio;cow_plough;solar_panel;...	2	Jan;Sept;Oct;Nov;Dec	uuid:099de9c9-3e5e-427b-8452-26250e840d6e

We might want to look at what villages were included in the data.

pd.unique(safi_df['village'])

array(['God', 'Chirodzo', 'Ruaca'], dtype=object)

We can also get count of the number of of each value

safi_df['village'].value_counts()

Ruaca       49
God         43
Chirodzo    39
Name: village, dtype: int64

Try it Yourself!

how many surveyed farms have all type mauddaub?

46 or 45 count as good answers.

safi_df['respondent_wall_type'].value_counts()

burntbricks     65
muddaub         45
sunbricks       17
 burntbricks     2
 muddaub         1
cement           1
Name: respondent_wall_type, dtype: int64

Review and Further reading¶

reading data with pandas
Python built in functions and in particular the type function
Pandas DataFrames
value_counts

If you’ve made it this far, let me know how you found these notes.

Programming for Data Science at URI Fall 2020

Class 4: Pandas¶

Remember, Programming is a Practice¶

Data in Pandas¶

Review and Further reading¶