Class 4: Pandas¶
Today we will:
Remember, Programming is a Practice¶
if you’re curious about something try it
you don’t need me to give you answers about how code works, the interpreter will tell you
if you don’t remember details, remember you can get help from Jupyter
with a ?
after the function name withouth ()
print?
or using the tab
key inside the ()
for a function
print()
or from the core python, with the help
fucntion
help(print)
Help on built-in function print in module builtins:
print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
Data in Pandas¶
We can import pandas
again as before
import pandas as pd
and we can read in data.
pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')
key_ID | village | interview_date | no_membrs | years_liv | respondent_wall_type | rooms | memb_assoc | affect_conflicts | liv_count | items_owned | no_meals | months_lack_food | instanceID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | God | 2016-11-17T00:00:00Z | 3 | 4 | muddaub | 1 | NaN | NaN | 1 | bicycle;television;solar_panel;table | 2 | Jan | uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef |
1 | 1 | God | 2016-11-17T00:00:00Z | 7 | 9 | muddaub | 1 | yes | once | 3 | cow_cart;bicycle;radio;cow_plough;solar_panel;... | 2 | Jan;Sept;Oct;Nov;Dec | uuid:099de9c9-3e5e-427b-8452-26250e840d6e |
2 | 3 | God | 2016-11-17T00:00:00Z | 10 | 15 | burntbricks | 1 | NaN | NaN | 1 | solar_torch | 2 | Jan;Feb;Mar;Oct;Nov;Dec | uuid:193d7daf-9582-409b-bf09-027dd36f9007 |
3 | 4 | God | 2016-11-17T00:00:00Z | 7 | 6 | burntbricks | 1 | NaN | NaN | 2 | bicycle;radio;cow_plough;solar_panel;mobile_phone | 2 | Sept;Oct;Nov;Dec | uuid:148d1105-778a-4755-aa71-281eadd4a973 |
4 | 5 | God | 2016-11-17T00:00:00Z | 7 | 40 | burntbricks | 1 | NaN | NaN | 4 | motorcyle;radio;cow_plough;mobile_phone | 2 | Aug;Sept;Oct;Nov | uuid:2c867811-9696-4966-9866-f35c3e97d02d |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
126 | 126 | Ruaca | 2017-05-18T00:00:00Z | 3 | 7 | burntbricks | 1 | no | more_once | 3 | motorcyle;radio;solar_panel | 3 | Oct;Nov;Dec | uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965 |
127 | 193 | Ruaca | 2017-06-04T00:00:00Z | 7 | 10 | cement | 3 | no | more_once | 3 | car;lorry;television;radio;sterio;cow_plough;s... | 3 | none | uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df |
128 | 194 | Ruaca | 2017-06-04T00:00:00Z | 4 | 5 | muddaub | 1 | no | more_once | 1 | radio;solar_panel;solar_torch;mobile_phone | 3 | Sept;Oct;Nov | uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf |
129 | 199 | Chirodzo | 2017-06-04T00:00:00Z | 7 | 17 | burntbricks | 2 | yes | more_once | 2 | cow_cart;lorry;motorcyle;computer;television;r... | 3 | Nov;Dec | uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b |
130 | 200 | Chirodzo | 2017-06-04T00:00:00Z | 8 | 20 | burntbricks | 2 | NaN | NaN | 3 | radio;cow_plough;solar_panel;solar_torch;table... | 3 | Oct;Nov | uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7 |
131 rows × 14 columns
to be able to use this, we need to save it to a variable.
safi_df = pd.read_csv('https://raw.githubusercontent.com/brownsarahm/python-socialsci-files/master/data/SAFI_clean.csv')
This is an excerpt from the SAFI dataset.
Another important thing to do is to check datatypes, this is how we know what things we can do with a variable.
type(safi_df)
pandas.core.frame.DataFrame
An important thing to check is the size of the dataset.
safi_df.shape
(131, 14)
Recall that you can also tab complete
safi_df.shape
(131, 14)
To see the first 5 rows of the dataset, use the head()
function
safi_df.head()
key_ID | village | interview_date | no_membrs | years_liv | respondent_wall_type | rooms | memb_assoc | affect_conflicts | liv_count | items_owned | no_meals | months_lack_food | instanceID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | God | 2016-11-17T00:00:00Z | 3 | 4 | muddaub | 1 | NaN | NaN | 1 | bicycle;television;solar_panel;table | 2 | Jan | uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef |
1 | 1 | God | 2016-11-17T00:00:00Z | 7 | 9 | muddaub | 1 | yes | once | 3 | cow_cart;bicycle;radio;cow_plough;solar_panel;... | 2 | Jan;Sept;Oct;Nov;Dec | uuid:099de9c9-3e5e-427b-8452-26250e840d6e |
2 | 3 | God | 2016-11-17T00:00:00Z | 10 | 15 | burntbricks | 1 | NaN | NaN | 1 | solar_torch | 2 | Jan;Feb;Mar;Oct;Nov;Dec | uuid:193d7daf-9582-409b-bf09-027dd36f9007 |
3 | 4 | God | 2016-11-17T00:00:00Z | 7 | 6 | burntbricks | 1 | NaN | NaN | 2 | bicycle;radio;cow_plough;solar_panel;mobile_phone | 2 | Sept;Oct;Nov;Dec | uuid:148d1105-778a-4755-aa71-281eadd4a973 |
4 | 5 | God | 2016-11-17T00:00:00Z | 7 | 40 | burntbricks | 1 | NaN | NaN | 4 | motorcyle;radio;cow_plough;mobile_phone | 2 | Aug;Sept;Oct;Nov | uuid:2c867811-9696-4966-9866-f35c3e97d02d |
We can call this function with a value to change how many rows are returned
safi_df.head(3)
key_ID | village | interview_date | no_membrs | years_liv | respondent_wall_type | rooms | memb_assoc | affect_conflicts | liv_count | items_owned | no_meals | months_lack_food | instanceID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | God | 2016-11-17T00:00:00Z | 3 | 4 | muddaub | 1 | NaN | NaN | 1 | bicycle;television;solar_panel;table | 2 | Jan | uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef |
1 | 1 | God | 2016-11-17T00:00:00Z | 7 | 9 | muddaub | 1 | yes | once | 3 | cow_cart;bicycle;radio;cow_plough;solar_panel;... | 2 | Jan;Sept;Oct;Nov;Dec | uuid:099de9c9-3e5e-427b-8452-26250e840d6e |
2 | 3 | God | 2016-11-17T00:00:00Z | 10 | 15 | burntbricks | 1 | NaN | NaN | 1 | solar_torch | 2 | Jan;Feb;Mar;Oct;Nov;Dec | uuid:193d7daf-9582-409b-bf09-027dd36f9007 |
To know how this works, we can view the documentation for the function
help(safi_df.head)
Help on method head in module pandas.core.generic:
head(n: 'int' = 5) -> 'FrameOrSeries' method of pandas.core.frame.DataFrame instance
Return the first `n` rows.
This function returns the first `n` rows for the object based
on position. It is useful for quickly testing if your object
has the right type of data in it.
For negative values of `n`, this function returns all rows except
the last `n` rows, equivalent to ``df[:-n]``.
Parameters
----------
n : int, default 5
Number of rows to select.
Returns
-------
same type as caller
The first `n` rows of the caller object.
See Also
--------
DataFrame.tail: Returns the last `n` rows.
Examples
--------
>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
... 'monkey', 'parrot', 'shark', 'whale', 'zebra']})
>>> df
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
Viewing the first 5 lines
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
Viewing the first `n` lines (three in this case)
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
For negative values of `n`
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
Since it says n =5
we know that the default value of the parameter n
is 5. When a function has a default value, we can call the function without a value.
To view the last few lines, we use tail
safi_df.tail()
key_ID | village | interview_date | no_membrs | years_liv | respondent_wall_type | rooms | memb_assoc | affect_conflicts | liv_count | items_owned | no_meals | months_lack_food | instanceID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
126 | 126 | Ruaca | 2017-05-18T00:00:00Z | 3 | 7 | burntbricks | 1 | no | more_once | 3 | motorcyle;radio;solar_panel | 3 | Oct;Nov;Dec | uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965 |
127 | 193 | Ruaca | 2017-06-04T00:00:00Z | 7 | 10 | cement | 3 | no | more_once | 3 | car;lorry;television;radio;sterio;cow_plough;s... | 3 | none | uuid:5ccc2e5a-ea90-48b5-8542-69400d5334df |
128 | 194 | Ruaca | 2017-06-04T00:00:00Z | 4 | 5 | muddaub | 1 | no | more_once | 1 | radio;solar_panel;solar_torch;mobile_phone | 3 | Sept;Oct;Nov | uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf |
129 | 199 | Chirodzo | 2017-06-04T00:00:00Z | 7 | 17 | burntbricks | 2 | yes | more_once | 2 | cow_cart;lorry;motorcyle;computer;television;r... | 3 | Nov;Dec | uuid:ffc83162-ff24-4a87-8709-eff17abc0b3b |
130 | 200 | Chirodzo | 2017-06-04T00:00:00Z | 8 | 20 | burntbricks | 2 | NaN | NaN | 3 | radio;cow_plough;solar_panel;solar_torch;table... | 3 | Oct;Nov | uuid:aa77a0d7-7142-41c8-b494-483a5b68d8a7 |
We can also get an Index
for the columns of the DataFrame.
safi_df.columns
Index(['key_ID', 'village', 'interview_date', 'no_membrs', 'years_liv',
'respondent_wall_type', 'rooms', 'memb_assoc', 'affect_conflicts',
'liv_count', 'items_owned', 'no_meals', 'months_lack_food',
'instanceID'],
dtype='object')
an Index
variable is iterable so we can index into it
Try it Yourself
How would you view the name of the 3rd column?
First the correct answer:
safi_df.columns[2]
'interview_date'
Now some misconceptions:
safi_df['interview_date']
0 2016-11-17T00:00:00Z
1 2016-11-17T00:00:00Z
2 2016-11-17T00:00:00Z
3 2016-11-17T00:00:00Z
4 2016-11-17T00:00:00Z
...
126 2017-05-18T00:00:00Z
127 2017-06-04T00:00:00Z
128 2017-06-04T00:00:00Z
129 2017-06-04T00:00:00Z
130 2017-06-04T00:00:00Z
Name: interview_date, Length: 131, dtype: object
Indexing with the column name) will return the values in the column
safi_df.columns(2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-bd02c7e8a4a6> in <module>
----> 1 safi_df.columns(2)
TypeError: 'Index' object is not callable
Using ()
returns an error, because columns
is an attribute which is referenced as is with no ()
. We get a type error because functions in python are objects of type callable
and properties are values not functions.
pd.DataFrame.columns[2]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-40e277f3074e> in <module>
----> 1 pd.DataFrame.columns[2]
TypeError: 'pandas._libs.properties.AxisProperty' object is not subscriptable
This doesn’t work because columns
is an attribute of an object of type pandas.DataFrame
and pd.DataFrame.columns
is not an object.
We can see what the type of pd.DataFrame is with the type
function.
type(pd.DataFrame)
type
Knowing about types is helpful for the individual columns of a dataset as well.
safi_df.dtypes
key_ID int64
village object
interview_date object
no_membrs int64
years_liv int64
respondent_wall_type object
rooms int64
memb_assoc object
affect_conflicts object
liv_count int64
items_owned object
no_meals int64
months_lack_food object
instanceID object
dtype: object
Note that it uses int64
and object
as the types.
safi_df.head(2)
key_ID | village | interview_date | no_membrs | years_liv | respondent_wall_type | rooms | memb_assoc | affect_conflicts | liv_count | items_owned | no_meals | months_lack_food | instanceID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | God | 2016-11-17T00:00:00Z | 3 | 4 | muddaub | 1 | NaN | NaN | 1 | bicycle;television;solar_panel;table | 2 | Jan | uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef |
1 | 1 | God | 2016-11-17T00:00:00Z | 7 | 9 | muddaub | 1 | yes | once | 3 | cow_cart;bicycle;radio;cow_plough;solar_panel;... | 2 | Jan;Sept;Oct;Nov;Dec | uuid:099de9c9-3e5e-427b-8452-26250e840d6e |
We might want to look at what villages were included in the data.
pd.unique(safi_df['village'])
array(['God', 'Chirodzo', 'Ruaca'], dtype=object)
We can also get count of the number of of each value
safi_df['village'].value_counts()
Ruaca 49
God 43
Chirodzo 39
Name: village, dtype: int64
Try it Yourself!
how many surveyed farms have all type mauddaub
?
46 or 45 count as good answers.
safi_df['respondent_wall_type'].value_counts()
burntbricks 65
muddaub 45
sunbricks 17
burntbricks 2
muddaub 1
cement 1
Name: respondent_wall_type, dtype: int64
Review and Further reading¶
Python built in functions and in particular the
type
function
If you’ve made it this far, let me know how you found these notes.