5. Visualization#
Data Visualization is about using plots, to convey informaiton and get a better understanding of the data.
5.1. Plotting in Python#
There are several popular plotting libaries:
matplotlib: low level plotting tools
seaborn: high level plotting with opinionated defaults
ggplot: plotting based on the ggplot library in R.
Plus pandas has a plot
method
Pandas and seaborn use matplotlib under the hood.
Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.
5.2. Figure and axis level plots#
add the image to your notebook with the following:
![summary of plot types](https://seaborn.pydata.org/_images/function_overview_8_0.png)
5.3. Anatomy of a figure#
*this was drawn with code
add the image to your notebook with the following:
![annotated graph](https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png)
we will load pandas and seaborn
import pandas as pd
import seaborn as sns
and continue with the same data
carbon_data_url = 'https://github.com/rfordatascience/tidytuesday/raw/master/data/2024/2024-05-21/emissions.csv'
and the added column we made last class
carbon_df = pd.read_csv(carbon_data_url)
carbon_df['commodity_simple'] = carbon_df['commodity'].apply(lambda s: s if not('Coal' in s) else 'Coal')
carbon_df.head()
year | parent_entity | parent_type | commodity | production_value | production_unit | total_emissions_MtCO2e | commodity_simple | |
---|---|---|---|---|---|---|---|---|
0 | 1962 | Abu Dhabi National Oil Company | State-owned Entity | Oil & NGL | 0.91250 | Million bbl/yr | 0.363885 | Oil & NGL |
1 | 1962 | Abu Dhabi National Oil Company | State-owned Entity | Natural Gas | 1.84325 | Bcf/yr | 0.134355 | Natural Gas |
2 | 1963 | Abu Dhabi National Oil Company | State-owned Entity | Oil & NGL | 1.82500 | Million bbl/yr | 0.727770 | Oil & NGL |
3 | 1963 | Abu Dhabi National Oil Company | State-owned Entity | Natural Gas | 4.42380 | Bcf/yr | 0.322453 | Natural Gas |
4 | 1964 | Abu Dhabi National Oil Company | State-owned Entity | Oil & NGL | 7.30000 | Million bbl/yr | 2.911079 | Oil & NGL |
5.4. How are the samples distributed in time?#
We have not yet worked with th year
column. An important first step we might want to know
is how the measurements are distributed in time.
From last class, wemight try value_counts
carbon_df['year'].value_counts()
year
2021 238
2022 238
2018 237
2019 236
2020 235
...
1858 3
1857 3
1856 3
1854 3
1863 3
Name: count, Length: 169, dtype: int64
but it’s a little hard to read. A histogram might better
carbon_df['year'].plot(kind='hist')
<Axes: ylabel='Frequency'>
Here we see that thare are a lot more samples in more recent years.
5.5. Anatomy of a plot#
the above figure come from matplotlib’s Anatomy of a Figure page which includes the code to generate that figure
5.6. Figure and axis level plots#
5.7. Changing colors#
sns.set_palette('colorblind')
5.8. Emissions by type#
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e')
<seaborn.axisgrid.FacetGrid at 0x7fc6d46e3a30>
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e',kind='bar')
<seaborn.axisgrid.FacetGrid at 0x7fc7185a3cd0>
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e',kind='bar',
col = 'parent_type')
<seaborn.axisgrid.FacetGrid at 0x7fc6d44d86d0>
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e',kind='bar',
col = 'parent_type',hue='commodity_simple')
<seaborn.axisgrid.FacetGrid at 0x7fc6d2096490>
Example okay questions:
which parent type has the most constent emissions across commodity type?
which parent type has highest emission?
Example good questions
which type of emissions should be targeted for interventions (the highest)?
5.9. Emissions over time?#
sns.relplot(data=carbon_df, x='year', y='total_emissions_MtCO2e',
hue ='parent_entity',row ='parent_type')
<seaborn.axisgrid.FacetGrid at 0x7fc6d2106ee0>
5.10. Variable types and data types#
Related but not the same.
Data types are literal, related to the representation in the computer.
ther can be int16, int32, int64
We can also have mathematical types of numbers
Integers can be positive, 0, or negative.
Reals are continuous, infinite possibilities.
Variable types are about the meaning in a conceptual sense.
categorical (can take a discrete number of values, could be used to group data, could be a string or integer; unordered)
continuous (can take on any possible value, always a number)
binary (like data type boolean, but could be represented as yes/no, true/false, or 1/0, could be categorical also, but often makes sense to calculate rates)
ordinal (ordered, but appropriately categorical)
we’ll focus on the first two most of the time. Some values that are technically only integers range high enough that we treat them more like continuous most of the time.
carbon_df.columns
Index(['year', 'parent_entity', 'parent_type', 'commodity', 'production_value',
'production_unit', 'total_emissions_MtCO2e', 'commodity_simple'],
dtype='object')
5.11. Questions After Class#
Class Response Summary:
5.11.1. To what degree should we be familiarizing ourselves with these different kinds of graphs?#
There is also a full semester data visualization class, so we will not cover everything it is useful to know a few basic ones and we will look to see that you can create and correctly interpret at least 3 different kinds.
5.11.2. Should I upload all parts of A2 today if I plan to go to office hours tomorrow? Or just the finished parts?#
All of it with your questions written in the file(s).
5.11.3. is there ways to overlap the different parent types into the same graph?#
This is called a stacked bar graph, there are examples in the seaborn tutorials for displot but with an important caveat that that can make some things hard to see and you can also stack with the low level features
5.11.4. is the ggplot option just the same method names as the version in R? or is the syntax updated to be similar also?#
I think its mostly matching method names, attribute names, and conceptual ideas. Python libraries all have to use Python syntax.
5.11.5. Is the peer review just for assignment 3 or will we have the option to do it for future assignments?#
Probably only 3, but possibly a couple more.
5.11.6. What is the typical range of sizes for a good dataset for this assignment#
hundred to maybe 2000, you do not need more than that and too many can make it slow
5.11.7. if we don’t get any achievements on an assignment are we able to revise them to get an achievement?#
If you are very close, yes, if you are not very close, you will get advice that we recommend you apply on future assignments.
5.11.8. What is a numpy array?#
A DataType that one of the attributes of a DataFrame takes. See the glossary entry for numpy array and the intro to DataFrames