5. Visualization#

Data Visualization is about using plots, to convey informaiton and get a better understanding of the data.

5.1. Plotting in Python#

There are several popular plotting libaries:

  • matplotlib: low level plotting tools

  • seaborn: high level plotting with opinionated defaults

  • ggplot: plotting based on the ggplot library in R.

Plus pandas has a plot method

Pandas and seaborn use matplotlib under the hood.

Seaborn and ggplot both assume the data is set up as a DataFrame. Getting started with seaborn is the simplest, so we’ll use that.

5.2. Figure and axis level plots#

summary of plot types

add the image to your notebook with the following:

![summary of plot types](https://seaborn.pydata.org/_images/function_overview_8_0.png)

5.3. Anatomy of a figure#

annotated graph

*this was drawn with code

add the image to your notebook with the following:

![annotated graph](https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png)

figure vs axes

we will load pandas and seaborn

import pandas as pd
import seaborn as sns

and continue with the same data

carbon_data_url = 'https://github.com/rfordatascience/tidytuesday/raw/master/data/2024/2024-05-21/emissions.csv'

and the added column we made last class

carbon_df = pd.read_csv(carbon_data_url)
carbon_df['commodity_simple'] = carbon_df['commodity'].apply(lambda s: s if not('Coal' in s) else 'Coal')
carbon_df.head()
year parent_entity parent_type commodity production_value production_unit total_emissions_MtCO2e commodity_simple
0 1962 Abu Dhabi National Oil Company State-owned Entity Oil & NGL 0.91250 Million bbl/yr 0.363885 Oil & NGL
1 1962 Abu Dhabi National Oil Company State-owned Entity Natural Gas 1.84325 Bcf/yr 0.134355 Natural Gas
2 1963 Abu Dhabi National Oil Company State-owned Entity Oil & NGL 1.82500 Million bbl/yr 0.727770 Oil & NGL
3 1963 Abu Dhabi National Oil Company State-owned Entity Natural Gas 4.42380 Bcf/yr 0.322453 Natural Gas
4 1964 Abu Dhabi National Oil Company State-owned Entity Oil & NGL 7.30000 Million bbl/yr 2.911079 Oil & NGL

5.4. How are the samples distributed in time?#

We have not yet worked with th year column. An important first step we might want to know is how the measurements are distributed in time.

From last class, wemight try value_counts

carbon_df['year'].value_counts()
year
2021    238
2022    238
2018    237
2019    236
2020    235
       ... 
1858      3
1857      3
1856      3
1854      3
1863      3
Name: count, Length: 169, dtype: int64

but it’s a little hard to read. A histogram might better

carbon_df['year'].plot(kind='hist')
<Axes: ylabel='Frequency'>
../_images/5ca74e4cd58dc9d398208919508d0a80c21da3ed30dbea696d3e010e821fc5f5.png

Here we see that thare are a lot more samples in more recent years.

5.5. Anatomy of a plot#

annotated graph the above figure come from matplotlib’s Anatomy of a Figure page which includes the code to generate that figure

5.6. Figure and axis level plots#

summary of plot types

5.7. Changing colors#

sns.set_palette('colorblind')

5.8. Emissions by type#

sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e')
<seaborn.axisgrid.FacetGrid at 0x7fc6d46e3a30>
../_images/d85b0dfc69d388eb167bf2ddd3053528d3d2e7630935a635d2fc154ec9b50bcf.png
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e',kind='bar')
<seaborn.axisgrid.FacetGrid at 0x7fc7185a3cd0>
../_images/59d15a747a6b613ac8f5843eb2f2006a0b86e321f4307ec4a3e64d1f3ecfd5cd.png
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e',kind='bar',
           col = 'parent_type')
<seaborn.axisgrid.FacetGrid at 0x7fc6d44d86d0>
../_images/fce62066c3db9c53cd03287e1243409b03943763f2514dc0162427b085ecc165.png
sns.catplot(data=carbon_df,x='commodity_simple', y='total_emissions_MtCO2e',kind='bar',
           col = 'parent_type',hue='commodity_simple')
<seaborn.axisgrid.FacetGrid at 0x7fc6d2096490>
../_images/825221cc653ab068a4c8df4ff9faca5d5a5896f224f37f8ead8dbdcd24d00e86.png

Example okay questions:

  • which parent type has the most constent emissions across commodity type?

  • which parent type has highest emission?

Example good questions

  • which type of emissions should be targeted for interventions (the highest)?

5.9. Emissions over time?#

sns.relplot(data=carbon_df, x='year', y='total_emissions_MtCO2e',
           hue ='parent_entity',row ='parent_type')
<seaborn.axisgrid.FacetGrid at 0x7fc6d2106ee0>
../_images/ec4bdfdbb84e45f3ba73c0394aa651e29c627e83aa50e02df833dab1767490c9.png

5.10. Variable types and data types#

Related but not the same.

Data types are literal, related to the representation in the computer.

ther can be int16, int32, int64

We can also have mathematical types of numbers

  • Integers can be positive, 0, or negative.

  • Reals are continuous, infinite possibilities.

Variable types are about the meaning in a conceptual sense.

  • categorical (can take a discrete number of values, could be used to group data, could be a string or integer; unordered)

  • continuous (can take on any possible value, always a number)

  • binary (like data type boolean, but could be represented as yes/no, true/false, or 1/0, could be categorical also, but often makes sense to calculate rates)

  • ordinal (ordered, but appropriately categorical)

we’ll focus on the first two most of the time. Some values that are technically only integers range high enough that we treat them more like continuous most of the time.

carbon_df.columns
Index(['year', 'parent_entity', 'parent_type', 'commodity', 'production_value',
       'production_unit', 'total_emissions_MtCO2e', 'commodity_simple'],
      dtype='object')

5.11. Questions After Class#

Class Response Summary:

5.11.1. To what degree should we be familiarizing ourselves with these different kinds of graphs?#

There is also a full semester data visualization class, so we will not cover everything it is useful to know a few basic ones and we will look to see that you can create and correctly interpret at least 3 different kinds.

5.11.2. Should I upload all parts of A2 today if I plan to go to office hours tomorrow? Or just the finished parts?#

All of it with your questions written in the file(s).

5.11.3. is there ways to overlap the different parent types into the same graph?#

This is called a stacked bar graph, there are examples in the seaborn tutorials for displot but with an important caveat that that can make some things hard to see and you can also stack with the low level features

5.11.4. is the ggplot option just the same method names as the version in R? or is the syntax updated to be similar also?#

I think its mostly matching method names, attribute names, and conceptual ideas. Python libraries all have to use Python syntax.

5.11.5. Is the peer review just for assignment 3 or will we have the option to do it for future assignments?#

Probably only 3, but possibly a couple more.

5.11.6. What is the typical range of sizes for a good dataset for this assignment#

hundred to maybe 2000, you do not need more than that and too many can make it slow

5.11.7. if we don’t get any achievements on an assignment are we able to revise them to get an achievement?#

If you are very close, yes, if you are not very close, you will get advice that we recommend you apply on future assignments.

5.11.8. What is a numpy array?#

A DataType that one of the attributes of a DataFrame takes. See the glossary entry for numpy array and the intro to DataFrames