Class 10: Cleaning review and Ray Summit Keynotes¶

Say hello on zoom chat
join prismia
sign up so you can watch Ray Summit talks by Pandas and Scikit learn

import pandas as pd

# %load http://drsmb.co/310
data_url = 'https://github.com/rhodyprog4ds/inclass-data/raw/main/ca_dds_summary.xlsx'

Let’s look at the data

pd.read_excel(data_url)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-cb7bb7f4d96e> in <module>
----> 1 pd.read_excel(data_url)

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1101             if ext != "xls" and xlrd_version >= "2":
   1102                 raise ValueError(
-> 1103                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1104                     f"only the xls format is supported. Install openpyxl instead."
   1105                 )

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

We can read multiple rows in as the header

pd.read_excel(data_url,header=list(range(4)))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-aeb280e6386d> in <module>
----> 1 pd.read_excel(data_url,header=list(range(4)))

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1101             if ext != "xls" and xlrd_version >= "2":
   1102                 raise ValueError(
-> 1103                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1104                     f"only the xls format is supported. Install openpyxl instead."
   1105                 )

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

Looks good, let’s save this to a DataFrame

ca_dds_df = pd.read_excel(data_url,header=list(range(4)))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-627b4cbdff1b> in <module>
----> 1 ca_dds_df = pd.read_excel(data_url,header=list(range(4)))

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1101             if ext != "xls" and xlrd_version >= "2":
   1102                 raise ValueError(
-> 1103                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1104                     f"only the xls format is supported. Install openpyxl instead."
   1105                 )

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

ca_dds_df.head()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-8d811b295c4b> in <module>
----> 1 ca_dds_df.head()

NameError: name 'ca_dds_df' is not defined

ca_dds_df.columns

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-2f51e10b6c51> in <module>
----> 1 ca_dds_df.columns

NameError: name 'ca_dds_df' is not defined

Ray Summit Notes¶

contribute things you learned here

Pandas, by Wes¶

Pandas was designed to do data science on your laptop
It’s designed to be coupled tightly to numpy, which is why it’s not very fast, especially with strings

Scikit Learn¶

Data science for the many not the mighty
Machine learning for all

Class 9: Preparing Data For Analysis Class 11: Cleaning Data