Class 10: Cleaning review and Ray Summit Keynotes

  • Say hello on zoom chat

  • join prismia

  • sign up so you can watch Ray Summit talks by Pandas and Scikit learn

import pandas as pd
# %load http://drsmb.co/310
data_url = 'https://github.com/rhodyprog4ds/inclass-data/raw/main/ca_dds_summary.xlsx'

Let’s look at the data

pd.read_excel(data_url)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-cb7bb7f4d96e> in <module>
----> 1 pd.read_excel(data_url)

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1101             if ext != "xls" and xlrd_version >= "2":
   1102                 raise ValueError(
-> 1103                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1104                     f"only the xls format is supported. Install openpyxl instead."
   1105                 )

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

We can read multiple rows in as the header

pd.read_excel(data_url,header=list(range(4)))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-aeb280e6386d> in <module>
----> 1 pd.read_excel(data_url,header=list(range(4)))

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1101             if ext != "xls" and xlrd_version >= "2":
   1102                 raise ValueError(
-> 1103                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1104                     f"only the xls format is supported. Install openpyxl instead."
   1105                 )

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.

Looks good, let’s save this to a DataFrame

ca_dds_df = pd.read_excel(data_url,header=list(range(4)))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-627b4cbdff1b> in <module>
----> 1 ca_dds_df = pd.read_excel(data_url,header=list(range(4)))

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    297                 )
    298                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299             return func(*args, **kwargs)
    300 
    301         return wrapper

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
    334     if not isinstance(io, ExcelFile):
    335         should_close = True
--> 336         io = ExcelFile(io, storage_options=storage_options, engine=engine)
    337     elif engine and engine != io.engine:
    338         raise ValueError(

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
   1101             if ext != "xls" and xlrd_version >= "2":
   1102                 raise ValueError(
-> 1103                     f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
   1104                     f"only the xls format is supported. Install openpyxl instead."
   1105                 )

ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.
ca_dds_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-8d811b295c4b> in <module>
----> 1 ca_dds_df.head()

NameError: name 'ca_dds_df' is not defined
ca_dds_df.columns
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-2f51e10b6c51> in <module>
----> 1 ca_dds_df.columns

NameError: name 'ca_dds_df' is not defined

Ray Summit Notes

contribute things you learned here

Pandas, by Wes

  • Pandas was designed to do data science on your laptop

  • It’s designed to be coupled tightly to numpy, which is why it’s not very fast, especially with strings

Scikit Learn

  • Data science for the many not the mighty

  • Machine learning for all