Class 10: Cleaning review and Ray Summit Keynotes¶
Say hello on zoom chat
join prismia
sign up so you can watch Ray Summit talks by Pandas and Scikit learn
import pandas as pd
# %load http://drsmb.co/310
data_url = 'https://github.com/rhodyprog4ds/inclass-data/raw/main/ca_dds_summary.xlsx'
Let’s look at the data
pd.read_excel(data_url)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-cb7bb7f4d96e> in <module>
----> 1 pd.read_excel(data_url)
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
334 if not isinstance(io, ExcelFile):
335 should_close = True
--> 336 io = ExcelFile(io, storage_options=storage_options, engine=engine)
337 elif engine and engine != io.engine:
338 raise ValueError(
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1101 if ext != "xls" and xlrd_version >= "2":
1102 raise ValueError(
-> 1103 f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
1104 f"only the xls format is supported. Install openpyxl instead."
1105 )
ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.
We can read multiple rows in as the header
pd.read_excel(data_url,header=list(range(4)))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-aeb280e6386d> in <module>
----> 1 pd.read_excel(data_url,header=list(range(4)))
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
334 if not isinstance(io, ExcelFile):
335 should_close = True
--> 336 io = ExcelFile(io, storage_options=storage_options, engine=engine)
337 elif engine and engine != io.engine:
338 raise ValueError(
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1101 if ext != "xls" and xlrd_version >= "2":
1102 raise ValueError(
-> 1103 f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
1104 f"only the xls format is supported. Install openpyxl instead."
1105 )
ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.
Looks good, let’s save this to a DataFrame
ca_dds_df = pd.read_excel(data_url,header=list(range(4)))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-627b4cbdff1b> in <module>
----> 1 ca_dds_df = pd.read_excel(data_url,header=list(range(4)))
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
334 if not isinstance(io, ExcelFile):
335 should_close = True
--> 336 io = ExcelFile(io, storage_options=storage_options, engine=engine)
337 elif engine and engine != io.engine:
338 raise ValueError(
/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1101 if ext != "xls" and xlrd_version >= "2":
1102 raise ValueError(
-> 1103 f"Your version of xlrd is {xlrd_version}. In xlrd >= 2.0, "
1104 f"only the xls format is supported. Install openpyxl instead."
1105 )
ValueError: Your version of xlrd is 2.0.1. In xlrd >= 2.0, only the xls format is supported. Install openpyxl instead.
ca_dds_df.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-8d811b295c4b> in <module>
----> 1 ca_dds_df.head()
NameError: name 'ca_dds_df' is not defined
ca_dds_df.columns
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-2f51e10b6c51> in <module>
----> 1 ca_dds_df.columns
NameError: name 'ca_dds_df' is not defined
Ray Summit Notes¶
contribute things you learned here
Pandas, by Wes¶
Pandas was designed to do data science on your laptop
It’s designed to be coupled tightly to numpy, which is why it’s not very fast, especially with strings
Scikit Learn¶
Data science for the many not the mighty
Machine learning for all