1. Welcome & What is Data Science#

1.1. Prismia Chat#

We will use these to monitor your participation in class and to gather information. Features:

  • instructor only

  • reply to you directly

  • share responses for all

1.2. How this class will work#

Participatory Live Coding

What is a topic you want to use data to learn about?

Debugging is both technical and a soft skill

1.3. Programming for Data Science vs other Programming#

The audience is different, so the form is different.

In Data Science our product is more often a report than a program.

Sometimes there will be points in the notes that were not made in class due to time or in response questions that came at the end of class.

Also, in data science we are using code to interact with data, instead of having a plan in advance

So programming for data science is more like writing it has a narrative flow and is made to be seen more than some other programming thaat you may have done.

1.4. Jupyter Lab and Jupyter notebooks#

Launch a jupyter lab server:

  • on Windows, use anaconda terminal

  • on Mac/Linux, use terminal

  • cd path/to/where/you/save/notes

  • enter jupyter lab

1.4.1. What just happened?#

  • launched a local web server

  • opened a new browser tab pointed to it

a diagram depicting a terminal window launching a local web server that reports back to the terminal and serves jupyter in the browser

1.4.2. A jupyter notebook tour#

A Jupyter notebook has two modes. When you first open, it is in command mode. It says the mode in the bottom right of the screen. Each bos is a cell, the highlighted cell is gray when in command mode.

When you press a key in command mode it works like a shortcut. For example p shows the command search menu.

If you press enter (or return) or click on the highlighted cell, which is the boxes we can type in, it changes to edit mode.

There are two type of cells that we will used: code and markdown. You can change that in command mode with y for code and m for markdown or on the cell type menu at the top of the notebook.

This is a markdown cell

  • we can make

  • itemized lists of

  • bullet points

  1. and we can make nubmered

  2. lists, and not have to worry

  3. about renumbering them

  4. if we add a step in the middle later

# this is a comment in a code cell
3+9
12

the output here is the value returned by the python interpretter for the last line of the cell

We can set variables

name = 'sarah'

The notebook displays nothing when we do an assignment, bcause it returns nothing

we can put a variable there tosee it

name
'sarah'
name
course = 'csc310'

it only does that for the last line, so this one displays nothing

Important

In class, we ran these cells out of order and noticed how the value does not update unless we run the new version

name*3
'sarahsarahsarah'

Common command mode actions:

  • m: switch cell to markdown

  • y: switch cell to code

  • a: add a cell above

  • b: add a cell below

  • c: copy cell

  • v: paste the cell

  • 0 + 0: restart kernel

  • p: command menu

use enter/return to get to edit mode

1.5. Getting Help in Jupyter#

Getting help is important in programming

When your cursor is inside the () of a function if you hold the shift key and press tab it will open a popup with information. If you press tab twice, it gets bigger and three times will make a popup window.

Python has a print function and we can use the help in jupyter to learn about how to use it in different ways.

print(name,course)
sarah csc310

The first line says that it can take multiple values, because it says args*, sep. The * means multiple.

It also has a keyword argument (must be used like argument=value and has a default) described as sep=' '. This means that by default it adds a space as above.

The help also tells us about other parameters, like the sep one

print(name,course,sep="_")
sarah_csc310

We can print the docstring out, as a whole instead of using the shfit + tab to view it.

help(print)
Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

This looks similar to the one above

print(name + '_'+ course)
sarah_csc310

but if we want to use more parameters, using that parameter becomes more helpful.

print(name,course,'hello',"bye", sep='_')
sarah_csc310_hello_bye

Basic programming is a prereq and we will go faster soon, but the goal of this review was to understand notebooks, getting help, and reading docstrings

1.6. What is Data Science?#

Data Science is the combination of venn diagram of CS, Stats, & domain expertise with DS at the center

statistics is the type of math we use to make sense of data. Formally, a statistic is just a function of data.

computer science is so that we can manipulate visualize and automate the inferences we make.

domain expertise helps us have the intuition to know if what we did worked right. A statistic must be interpreted in context; the relevant context determines what they mean and which are valid. The context will say whether automating something is safe or not, it can help us tell whether our code actually worked right or not.

1.6.1. In this class,#

venn diagram of CS, Stats, & domain expertise with DS at the center, w/310 location marked

We’ll focus on the programming as our main means of studying data science, but we will use bits of the other parts. In particular, you’re encouraged to choose datasets that you have domain expertise about, or that you want to learn about.

But there are many definitions. We’ll use this one, but you may come across others.

1.6.2. How does data science happen?#

The most common way to think about what doing data science means is to think of this pipeline. It is in the perspective of the data, these are all of the things that happen to the data.

DS pipeline: collect, clean, explore, model, deploy

Another way to think about it

DS process: 3 major phases (prepare,build,finish) with many sub-phases. Prepare:set goals, explore, wrangle, assess; Build: Plan, analyze, engineer, optimize, execute; Finish: Deliver, revise, wrap up

1.6.3. how we’ll cover Data Science, in depth#

DS pipeline: collect, clean, explore, model, deploy

  • collect: Discuss only a little; Minimal programming involved

  • clean: Cover the main programming techniques; Some requires domain knowledge beyond scope of course

  • explore: Cover the main programming techniques; Some requires domain knowledge beyond scope of course

  • model:Cover the main programming, basic idea of models; How to use models, not how learning algorithms work

  • deploy: A little bit at the end, but a lot of preparation for decision making around deployment

1.6.4. how we’ll cover it in, time#

DS pipeline: collect, clean, explore, model, deploy

We’ll cover exploratory data analysis before cleaning because those tools will help us check how we’ve cleaned the data.

1.7. Python Review#

Official source on python:

We will go quickly through these focusing on pythonic style, because the prerequisite is a programming course.

1.8. Functions#

def greeting(name):
    '''
    say hi to a person
    
    Parameters
    +++++++++-
    name : string
        the name of the person to greet
    '''
    return 'hi ' + name

A few things to note:

  • the def keywords starts a function

  • then the name of the function

  • parameters in () then :

  • the body is indented

  • the first thing in the body should be a docstring, denoted in ''' which is a multiline comment

  • returning is more reliable than printing in a function

In python, PEP 257 says how to write a docstring, but it is very broad.

In Data Science, numpydoc style docstrings are popular.

Once the cell with the function definition is run, we can use the function

greeting(name)
'hi sarah'
print(greeting('surbhi'))
hi surbhi
assert greeting('sarah') == 'hi sarah'

With a return this works to check that it does the right thing.

when assert is true, it returns nothing, it throws an error on failure

1.9. Conditionals#

def greeting2(name,formal=False):
    '''
    say hi to a person
    
    Parameters
    +++++++++-
    name : string
        the name of the person to greet
    formal: bool
        if the greeting should formal (hello) or not (hi)
    '''
    if formal: 
        message = 'hello  ' + name
    else:
        message = 'hi ' + name
    return message

key points in this function:

  • an if also has the conditional part indented

  • for a bool variable we can just use the variable

  • we can set a default value

because of the default value we do not have to pass the second variable:

greeting2(name)
'hi sarah'
greeting2(name,True)
'hello  sarah'

1.10. Hints#

Reading chapter 1 of think like a data scientist will help you with the data science definition part of the assignment.

Think like a data scientist is written for practitioners; not as a text book for a class. It does not have a lot of prerequisite background, but the sections of it that I assign will help you build a better mental picture of what doing Data Science about.

Only the first assignment will be due this fast, it’s a short review and setup assignment. It’s due quickly so that we know that you have everything set up and the prerequisite material before we start new material next week.