What is Data Science?¶

Data Science is the combination of venn diagram of CS, Stats, & domain expertise with DS at the center

statistics is the type of math we use to make sense of data. Formally, a statistic is just a function of data.

computer science is so that we can manipulate visualize and automate the inferences we make.

domain expertise helps us have the intuition to know if what we did worked right. A statistic must be interpreted in context; the relevant context determines what they mean and which are valid. The context will say whether automating something is safe or not, it can help us tell whether our code actually worked right or not.

In this class,¶

venn diagram of CS, Stats, & domain expertise with DS at the center, w/310 location marked

We’ll focus on the programming as our main means of studying data science, but we will use bits of the other parts. In particular, you’re encouraged to choose datasets that you have domain expertise about, or that you want to learn about.

But there are many definitions. We’ll use this one, but you may come across others.

How does data science happen?¶

The most common way to think about what doing data science means is to think of this pipeline. It is in the perspective of the data, these are all of the things that happen to the data.

DS pipeline: collect, clean, explore, model, deploy

Another way to think about it

DS process: 3 major phases (prepare,build,finish) with many sub-phases. Prepare:set goals, explore, wrangle, assess; Build: Plan, analyze, engineer, optimize, execute; Finish: Deliver, revise, wrap up

how we’ll cover Data Science, in depth¶

collect: Discuss only a little; Minimal programming involved
clean: Cover the main programming techniques; Some requires domain knowledge beyond scope of course
explore: Cover the main programming techniques; Some requires domain knowledge beyond scope of course
model:Cover the main programming, basic idea of models; How to use models, not how learning algorithms work
deploy: A little bit at the end, but a lot of preparation for decision making around deployment

how we’ll cover it in, time¶

We’ll cover exploratory data analysis before cleaning because those tools will help us check how we’ve cleaned the data.

How this class will work¶

Participatory Live Coding

What is a topic you want to use data to learn about?

Intro to Jupyter Notebooks¶

Programming for Data Science vs other Programming¶

The audience is different, so the form is different.

In Data Science our product is more often a report than a program.

Sometimes there will be points in the notes that were not made in class due to time or in response questions that came at the end of class.

Also, in data science we are using code to interact with data, instead of having a plan in advance

So programming for data science is more like writing it has a narrative flow and is made to be seen more than some other programming thaat you may have done.

Jupyter Lab and Jupyter notebooks¶

Launch a jupyter lab server:

on Windows, use anaconda terminal
on Mac/Linux, use terminal
cd path/to/where/you/save/notes
enter jupyter lab

What just happened?¶

launched a local web server
opened a new browser tab pointed to it

a diagram depicting a terminal window launching a local web server that reports back to the terminal and serves jupyter in the browser

A jupyter notebook tour¶

A Jupyter notebook has two modes. When you first open, it is in command mode. It says the mode in the bottom right of the screen. Each bos is a cell, the highlighted cell is gray when in command mode.

When you press a key in command mode it works like a shortcut. For example p shows the command search menu.

If you press enter (or return) or click on the highlighted cell, which is the boxes we can type in, it changes to edit mode.

There are two type of cells that we will used: code and markdown. You can change that in command mode with y for code and m for markdown or on the cell type menu at the top of the notebook.

This is a markdown cell

we can make
itemized lists of
bullet points

and we can make nubmered
lists, and not have to worry
about renumbering them
if we add a step in the middle later

What is Data Science¶

venn diagram of CS, Stats, & domain expertise with DS at the center

Jupyter Introduction¶

hello

bold

can you read this

5+4

9

the output here is the value returned by the python interpretter for the last line of the cell

We can set variables

name = 'sarah'

The notebook displays nothing when we do an assignment, bcause it returns nothing

we can put a variable there to see it

name

'sarah'

Getting Help¶

# this is a bad idea
# help = 'use the help function'

help(print)

Help on built-in function print in module builtins:

print(*args, sep=' ', end='\n', file=None, flush=False)
    Prints the values to a stream, or to sys.stdout by default.

    sep
      string inserted between values, default a space.
    end
      string appended after the last value, default a newline.
    file
      a file-like object (stream); defaults to the current sys.stdout.
    flush
      whether to forcibly flush the stream.

print.__doc__

'Prints the values to a stream, or to sys.stdout by default.\n\n  sep\n    string inserted between values, default a space.\n  end\n    string appended after the last value, default a newline.\n  file\n    a file-like object (stream); defaults to the current sys.stdout.\n  flush\n    whether to forcibly flush the stream.'

We can also get help with shift +tab inside of any ()

print(name)

sarah

major='EE'

Modify the following code so that it prints out the two things on separate lines

print(name,major)

print(name,major)

sarah EE

print(name,major,sep='\n')

sarah
EE