Getting Started w/ Python and Jupyter#
Congratulations! If you’ve made it this far in the setup, you’ve already come a long way on this first day of class.
This is a Jupyter Notebook, running inside of Jupyter Lab. In this environment, you can run Python code alongside Markdown text. What that means is that you can write code and you can also write prose explaining your code. This paradigm is called Literate Programming, and it’s the most common way of coding for data analysis.
You’re now ready to run your first bit of Python code. We’ll start with the print()
function, which allows us to output any bit of text. To run the cell below, click anywhere on the code, and then hit SHIFT-ENTER.
print("Hello world!")
Hello world!
Now that you can run code, you can also write some of your own. Modify the cell below to whatever text you’d like, then run it.
print("Your text here")
Your text here
In a nutshell, that’s what we’ll be doing all semester: write a bit of code, run it in a Jupyter Notebook cell, and write text to explain what you did.
Basics of Working with Data#
We’ll have lots of time together next week to learn the ropes with Python and our data analysis workflow. For today, follow along with the cells below to see how the basics work in action.
The first thing we’ll need to do is import some libraries. For data mining, the essential libraries are Pandas (for organizing and analyzing data), Numpy (for doing complex math), and Altair (for creating nice graphs and plots). All of these are pre-installed with Anaconda. To import them, you use Python’s import
command:
import numpy as np
import pandas as pd
import altair as alt
Now that we have some libraries, we need some data to begin working. Most of the time in this class, you’ll be working with data in separate CSV files, but for today we’ll use a JSON file.
penguins = pd.read_json('https://raw.githubusercontent.com/vega/vega-datasets/main/data/penguins.json')
In the command above, we loaded the Palmer Penguins sample dataset, a classic bit of sample data used in machine learning. We saved it in a variable called “penguins.” To take a peek at the data, we can simply type penguins
into an empty cell:
penguins
Species | Island | Beak Length (mm) | Beak Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | None |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | None |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE |
344 rows × 7 columns
What you see above is a DataFrame: a data type in Pandas that stores and organizes data in rows and columns. Think of it like a spreadsheet that you can work with in Python. As you can see above, we have data on 344 penguins (one in each row), and we know about the lengths of those penguins’ bills and flippers, the species of penguins, the weight of the penguins, and so on. That’s a lot to work with!
We can use this data in a lot of different ways (and we will!). For starters, you can run basic summary statistics easily in Pandas. Let’s find the average flipper length for all of our penguins:
penguins['Flipper Length (mm)'].mean()
200.91520467836258
On average, the flippers of our penguins are about 200mm long!
In addition to calculations, with the Altair library we can easily make nice-looking visualizations. For example, let’s compare bill length and flipper length among all our penguins. Are they correlated?
alt.Chart(penguins, title="Comparing Beak Length and Flipper Length").mark_point().encode(
x=alt.X('Beak Length (mm):Q'),
y=alt.Y('Flipper Length (mm):Q'),
).interactive()
Because we created an interactive plot, you can scroll to zoom in on the area where the points are, just like a map. In a couple weeks, you’ll learn how to make these plots yourself.
And the two variables do seem to be correlated! As bill length goes up, flipper length also goes up. This makes sense: bigger penguins would have bigger bills and flippers.
But wait, the points on the scatter plot above seem to be clustering into three different groups. What could be going on here? To find out, we can add a color variable to this graph. Let’s look at the different species of penguins that we have:
alt.Chart(penguins, title="Comparing Beak Length and Flipper Length").mark_point().encode(
x=alt.X('Beak Length (mm):Q'),
y=alt.Y('Flipper Length (mm):Q'),
color=alt.Color('Species:N')
).interactive()
Aha! This makes sense, too. Not only are bill length and flipper length correlated, there are groups within those values that correspond to the different species of penguins. We’d definitely expect species to have distinct size ranges.
Over the course of this semester, you’ll learn how to do everything we just did above and a lot more. We’ll start with exploratory data analysis and work our way toward advanced machine learning techniques. Along the way we’ll explore data from a wide variety of fields and disciplines. What you’ve done today is a great first step!