Pandas Primer¶

Pandas is a popular and useful data analysis library. "Pandas" stands for "panel data analysis" or sometimes "Python data analysis".

Pandas is the go to package for working with panel data, CSVs, data stored in text files, Excel sheets, or even SQL databases. If you're coming from Excel, you'll find the Pandas approach fairly intuitive.

In this primer, we'll cover the basics.

Imports:

import pandas as pd
import numpy as np
import os
In [30]:
import pandas as pd
import numpy as np
import os

You'll see Pandas usually abbreviated as pd and Numpy usually abbreviated as np.

In Pandas, data is stored as either a "series" (one-dimensional) or a "dataframe" (two-dimensional).

You'll usually import data into a dataframe, but to begin we'll create this basic dataframe from scratch using pd.dataframe().

In [4]:
df = pd.DataFrame({"Person":["John", "Myla", "Lewis", "John", "Myla"],
                   "Age": [24., np.nan, 21., 33, 26],
                   "Single": [False, True, True, True, False]})
df
In [5]:
df
Out[5]:
Person Age Single
0 John 24.0 False
1 Myla NaN True
2 Lewis 21.0 True
3 John 33.0 True
4 Myla 26.0 False

View just the column headers:

df.columns
In [6]:
df.columns
Out[6]:
Index(['Person', 'Age', 'Single'], dtype='object')

Select individual columns:

df['Age']
In [7]:
df['Age']
Out[7]:
0    24.0
1     NaN
2    21.0
3    33.0
4    26.0
Name: Age, dtype: float64

Select multiple columns by nesting a list:

df[['Person','Age']]
In [8]:
df[['Person','Age']]
Out[8]:
Person Age
0 John 24.0
1 Myla NaN
2 Lewis 21.0
3 John 33.0
4 Myla 26.0

Select rows by index position using iloc:

df.iloc[1:3]
In [9]:
df.iloc[1:3]
Out[9]:
Person Age Single
1 Myla NaN True
2 Lewis 21.0 True

Select by values using loc:

df.loc[df['Single'] == True]
In [10]:
df.loc[df['Single'] == True]
Out[10]:
Person Age Single
1 Myla NaN True
2 Lewis 21.0 True
3 John 33.0 True

Select based on multiple conditions:

df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
In [11]:
df.loc[(df['Age'] >20) & (df['Age'] < 30)]
Out[11]:
Person Age Single
0 John 24.0 False
2 Lewis 21.0 True
4 Myla 26.0 False

Make a new dataframe based on a selection:

newDF = df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
In [12]:
newDF = df.loc[(df['Age'] > 20) & (df['Age'] < 30)]

A note on "copying"... The above method is really a view rather than a copy. You can sometimes unintentionally modify the original doing it this way, which may give you unexpected results down the line if you plan on using the original again. What?? Yeah.

You may get some oblique warnings if you do it the above way...

The prefered way to split off a selection of a dataframe is using the .copy() method:

newDF2 = df.loc[(df['Age'] > 20) & (df['Age'] < 30)].copy()
In [14]:
newDF2 = df.loc[(df['Age'] > 20) & (df['Age'] < 30)].copy()

And here is the difference. Try this simple modification:

newDF['Age'] = newDF['Age'] + 1
In [13]:
newDF['Age'] = newDF['Age'] +1 
C:\Users\phwh9568\AppData\Local\Temp\ipykernel_34720\1414648428.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newDF['Age'] = newDF['Age'] +1

Uh. What?

Okay, try again using newDF2:

newDF2['Age'] = newDF['Age'] + 1
In [15]:
newDF2['Age'] = newDF2['Age'] +1
In [16]:
newDF2
Out[16]:
Person Age Single
0 John 25.0 False
2 Lewis 22.0 True
4 Myla 27.0 False

There are a TON of built in dataframe methods that help you with munging, manipulating, statistics, and a more. See the docs: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html.

Some simple descriptive statistics...

.min()
.max()
.mean()
.sum()
.cumsum()
.mode()
.median()
.count()
.std()
.unique()
.describe()
In [28]:
df['Age'].describe()
Out[28]:
count     4.00000
mean     26.00000
std       5.09902
min      21.00000
25%      23.25000
50%      25.00000
75%      27.75000
max      33.00000
Name: Age, dtype: float64

Dealing with missing data:

Lot's of approaches, but commonly:

.isna() .fillna()

In [29]:
df.isna()
Out[29]:
Person Age Single
0 False False False
1 False True False
2 False False False
3 False False False
4 False False False

Note! Was this change permanent?

Two ways to "save" a change:

df = df.fillna(20.0)

OR

df.fillna(20.0, inplace=True)
In [55]:
df.fillna(20.0,inplace=True)
In [362]:
df
Out[362]:
Person Age Single
0 John 24.0 False
1 Myla 20.0 True
2 Lewis 21.0 True
3 John 33.0 True
4 Myla 26.0 False

Creating new columns:

df['chainsaw'] = 'vrooooooom'
In [404]:
df['chainsaw'] = 'vrooooooom'

Math problem:

df['result'] = df['Age'] ** 2
In [405]:
df['result'] = df['Age'] ** 2

String methods:

Concat new string to a column value to form a new column:

df['email'] = df['Person'] + '@cuboulder.rules'
In [406]:
df['email'] = df['Person'] + '@cuboulder.rules'

Split a string column using .str.split():

Note the documentation. Use the expand parameter to split into columns.

Split email on @.

In [407]:
df['email'].str.split('@',expand=True)
Out[407]:
0 1
0 John cuboulder.rules
1 Myla cuboulder.rules
2 Lewis cuboulder.rules
3 John cuboulder.rules
4 Myla cuboulder.rules

What did .str.split() return?

How do we select an element by it's index position?

Make a domain column:

In [408]:
df['domain'] = df['email'].str.split('@',expand=True)[1]
df
In [409]:
df
Out[409]:
Person Age Single chainsaw result email domain
0 John 24.0 False vrooooooom 576.0 John@cuboulder.rules cuboulder.rules
1 Myla 20.0 True vrooooooom 400.0 Myla@cuboulder.rules cuboulder.rules
2 Lewis 21.0 True vrooooooom 441.0 Lewis@cuboulder.rules cuboulder.rules
3 John 33.0 True vrooooooom 1089.0 John@cuboulder.rules cuboulder.rules
4 Myla 26.0 False vrooooooom 676.0 Myla@cuboulder.rules cuboulder.rules

Delete a column using .drop()

df.drop(columns='chainsaw', inplace=True)
In [410]:
df.drop(columns='chainsaw', inplace=True)
In [411]:
df
Out[411]:
Person Age Single result email domain
0 John 24.0 False 576.0 John@cuboulder.rules cuboulder.rules
1 Myla 20.0 True 400.0 Myla@cuboulder.rules cuboulder.rules
2 Lewis 21.0 True 441.0 Lewis@cuboulder.rules cuboulder.rules
3 John 33.0 True 1089.0 John@cuboulder.rules cuboulder.rules
4 Myla 26.0 False 676.0 Myla@cuboulder.rules cuboulder.rules

Delete multiple columns.
Feed the columns parameter a list.

In [412]:
df.drop(columns=['result','email','domain'], inplace=True)
In [54]:
df
Out[54]:
Person Age Single
0 John 24.0 False
1 Myla NaN True
2 Lewis 21.0 True
3 John 33.0 True
4 Myla 26.0 False

Applying a custom function to a column.¶

Sometimes, you just have to do something your way! Imagine you have a "formula" you want to apply across all the rows in a column, excel style. You may be tempted to iterate through the rows of a column (and you can using .iterrows), but this is not the most efficient way.

You can use .apply() to apply a custom function to all rows in a column.

Let's start with a quick side trip on functions:

def mingle(age, single):
    if age < 30:
        if single == True:
            result = 'READY TO MINGLE'
        else: 
            result = "Home by 8pm"
    else:
        result = "Home by 8pm"
    
    return result
In [31]:
def mingle(age, single):
    if age < 30:
        if single == True:
            result = 'READY TO MINGLE'
        else: 
            result = "Home by 8pm"
    else:
        result = "Home by 8pm"
    
    return result

Does this function work? Let's try!

mingle(25,True)
In [45]:
mingle(30,True)
Out[45]:
'Home by 8pm'

Okay, now we'll use df.apply(), a lambda function to run this on our dataframe with the Age and Single columns as input.

(Don't get too hung up on the details right now :-) ).

df['status'] = df.apply(lambda row: mingle(row['Age'], row['Single']), axis = 1)
In [59]:
df['status'] = df.apply(lambda row: mingle(row['Age'], row['Single']), axis=1)
In [60]:
df
Out[60]:
Person Age Single status
0 John 24.0 False Home by 8pm
1 Myla 20.0 True READY TO MINGLE
2 Lewis 21.0 True READY TO MINGLE
3 John 33.0 True Home by 8pm
4 Myla 26.0 False Home by 8pm

Reshaping data¶

First, what is the shape?

df.shape
In [416]:
df.shape
Out[416]:
(5, 3)

Transpose using .T

In [417]:
trans = df.T
trans
Out[417]:
0 1 2 3 4
Person John Myla Lewis John Myla
Age 24.0 20.0 21.0 33.0 26.0
Single False True True True False

Pivoting and Pivot Tables

In [443]:
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                        "bar", "bar", "bar", "bar"],
                    "B": ["one", "one", "one", "two", "two",
                        "one", "one", "two", "two"],
                    "C": ["small", "large", "large", "small",
                        "small", "large", "small", "small", "large"],
                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
In [435]:
df
Out[435]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7

You can create Excel-style pivot tables--useful for aggregating:

pd.pivot_table(df,values='D', index='A', aggfunc='sum')
In [436]:
aggDF = pd.pivot_table(df,values='D', index='A', aggfunc='sum')
aggDF
Out[436]:
D
A
bar 22
foo 11

Rename a column using .rename()

In [437]:
aggDF = aggDF.rename(columns={'D':'sumD'})

Merge the aggregated data back to our original dataframe:

df = pd.merge(df, aggDF, on='A')
In [438]:
df = pd.merge(df, aggDF, on='A')
In [439]:
df
Out[439]:
A B C D sumD
0 foo one small 1 11
1 foo one large 2 11
2 foo one large 2 11
3 foo two small 3 11
4 foo two small 3 11
5 bar one large 4 22
6 bar one small 5 22
7 bar two small 6 22
8 bar two large 7 22

Now, what's the proportion of of individual D to sumD?

In [440]:
df['Dperc'] = df['D']/df['sumD']*100
In [441]:
df
Out[441]:
A B C D sumD Dperc
0 foo one small 1 11 9.090909
1 foo one large 2 11 18.181818
2 foo one large 2 11 18.181818
3 foo two small 3 11 27.272727
4 foo two small 3 11 27.272727
5 bar one large 4 22 18.181818
6 bar one small 5 22 22.727273
7 bar two small 6 22 27.272727
8 bar two large 7 22 31.818182

Similar: .groupby()

In [444]:
df.groupby(['A','C']).sum('D')
Out[444]:
D
A C
bar large 11
small 11
foo large 4
small 7

That's all for this lesson... questions?