Pandas Primer¶
Pandas is a popular and useful data analysis library. "Pandas" stands for "panel data analysis" or sometimes "Python data analysis".
Pandas is the go to package for working with panel data, CSVs, data stored in text files, Excel sheets, or even SQL databases. If you're coming from Excel, you'll find the Pandas approach fairly intuitive.
In this primer, we'll cover the basics.
Imports:
import pandas as pd
import numpy as np
import os
import pandas as pd
import numpy as np
import os
You'll see Pandas usually abbreviated as pd
and Numpy usually abbreviated as np
.
In Pandas, data is stored as either a "series" (one-dimensional) or a "dataframe" (two-dimensional).
You'll usually import data into a dataframe, but to begin we'll create this basic dataframe from scratch using pd.dataframe()
.
df = pd.DataFrame({"Person":["John", "Myla", "Lewis", "John", "Myla"],
"Age": [24., np.nan, 21., 33, 26],
"Single": [False, True, True, True, False]})
df
df
Person | Age | Single | |
---|---|---|---|
0 | John | 24.0 | False |
1 | Myla | NaN | True |
2 | Lewis | 21.0 | True |
3 | John | 33.0 | True |
4 | Myla | 26.0 | False |
View just the column headers:
df.columns
df.columns
Index(['Person', 'Age', 'Single'], dtype='object')
Select individual columns:
df['Age']
df['Age']
0 24.0 1 NaN 2 21.0 3 33.0 4 26.0 Name: Age, dtype: float64
Select multiple columns by nesting a list:
df[['Person','Age']]
df[['Person','Age']]
Person | Age | |
---|---|---|
0 | John | 24.0 |
1 | Myla | NaN |
2 | Lewis | 21.0 |
3 | John | 33.0 |
4 | Myla | 26.0 |
Select rows by index position using iloc
:
df.iloc[1:3]
df.iloc[1:3]
Person | Age | Single | |
---|---|---|---|
1 | Myla | NaN | True |
2 | Lewis | 21.0 | True |
Select by values using loc
:
df.loc[df['Single'] == True]
df.loc[df['Single'] == True]
Person | Age | Single | |
---|---|---|---|
1 | Myla | NaN | True |
2 | Lewis | 21.0 | True |
3 | John | 33.0 | True |
Select based on multiple conditions:
df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
df.loc[(df['Age'] >20) & (df['Age'] < 30)]
Person | Age | Single | |
---|---|---|---|
0 | John | 24.0 | False |
2 | Lewis | 21.0 | True |
4 | Myla | 26.0 | False |
Make a new dataframe based on a selection:
newDF = df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
newDF = df.loc[(df['Age'] > 20) & (df['Age'] < 30)]
A note on "copying"... The above method is really a view rather than a copy. You can sometimes unintentionally modify the original doing it this way, which may give you unexpected results down the line if you plan on using the original again. What?? Yeah.
You may get some oblique warnings if you do it the above way...
The prefered way to split off a selection of a dataframe is using the .copy()
method:
newDF2 = df.loc[(df['Age'] > 20) & (df['Age'] < 30)].copy()
newDF2 = df.loc[(df['Age'] > 20) & (df['Age'] < 30)].copy()
And here is the difference. Try this simple modification:
newDF['Age'] = newDF['Age'] + 1
newDF['Age'] = newDF['Age'] +1
C:\Users\phwh9568\AppData\Local\Temp\ipykernel_34720\1414648428.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy newDF['Age'] = newDF['Age'] +1
Uh. What?
Okay, try again using newDF2
:
newDF2['Age'] = newDF['Age'] + 1
newDF2['Age'] = newDF2['Age'] +1
newDF2
Person | Age | Single | |
---|---|---|---|
0 | John | 25.0 | False |
2 | Lewis | 22.0 | True |
4 | Myla | 27.0 | False |
There are a TON of built in dataframe methods that help you with munging, manipulating, statistics, and a more. See the docs: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html.
Some simple descriptive statistics...
.min()
.max()
.mean()
.sum()
.cumsum()
.mode()
.median()
.count()
.std()
.unique()
.describe()
df['Age'].describe()
count 4.00000 mean 26.00000 std 5.09902 min 21.00000 25% 23.25000 50% 25.00000 75% 27.75000 max 33.00000 Name: Age, dtype: float64
df.isna()
Person | Age | Single | |
---|---|---|---|
0 | False | False | False |
1 | False | True | False |
2 | False | False | False |
3 | False | False | False |
4 | False | False | False |
Note! Was this change permanent?
Two ways to "save" a change:
df = df.fillna(20.0)
OR
df.fillna(20.0, inplace=True)
df.fillna(20.0,inplace=True)
df
Person | Age | Single | |
---|---|---|---|
0 | John | 24.0 | False |
1 | Myla | 20.0 | True |
2 | Lewis | 21.0 | True |
3 | John | 33.0 | True |
4 | Myla | 26.0 | False |
Creating new columns:
df['chainsaw'] = 'vrooooooom'
df['chainsaw'] = 'vrooooooom'
Math problem:
df['result'] = df['Age'] ** 2
df['result'] = df['Age'] ** 2
String methods:
Concat new string to a column value to form a new column:
df['email'] = df['Person'] + '@cuboulder.rules'
df['email'] = df['Person'] + '@cuboulder.rules'
Split a string column using .str.split()
:
Note the documentation. Use the expand
parameter to split into columns.
Split email on @
.
df['email'].str.split('@',expand=True)
0 | 1 | |
---|---|---|
0 | John | cuboulder.rules |
1 | Myla | cuboulder.rules |
2 | Lewis | cuboulder.rules |
3 | John | cuboulder.rules |
4 | Myla | cuboulder.rules |
What did .str.split()
return?
How do we select an element by it's index position?
Make a domain
column:
df['domain'] = df['email'].str.split('@',expand=True)[1]
df
df
Person | Age | Single | chainsaw | result | domain | ||
---|---|---|---|---|---|---|---|
0 | John | 24.0 | False | vrooooooom | 576.0 | John@cuboulder.rules | cuboulder.rules |
1 | Myla | 20.0 | True | vrooooooom | 400.0 | Myla@cuboulder.rules | cuboulder.rules |
2 | Lewis | 21.0 | True | vrooooooom | 441.0 | Lewis@cuboulder.rules | cuboulder.rules |
3 | John | 33.0 | True | vrooooooom | 1089.0 | John@cuboulder.rules | cuboulder.rules |
4 | Myla | 26.0 | False | vrooooooom | 676.0 | Myla@cuboulder.rules | cuboulder.rules |
Delete a column using .drop()
df.drop(columns='chainsaw', inplace=True)
df.drop(columns='chainsaw', inplace=True)
df
Person | Age | Single | result | domain | ||
---|---|---|---|---|---|---|
0 | John | 24.0 | False | 576.0 | John@cuboulder.rules | cuboulder.rules |
1 | Myla | 20.0 | True | 400.0 | Myla@cuboulder.rules | cuboulder.rules |
2 | Lewis | 21.0 | True | 441.0 | Lewis@cuboulder.rules | cuboulder.rules |
3 | John | 33.0 | True | 1089.0 | John@cuboulder.rules | cuboulder.rules |
4 | Myla | 26.0 | False | 676.0 | Myla@cuboulder.rules | cuboulder.rules |
Delete multiple columns.
Feed the columns parameter a list.
df.drop(columns=['result','email','domain'], inplace=True)
df
Person | Age | Single | |
---|---|---|---|
0 | John | 24.0 | False |
1 | Myla | NaN | True |
2 | Lewis | 21.0 | True |
3 | John | 33.0 | True |
4 | Myla | 26.0 | False |
Applying a custom function to a column.¶
Sometimes, you just have to do something your way! Imagine you have a "formula" you want to apply across all the rows in a column, excel style. You may be tempted to iterate through the rows of a column (and you can using .iterrows
), but this is not the most efficient way.
You can use .apply()
to apply a custom function to all rows in a column.
Let's start with a quick side trip on functions:
def mingle(age, single):
if age < 30:
if single == True:
result = 'READY TO MINGLE'
else:
result = "Home by 8pm"
else:
result = "Home by 8pm"
return result
def mingle(age, single):
if age < 30:
if single == True:
result = 'READY TO MINGLE'
else:
result = "Home by 8pm"
else:
result = "Home by 8pm"
return result
Does this function work? Let's try!
mingle(25,True)
mingle(30,True)
'Home by 8pm'
Okay, now we'll use df.apply()
, a lambda
function to run this on our dataframe with the Age
and Single
columns as input.
(Don't get too hung up on the details right now :-) ).
df['status'] = df.apply(lambda row: mingle(row['Age'], row['Single']), axis = 1)
df['status'] = df.apply(lambda row: mingle(row['Age'], row['Single']), axis=1)
df
Person | Age | Single | status | |
---|---|---|---|---|
0 | John | 24.0 | False | Home by 8pm |
1 | Myla | 20.0 | True | READY TO MINGLE |
2 | Lewis | 21.0 | True | READY TO MINGLE |
3 | John | 33.0 | True | Home by 8pm |
4 | Myla | 26.0 | False | Home by 8pm |
df.shape
(5, 3)
Transpose using .T
trans = df.T
trans
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Person | John | Myla | Lewis | John | Myla |
Age | 24.0 | 20.0 | 21.0 | 33.0 | 26.0 |
Single | False | True | True | True | False |
Pivoting and Pivot Tables
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small", "large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | small | 1 |
1 | foo | one | large | 2 |
2 | foo | one | large | 2 |
3 | foo | two | small | 3 |
4 | foo | two | small | 3 |
5 | bar | one | large | 4 |
6 | bar | one | small | 5 |
7 | bar | two | small | 6 |
8 | bar | two | large | 7 |
You can create Excel-style pivot tables--useful for aggregating:
pd.pivot_table(df,values='D', index='A', aggfunc='sum')
aggDF = pd.pivot_table(df,values='D', index='A', aggfunc='sum')
aggDF
D | |
---|---|
A | |
bar | 22 |
foo | 11 |
Rename a column using .rename()
aggDF = aggDF.rename(columns={'D':'sumD'})
Merge the aggregated data back to our original dataframe:
df = pd.merge(df, aggDF, on='A')
df = pd.merge(df, aggDF, on='A')
df
A | B | C | D | sumD | |
---|---|---|---|---|---|
0 | foo | one | small | 1 | 11 |
1 | foo | one | large | 2 | 11 |
2 | foo | one | large | 2 | 11 |
3 | foo | two | small | 3 | 11 |
4 | foo | two | small | 3 | 11 |
5 | bar | one | large | 4 | 22 |
6 | bar | one | small | 5 | 22 |
7 | bar | two | small | 6 | 22 |
8 | bar | two | large | 7 | 22 |
Now, what's the proportion of of individual D
to sumD
?
df['Dperc'] = df['D']/df['sumD']*100
df
A | B | C | D | sumD | Dperc | |
---|---|---|---|---|---|---|
0 | foo | one | small | 1 | 11 | 9.090909 |
1 | foo | one | large | 2 | 11 | 18.181818 |
2 | foo | one | large | 2 | 11 | 18.181818 |
3 | foo | two | small | 3 | 11 | 27.272727 |
4 | foo | two | small | 3 | 11 | 27.272727 |
5 | bar | one | large | 4 | 22 | 18.181818 |
6 | bar | one | small | 5 | 22 | 22.727273 |
7 | bar | two | small | 6 | 22 | 27.272727 |
8 | bar | two | large | 7 | 22 | 31.818182 |
Similar: .groupby()
df.groupby(['A','C']).sum('D')
D | ||
---|---|---|
A | C | |
bar | large | 11 |
small | 11 | |
foo | large | 4 |
small | 7 |
That's all for this lesson... questions?