Pandas Mastery: Data Analysis Made Easy

Pandas is the most popular Python library for data analysis and manipulation. Built on top of NumPy, it provides high-level data structures and tools for working with structured data.

Installation

pip install pandas

# With visualization support
pip install pandas matplotlib seaborn

# Verify
python -c "import pandas; print(pandas.__version__)"

Core Data Structures

Series (1D)

import pandas as pd
import numpy as np

# From list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
# 0    1.0
# 1    3.0
# 2    5.0
# 3    NaN
# 4    6.0
# 5    8.0

# With custom index
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s['b'])  # 20

# From dictionary
d = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(d)

DataFrame (2D)

# From dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
#       Name  Age     City
# 0    Alice   25      NYC
# 1      Bob   30       LA
# 2  Charlie   35  Chicago

# From CSV
df = pd.read_csv('data.csv')

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query("SELECT * FROM table", conn)

Viewing Data

# First/last rows
df.head()      # First 5 rows
df.head(10)    # First 10 rows
df.tail()      # Last 5 rows

# Info
df.info()      # Column types, non-null counts
df.describe()  # Statistical summary
df.shape       # (rows, columns)
df.columns     # Column names
df.dtypes      # Data types

# Quick stats
df.count()     # Non-null values
df.mean()      # Mean of numeric columns
df.median()    # Median
df.std()       # Standard deviation

Selecting Data

Column Selection

# Single column (Series)
df['Name']
df.Name  # Dot notation

# Multiple columns (DataFrame)
df[['Name', 'Age']]

# All columns except one
df.drop('City', axis=1)

Row Selection

# By position (.iloc)
df.iloc[0]        # First row
df.iloc[0:3]      # First 3 rows
df.iloc[[0, 2]]   # Rows 0 and 2

# By label (.loc)
df.loc[0]         # Row with index 0
df.loc[0:2]       # Rows 0 to 2 (inclusive!)
df.loc[df['Age'] > 25]  # Conditional

# Boolean indexing
df[df['Age'] > 25]
df[df['City'] == 'NYC']
df[(df['Age'] > 25) & (df['City'] == 'NYC')]

Cell Selection

# Single cell
df.loc[0, 'Name']    # 'Alice'
df.iloc[0, 0]        # 'Alice'

# Multiple cells
df.loc[0:2, ['Name', 'Age']]
df.iloc[0:2, 0:2]

Data Manipulation

Adding Columns

# New column
df['Salary'] = [50000, 60000, 70000]

# Calculated column
df['Age_Plus_10'] = df['Age'] + 10

# Conditional column
df['Senior'] = df['Age'] > 30

Modifying Data

# Update single value
df.loc[0, 'Age'] = 26

# Update multiple values
df.loc[df['City'] == 'NYC', 'City'] = 'New York'

# Apply function
df['Age'] = df['Age'].apply(lambda x: x + 1)

# Map values
df['City'] = df['City'].map({'NYC': 'New York', 'LA': 'Los Angeles'})

# Replace
df['Age'].replace(25, 26, inplace=True)

Deleting Data

# Drop column
df.drop('Salary', axis=1, inplace=True)

# Drop row
df.drop(0, axis=0, inplace=True)

# Drop multiple
df.drop(['Salary', 'Age'], axis=1, inplace=True)

# Drop by condition
df = df[df['Age'] > 25]

Handling Missing Data

# Check for missing values
df.isnull()        # Boolean DataFrame
df.isnull().sum()  # Count per column

# Drop missing values
df.dropna()                    # Drop rows with any NaN
df.dropna(axis=1)              # Drop columns with any NaN
df.dropna(thresh=2)            # Keep rows with at least 2 non-NaN

# Fill missing values
df.fillna(0)                   # Fill with 0
df.fillna(df.mean())           # Fill with mean
df.fillna(method='ffill')      # Forward fill
df.fillna(method='bfill')      # Backward fill

# Interpolate
df.interpolate()               # Linear interpolation

Grouping and Aggregation

# Sample data
df = pd.DataFrame({
    'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [50000, 60000, 70000, 80000, 55000]
})

# Group by
grouped = df.groupby('Department')

# Aggregations
grouped.mean()     # Mean salary by department
grouped.sum()      # Total salary by department
grouped.count()    # Count by department
grouped.min()      # Minimum by department
grouped.max()      # Maximum by department

# Multiple aggregations
grouped.agg(['mean', 'sum', 'count'])

# Custom aggregations
grouped.agg({
    'Salary': ['mean', 'sum'],
    'Employee': 'count'
})

# Group by multiple columns
df.groupby(['Department', 'Employee']).mean()

Sorting

# Sort by column
df.sort_values('Age')                    # Ascending
df.sort_values('Age', ascending=False)   # Descending

# Sort by multiple columns
df.sort_values(['Department', 'Salary'], ascending=[True, False])

# Sort by index
df.sort_index()

Merging and Joining

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Salary': [50000, 60000, 70000]})

# Inner join (default)
merged = pd.merge(df1, df2, on='ID')
#    ID     Name  Salary
# 0   1    Alice   50000
# 1   2      Bob   60000

# Left join
merged = pd.merge(df1, df2, on='ID', how='left')
#    ID     Name   Salary
# 0   1    Alice  50000.0
# 1   2      Bob  60000.0
# 2   3  Charlie      NaN

# Right join
merged = pd.merge(df1, df2, on='ID', how='right')

# Outer join
merged = pd.merge(df1, df2, on='ID', how='outer')

# Concatenate
combined = pd.concat([df1, df2], axis=0)  # Stack vertically
combined = pd.concat([df1, df2], axis=1)  # Stack horizontally

Pivot Tables

# Sample sales data
df = pd.DataFrame({
    'Date': ['2024-01', '2024-01', '2024-02', '2024-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 180]
})

# Pivot table
pivot = df.pivot_table(
    values='Sales',
    index='Date',
    columns='Product',
    aggfunc='sum'
)
#          A    B
# Date            
# 2024-01  100  150
# 2024-02  120  180

# Multiple aggregations
pivot = df.pivot_table(
    values='Sales',
    index='Date',
    columns='Product',
    aggfunc=['sum', 'mean']
)

String Operations

df = pd.DataFrame({'Name': ['Alice Smith', 'Bob Jones', 'Charlie Brown']})

# String methods
df['Name'].str.lower()           # 'alice smith'
df['Name'].str.upper()           # 'ALICE SMITH'
df['Name'].str.title()           # 'Alice Smith'
df['Name'].str.len()             # 11
df['Name'].str.contains('Smith') # True, False, False
df['Name'].str.startswith('A')   # True, False, False
df['Name'].str.split(' ')        # ['Alice', 'Smith']

# Extract first name
df['First_Name'] = df['Name'].str.split(' ').str[0]

# Replace
df['Name'].str.replace('Smith', 'Johnson')

# Strip whitespace
df['Name'].str.strip()

Date/Time Operations

# Create date range
dates = pd.date_range('2024-01-01', periods=10, freq='D')

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Extract components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['DayName'] = df['Date'].dt.day_name()

# Time delta
df['Days_Since'] = (pd.Timestamp.now() - df['Date']).dt.days

# Resample time series
df.set_index('Date').resample('M').mean()  # Monthly average

Practical Examples

Data Cleaning

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['Name'])

# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Change data types
df['Age'] = df['Age'].astype(int)
df['Date'] = pd.to_datetime(df['Date'])

# Strip whitespace from strings
df['Name'] = df['Name'].str.strip()

# Handle outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Salary'] >= Q1 - 1.5*IQR) & (df['Salary'] <= Q3 + 1.5*IQR)]

Data Analysis

# Correlation matrix
df.corr()

# Cross-tabulation
pd.crosstab(df['Department'], df['City'])

# Value counts
df['City'].value_counts()

# Unique values
df['City'].unique()
df['City'].nunique()

# Percentiles
df['Salary'].quantile([0.25, 0.5, 0.75])

Export Data

# To CSV
df.to_csv('output.csv', index=False)

# To Excel
df.to_excel('output.xlsx', sheet_name='Data', index=False)

# To SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

# To JSON
df.to_json('output.json', orient='records')

# To HTML
df.to_html('output.html', index=False)

Performance Tips

1. Use Vectorized Operations

# ❌ Slow: Loop
for i in range(len(df)):
    df.loc[i, 'New'] = df.loc[i, 'A'] + df.loc[i, 'B']

# ✅ Fast: Vectorized
df['New'] = df['A'] + df['B']

2. Use Categories for Repeated Strings

# ❌ Memory inefficient
df['City'] = df['City']  # object type

# ✅ Memory efficient
df['City'] = df['City'].astype('category')

3. Read Only Needed Columns

# ❌ Read all columns
df = pd.read_csv('large_file.csv')

# ✅ Read specific columns
df = pd.read_csv('large_file.csv', usecols=['Name', 'Age'])

4. Use Chunking for Large Files

# Read in chunks
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=10000):
    # Process chunk
    processed = chunk[chunk['Age'] > 25]
    chunks.append(processed)

df = pd.concat(chunks)

Common Pitfalls

1. SettingWithCopyWarning

# ❌ Chained assignment
df[df['Age'] > 25]['Salary'] = 50000  # Warning!

# ✅ Use .loc
df.loc[df['Age'] > 25, 'Salary'] = 50000

2. Inplace Operations

# Returns None!
result = df.sort_values('Age', inplace=True)
print(result)  # None

# Better: Don't use inplace
df = df.sort_values('Age')

3. Index Confusion

# After filtering, index may have gaps
df = df[df['Age'] > 25]
print(df.index)  # [0, 2, 4, 7, ...]

# Reset index
df = df.reset_index(drop=True)
print(df.index)  # [0, 1, 2, 3, ...]

Resources

Official Documentation: pandas.pydata.org
10 Minutes to Pandas: Quick Start Guide
Cheat Sheet: Pandas Cheat Sheet

Conclusion

Pandas is essential for:

✅ Data cleaning and preparation
✅ Exploratory data analysis
✅ Data transformation
✅ Statistical analysis

Key Takeaways:

Use vectorized operations for speed
Master .loc and .iloc for selection
Understand groupby for aggregations
Handle missing data appropriately
Use categories for memory efficiency

Master Pandas, and you’ll be able to handle any data analysis task in Python!