Pandas Mastery: Data Analysis Made Easy
AS
Aman Saurav
read
#pandas
#data-analysis
#dataframes
#data-science
Pandas Mastery: Data Analysis Made Easy
Pandas is the most popular Python library for data analysis and manipulation. Built on top of NumPy, it provides high-level data structures and tools for working with structured data.
Installation
pip install pandas
# With visualization support
pip install pandas matplotlib seaborn
# Verify
python -c "import pandas; print(pandas.__version__)"
Core Data Structures
Series (1D)
import pandas as pd
import numpy as np
# From list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
# 0 1.0
# 1 3.0
# 2 5.0
# 3 NaN
# 4 6.0
# 5 8.0
# With custom index
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s['b']) # 20
# From dictionary
d = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(d)
DataFrame (2D)
# From dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['NYC', 'LA', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
# Name Age City
# 0 Alice 25 NYC
# 1 Bob 30 LA
# 2 Charlie 35 Chicago
# From CSV
df = pd.read_csv('data.csv')
# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# From SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query("SELECT * FROM table", conn)
Viewing Data
# First/last rows
df.head() # First 5 rows
df.head(10) # First 10 rows
df.tail() # Last 5 rows
# Info
df.info() # Column types, non-null counts
df.describe() # Statistical summary
df.shape # (rows, columns)
df.columns # Column names
df.dtypes # Data types
# Quick stats
df.count() # Non-null values
df.mean() # Mean of numeric columns
df.median() # Median
df.std() # Standard deviation
Selecting Data
Column Selection
# Single column (Series)
df['Name']
df.Name # Dot notation
# Multiple columns (DataFrame)
df[['Name', 'Age']]
# All columns except one
df.drop('City', axis=1)
Row Selection
# By position (.iloc)
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[[0, 2]] # Rows 0 and 2
# By label (.loc)
df.loc[0] # Row with index 0
df.loc[0:2] # Rows 0 to 2 (inclusive!)
df.loc[df['Age'] > 25] # Conditional
# Boolean indexing
df[df['Age'] > 25]
df[df['City'] == 'NYC']
df[(df['Age'] > 25) & (df['City'] == 'NYC')]
Cell Selection
# Single cell
df.loc[0, 'Name'] # 'Alice'
df.iloc[0, 0] # 'Alice'
# Multiple cells
df.loc[0:2, ['Name', 'Age']]
df.iloc[0:2, 0:2]
Data Manipulation
Adding Columns
# New column
df['Salary'] = [50000, 60000, 70000]
# Calculated column
df['Age_Plus_10'] = df['Age'] + 10
# Conditional column
df['Senior'] = df['Age'] > 30
Modifying Data
# Update single value
df.loc[0, 'Age'] = 26
# Update multiple values
df.loc[df['City'] == 'NYC', 'City'] = 'New York'
# Apply function
df['Age'] = df['Age'].apply(lambda x: x + 1)
# Map values
df['City'] = df['City'].map({'NYC': 'New York', 'LA': 'Los Angeles'})
# Replace
df['Age'].replace(25, 26, inplace=True)
Deleting Data
# Drop column
df.drop('Salary', axis=1, inplace=True)
# Drop row
df.drop(0, axis=0, inplace=True)
# Drop multiple
df.drop(['Salary', 'Age'], axis=1, inplace=True)
# Drop by condition
df = df[df['Age'] > 25]
Handling Missing Data
# Check for missing values
df.isnull() # Boolean DataFrame
df.isnull().sum() # Count per column
# Drop missing values
df.dropna() # Drop rows with any NaN
df.dropna(axis=1) # Drop columns with any NaN
df.dropna(thresh=2) # Keep rows with at least 2 non-NaN
# Fill missing values
df.fillna(0) # Fill with 0
df.fillna(df.mean()) # Fill with mean
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
# Interpolate
df.interpolate() # Linear interpolation
Grouping and Aggregation
# Sample data
df = pd.DataFrame({
'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Salary': [50000, 60000, 70000, 80000, 55000]
})
# Group by
grouped = df.groupby('Department')
# Aggregations
grouped.mean() # Mean salary by department
grouped.sum() # Total salary by department
grouped.count() # Count by department
grouped.min() # Minimum by department
grouped.max() # Maximum by department
# Multiple aggregations
grouped.agg(['mean', 'sum', 'count'])
# Custom aggregations
grouped.agg({
'Salary': ['mean', 'sum'],
'Employee': 'count'
})
# Group by multiple columns
df.groupby(['Department', 'Employee']).mean()
Sorting
# Sort by column
df.sort_values('Age') # Ascending
df.sort_values('Age', ascending=False) # Descending
# Sort by multiple columns
df.sort_values(['Department', 'Salary'], ascending=[True, False])
# Sort by index
df.sort_index()
Merging and Joining
# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Salary': [50000, 60000, 70000]})
# Inner join (default)
merged = pd.merge(df1, df2, on='ID')
# ID Name Salary
# 0 1 Alice 50000
# 1 2 Bob 60000
# Left join
merged = pd.merge(df1, df2, on='ID', how='left')
# ID Name Salary
# 0 1 Alice 50000.0
# 1 2 Bob 60000.0
# 2 3 Charlie NaN
# Right join
merged = pd.merge(df1, df2, on='ID', how='right')
# Outer join
merged = pd.merge(df1, df2, on='ID', how='outer')
# Concatenate
combined = pd.concat([df1, df2], axis=0) # Stack vertically
combined = pd.concat([df1, df2], axis=1) # Stack horizontally
Pivot Tables
# Sample sales data
df = pd.DataFrame({
'Date': ['2024-01', '2024-01', '2024-02', '2024-02'],
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 150, 120, 180]
})
# Pivot table
pivot = df.pivot_table(
values='Sales',
index='Date',
columns='Product',
aggfunc='sum'
)
# A B
# Date
# 2024-01 100 150
# 2024-02 120 180
# Multiple aggregations
pivot = df.pivot_table(
values='Sales',
index='Date',
columns='Product',
aggfunc=['sum', 'mean']
)
String Operations
df = pd.DataFrame({'Name': ['Alice Smith', 'Bob Jones', 'Charlie Brown']})
# String methods
df['Name'].str.lower() # 'alice smith'
df['Name'].str.upper() # 'ALICE SMITH'
df['Name'].str.title() # 'Alice Smith'
df['Name'].str.len() # 11
df['Name'].str.contains('Smith') # True, False, False
df['Name'].str.startswith('A') # True, False, False
df['Name'].str.split(' ') # ['Alice', 'Smith']
# Extract first name
df['First_Name'] = df['Name'].str.split(' ').str[0]
# Replace
df['Name'].str.replace('Smith', 'Johnson')
# Strip whitespace
df['Name'].str.strip()
Date/Time Operations
# Create date range
dates = pd.date_range('2024-01-01', periods=10, freq='D')
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Extract components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['DayName'] = df['Date'].dt.day_name()
# Time delta
df['Days_Since'] = (pd.Timestamp.now() - df['Date']).dt.days
# Resample time series
df.set_index('Date').resample('M').mean() # Monthly average
Practical Examples
Data Cleaning
# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['Name'])
# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)
# Change data types
df['Age'] = df['Age'].astype(int)
df['Date'] = pd.to_datetime(df['Date'])
# Strip whitespace from strings
df['Name'] = df['Name'].str.strip()
# Handle outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Salary'] >= Q1 - 1.5*IQR) & (df['Salary'] <= Q3 + 1.5*IQR)]
Data Analysis
# Correlation matrix
df.corr()
# Cross-tabulation
pd.crosstab(df['Department'], df['City'])
# Value counts
df['City'].value_counts()
# Unique values
df['City'].unique()
df['City'].nunique()
# Percentiles
df['Salary'].quantile([0.25, 0.5, 0.75])
Export Data
# To CSV
df.to_csv('output.csv', index=False)
# To Excel
df.to_excel('output.xlsx', sheet_name='Data', index=False)
# To SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)
# To JSON
df.to_json('output.json', orient='records')
# To HTML
df.to_html('output.html', index=False)
Performance Tips
1. Use Vectorized Operations
# ❌ Slow: Loop
for i in range(len(df)):
df.loc[i, 'New'] = df.loc[i, 'A'] + df.loc[i, 'B']
# ✅ Fast: Vectorized
df['New'] = df['A'] + df['B']
2. Use Categories for Repeated Strings
# ❌ Memory inefficient
df['City'] = df['City'] # object type
# ✅ Memory efficient
df['City'] = df['City'].astype('category')
3. Read Only Needed Columns
# ❌ Read all columns
df = pd.read_csv('large_file.csv')
# ✅ Read specific columns
df = pd.read_csv('large_file.csv', usecols=['Name', 'Age'])
4. Use Chunking for Large Files
# Read in chunks
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=10000):
# Process chunk
processed = chunk[chunk['Age'] > 25]
chunks.append(processed)
df = pd.concat(chunks)
Common Pitfalls
1. SettingWithCopyWarning
# ❌ Chained assignment
df[df['Age'] > 25]['Salary'] = 50000 # Warning!
# ✅ Use .loc
df.loc[df['Age'] > 25, 'Salary'] = 50000
2. Inplace Operations
# Returns None!
result = df.sort_values('Age', inplace=True)
print(result) # None
# Better: Don't use inplace
df = df.sort_values('Age')
3. Index Confusion
# After filtering, index may have gaps
df = df[df['Age'] > 25]
print(df.index) # [0, 2, 4, 7, ...]
# Reset index
df = df.reset_index(drop=True)
print(df.index) # [0, 1, 2, 3, ...]
Resources
- Official Documentation: pandas.pydata.org
- 10 Minutes to Pandas: Quick Start Guide
- Cheat Sheet: Pandas Cheat Sheet
Conclusion
Pandas is essential for:
- ✅ Data cleaning and preparation
- ✅ Exploratory data analysis
- ✅ Data transformation
- ✅ Statistical analysis
Key Takeaways:
- Use vectorized operations for speed
- Master
.locand.ilocfor selection - Understand groupby for aggregations
- Handle missing data appropriately
- Use categories for memory efficiency
Master Pandas, and you’ll be able to handle any data analysis task in Python!