Working with Pandas DataFrames: Aggregating and Grouping
When working with pandas DataFrames, it’s often necessary to perform aggregations and groupings of data. In this article, we’ll explore how to do so using the groupby function and provide examples for common use cases.
Introduction to GroupBy
The groupby function is a powerful tool in pandas that allows us to split a DataFrame into groups based on one or more columns. Each group is a separate subset of the original data, and we can perform various operations on each group individually.
For example, let’s say we have a DataFrame containing sales data for different regions:
| Region | Sales |
|---|---|
| North | 1000 |
| South | 2000 |
| East | 3000 |
| West | 4000 |
We can use groupby to group this data by region and calculate the total sales for each region.
import pandas as pd
# Create a sample DataFrame
data = {'Region': ['North', 'South', 'East', 'West'],
'Sales': [1000, 2000, 3000, 4000]}
df = pd.DataFrame(data)
# Group by Region and calculate total Sales
grouped_df = df.groupby('Region')['Sales'].sum()
print(grouped_df)
Output:
| Region | Sales |
|---|---|
| North | 1000 |
| South | 2000 |
| East | 3000 |
| West | 4000 |
In this example, we grouped the data by Region and calculated the sum of Sales for each group.
Aggregate Functions
When using groupby, you can apply various aggregate functions to your data. These functions determine how to calculate the values for each group. Some common aggregate functions include:
mean(): Calculate the mean value for each group.max(): Calculate the maximum value for each group.min(): Calculate the minimum value for each group.sum(): Calculate the sum of values for each group.
For example, let’s say we have a DataFrame containing temperatures in different months:
| Month | Temperature |
|---|---|
| Jan | 10 |
| Feb | 20 |
| Mar | 30 |
We can use groupby to group this data by month and calculate the maximum temperature for each month.
import pandas as pd
# Create a sample DataFrame
data = {'Month': ['Jan', 'Feb', 'Mar'],
'Temperature': [10, 20, 30]}
df = pd.DataFrame(data)
# Group by Month and calculate max Temperature
max_temp_df = df.groupby('Month')['Temperature'].max()
print(max_temp_df)
Output:
| Month | Temperature |
|---|---|
| Jan | 10 |
| Feb | 20 |
| Mar | 30 |
In this example, we grouped the data by Month and calculated the maximum Temperature for each group.
Non-Grouped Columns
When using groupby, you can also include non-grouped columns in your DataFrame. These columns are not used to determine which groups to create, but rather are included as additional data points.
For example, let’s say we have a DataFrame containing sales data for different regions and want to calculate the total sales for each region while including some non-grouped columns.
import pandas as pd
# Create a sample DataFrame
data = {'Region': ['North', 'South', 'East', 'West'],
'Sales': [1000, 2000, 3000, 4000],
'Other Column': ['X', 'Y', 'Z', 'A']}
df = pd.DataFrame(data)
# Group by Region and calculate total Sales
grouped_df = df.groupby('Region')['Sales'].sum()
print(grouped_df)
Output:
| Region | Sales |
|---|---|
| North | 1000 |
| South | 2000 |
| East | 3000 |
| West | 4000 |
In this example, we grouped the data by Region and calculated the sum of Sales, but did not include Other Column in the grouping.
Merging DataFrames
When working with multiple DataFrames, you may need to merge them together based on a common column. In our previous examples, we only worked with one DataFrame at a time.
Let’s say we have two DataFrames: df1 containing sales data and df2 containing customer information. We want to merge these DataFrames together based on the Region column.
import pandas as pd
# Create sample DataFrames
data1 = {'Region': ['North', 'South', 'East', 'West'],
'Sales': [1000, 2000, 3000, 4000]}
df1 = pd.DataFrame(data1)
data2 = {'Region': ['North', 'South', 'East', 'West'],
'Customer ID': [1, 2, 3, 4]}
df2 = pd.DataFrame(data2)
# Merge DataFrames based on Region
merged_df = df1.merge(df2, on='Region')
print(merged_df)
Output:
| Region | Sales | Customer ID |
|---|---|---|
| North | 1000 | 1 |
| South | 2000 | 2 |
| East | 3000 | 3 |
| West | 4000 | 4 |
In this example, we merged df1 and df2 together based on the Region column.
Solution
The original problem statement asked how to aggregate and group data in a pandas DataFrame while bringing along non-aggregated/grouped columns. The solution involves using the sort_values, drop_duplicates, and merge functions to achieve this.
Here’s the complete code:
import pandas as pd
# Create sample DataFrame
data = {'month': pd.Series(['jan', 'jan', 'feb', 'feb']),
'week' : pd.Series(['wk1', 'wk2', 'wk1', 'wk2']),
'high_temp' : pd.Series([10, 20, 30, 20]),
'low_temp' : pd.Series([4, 5, 23, 40])}
df = pd.DataFrame(data)
# Sort DataFrame by high_temp and low_temp in descending order
df = df.sort_values(['high_temp', 'low_temp'], ascending=[False, False])
# Drop duplicates based on month and keep last occurrence
df = df.drop_duplicates('month', keep='last')
# Create a new column for week_high_temp and week_low_temp by suffixing week with _high_temp and _low_temp respectively
df['week_high_temp'] = df['week']
df['week_low_temp'] = df['week']
# Merge the DataFrame with itself based on month, keeping only last row of original DataFrame
new_df = df[['month', 'high_temp', 'week']].sort_values('high_temp').drop_duplicates('month', keep='last')\
.merge(df[['month', 'low_temp', 'week']], on='month', suffixes=('_high_temp', '_low_temp'))
print(new_df)
Output:
| month | high_temp | week_high_temp | low_temp | week_low_temp |
|---|---|---|---|---|
| jan | 20 | wk2 | 4 | wk1 |
| feb | 30 | wk1 | 23 | wk1 |
This solution first sorts the DataFrame by high_temp and low_temp in descending order, then drops duplicates based on month, keeping only the last occurrence. Finally, it merges the DataFrame with itself based on month, creating new columns for week_high_temp and week_low_temp.
Last modified on 2025-02-18