Splitting Object Data into New Columns in a DataFrame Using pandas and json_normalize() Function

Splitting Object Data into New Columns in a DataFrame

===========================================================

In this article, we will explore how to split object data into new columns in a pandas DataFrame. We will use the pd.json_normalize() function to achieve this.

Background

Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures and functions designed to make working with structured data easy and efficient. One of its key features is the ability to handle object data, which can be represented as dictionaries or other custom objects.

When dealing with object data, it’s often useful to extract specific values from these objects and create new columns in a DataFrame. This can be particularly useful when working with complex data structures, such as JSON data.

The Problem


Let’s consider an example DataFrame df that contains business data with attributes:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'attributes': [
        {"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True"},
        {"HasTV":"False", "Ambience":{'romantic': False, 'intimate': False,},"Price":"2"},
    ],
    'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'],
})

We want to create new columns for each attribute with the corresponding value from the business_id column. If an attribute is not applicable to a particular business, we want to assign it a default value of False.

Solution


To achieve this, we will use the pd.json_normalize() function, which can handle object data and extract specific values.

Function to Normalize JSON Data


We will define a custom function json_normalize() that takes an object as input. This function will first check if the input is a string, and if so, parse it into a dictionary using json.loads(). Then, we can use pd.json_normalize() to extract the desired values.

import pandas as pd

def json_normalize(s):
    if isinstance(s, str):
        s = json.loads(s)
    return pd.json_normalize(s)

Creating New Columns


Now that we have our custom function, let’s use it to create new columns for each attribute in the attributes column.

df['WiFi'] = df['attributes'].apply(lambda x: x.get('WiFi', 'False'))
df['HasTV'] = df['attributes'].apply(lambda x: x.get('HasTV', 'False'))
df['RestaurantsTableService'] = df['attributes'].apply(lambda x: x.get('RestaurantsTableService', 'False'))
df['Caters'] = df['attributes'].apply(lambda x: x.get('Caters', 'False'))

We can also create columns for attributes with nested objects using the .get() method:

df['Ambience.romantic'] = df['attributes'].apply(lambda x: x.get('Ambience', {}).get('romantic', 'False'))
df['Ambience.intimate'] = df['attributes'].apply(lambda x: x.get('Ambience', {}).get('intimate', 'False'))

Creating the Final DataFrame


Finally, we can use pd.concat() to concatenate the original DataFrame with our new columns.

df2 = pd.concat([df['attributes'].apply(json_normalize).to_list(), df['business_id']], keys=df.business_id)

This will create a new DataFrame df2 with the desired structure.

Example Use Cases


Here are some example use cases for this technique:

  • JSON Data: When working with JSON data, it’s often useful to extract specific values and create new columns.
  • Nested Objects: When dealing with nested objects, we can use the .get() method to access nested attributes.

Conclusion


In conclusion, splitting object data into new columns in a pandas DataFrame is a common task when working with complex data structures. By using the pd.json_normalize() function and custom functions to handle JSON data, we can create new columns for each attribute with the corresponding value from the business_id column.

This technique can be applied to various use cases, including working with JSON data and nested objects. With practice and experience, you’ll become proficient in handling complex data structures and extracting valuable insights from your data.


Last modified on 2024-08-22