Replacing Countries with 'Other' in a Pandas DataFrame

Replacing Countries in a Pandas DataFrame

In this tutorial, we will walk through the process of replacing specific values in a pandas DataFrame column based on condition. We will use an example where countries other than ‘India’ and ‘U.S.A’ are replaced with ‘Other’.

Introduction

Pandas is a powerful library used for data manipulation and analysis in Python. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables.

One of the common use cases of pandas is replacing values in a column based on certain conditions. In this tutorial, we will explore how to replace countries other than ‘India’ and ‘U.S.A’ with ‘Other’ in a pandas DataFrame.

Background

Before diving into the code, let’s understand the data structure and functions used in pandas.

Series: A Series is a one-dimensional labeled array of values. It can be thought of as a column in a spreadsheet.
DataFrame: A DataFrame is two-dimensional labeled data structure with columns of potentially different types. Each row represents a single observation, and each column represents a variable.
isin() function: The isin() function returns boolean values indicating whether elements of the Series are present in the specified set or not.
mask() function: The mask() function replaces True with the value in the first argument and False with the value in the second argument. It is commonly used to replace missing values with a specific value.

Step 1: Importing Libraries

Before we begin, let’s import the necessary libraries.

import pandas as pd
import numpy as np

pandas is the library we will use for data manipulation and analysis.
numpy is used for generating random numbers.

Step 2: Creating a Sample DataFrame

Let’s create a sample DataFrame with countries in column ‘Q0_0’ and some other columns.

# Create a sample DataFrame
df = pd.DataFrame({
    'Q0_0': ["India", "Algeria", "India", "U.S.A", "Morocco", "Tunisia", "U.S.A", "France", "Russia", "Algeria"],
    'Q1_1': [np.random.randint(1,100) for i in range(10)],
    'Q1_2': np.random.random(10),
    'Q1_3': np.random.randint(2, size=10),
    'Q2_1': [np.random.randint(1,100) for i in range(10)],
    'Q2_2': np.random.random(10),
    'Q2_3': np.random.randint(2, size=10)
})

This DataFrame has 10 rows and 6 columns.

Step 3: Replacing Countries

Now, let’s replace countries other than ‘India’ and ‘U.S.A’ with ‘Other’.

# Replace countries other than 'India' and 'U.S.A' with 'Other'
df["Q0_0"] = df["Q0_0"].mask(~df["Q0_0"].isin(["India", "U.S.A"])).fillna("Other")

This code uses the mask() function to replace countries other than ‘India’ and ‘U.S.A’ with ‘Other’. The fillna() function is used to replace missing values with ‘Other’.

Step 4: Printing the DataFrame

Finally, let’s print the modified DataFrame.

# Print the modified DataFrame
print(df)

This will display the original DataFrame followed by the modified DataFrame where countries other than ‘India’ and ‘U.S.A’ are replaced with ‘Other’.

Conclusion

In this tutorial, we have learned how to replace specific values in a pandas DataFrame column based on condition. We used an example where countries other than ‘India’ and ‘U.S.A’ are replaced with ‘Other’. This is a common use case in data manipulation and analysis. By understanding the mask() function and the fillna() function, you can efficiently handle structured data in pandas.

Note that this code assumes that the country names are exact matches. If you want to replace countries based on partial matches or other criteria, you will need to adjust the code accordingly.

Last modified on 2025-02-09