JOIN In Pandas

This post is a practical, SQL-first guide to understanding:

LEFT JOIN equivalent in pandas
Which table’s column appears in the output
What happens when both tables have the same column names
How to keep only LEFT table data
When you should not merge at all

LEFT JOIN: SQL vs Pandas

SQL

SELECT *
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id;

Pandas Equivalent

result = pd.merge(
    orders,
    customers,
    how="left",
    on="customer_id"
)

✔ All rows from the left table are preserved
✔ Matching rows from the right table are added
✔ Non-matching right columns become NaN

Which Table Does the Join Column Come From?

👉 Always from the LEFT table

What Happens When Both Tables Have the Same Column Names?

This is where most SQL users get tripped up.

Example

left_df = pd.DataFrame({
    "customer_id": [101, 102, 103],
    "status": ["Active", "Active", "Inactive"]
})

right_df = pd.DataFrame({
    "customer_id": [101, 102],
    "status": ["Gold", "Silver"]
})

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id"
)

Output

   customer_id status_x status_y
0          101   Active     Gold
1          102   Active   Silver
2          103 Inactive       NaN

Meaning

status_x → from LEFT table
status_y → from RIGHT table
_x and _y are pandas’ default conflict resolution

SQL hides this with aliases.
Pandas makes it explicit.

Using Custom Suffixes (Strongly Recommended)

Never accept _x and _y in production code.

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id",
    suffixes=("_orders", "_customers")
)

Output

customer_id | status_orders | status_customers

Readable. Maintainable. Safe.

What If Join Columns Have Different Names?

pd.merge(
    left_df,
    right_df,
    how="left",
    left_on="customer_id",
    right_on="cust_id"
)

➡️ Both key columns appear.
Usually, you validate and drop the right one:

result.drop(columns=["cust_id"], inplace=True)

How to Keep ONLY LEFT Table Data (Very Common Requirement)

Sometimes you need:

LEFT JOIN logic
❌ No columns from the right table
❌ No _x, _y

Here are the right ways to do it.

✅ Method 1: Merge + Explicitly Keep LEFT Columns (Best Practice)

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id",
    suffixes=("_left", "_right")
)

result = result[
    [col for col in result.columns 
     if col.endswith("_left") or col == "customer_id"]
]

result.columns = [col.replace("_left", "") for col in result.columns]

✔ LEFT join preserved
✔ RIGHT table used only for matching
✔ Clean final schema

✅ Method 2: Disposable Suffix Technique (Simple & Practical)

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id",
    suffixes=("", "_drop")
)

result = result.drop(
    columns=[col for col in result.columns if col.endswith("_drop")]
)

🧠 Treat the right table as temporary validation data.

✅ Method 3: Don’t Merge at All (SQL-Style Optimization)

If your goal is only filtering, merging is unnecessary.

SQL

SELECT *
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NOT NULL;

Pandas

result = left_df[
    left_df["customer_id"].isin(right_df["customer_id"])
]

🚀 Faster
🚀 Cleaner
🚀 No column conflicts

When to Use What

Requirement	Best Solution
Need right-side data	`merge()`
Only filter left rows	`isin()`
Same column names	suffix + drop
Production pipelines	rename before merge

What NOT to Do

pd.merge(left_df, right_df, how="left", on="customer_id")

…and then live with:

status_x | status_y

That’s technical debt.

SQL vs Pandas: The Mindset Shift

SQL is declarative
Pandas is explicit

Pandas forces you to:

Think about column ownership
Be intentional about schemas
Clean up after joins

Once you accept this, pandas becomes incredibly powerful.

SQL → Pandas Join Mapping

SQL	Pandas
LEFT JOIN	`merge(how="left")`
INNER JOIN	`merge(how="inner")`
RIGHT JOIN	`merge(how="right")`
FULL OUTER JOIN	`merge(how="outer")`

Final Takeaway

If you only want LEFT table data, don’t blindly merge.

Filter when possible
Merge intentionally
Drop aggressively
Never accept _x / _y in final outputs

This one habit alone will save you hours of debugging in pandas.

Karthi Baskar Blogs

Search This Blog

JOIN In Pandas

LEFT JOIN: SQL vs Pandas

SQL

Pandas Equivalent

Which Table Does the Join Column Come From?

What Happens When Both Tables Have the Same Column Names?

Example

Output

Meaning

Using Custom Suffixes (Strongly Recommended)

Output

What If Join Columns Have Different Names?

How to Keep ONLY LEFT Table Data (Very Common Requirement)

✅ Method 1: Merge + Explicitly Keep LEFT Columns (Best Practice)

✅ Method 2: Disposable Suffix Technique (Simple & Practical)

✅ Method 3: Don’t Merge at All (SQL-Style Optimization)

SQL

Pandas

When to Use What

What NOT to Do

SQL vs Pandas: The Mindset Shift

SQL → Pandas Join Mapping

Final Takeaway

Comments

Post a Comment