JOIN In Pandas

This post is a practical, SQL-first guide to understanding:

  • LEFT JOIN equivalent in pandas

  • Which table’s column appears in the output

  • What happens when both tables have the same column names

  • How to keep only LEFT table data

  • When you should not merge at all


LEFT JOIN: SQL vs Pandas

SQL

SELECT *
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id;

Pandas Equivalent

result = pd.merge(
    orders,
    customers,
    how="left",
    on="customer_id"
)

✔ All rows from the left table are preserved
✔ Matching rows from the right table are added
✔ Non-matching right columns become NaN


Which Table Does the Join Column Come From?

πŸ‘‰ Always from the LEFT table


What Happens When Both Tables Have the Same Column Names?

This is where most SQL users get tripped up.

Example

left_df = pd.DataFrame({
    "customer_id": [101, 102, 103],
    "status": ["Active", "Active", "Inactive"]
})

right_df = pd.DataFrame({
    "customer_id": [101, 102],
    "status": ["Gold", "Silver"]
})

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id"
)

Output

   customer_id status_x status_y
0          101   Active     Gold
1          102   Active   Silver
2          103 Inactive       NaN

Meaning

  • status_x → from LEFT table

  • status_y → from RIGHT table

  • _x and _y are pandas’ default conflict resolution

SQL hides this with aliases.
Pandas makes it explicit.


Using Custom Suffixes (Strongly Recommended)

Never accept _x and _y in production code.

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id",
    suffixes=("_orders", "_customers")
)

Output

customer_id | status_orders | status_customers

Readable. Maintainable. Safe.


What If Join Columns Have Different Names?

pd.merge(
    left_df,
    right_df,
    how="left",
    left_on="customer_id",
    right_on="cust_id"
)

➡️ Both key columns appear.
Usually, you validate and drop the right one:

result.drop(columns=["cust_id"], inplace=True)

How to Keep ONLY LEFT Table Data (Very Common Requirement)

Sometimes you need:

  • LEFT JOIN logic

  • ❌ No columns from the right table

  • ❌ No _x, _y

Here are the right ways to do it.


✅ Method 1: Merge + Explicitly Keep LEFT Columns (Best Practice)

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id",
    suffixes=("_left", "_right")
)

result = result[
    [col for col in result.columns 
     if col.endswith("_left") or col == "customer_id"]
]

result.columns = [col.replace("_left", "") for col in result.columns]

✔ LEFT join preserved
✔ RIGHT table used only for matching
✔ Clean final schema


✅ Method 2: Disposable Suffix Technique (Simple & Practical)

result = pd.merge(
    left_df,
    right_df,
    how="left",
    on="customer_id",
    suffixes=("", "_drop")
)

result = result.drop(
    columns=[col for col in result.columns if col.endswith("_drop")]
)

🧠 Treat the right table as temporary validation data.


✅ Method 3: Don’t Merge at All (SQL-Style Optimization)

If your goal is only filtering, merging is unnecessary.

SQL

SELECT *
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NOT NULL;

Pandas

result = left_df[
    left_df["customer_id"].isin(right_df["customer_id"])
]

πŸš€ Faster
πŸš€ Cleaner
πŸš€ No column conflicts


When to Use What

RequirementBest Solution
Need right-side data    merge()
Only filter left rows    isin()
Same column names        suffix + drop
Production pipelinesrename before merge

What NOT to Do

pd.merge(left_df, right_df, how="left", on="customer_id")

…and then live with:

status_x | status_y

That’s technical debt.


SQL vs Pandas: The Mindset Shift

  • SQL is declarative

  • Pandas is explicit

Pandas forces you to:

  • Think about column ownership

  • Be intentional about schemas

  • Clean up after joins

Once you accept this, pandas becomes incredibly powerful.


SQL → Pandas Join Mapping

SQLPandas
LEFT JOINmerge(how="left")
INNER JOINmerge(how="inner")
RIGHT JOINmerge(how="right")
FULL OUTER JOINmerge(how="outer")

Final Takeaway

If you only want LEFT table data, don’t blindly merge.

  • Filter when possible

  • Merge intentionally

  • Drop aggressively

  • Never accept _x / _y in final outputs

This one habit alone will save you hours of debugging in pandas.

Comments