This post is a practical, SQL-first guide to understanding:
LEFT JOIN equivalent in pandas
Which table’s column appears in the output
What happens when both tables have the same column names
How to keep only LEFT table data
When you should not merge at all
LEFT JOIN: SQL vs Pandas
SQL
SELECT *
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id;
Pandas Equivalent
result = pd.merge(
orders,
customers,
how="left",
on="customer_id"
)
✔ All rows from the left table are preserved
✔ Matching rows from the right table are added
✔ Non-matching right columns become NaN
Which Table Does the Join Column Come From?
π Always from the LEFT table
What Happens When Both Tables Have the Same Column Names?
This is where most SQL users get tripped up.
Example
left_df = pd.DataFrame({
"customer_id": [101, 102, 103],
"status": ["Active", "Active", "Inactive"]
})
right_df = pd.DataFrame({
"customer_id": [101, 102],
"status": ["Gold", "Silver"]
})
result = pd.merge(
left_df,
right_df,
how="left",
on="customer_id"
)
Output
customer_id status_x status_y
0 101 Active Gold
1 102 Active Silver
2 103 Inactive NaN
Meaning
status_x→ from LEFT tablestatus_y→ from RIGHT table_xand_yare pandas’ default conflict resolution
SQL hides this with aliases.
Pandas makes it explicit.
Using Custom Suffixes (Strongly Recommended)
Never accept _x and _y in production code.
result = pd.merge(
left_df,
right_df,
how="left",
on="customer_id",
suffixes=("_orders", "_customers")
)
Output
customer_id | status_orders | status_customers
Readable. Maintainable. Safe.
What If Join Columns Have Different Names?
pd.merge(
left_df,
right_df,
how="left",
left_on="customer_id",
right_on="cust_id"
)
➡️ Both key columns appear.
Usually, you validate and drop the right one:
result.drop(columns=["cust_id"], inplace=True)
How to Keep ONLY LEFT Table Data (Very Common Requirement)
Sometimes you need:
LEFT JOIN logic
❌ No columns from the right table
❌ No
_x,_y
Here are the right ways to do it.
✅ Method 1: Merge + Explicitly Keep LEFT Columns (Best Practice)
result = pd.merge(
left_df,
right_df,
how="left",
on="customer_id",
suffixes=("_left", "_right")
)
result = result[
[col for col in result.columns
if col.endswith("_left") or col == "customer_id"]
]
result.columns = [col.replace("_left", "") for col in result.columns]
✔ LEFT join preserved
✔ RIGHT table used only for matching
✔ Clean final schema
✅ Method 2: Disposable Suffix Technique (Simple & Practical)
result = pd.merge(
left_df,
right_df,
how="left",
on="customer_id",
suffixes=("", "_drop")
)
result = result.drop(
columns=[col for col in result.columns if col.endswith("_drop")]
)
π§ Treat the right table as temporary validation data.
✅ Method 3: Don’t Merge at All (SQL-Style Optimization)
If your goal is only filtering, merging is unnecessary.
SQL
SELECT *
FROM orders o
LEFT JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NOT NULL;
Pandas
result = left_df[
left_df["customer_id"].isin(right_df["customer_id"])
]
π Faster
π Cleaner
π No column conflicts
When to Use What
| Requirement | Best Solution |
|---|---|
| Need right-side data | merge() |
| Only filter left rows | isin() |
| Same column names | suffix + drop |
| Production pipelines | rename before merge |
What NOT to Do
pd.merge(left_df, right_df, how="left", on="customer_id")
…and then live with:
status_x | status_y
That’s technical debt.
SQL vs Pandas: The Mindset Shift
SQL is declarative
Pandas is explicit
Pandas forces you to:
Think about column ownership
Be intentional about schemas
Clean up after joins
Once you accept this, pandas becomes incredibly powerful.
SQL → Pandas Join Mapping
| SQL | Pandas |
|---|---|
| LEFT JOIN | merge(how="left") |
| INNER JOIN | merge(how="inner") |
| RIGHT JOIN | merge(how="right") |
| FULL OUTER JOIN | merge(how="outer") |
Final Takeaway
If you only want LEFT table data, don’t blindly merge.
Filter when possible
Merge intentionally
Drop aggressively
Never accept
_x/_yin final outputs
This one habit alone will save you hours of debugging in pandas.
Comments
Post a Comment