From 48 Hours to 6 Minutes: My Journey Optimizing a Pandas Reconciliation Process for Large-Scale Data

10 / Jul / 2025 by Vijay Pal Vishwakarma 0 comments

Introduction

I recently went through a wild but rewarding ride optimizing a heavy pandas-based reconciliation workflow. What started as a slow, clunky piece of code eating up over 48 hours, ended up getting polished into a lean, mean 6-minutes machine.
I wanted to share my story – not as a tutorial, but as a real-world case study from the trenches. If you’ve ever felt your pandas code was taking forever, this post is for you.

The Problem: Reconciling Huge DataFrames

I was working with two large pandas DataFrames – let’s call them df1 and df2, and to reconcile under two categories:

  1. Exact Match
  2. Tolerance-Based Match

The reconciliation keys were mostly PAN, TAN, TAXABLE_VALUE, TDS and DATE, but for the second category, TAXABLE_VALUE & TDS match allowed a tolerance window – not an exact match. The date tolerance was applicable in both categories (i.e., not a strict equality match on date)

Here’s how it looked:

  • category_1: Grouped by [PAN, TAN, TAXABLE_VALUE, TDS]
  • category_2: Grouped by just [PAN, TAN], since we wanted more flexibility with TAXABLE_VALUE & TDS tolerance

A Look into Data Complexity

The data wasn’t just large – it was messy and unbalanced, which made the problem more challenging:

  • 3.9 million rows in one DataFrame (df1) and 770,000 rows in the other DataFrame (df2)
  • Duplicate entries on both sides for the same PAN-TAN combinations
  • Multiple TANs under a single PAN, not always aligned between df1 and df2
  • Skewed distribution – some PAN-TAN groups had thousands of rows, others had only a few

This resulted in combinatorial explosions when matching duplicate groups – especially during tolerance-based reconciliation – and made naive multiprocessing very inefficient.
Below are samples of how the actual data looks-

df1-

PAN TAN TAXABLE_VALUE TDS date
ABCDE1234F TAN001 1000 100 2023-04-01
ABCDE1234F TAN001 1000 100 2023-04-01
ABCDE1234F TAN002 2000 200 2023-04-05
ABCDE1234F TAN002 3000 300 2023-04-10

df2-

PAN TAN TAXABLE_VALUE TDS date
ABCDE1234F TAN001 1000 100 2023-04-02
ABCDE1234F TAN001 999 101 2023-04-03
ABCDE1234F TAN002 1995 198 2023-04-06
ABCDE1234F TAN004 4000 400 2023-04-11

 

First Attempt

The initial implementation had two separate methods, one for each category. Each ran independently in its own multiprocessing pool. I was relying heavily on .apply(lambda…) for logic like date tolerance, and the performance was… let’s just say, not ideal.

def complex_check(row, category):
    .... .... ....
    time_delta = (dt.strptime(row[DATE_X], "%Y-%m-%d") - dt.strptime(row[DATE_Y], "%Y-%m-%d")).days

    if not time_delta or LOWER_BOUND <= time_delta <= UPPER_BOUND:
        return category
    return "X"
# Here, data and df3 are derived from df1 and df2, respectively.
.... .... ....
data.set_index(data[KEY_INDEX], inplace=True)
df3.set_index(df3[KEY_INDEX], inplace=True)
df3 = pd.merge(data, df3, left_index=True, right_index=True)
if df3.empty:
    continue
.... ... ....
df3[CATEGORY] = df3.apply(lambda record: complex_check(record, category), axis=1)
.... .... ....

Task parallelism is driven by the unique values in the KEY_INDEX column of df1. Let’s explore this further below.

.... .... ....
df1.set_index(df1[KEY_INDEX], inplace=True)
df2.set_index(df2[KEY_INDEX], inplace=True)
jobs = []
with Pool(processes=ie.n_processes) as pool:
    for key_index in df1[KEY_INDEX].unique():
        data = df1[df1[KEY_INDEX] == key_index]
        df3 = df2[df2[KEY_INDEX] == key_index]
        if not data.empty and not df3.empty:
            jobs.append(pool.apply_async(partial(func, data, df3, category, ... .... ....)))
if jobs:
    df = pd.concat([job.get() for job in jobs])
.... .... ....
.... .... ....
Execution Time:
Category 1: 46.53 hours
Category 2: 1.40 hours
Execution Flow Diagram: Before Optimization

Execution Flow Diagram: Before Optimization

Refactoring & Optimizations

I knew I had to fix this. So I took a phased approach, breaking down the bottlenecks and tackling them one by one.

Execution Flow Diagram: After Optimization

Execution Flow Diagram: After Optimization

Case 1: Merging Parallel Workflows

Problem: I was running both categories separately, duplicating effort and wasting resources.
Fix: Instead of parallelizing the categories, I parallelized each PAN-TAN combination, and then handled both categories together in each process.

# Here, we are determining how many tasks will be processed in parallel for the chunks of PAN-TAN.
.... .... ....
groups_df1 = dict(list(df1.groupby([PAN, TAN])))
groups_df2 = dict(list(df2.groupby([PAN, TAN])))
common_keys = groups_df1.keys() & groups_df2.keys()
max_workers = int(os.cpu_count() * 0.85)

manager = multiprocessing.Manager()
shared_list = manager.list()

grouped_task = [
    (shared_list, key, groups_df1.pop(key), groups_df2.pop(key), ..., ..., ....) for key in common_keys
]
with ProcessPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_pan_tan_chunk, grouped_task))

if results:
    .... .... .... 
    .... .... ....
del results, grouped_task, common_keys, groups_df1, groups_df2
.... .... ....
Result:
Execution time dropped from 46+ hours to 15 hours.
Not amazing, but hey — progress!

Case 2: Ditching .apply() for Vectorization

Problem: I was doing too much in .apply(), especially around date tolerance.
Fix: I rewrote those pieces using pandas vectorized operations – things like broadcasting ranges and using .merge_asof() instead of row-wise checks.

# Here, Vectorized version of .apply(lambda .... ....)
.... .... ....
df3 = pd.merge(data, df3, left_index=True, right_index=True)
if df3.empty:
    continue

df3["delta_days"] = (pd.to_datetime(df3[DATE_X]) - pd.to_datetime(df3[DATE_Y])).dt.days
df3[CATEGORY] = np.where((df3["delta_days"] >= LOWER_BOUND) & (df3["delta_days"] <= UPPER_BOUND), category, "X")
.... .... ....
Result:
Boom - down to 2 hours. That’s a 7x improvement.
Vectorization is king in pandas, no kidding.

Case 3: Smart Chunking + Resource Balancing

Problem: My chunks were uneven. Some processes had massive data loads; others were idle.
Fix: I wrote a dynamic chunk manager that split PAN-TAN groups more intelligently based on available memory and CPU cores. Now, multiple small chunks could be handled per process without starving or overloading the system.

# Here, we finalize the chunk size and decide how many tasks will run in parallel, aiming for as balanced resource usage as possible.
.... .... ....
grouped_task = []
temp_df1 = None
temp_df2 = None
total_rows_df1 = sum(groups_df1[k].shape[0] for k in common_keys)
total_rows_df2 = sum(groups_df2[k].shape[0] for k in common_keys)
total_rows = max(total_rows_df1, total_rows_df2)
chunk_size = min(MIN_CHUCK_SIZE, total_rows // max_workers)

for key in common_keys:
    if temp_df1 is None:
        temp_df1 = groups_df1.pop(key)
        temp_df2 = groups_df2.pop(key)
    else:
        temp_df1 = pd.concat([temp_df1, groups_df1.pop(key)], ignore_index=True)
        temp_df2 = pd.concat([temp_df2, groups_df2.pop(key)], ignore_index=True)
    if temp_df1.shape[0]>=chunk_size or temp_df2.shape[0]>=chunk_size:
        grouped_task.append((shared_list, key, temp_df1, temp_df2, ..., ...., ....))
        temp_df1 = None
        temp_df2 = None
if temp_df1 is not None:
grouped_task.append((shared_list, key, temp_df1, temp_df2, ..., ..., ....))
temp_df1 = None
temp_df2 = None
.... .... ....
Result:
Execution time fell to just 6 minutes.
Yes, from 48 hours to 6 minutes. And no fancy tools -- just pandas, multiprocessing, and smart engineering.

Conclusion

  • Avoid the use of .apply() unless you really have to. It’s intuitive, but slow for big data.
  • Chunk your data smartly – think in groups, especially if your business logic allows for clean separation (like PAN-TAN in my case).
  • Don’t over-parallelize – multiprocessing is powerful, but only when you balance memory, I/O, and CPU smartly.
  • Merge related logic where possible to avoid redundant data loads
  • Vectorized code isn’t just faster – it’s cleaner and easier to maintain.
FOUND THIS USEFUL? SHARE IT

Tag -

BigData pandas

Leave a Reply

Your email address will not be published. Required fields are marked *