{"id":72948,"date":"2025-07-10T11:44:15","date_gmt":"2025-07-10T06:14:15","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=72948"},"modified":"2025-07-14T12:16:13","modified_gmt":"2025-07-14T06:46:13","slug":"from-48-hours-to-6-minutes-my-journey-optimizing-a-pandas-reconciliation-process-for-large-scale-data","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/from-48-hours-to-6-minutes-my-journey-optimizing-a-pandas-reconciliation-process-for-large-scale-data\/","title":{"rendered":"From 48 Hours to 6 Minutes: My Journey Optimizing a Pandas Reconciliation Process for Large-Scale Data"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p>I recently went through a wild but rewarding ride optimizing a heavy pandas-based reconciliation workflow. What started as a slow, clunky piece of code eating up over <strong>48 hours<\/strong>, ended up getting polished into a lean, mean <strong>6-minutes<\/strong> machine.<br \/>\nI wanted to share my story &#8211; not as a tutorial, but as a real-world case study from the trenches. If you\u2019ve ever felt your pandas code was taking forever, this post is for you.<\/p>\n<h3><strong>The Problem: Reconciling Huge DataFrames<\/strong><\/h3>\n<p>I was working with two large pandas DataFrames &#8211; let\u2019s call them df1 and df2, and to reconcile under two categories:<\/p>\n<ol>\n<li>Exact Match<\/li>\n<li>Tolerance-Based Match<\/li>\n<\/ol>\n<p>The reconciliation keys were mostly PAN, TAN, TAXABLE_VALUE, TDS and DATE, but for the second category, TAXABLE_VALUE &amp; TDS match allowed a tolerance window &#8211; not an exact match. The date tolerance was applicable in both categories (i.e., not a strict equality match on date)<\/p>\n<p>Here\u2019s how it looked:<\/p>\n<ul>\n<li>category_1: Grouped by [PAN, TAN, TAXABLE_VALUE, TDS]<\/li>\n<li>category_2: Grouped by just [PAN, TAN], since we wanted more flexibility with TAXABLE_VALUE &amp; TDS tolerance<\/li>\n<\/ul>\n<h3>A Look into Data Complexity<\/h3>\n<p>The data wasn\u2019t just large &#8211; it was messy and unbalanced, which made the problem more challenging:<\/p>\n<ul>\n<li>3.9 million rows in one DataFrame (df1) and 770,000 rows in the other DataFrame (df2)<\/li>\n<li>Duplicate entries on both sides for the same PAN-TAN combinations<\/li>\n<li>Multiple TANs under a single PAN, not always aligned between df1 and df2<\/li>\n<li>Skewed distribution &#8211; some PAN-TAN groups had thousands of rows, others had only a few<\/li>\n<\/ul>\n<p>This resulted in combinatorial explosions when matching duplicate groups &#8211; especially during tolerance-based reconciliation &#8211; and made naive multiprocessing very inefficient.<br \/>\nBelow are samples of how the actual data looks-<\/p>\n<p>df1-<\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 168px;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\"><span style=\"color: #000000;\"><strong>PAN<\/strong><\/span><\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\"><span style=\"color: #000000;\"><strong>TAN<\/strong><\/span><\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\"><span style=\"color: #000000;\"><strong>TAXABLE_VALUE<\/strong><\/span><\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\"><span style=\"color: #000000;\"><strong>TDS<\/strong><\/span><\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\"><span style=\"color: #000000;\"><strong>date<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">TAN001<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">1000<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">100<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">2023-04-01<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">TAN001<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">1000<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">100<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">2023-04-01<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">TAN002<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">2000<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">200<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">2023-04-05<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">TAN002<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">3000<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">300<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">2023-04-10<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<td style=\"width: 20%; height: 24px; text-align: center;\">&#8230;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>df2-<\/p>\n<table style=\"border-collapse: collapse; width: 100%; height: 144px;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; text-align: center; height: 24px;\"><span style=\"color: #000000;\"><strong>PAN<\/strong><\/span><\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\"><span style=\"color: #000000;\"><strong>TAN<\/strong><\/span><\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\"><span style=\"color: #000000;\"><strong>TAXABLE_VALUE<\/strong><\/span><\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\"><span style=\"color: #000000;\"><strong>TDS<\/strong><\/span><\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\"><span style=\"color: #000000;\"><strong>date<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; text-align: center; height: 24px;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">TAN001<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">1000<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">100<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">2023-04-02<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; text-align: center; height: 24px;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">TAN001<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">999<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">101<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">2023-04-03<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; text-align: center; height: 24px;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">TAN002<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">1995<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">198<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">2023-04-06<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; text-align: center; height: 24px;\">ABCDE1234F<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">TAN004<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">4000<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">400<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">2023-04-11<\/td>\n<\/tr>\n<tr style=\"height: 24px;\">\n<td style=\"width: 20%; text-align: center; height: 24px;\">&#8230;<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">&#8230;<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">&#8230;<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">&#8230;<\/td>\n<td style=\"width: 20%; text-align: center; height: 24px;\">&#8230;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<h3>First Attempt<\/h3>\n<p>The initial implementation had two separate methods, one for each category. Each ran independently in its own multiprocessing pool. I was relying heavily on .apply(lambda&#8230;) for logic like date tolerance, and the performance was&#8230; let\u2019s just say, not ideal.<\/p>\n<pre>def complex_check(row, category):\r\n    .... .... ....\r\n\u00a0 \u00a0 time_delta = (dt.strptime(row[DATE_X], \"%Y-%m-%d\") - dt.strptime(row[DATE_Y], \"%Y-%m-%d\")).days\r\n\r\n\u00a0 \u00a0 if not time_delta or LOWER_BOUND &lt;= time_delta &lt;= UPPER_BOUND:\r\n\u00a0 \u00a0 \u00a0 \u00a0 return category\r\n\u00a0 \u00a0 return \"X\"\r\n<\/pre>\n<pre># Here, data and df3 are derived from df1 and df2, respectively.\r\n.... .... ....\r\ndata.set_index(data[KEY_INDEX], inplace=True)\r\ndf3.set_index(df3[KEY_INDEX], inplace=True)\r\ndf3 = pd.merge(data, df3, left_index=True, right_index=True)\r\nif df3.empty:\r\n\u00a0 \u00a0 continue\r\n.... ... ....\r\ndf3[CATEGORY] = df3.apply(lambda record: complex_check(record, category), axis=1)\r\n.... .... ....<\/pre>\n<p>Task parallelism is driven by the unique values in the KEY_INDEX column of df1. Let\u2019s explore this further below.<\/p>\n<pre>.... .... ....\r\ndf1.set_index(df1[KEY_INDEX], inplace=True)\r\ndf2.set_index(df2[KEY_INDEX], inplace=True)\r\njobs = []\r\nwith Pool(processes=ie.n_processes) as pool:\r\n\u00a0 \u00a0 for key_index in df1[KEY_INDEX].unique():\r\n\u00a0 \u00a0 \u00a0 \u00a0 data = df1[df1[KEY_INDEX] == key_index]\r\n\u00a0 \u00a0 \u00a0 \u00a0 df3 = df2[df2[KEY_INDEX] == key_index]\r\n\u00a0 \u00a0 \u00a0 \u00a0 if not data.empty and not df3.empty:\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 jobs.append(pool.apply_async(partial(func, data, df3, category, ... .... ....)))\r\nif jobs:\r\n\u00a0 \u00a0 df = pd.concat([job.get() for job in jobs])\r\n.... .... ....\r\n.... .... ....<\/pre>\n<pre><strong>Execution Time:\r\nCategory 1: 46.53 hours\r\nCategory 2: 1.40 hours<\/strong><\/pre>\n<div id=\"attachment_72990\" style=\"width: 959px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72990\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72990\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/07\/Flow-Diagram_-Before-Optimization_1.png\" alt=\"Execution Flow Diagram: Before Optimization\" width=\"949\" height=\"786\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/07\/Flow-Diagram_-Before-Optimization_1.png 949w, \/blog\/wp-ttn-blog\/uploads\/2025\/07\/Flow-Diagram_-Before-Optimization_1-300x248.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/07\/Flow-Diagram_-Before-Optimization_1-768x636.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/07\/Flow-Diagram_-Before-Optimization_1-624x517.png 624w\" sizes=\"(max-width: 949px) 100vw, 949px\" \/><p id=\"caption-attachment-72990\" class=\"wp-caption-text\">Execution Flow Diagram: Before Optimization<\/p><\/div>\n<h3>Refactoring &amp; Optimizations<\/h3>\n<p>I knew I had to fix this. So I took a phased approach, breaking down the bottlenecks and tackling them one by one.<\/p>\n<div id=\"attachment_72991\" style=\"width: 970px\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-72991\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-72991\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2025\/07\/Execution-Flow-Diagram_-After-Optimization_1.png\" alt=\"Execution Flow Diagram: After Optimization\" width=\"960\" height=\"720\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2025\/07\/Execution-Flow-Diagram_-After-Optimization_1.png 960w, \/blog\/wp-ttn-blog\/uploads\/2025\/07\/Execution-Flow-Diagram_-After-Optimization_1-300x225.png 300w, \/blog\/wp-ttn-blog\/uploads\/2025\/07\/Execution-Flow-Diagram_-After-Optimization_1-768x576.png 768w, \/blog\/wp-ttn-blog\/uploads\/2025\/07\/Execution-Flow-Diagram_-After-Optimization_1-624x468.png 624w\" sizes=\"(max-width: 960px) 100vw, 960px\" \/><p id=\"caption-attachment-72991\" class=\"wp-caption-text\">Execution Flow Diagram: After Optimization<\/p><\/div>\n<h3><strong>Case 1: Merging Parallel Workflows<\/strong><\/h3>\n<p><strong>Problem<\/strong>: I was running both categories separately, duplicating effort and wasting resources.<br \/>\n<strong>Fix<\/strong>: Instead of parallelizing the categories, I parallelized each PAN-TAN combination, and then handled both categories together in each process.<\/p>\n<pre># Here, we are determining how many tasks will be processed in parallel for the chunks of PAN-TAN.\r\n.... .... ....\r\ngroups_df1 = dict(list(df1.groupby([PAN, TAN])))\r\ngroups_df2 = dict(list(df2.groupby([PAN, TAN])))\r\ncommon_keys = groups_df1.keys() &amp; groups_df2.keys()\r\nmax_workers = int(os.cpu_count() * 0.85)\r\n\r\nmanager = multiprocessing.Manager()\r\nshared_list = manager.list()\r\n\r\ngrouped_task = [\r\n    (shared_list, key, groups_df1.pop(key), groups_df2.pop(key), ..., ..., ....) for key in common_keys\r\n]\r\nwith ProcessPoolExecutor(max_workers=max_workers) as executor:\r\nresults = list(executor.map(process_pan_tan_chunk, grouped_task))\r\n\r\nif results:\r\n\u00a0 \u00a0 .... .... ....\u00a0\r\n\u00a0 \u00a0 .... .... ....\r\ndel results, grouped_task, common_keys, groups_df1, groups_df2\r\n.... .... ....<\/pre>\n<pre><strong>Result:\r\nExecution time dropped from 46+ hours to 15 hours.\r\nNot amazing, but hey \u2014 progress!<\/strong><\/pre>\n<h3><strong>Case 2: Ditching .apply() for Vectorization<\/strong><\/h3>\n<p><strong>Problem<\/strong>: I was doing too much in .apply(), especially around date tolerance.<br \/>\n<strong>Fix<\/strong>: I rewrote those pieces using pandas vectorized operations &#8211; things like broadcasting ranges and using .merge_asof() instead of row-wise checks.<\/p>\n<pre># Here, Vectorized version of .apply(lambda .... ....)\r\n.... .... ....\r\ndf3 = pd.merge(data, df3, left_index=True, right_index=True)\r\nif df3.empty:\r\n  \u00a0 continue\r\n\r\ndf3[\"delta_days\"] = (pd.to_datetime(df3[DATE_X]) - pd.to_datetime(df3[DATE_Y])).dt.days\r\ndf3[CATEGORY] = np.where((df3[\"delta_days\"] &gt;= LOWER_BOUND) &amp; (df3[\"delta_days\"] &lt;= UPPER_BOUND), category, \"X\")\r\n.... .... ....<\/pre>\n<pre><strong>Result:\r\nBoom - down to 2 hours. That\u2019s a 7x improvement.\r\nVectorization is king in pandas, no kidding.<\/strong><\/pre>\n<h3><strong>Case 3: Smart Chunking + Resource Balancing<\/strong><\/h3>\n<p><strong>Problem<\/strong>: My chunks were uneven. Some processes had massive data loads; others were idle.<br \/>\n<strong>Fix<\/strong>: I wrote a dynamic chunk manager that split <em>PAN-TAN<\/em> groups more intelligently based on available memory and CPU cores. Now, multiple small chunks could be handled per process without starving or overloading the system.<\/p>\n<pre># Here, we finalize the chunk size and decide how many tasks will run in parallel, aiming for as balanced resource usage as possible.\r\n.... .... ....\r\ngrouped_task = []\r\ntemp_df1 = None\r\ntemp_df2 = None\r\ntotal_rows_df1 = sum(groups_df1[k].shape[0] for k in common_keys)\r\ntotal_rows_df2 = sum(groups_df2[k].shape[0] for k in common_keys)\r\ntotal_rows = max(total_rows_df1, total_rows_df2)\r\nchunk_size = min(MIN_CHUCK_SIZE, total_rows \/\/ max_workers)\r\n\r\nfor key in common_keys:\r\n\u00a0 \u00a0 if temp_df1 is None:\r\n\u00a0 \u00a0 \u00a0 \u00a0 temp_df1 = groups_df1.pop(key)\r\n\u00a0 \u00a0 \u00a0 \u00a0 temp_df2 = groups_df2.pop(key)\r\n\u00a0 \u00a0 else:\r\n\u00a0 \u00a0 \u00a0 \u00a0 temp_df1 = pd.concat([temp_df1, groups_df1.pop(key)], ignore_index=True)\r\n\u00a0 \u00a0 \u00a0 \u00a0 temp_df2 = pd.concat([temp_df2, groups_df2.pop(key)], ignore_index=True)\r\n\u00a0 \u00a0 if temp_df1.shape[0]&gt;=chunk_size or temp_df2.shape[0]&gt;=chunk_size:\r\n\u00a0 \u00a0 \u00a0 \u00a0 grouped_task.append((shared_list, key, temp_df1, temp_df2, ..., ...., ....))\r\n\u00a0 \u00a0 \u00a0 \u00a0 temp_df1 = None\r\n\u00a0 \u00a0 \u00a0 \u00a0 temp_df2 = None\r\nif temp_df1 is not None:\r\ngrouped_task.append((shared_list, key, temp_df1, temp_df2, ..., ..., ....))\r\ntemp_df1 = None\r\ntemp_df2 = None\r\n.... .... ....<\/pre>\n<pre><strong>Result:\r\nExecution time fell to just 6 minutes.\r\nYes, from 48 hours to 6 minutes. And no fancy tools -- just pandas, multiprocessing, and smart engineering.<\/strong><\/pre>\n<h2>Conclusion<\/h2>\n<ul>\n<li>Avoid the use of .apply() unless you really have to. It\u2019s intuitive, but slow for big data.<\/li>\n<li>Chunk your data smartly &#8211; think in groups, especially if your business logic allows for clean separation (like PAN-TAN in my case).<\/li>\n<li>Don\u2019t over-parallelize &#8211; multiprocessing is powerful, but only when you balance memory, I\/O, and CPU smartly.<\/li>\n<li>Merge related logic where possible to avoid redundant data loads<\/li>\n<li>Vectorized code isn&#8217;t just faster &#8211; it&#8217;s cleaner and easier to maintain.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introduction I recently went through a wild but rewarding ride optimizing a heavy pandas-based reconciliation workflow. What started as a slow, clunky piece of code eating up over 48 hours, ended up getting polished into a lean, mean 6-minutes machine. I wanted to share my story &#8211; not as a tutorial, but as a real-world [&hellip;]<\/p>\n","protected":false},"author":2102,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":53},"categories":[5879],"tags":[1197,7553],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/72948"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/2102"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=72948"}],"version-history":[{"count":17,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/72948\/revisions"}],"predecessor-version":[{"id":73239,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/72948\/revisions\/73239"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=72948"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=72948"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=72948"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}