{"id":79942,"date":"2026-06-04T11:17:28","date_gmt":"2026-06-04T05:47:28","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=79942"},"modified":"2026-06-08T18:38:42","modified_gmt":"2026-06-08T13:08:42","slug":"moving-beyond-the-messy-middle-how-i-automated-regulatory-data-pipeline-with-snowflake-and-streamlit","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/moving-beyond-the-messy-middle-how-i-automated-regulatory-data-pipeline-with-snowflake-and-streamlit\/","title":{"rendered":"Moving Beyond the Messy Middle: How I Automated Regulatory Data Pipeline with Snowflake and Streamlit"},"content":{"rendered":"<p>Every data engineer knows the dread of a process that looks perfect on paper but relies entirely on manual intervention.<\/p>\n<p>Not long ago, I was managing a critical regulatory compliance data pipeline for a client. The raw ingredients were complex, multi-format regulatory list updates provided in Excel and CSV files. The destination was Snowflake, where all our business transformations and mapping logic needed to live.<\/p>\n<p>The middle, though? It was a manual mess.<\/p>\n<p>Files were hand-carried to Azure Blob Storage, manually loaded into Snowflake external stages, and processed via ad-hoc scripts. It worked, but it absolutely didn&#8217;t scale. It was error-prone, completely lacked version control, and kept our data consumers waiting.<\/p>\n<p>Here is how we turned that brittle, manual workflow into an automated, production-grade enterprise data pipeline using Snowflake, Streamlit, and FastAPI.<\/p>\n<h2><strong>The Core Challenge: Standardizing the Unstandardized<\/strong><\/h2>\n<p>Regulatory lists\u2014like chemical substance inventories and restriction lists\u2014are notoriously messy. They are full of multi-value fields (like semi-colon-separated CAS numbers), irregular date formats, and unpredictable schema drift.<\/p>\n<p>To bring order to this chaos, I established a strict architectural foundation inside Snowflake before even touching the automation layer.<\/p>\n<p>1. Hardening the Ingestion Layer<\/p>\n<p>We designed a custom Snowflake file format to handle the nuances of incoming UTF-8 encoded files, ensuring that a single malformed row wouldn&#8217;t bring the entire pipeline to a screeching halt.<\/p>\n<div id=\"attachment_79934\" style=\"width: 310px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79934\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79934 size-medium\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.1-300x184.png\" alt=\"Image\" width=\"300\" height=\"184\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.1-300x184.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.1.png 511w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-79934\" class=\"wp-caption-text\">Image<\/p><\/div>\n<p>2. Schema-on-Read with Variadic Data<\/p>\n<p>Instead of forcing raw, unstable file structures directly into rigid relational columns, we used Snowflake&#8217;s OBJECT_CONSTRUCT to ingest the data as a structured JSON payload. This shielded our staging tables from sudden, unexpected column changes.<\/p>\n<div id=\"attachment_79935\" style=\"width: 737px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79935\" decoding=\"async\" loading=\"lazy\" class=\"wp-image-79935 size-full\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.2.png\" alt=\"Image\" width=\"727\" height=\"367\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.2.png 727w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.2-300x151.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.2-624x315.png 624w\" sizes=\"(max-width: 727px) 100vw, 727px\" \/><\/a><p id=\"caption-attachment-79935\" class=\"wp-caption-text\">Image<\/p><\/div>\n<p>Why this matters:<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<strong> OBJECT_CONSTRUCT<\/strong> standardizes downstream consumption into flexible key-value pairs, making schema drift a non-issue.<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <strong>ON_ERROR = &#8216;CONTINUE&#8217;<\/strong> guarantees pipeline resilience, isolating bad records without killing the entire ingestion batch.<\/p>\n<h2><strong>Centralizing the Logic: Why I Used UDFs<\/strong><\/h2>\n<p>Regulatory transformations require highly specific mapping rules, like parsing conditional values or handling complex string logic. Embedding this directly inside execution scripts is a recipe for code duplication and maintenance nightmares.<\/p>\n<p>Instead, we encapsulated the mapping logic into a centralized User-Defined Function (UDF) in Snowflake. This gave us a few massive advantages:<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 One Source of Truth: All regulatory mapping rules live in a single, easily auditable repository.<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Reusability: The exact same mapping logic applies to both historical processing and real-time ingestion.<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Cleaner Pipelines: It keeps our primary SQL ingestion syntax modular and incredibly easy to maintain.<\/p>\n<h2>Enter Streamlit: Giving Control Back to the Business<\/h2>\n<p>While the database foundation was solid, we still had to solve the human problem: how do non-technical business stakeholders upload new lists without touching Azure Storage or running SQL commands?<\/p>\n<p>We solved this by building an interactive internal web application using Streamlit.<\/p>\n<div id=\"attachment_79936\" style=\"width: 695px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79936\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-79936\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.3.png\" alt=\"Image\" width=\"685\" height=\"109\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.3.png 685w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.3-300x48.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.3-624x99.png 624w\" sizes=\"(max-width: 685px) 100vw, 685px\" \/><\/a><p id=\"caption-attachment-79936\" class=\"wp-caption-text\">Image<\/p><\/div>\n<p><strong>Why Streamlit Changed the Game<\/strong><\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Familiar, Intuitive UI: Users simply drag and drop their updated Excel or CSV regulatory lists right into the browser.<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Embedded Automation: The moment a file is uploaded, the Streamlit app interacts with Snowflake behind the scenes via native Python connectors to run the staging copy and execute the transformation queries.<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Bulletproof Auditing: To prevent accidental data overwrites, we implemented an IS_LATEST flagging strategy. Every upload generates a unique FILE_VERSION_ID and captures an INGESTED_AT timestamp, ensuring we never lose historical lineage.<\/p>\n<h2>Data Modernization: Secure Views and API Readiness<\/h2>\n<p>Regulatory data is highly sensitive and constantly requested by downstream enterprise applications. To protect it, we avoided letting third-party tools query our physical base tables directly. Instead, we built a Secure View Layer that handles both data parsing and access control.<\/p>\n<div id=\"attachment_79937\" style=\"width: 610px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79937\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-79937\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.4.png\" alt=\"Image\" width=\"600\" height=\"396\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.4.png 600w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.4-300x198.png 300w\" sizes=\"(max-width: 600px) 100vw, 600px\" \/><\/a><p id=\"caption-attachment-79937\" class=\"wp-caption-text\">Image<\/p><\/div>\n<p>The Engineering Behind the View<\/p>\n<p>1.\u00a0\u00a0\u00a0\u00a0 Dynamic Flattening: We used LATERAL FLATTEN to split multi-CAS number strings (e.g., 123-45-6; 789-10-1) into clean, distinct rows on the fly.<\/p>\n<p>2.\u00a0\u00a0\u00a0\u00a0 Regex Cleansing: A strict REGEXP_LIKE filter guarantees that only syntactically valid CAS numbers are exposed to downstream systems.<\/p>\n<p>3.\u00a0\u00a0\u00a0\u00a0 Data Security: Because it\u2019s a Secure View, Snowflake prevents external consumers from analyzing the underlying base tables, ensuring tight data governance.<\/p>\n<h2>Exposing Data Safely via FastAPI<\/h2>\n<p>To transform this warehouse asset into an agile data service, we wrapped the Snowflake Secure View in a lightweight, asynchronous FastAPI application.<\/p>\n<p><strong>Secure Connection Engineering<\/strong><\/p>\n<p>To keep credentials secure, we decoupled our configuration from the application logic entirely by using an isolated .env environment file.<\/p>\n<p>Here is the clean implementation pattern we utilized to connect safely to Snowflake:<\/p>\n<div id=\"attachment_79939\" style=\"width: 688px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79939\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-79939\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.5.png\" alt=\"Image\" width=\"678\" height=\"548\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.5.png 678w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.5-300x242.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.5-624x504.png 624w\" sizes=\"(max-width: 678px) 100vw, 678px\" \/><\/a><p id=\"caption-attachment-79939\" class=\"wp-caption-text\">Image<\/p><\/div>\n<p><strong>The Production API Endpoints<\/strong><\/p>\n<p>With our database connection handled, we exposed real-time endpoint paths to deliver data dynamically to external teams.<\/p>\n<div id=\"attachment_79940\" style=\"width: 995px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79940\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-79940\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.6.png\" alt=\"Image\" width=\"985\" height=\"635\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.6.png 985w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.6-300x193.png 300w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.6-768x495.png 768w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.6-624x402.png 624w\" sizes=\"(max-width: 985px) 100vw, 985px\" \/><\/a><p id=\"caption-attachment-79940\" class=\"wp-caption-text\">Image<\/p><\/div>\n<h2><strong>Lessons from the Field: Troubleshooting Infrastructure<\/strong><\/h2>\n<p>Moving a pipeline from manual execution to automated, local code always uncovers a few environmental quirks. During development, we hit a couple of classic infrastructure roadblocks:<\/p>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Account Identifier Subtleties: Snowflake&#8217;s account identifier format can be deceptive. A simple typo like company.azure-westeurope instead of the correct, cloud-provider-specific format company.west-europe.azure will result in frustrating network timeouts. If you run into this, verify your settings natively using:<\/p>\n<div id=\"attachment_79941\" style=\"width: 323px\" class=\"wp-caption alignnone\"><img aria-describedby=\"caption-attachment-79941\" decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-79941\" src=\"https:\/\/www.tothenew.com\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.7.png\" alt=\"Image\" width=\"313\" height=\"58\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.7.png 313w, \/blog\/wp-ttn-blog\/uploads\/2026\/06\/blog.7-300x56.png 300w\" sizes=\"(max-width: 313px) 100vw, 313px\" \/><\/a><p id=\"caption-attachment-79941\" class=\"wp-caption-text\">Image<\/p><\/div>\n<p>\u00b7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Network &amp; Firewall Whitelisting: Local API setups require strict coordination with IT security policies. When connecting over enterprise VPNs, you have to ensure your public IP routing matches the client&#8217;s Snowflake Network Policies and that traffic can seamlessly traverse port 443.<\/p>\n<h1>The Value Delivered<\/h1>\n<p>By eliminating the manual data-shuffling middleman and wrapping our data platform inside user-friendly web frameworks and secure APIs, we completely changed how the business interacts with this data:<\/p>\n<table style=\"border-collapse: collapse; width: 100%;\">\n<tbody>\n<tr>\n<td style=\"width: 46.7241%;\"><strong>Before: Manual Process<\/strong><\/td>\n<td style=\"width: 53.2759%;\"><strong>After: Modern Automated Stack<\/strong><\/td>\n<\/tr>\n<tr>\n<td style=\"width: 46.7241%;\">1. Hand-carried file uploads took hours and introduced human error.<br \/>\n2. Inconsistent transformation scripts caused frequent schema discrepancies.<br \/>\n3. Overwriting older files erased critical historical audit trails.<br \/>\n4. Monolithic access patterns exposed raw base tables to consumers.<\/td>\n<td style=\"width: 53.2759%;\">1. Streamlit App allows self-service uploads with instant, automatic processing.<br \/>\n2. Snowflake UDFs enforce central, standardized mapping logic.<br \/>\n3. Audit Versioning (IS_LATEST) preserves permanent tracking and history.<br \/>\n4. FastAPI &amp; Secure Views expose data safely via production-ready REST endpoints.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Building modern data pipelines isn&#8217;t just about moving data from point A to point B\u2014it&#8217;s about creating clean, secure, and accessible entry points that empower both business users and downstream software. By combining Snowflake, Streamlit, and FastAPI, we took a slow, painful chore and turned it into a robust, hands-off enterprise asset.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every data engineer knows the dread of a process that looks perfect on paper but relies entirely on manual intervention. Not long ago, I was managing a critical regulatory compliance data pipeline for a client. The raw ingredients were complex, multi-format regulatory list updates provided in Excel and CSV files. The destination was Snowflake, where [&hellip;]<\/p>\n","protected":false},"author":1789,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":0},"categories":[6194],"tags":[1358,5236],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/79942"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1789"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=79942"}],"version-history":[{"count":2,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/79942\/revisions"}],"predecessor-version":[{"id":80019,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/79942\/revisions\/80019"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=79942"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=79942"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=79942"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}