Like many data engineers, I’ve spent a good chunk of my time dealing with a problem that sounds simple on paper but is messy in reality: reliably moving data from source systems into an analytics platform. In one of my recent projects, I worked on setting up data integration using Airbyte, and this post is a reflection on that...
Let me tell you about the moment I realized I’d been overcomplicating things for years. I was working on a pipeline in Snowflake. You know the type — a multi-stage transformation process where a few base tables feed into intermediate tables, some reconciliation happens, and eventually it all lands in a final reporting layer. I’d...
Recently converted a Python script that relied on Pandas DataFrames to utilize PySpark DataFrames instead. The main goal is to transition data manipulation from the localized context of Pandas to the distributed processing capabilities offered by PySpark. This shift to PySpark DataFrames enables us to enhance scalability and efficiency by...
