Solving ETL Dependency Bottlenecks with GitHub Actions

12 / Feb / 2026 by Keha Gupta 0 comments

Introduction

In modern data platforms, ETL pipelines are rarely independent. They are deeply interconnected—one pipeline’s output becomes another pipeline’s input. In one of our production projects, we faced a classic orchestration challenge. Pipelines were scheduled based on time rather than actual completion i.e real-time problem with predictable execution time of pipelines.

This article will walks us through:

  • The real problem we encountered

  • Why ETL schedulers alone were insufficient

  • How external orchestration changed everything

  • Why we chose GitHub Actions

  • How we implemented it in practice


The Problem: Time-Based Scheduling vs Reality

We were using an ETL tool and had multiple pipelines scheduled based on dependencies.

A simplified version of our setup looked like this:

  • Pipeline P1

    • Scheduled at 1:00 PM

    • Typical runtime: 1–2 hours

  • Pipeline P2

    • Scheduled at 3:00 PM

    • Depends on P1’s output

Where Things Started Breaking

This approach assumes:

  • P1 will always finish before 3 PM

  • Execution time is predictable

But in real-world ETL systems:

  • Source data volume fluctuates

  • Network latency changes

  • Downstream systems slow down

  • Unexpected retries happen

This led to two major inefficiencies:

Scenario 1:

If P1 finished in 1 hour, it completed at 2 PM.
But P2 would still wait until 3 PM, wasting a full hour of idle time.

Scenario 2:

If P1 ran longer—say 2.5 hours—it was still running at 3 PM.
As a result:

  • P2 either failed

  • Or was blocked

  • Or ran on incomplete data

This created operational instability and required manual intervention.


Why Native ETL Schedulers Weren’t Enough

Most ETL tools offer:

  • Cron-based scheduling

  • Basic dependency handling

  • Trigger-on-success options (within limits)

However, these features struggle when:

  • Pipelines have variable runtimes

  • Dependencies span multiple environments

  • You need dynamic orchestration

  • Monitoring and retries must be centralized

At their core, ETL schedulers are time-driven.
Our problem needed an event-driven solution.


The Core Insight: Orchestration ≠ Scheduling

This was the turning point for us.

Scheduling answers:

When should something start?

Orchestration answers:

What should happen next, and under what conditions?

We didn’t need better schedules.
We needed a system that could say:

“Start Pipeline P2 only after Pipeline P1 has actually completed successfully.”


Why External Orchestration Wins

By moving orchestration outside the ETL tool, we gained:

  • Decoupling from rigid schedules

  • Centralized control

  • Clear dependency graphs

  • Better failure visibility

  • Easier retries and alerts

We evaluated several options—but GitHub Actions stood out.


What Is GitHub Actions?

GitHub Actions is a workflow automation tool built directly into GitHub.

It is commonly used for:

  • CI/CD pipelines

  • Automated testing

  • Code deployment

Basically, GitHub Actions is a general-purpose workflow engine that can:

  • Run jobs sequentially

  • Trigger external systems via APIs

  • Wait for conditions

  • Fail fast or retry

  • Log everything with timestamps

That makes it powerful for ETL orchestration.


Why GitHub Actions Worked for Our ETL Pipelines

  • Sequential execution

  • Conditional flow based on success/failure

  • Event-driven execution

  • Full visibility into pipeline runs with logs

  • External control, without ETL-tool lock-in

Instead of guessing execution times, we started reacting to actual pipeline state.


High-Level Architecture

Our new orchestration flow looked like this:

GitHub Actions Workflow -> Trigger Pipeline P1 -> Poll P1 status (RUNNING → SUCCESS) -> Trigger Pipeline P2 -> Trigger P3 → P4 → …

Each pipeline starts only after the previous one finishes successfully.

No clocks.
No guessing.
No wasted time.


How We Implemented It

Step 1: Using APIs

Tools provide REST APIs that allow:

  • Trigger jobs

  • Fetch execution status

  • Capture success or failure

Each pipeline is identified by:

  • Project

  • Environment

  • Job name

  • Authentication token


Step 2: Creating a GitHub Actions Workflow

We created a workflow YAML file in our repository:

name: ETL Orchestration

on:
    workflow_dispatch:
    schedule:
        – cron: “0 13 * * *”

jobs:
    run-etl:
        runs-on: ubuntu-latest

        steps:
            – name: Trigger Pipeline P1
            run: |
curl -X POST “$P1_API”

           – name: Wait for P1 to complete
           run: |
./check_status.sh P1

           – name: Trigger Pipeline P2
           run: |
curl -X POST “$P2_API”


Step 3: Dependency Validation

Instead of triggering the next pipeline, we:

  • Polled API

  • Checked execution state

  • Proceeded only when status was SUCCESS

If a pipeline failed:

  • The GitHub Actions job failed

  • Downstream pipelines were not triggered

  • Alerts could be sent immediately  via PagerDuty, Slack, etc


Step 4: Observability & Control

GitHub Actions provided:

  • Clear execution logs

  • Start and end timestamps

  • Pipeline-level visibility

  • Single platform to monitor everything

  • Manage the cases, whether to run the next step in failure or stop.

We could easily add:

  • Slack notifications / Pager Duty Calls

  • Retry logic

  • Conditional branching


Why This Was Better Than Native Scheduling

Feature ETL Scheduler GitHub Actions
Time-based runs Yes Yes
State-based execution Limited Strong
Dynamic dependencies Weak Native
Sequential control Limited Built-in
Failure visibility Moderate Excellent
Central orchestration No Yes

Real Production Impact

After switching to GitHub Actions:

  • Pipelines started as soon as dependencies were completed

  • No idle waiting between jobs

  • No blocked executions

  • No manual restarts

We moved from a time-driven ETL system to a state-driven or event-driven data pipeline.


Key Takeaways

  • ETL schedulers are good, but for complex dependencies, one can use GitHub Actions for a better high-level architecture

  • Time-based scheduling is not very predictable with variable runtimes

  • External orchestration brings flexibility and reliability

  • GitHub Actions is not just for CI/CD—it’s a powerful workflow engine

  • State-driven orchestration leads to faster, safer pipelines

FOUND THIS USEFUL? SHARE IT

Leave a Reply

Your email address will not be published. Required fields are marked *