Solving ETL Dependency Bottlenecks with GitHub Actions

Introduction

In modern data platforms, ETL pipelines are rarely independent. They are deeply interconnected—one pipeline’s output becomes another pipeline’s input. In one of our production projects, we faced a classic orchestration challenge. Pipelines were scheduled based on time rather than actual completion i.e real-time problem with predictable execution time of pipelines.

This article will walks us through:

The real problem we encountered
Why ETL schedulers alone were insufficient
How external orchestration changed everything
Why we chose GitHub Actions
How we implemented it in practice

The Problem: Time-Based Scheduling vs Reality

We were using an ETL tool and had multiple pipelines scheduled based on dependencies.

A simplified version of our setup looked like this:

Pipeline P1
- Scheduled at 1:00 PM
- Typical runtime: 1–2 hours
Pipeline P2
- Scheduled at 3:00 PM
- Depends on P1’s output

Where Things Started Breaking

This approach assumes:

P1 will always finish before 3 PM
Execution time is predictable

But in real-world ETL systems:

Source data volume fluctuates
Network latency changes
Downstream systems slow down
Unexpected retries happen

This led to two major inefficiencies:

Scenario 1:

If P1 finished in 1 hour, it completed at 2 PM.
But P2 would still wait until 3 PM, wasting a full hour of idle time.

Scenario 2:

If P1 ran longer—say 2.5 hours—it was still running at 3 PM.
As a result:

P2 either failed
Or was blocked
Or ran on incomplete data

This created operational instability and required manual intervention.

Why Native ETL Schedulers Weren’t Enough

Most ETL tools offer:

Cron-based scheduling
Basic dependency handling
Trigger-on-success options (within limits)

However, these features struggle when:

Pipelines have variable runtimes
Dependencies span multiple environments
You need dynamic orchestration
Monitoring and retries must be centralized

At their core, ETL schedulers are time-driven.
Our problem needed an event-driven solution.

The Core Insight: Orchestration ≠ Scheduling

This was the turning point for us.

Scheduling answers:

When should something start?

Orchestration answers:

What should happen next, and under what conditions?

We didn’t need better schedules.
We needed a system that could say:

“Start Pipeline P2 only after Pipeline P1 has actually completed successfully.”

Why External Orchestration Wins

By moving orchestration outside the ETL tool, we gained:

Decoupling from rigid schedules
Centralized control
Clear dependency graphs
Better failure visibility
Easier retries and alerts

We evaluated several options—but GitHub Actions stood out.

What Is GitHub Actions?

GitHub Actions is a workflow automation tool built directly into GitHub.

It is commonly used for:

CI/CD pipelines
Automated testing
Code deployment

Basically, GitHub Actions is a general-purpose workflow engine that can:

Run jobs sequentially
Trigger external systems via APIs
Wait for conditions
Fail fast or retry
Log everything with timestamps

That makes it powerful for ETL orchestration.

Why GitHub Actions Worked for Our ETL Pipelines

Sequential execution
Conditional flow based on success/failure
Event-driven execution
Full visibility into pipeline runs with logs
External control, without ETL-tool lock-in

Instead of guessing execution times, we started reacting to actual pipeline state.

High-Level Architecture

Our new orchestration flow looked like this:

GitHub Actions Workflow -> Trigger Pipeline P1 -> Poll P1 status (RUNNING → SUCCESS) -> Trigger Pipeline P2 -> Trigger P3 → P4 → …

Each pipeline starts only after the previous one finishes successfully.

No clocks.
No guessing.
No wasted time.

How We Implemented It

Step 1: Using APIs

Tools provide REST APIs that allow:

Trigger jobs
Fetch execution status
Capture success or failure

Each pipeline is identified by:

Project
Environment
Job name
Authentication token

Step 2: Creating a GitHub Actions Workflow

We created a workflow YAML file in our repository:

Step 3: Dependency Validation

Instead of triggering the next pipeline, we:

Polled API
Checked execution state
Proceeded only when status was SUCCESS

If a pipeline failed:

The GitHub Actions job failed
Downstream pipelines were not triggered
Alerts could be sent immediately via PagerDuty, Slack, etc

Step 4: Observability & Control

GitHub Actions provided:

Clear execution logs
Start and end timestamps
Pipeline-level visibility
Single platform to monitor everything
Manage the cases, whether to run the next step in failure or stop.

We could easily add:

Slack notifications / Pager Duty Calls
Retry logic
Conditional branching

Why This Was Better Than Native Scheduling

Feature	ETL Scheduler	GitHub Actions
Time-based runs	Yes	Yes
State-based execution	Limited	Strong
Dynamic dependencies	Weak	Native
Sequential control	Limited	Built-in
Failure visibility	Moderate	Excellent
Central orchestration	No	Yes

Real Production Impact

After switching to GitHub Actions:

Pipelines started as soon as dependencies were completed
No idle waiting between jobs
No blocked executions
No manual restarts

We moved from a time-driven ETL system to a state-driven or event-driven data pipeline.

Key Takeaways

ETL schedulers are good, but for complex dependencies, one can use GitHub Actions for a better high-level architecture
Time-based scheduling is not very predictable with variable runtimes
External orchestration brings flexibility and reliability
GitHub Actions is not just for CI/CD—it’s a powerful workflow engine
State-driven orchestration leads to faster, safer pipelines