Solving ETL Dependency Bottlenecks with GitHub Actions
Introduction
In modern data platforms, ETL pipelines are rarely independent. They are deeply interconnected—one pipeline’s output becomes another pipeline’s input. In one of our production projects, we faced a classic orchestration challenge. Pipelines were scheduled based on time rather than actual completion i.e real-time problem with predictable execution time of pipelines.
This article will walks us through:
-
The real problem we encountered
-
Why ETL schedulers alone were insufficient
-
How external orchestration changed everything
-
Why we chose GitHub Actions
-
How we implemented it in practice
The Problem: Time-Based Scheduling vs Reality
We were using an ETL tool and had multiple pipelines scheduled based on dependencies.
A simplified version of our setup looked like this:
-
Pipeline P1
-
Scheduled at 1:00 PM
-
Typical runtime: 1–2 hours
-
-
Pipeline P2
-
Scheduled at 3:00 PM
-
Depends on P1’s output
-
Where Things Started Breaking
This approach assumes:
-
P1 will always finish before 3 PM
-
Execution time is predictable
But in real-world ETL systems:
-
Source data volume fluctuates
-
Network latency changes
-
Downstream systems slow down
-
Unexpected retries happen
This led to two major inefficiencies:
Scenario 1:
If P1 finished in 1 hour, it completed at 2 PM.
But P2 would still wait until 3 PM, wasting a full hour of idle time.
Scenario 2:
If P1 ran longer—say 2.5 hours—it was still running at 3 PM.
As a result:
-
P2 either failed
-
Or was blocked
-
Or ran on incomplete data
This created operational instability and required manual intervention.
Why Native ETL Schedulers Weren’t Enough
Most ETL tools offer:
-
Cron-based scheduling
-
Basic dependency handling
-
Trigger-on-success options (within limits)
However, these features struggle when:
-
Pipelines have variable runtimes
-
Dependencies span multiple environments
-
You need dynamic orchestration
-
Monitoring and retries must be centralized
At their core, ETL schedulers are time-driven.
Our problem needed an event-driven solution.
The Core Insight: Orchestration ≠ Scheduling
This was the turning point for us.
Scheduling answers:
When should something start?
Orchestration answers:
What should happen next, and under what conditions?
We didn’t need better schedules.
We needed a system that could say:
“Start Pipeline P2 only after Pipeline P1 has actually completed successfully.”
Why External Orchestration Wins
By moving orchestration outside the ETL tool, we gained:
-
Decoupling from rigid schedules
-
Centralized control
-
Clear dependency graphs
-
Better failure visibility
-
Easier retries and alerts
We evaluated several options—but GitHub Actions stood out.
What Is GitHub Actions?
GitHub Actions is a workflow automation tool built directly into GitHub.
It is commonly used for:
-
CI/CD pipelines
-
Automated testing
-
Code deployment
Basically, GitHub Actions is a general-purpose workflow engine that can:
-
Run jobs sequentially
-
Trigger external systems via APIs
-
Wait for conditions
-
Fail fast or retry
-
Log everything with timestamps
That makes it powerful for ETL orchestration.
Why GitHub Actions Worked for Our ETL Pipelines
-
Sequential execution
-
Conditional flow based on success/failure
-
Event-driven execution
-
Full visibility into pipeline runs with logs
-
External control, without ETL-tool lock-in
Instead of guessing execution times, we started reacting to actual pipeline state.
High-Level Architecture
Our new orchestration flow looked like this:
GitHub Actions Workflow -> Trigger Pipeline P1 -> Poll P1 status (RUNNING → SUCCESS) -> Trigger Pipeline P2 -> Trigger P3 → P4 → …
Each pipeline starts only after the previous one finishes successfully.
No clocks.
No guessing.
No wasted time.
How We Implemented It
Step 1: Using APIs
Tools provide REST APIs that allow:
-
Trigger jobs
-
Fetch execution status
-
Capture success or failure
Each pipeline is identified by:
-
Project
-
Environment
-
Job name
-
Authentication token
Step 2: Creating a GitHub Actions Workflow
We created a workflow YAML file in our repository:
Step 3: Dependency Validation
Instead of triggering the next pipeline, we:
-
Polled API
-
Checked execution state
-
Proceeded only when status was
SUCCESS
If a pipeline failed:
-
The GitHub Actions job failed
-
Downstream pipelines were not triggered
-
Alerts could be sent immediately via PagerDuty, Slack, etc
Step 4: Observability & Control
GitHub Actions provided:
-
Clear execution logs
-
Start and end timestamps
-
Pipeline-level visibility
-
Single platform to monitor everything
- Manage the cases, whether to run the next step in failure or stop.
We could easily add:
-
Slack notifications / Pager Duty Calls
-
Retry logic
-
Conditional branching
Why This Was Better Than Native Scheduling
| Feature | ETL Scheduler | GitHub Actions |
|---|---|---|
| Time-based runs | Yes | Yes |
| State-based execution | Limited | Strong |
| Dynamic dependencies | Weak | Native |
| Sequential control | Limited | Built-in |
| Failure visibility | Moderate | Excellent |
| Central orchestration | No | Yes |
Real Production Impact
After switching to GitHub Actions:
-
Pipelines started as soon as dependencies were completed
-
No idle waiting between jobs
-
No blocked executions
-
No manual restarts
We moved from a time-driven ETL system to a state-driven or event-driven data pipeline.
Key Takeaways
-
ETL schedulers are good, but for complex dependencies, one can use GitHub Actions for a better high-level architecture
-
Time-based scheduling is not very predictable with variable runtimes
-
External orchestration brings flexibility and reliability
-
GitHub Actions is not just for CI/CD—it’s a powerful workflow engine
-
State-driven orchestration leads to faster, safer pipelines
