Introduction
In modern data platforms, ETL pipelines are rarely independent. They are deeply interconnected—one pipeline’s output becomes another pipeline’s input. In one of our production projects, we faced a classic orchestration challenge. Pipelines were scheduled based on time rather than actual completion i.e real-time problem with predictable execution time of pipelines.
This article will walks us through:
-
The real problem we encountered
-
Why ETL schedulers alone were insufficient
-
How external orchestration changed everything
-
Why we chose GitHub Actions
-
How we implemented it in practice
The Problem: Time-Based Scheduling vs Reality
We were using an ETL tool and had multiple pipelines scheduled based on dependencies.
A simplified version of our setup looked like this:
-
Pipeline P1
-
Scheduled at 1:00 PM
-
Typical runtime: 1–2 hours
-
-
Pipeline P2
-
Scheduled at 3:00 PM
-
Depends on P1’s output
-
Where Things Started Breaking
This approach assumes:
-
P1 will always finish before 3 PM
-
Execution time is predictable
But in real-world ETL systems:
-
Source data volume fluctuates
-
Network latency changes
-
Downstream systems slow down
-
Unexpected retries happen
This led to two major inefficiencies:
Scenario 1:
If P1 finished in 1 hour, it completed at 2 PM.
But P2 would still wait until 3 PM, wasting a full hour of idle time.
Scenario 2:
If P1 ran longer—say 2.5 hours—it was still running at 3 PM.
As a result:
-
P2 either failed
-
Or was blocked
-
Or ran on incomplete data
This created operational instability and required manual intervention.
Why Native ETL Schedulers Weren’t Enough
Most ETL tools offer:
-
Cron-based scheduling
-
Basic dependency handling
-
Trigger-on-success options (within limits)
However, these features struggle when:
-
Pipelines have variable runtimes
-
Dependencies span multiple environments
-
You need dynamic orchestration
-
Monitoring and retries must be centralized
At their core, ETL schedulers are time-driven.
Our problem needed an event-driven solution.
The Core Insight: Orchestration ≠ Scheduling
This was the turning point for us.
Scheduling answers:
When should something start?
Orchestration answers:
What should happen next, and under what conditions?
We didn’t need better schedules.
We needed a system that could say:
“Start Pipeline P2 only after Pipeline P1 has actually completed successfully.”
Why External Orchestration Wins
By moving orchestration outside the ETL tool, we gained:
-
Decoupling from rigid schedules
-
Centralized control
-
Clear dependency graphs
-
Better failure visibility
-
Easier retries and alerts
We evaluated several options—but GitHub Actions stood out.
What Is GitHub Actions?
GitHub Actions is a workflow automation tool built directly into GitHub.
It is commonly used for:
-
CI/CD pipelines
-
Automated testing
-
Code deployment
Basically, GitHub Actions is a general-purpose workflow engine that can:
-
Run jobs sequentially
-
Trigger external systems via APIs
-
Wait for conditions
-
Fail fast or retry
-
Log everything with timestamps
That makes it powerful for ETL orchestration.
Why GitHub Actions Worked for Our ETL Pipelines
-
Sequential execution
-
Conditional flow based on success/failure
-
Event-driven execution
-
Full visibility into pipeline runs with logs
-
External control, without ETL-tool lock-in
Instead of guessing execution times, we started reacting to actual pipeline state.
High-Level Architecture
Our new orchestration flow looked like this:
GitHub Actions Workflow -> Trigger Pipeline P1 -> Poll P1 status (RUNNING → SUCCESS) -> Trigger Pipeline P2 -> Trigger P3 → P4 → …
Each pipeline starts only after the previous one finishes successfully.
No clocks.
No guessing.
No wasted time.
How We Implemented It
Step 1: Using APIs
Tools provide REST APIs that allow:
-
Trigger jobs
-
Fetch execution status
-
Capture success or failure
Each pipeline is identified by:
-
Project
-
Environment
-
Job name
-
Authentication token
Step 2: Creating a GitHub Actions Workflow
We created a workflow YAML file in our repository:
Step 3: Dependency Validation
Instead of triggering the next pipeline, we:
-
Polled API
-
Checked execution state
-
Proceeded only when status was
SUCCESS
If a pipeline failed:
-
The GitHub Actions job failed
-
Downstream pipelines were not triggered
-
Alerts could be sent immediately via PagerDuty, Slack, etc
Step 4: Observability & Control
GitHub Actions provided:
-
Clear execution logs
-
Start and end timestamps
-
Pipeline-level visibility
-
Single platform to monitor everything
- Manage the cases, whether to run the next step in failure or stop.
We could easily add:
-
Slack notifications / Pager Duty Calls
-
Retry logic
-
Conditional branching
Why This Was Better Than Native Scheduling
| Feature | ETL Scheduler | GitHub Actions |
|---|---|---|
| Time-based runs | Yes | Yes |
| State-based execution | Limited | Strong |
| Dynamic dependencies | Weak | Native |
| Sequential control | Limited | Built-in |
| Failure visibility | Moderate | Excellent |
| Central orchestration | No | Yes |
Real Production Impact
After switching to GitHub Actions:
-
Pipelines started as soon as dependencies were completed
-
No idle waiting between jobs
-
No blocked executions
-
No manual restarts
We moved from a time-driven ETL system to a state-driven or event-driven data pipeline.
Key Takeaways
-
ETL schedulers are good, but for complex dependencies, one can use GitHub Actions for a better high-level architecture
-
Time-based scheduling is not very predictable with variable runtimes
-
External orchestration brings flexibility and reliability
-
GitHub Actions is not just for CI/CD—it’s a powerful workflow engine
-
State-driven orchestration leads to faster, safer pipelines