Downtime Is The Enemy: Our Battle-Tested Zero-Downtime JFrog Migration Story

DevOps

19 / Feb / 2025 by Vansh Raj 0 comments

“In the world of continuous delivery, downtime is the enemy of progress “

Have you ever thought that you are sitting in your office peacefully and suddenly, you are notified that migration is about to begin for the most critical tool on which all other or I would say the entire project depends on it. What do you think what tool it could be ..aah none other than the artifactory .

So let’s dive into the journey of ups and downs and at last a successful change.

JFrog

Moving forward, let’s give you a background context on who we serve, the current architecture, and why the artifactory was so important for the organisation.

In the world of betting and gaming services, seamless digital operations are critical to ensuring an uninterrupted experience for users. During certain periods, the anticipated system load can spike up to five times the usual levels due to the nature of the business.

Just before one such peak season, we were notified about a major upgrade related to our production setup of the JFrog Artifactory. Before diving into the changes, let’s first understand what JFrog is and why it plays a crucial role in an organization.

Artifactory

JFrog Artifactory is a universal repository manager that centralizes artifact storage, management, and distribution across various technologies. It integrates with CI/CD pipelines, supports multiple package formats (Docker, Maven, npm, etc.), and offers high availability, scalability, and security. As part of JFrog’s DevOps platform, Artifactory streamlines workflows, enhances build stability, and accelerates software delivery.

Using a repository manager offers several benefits:

Reduces remote downloads, saving bandwidth and time.
Enhances build stability by minimizing reliance on external repositories.
Optimizes performance with remote repositories.
Provides better control over consumed and shared artifacts.
Enables efficient binary sharing without rebuilding from source.
Centralizes artifact storage for easy access across teams.

The key benefit of Artifactory for us is its role as a centralized storage system for Docker images, packages, and other artifacts, making it essential to our workflow. Now, you might be wondering—what major upgrades were we concerned about? Let me list them for you.

Major Upgrade – Amazon Linux 2 to Amazon Linux 2023.
Introduction of a new Private LoadBalancer for the On-premise server access.
Security updates – CrowdStrike.
Network hardening Rules – Upgradation of Security Groups.
Creation of alarms

With the gaming season approaching, minimizing downtime was crucial to avoid production impact. The main challenge was upgrading JFrog without data loss, ensuring continuity with the same database and S3 binary store—this marked the start of our journey.

Before, diving into what steps we took let us know via a diagram what our current architecture was to have a basic idea of changes that we were going to make.

Current Architecture

So, as you can see it was a highly available ECS setup and one thing to mention is deployed using CloudFormation. Moving further we were asked to come up with a plan for a successful no-down timer plan as there was no way to upgrade the OS type directly of the running servers without stopping the servers and creating a new one for Amazon Linux 2023.

Wanna know what we came up with ………Excited, and hope so coming up to this part of the blog. We thought of creating a parallel setup of the same JFrog by deploying a new ECS service and ASG and not disturbing the old ones in a way that talks to the same RDS and S3 …. the on-premise testing to go smoothly and no downtime also to be expected during the upgrade.

In our current setup, all the JFrog-related files were part of an AMI and we wanted to remove that dependency, The major hurdle that we came across that what files were necessary to initiate the JFrog, As. Let me make this easy and list the important files for you.

binarystore.xml – Filestore configuration in our case we are storing artifacts on an S3 bucket
master.key
join.key

So, keeping this in mind we came up with a new architecture. Let me show you the changes with a diagram.

New Architecture

As you can see there have been multiple changes done to the architecture as of now. Let me sum up in points the changes and the new resources that have come up.

A New Autoscaling Group: It scales up a new AL2023-based EC2 currently with the ecs.config for the same ECS cluster and the user data updates for crowdstrike and new relic intergrations and the above-mentioned files for JFrog from S3 to the source path in EC2 to be mounted. Also not to forget the default.conf file for nginx is also called from the S3 bucket.
S3-config-bucket: The new s3 config bucket contains all the important and major files that are required from the old running setup i.e;
1. ecs.config – cluster configurations
2. master.key
3. join.key
4. binarystore.xml
5. default.conf – Nginx config from the old setup.
New ECS Service: New ECS service with new private, and public loadbalancers and target group, using the same Jfrog files being mounted to the mount volume.

These changes created a parallel setup based on Amazon Linux 2023 and the latest security updates.Now comes the part where we need to scale down the old services and purge them. Further scale up the new infra with the new container instance.

We divided the entire scenario into five different groups :

Pre Deployment
In Deployment
PVT
DNS Change
CleanUp

The Pre Deployment and Deployment steps were pretty much covered in the above setup in which we followed the following steps :

Pre Deployment :
The following steps are to be followed during the rollout activity for changes to be done in the Jfrog artifactory Setup :

Log in to your artifactory account
Take a backup of Artifactory RDS
Create and deploy 2 CFN artifactory-sg.yaml and artifactory-config-s3.yaml first, this will create the required SGs for RDS, Private ALB and EC2 and S3 i.e required to store the binaryStore, ecs.config. default.conf. master and join.key
Copy the below-mentioned files from already running setup EC2 hosts to the new Artifactory S3 config Bucket.
/root/default.conf
/root/jfrog_home/etc/artifactory/binarystore.xml/root/jfrog_home/etc/security/master.key
/root/jfrog_home/etc/security/join.key
/etc/ecs/ecs.config
Note: The files are being transferred to the S3 as that will remove any dependency of files being on any remote server and eventually losing it due to any fault at cause.
Now, go to your Artifactory console and verify that the spare license has been added. The reason for this is that we are running an HA setup, and since we have two tasks already consuming licenses, adding a new setup to the same HA setup will require an additional license. (Login to https://jfrog.com/start-free/ to get a 1-month free license).

In Deployment :

We deployed 2 CFNs one is the ASG deployment and the other is the ECS service deployment with a new Private and Publoc ALB.
We created two cloudformation templates named artifactory-ha-asg.yaml and artifactory-service-v2.yaml
With the help of artifactory-ha-asg.yaml we will deploy our first CFN that will create an asg and register a new container instance on our cluster “artifactory” with the latest ami of Amazon Linux 2023 and all the updates for tools like crowdstrike and newrelic installation and path-based updates.
Note: The container paths would be :
Nginx: /etc/nginx/default.conf
JFrog: /var/opt/jfrog/artifactory
With the help of artifactory-service-v2.yaml, we will deploy the second CFN which will deploy our service with both the private and public loadbalancer.

Now we will move to the PVT, DNS Change and Cleanup steps :

PVT :

Once CFNs are deployed successfully, create a /etc/hosts entry with new LB’s IPs pointing
xyz.artifact.com
Run the test cases like AWS Healthchecks, Local Docker pull, push, npm install, git-lfs, console logging etc
Verify on-prem terminal access to private loadbalancer.

DNS Change :

Go to artifactory cluster → artifactory ECS Service(Old service) → Update service and scale down desire from 2 to 1
Once the task for the above service is set to 1, Drain the old host where there is no task running and detach that instance from ASG without selecting the Replace instance option (this will automatically decrease the desired to 1).
Increase the new Host count to 2 and go to the new ECS service and increase the desired to 2
Wait for new tasks to come up and perform the PVT again
After successful PVT, again update the old service and set the desired and min to 0
Go to xyz.artifact.com/ui/admin/license_management/license and verify that the new IPs are visible there, perform the PVT again.
Then Go to Route53 → xyz.artifact.com.au hostedzone and update the DNS xyz.artifact.com.au with the new LB DNS

Cleanup :

Delete the CFNs of the old setup

The final JFROG architecture looked as :

Final Architecture

This is how finally we had a highly available artifactory setup migrated with zero downtime. Hope the findings help you in your infrastructure and that the enemy of progress downtime does not hinder your work.

You can connect with us via LinkedIn for more info :

Blogs

Downtime Is The Enemy: Our Battle-Tested Zero-Downtime JFrog Migration Story

Leave a Reply Cancel reply

Blogs

Tag -

Leave a Reply Cancel reply

Tips for writing a blog

Learn how to write a caption