GitOps for Industrial Fleets: ArgoCD vs Flux Production Tutorial
Last Updated: April 29, 2026
Industrial IoT deployments operate at a completely different scale and constraint than cloud-native SaaS. When you’re managing 1,000+ edge clusters across factories, energy plants, and retail sites, GitOps for industrial fleets demands a fundamentally different approach from pushing Helm charts to EKS in a single region. You face disconnected networks, regulatory approval gates, OT certification requirements, and the need for plant-by-plant rollout safety.
In this tutorial, we’ll walk through production-grade GitOps for industrial fleets using ArgoCD and Flux, cover the architectural decisions that matter at scale, and show you exactly how to handle disconnected sites, multi-site sync waves, and rollback patterns that keep your fleet compliant and safe.
What Makes Industrial Fleet GitOps Different?
Cloud GitOps assumes fast, reliable connectivity and autonomous cluster behavior. Industrial fleets operate under entirely different constraints:
| Challenge | Impact | GitOps Solution |
|---|---|---|
| Disconnected sites (no internet 24/7) | Clusters can’t fetch manifests; manifests stale in offline mode | Image mirrors, manifest snapshots, sync windows, sneakernet patterns |
| Slow, metered links (satellite, cellular) | Pulling 500MB image registries fails or costs $$$ | Delta sync, minimal image footprints, local caching layers |
| Regulatory drift gates | Can’t auto-sync production changes; need approval before any drift correction | Sync modes (Manual vs Automatic), OPA Gatekeeper policies, audit logging |
| OT certification | Changes to safety-critical workloads require documented approval trail | GitOps audit log (who changed what, when), signed commits (sigstore), immutable webhooks |
| Plant-by-plant rollout | One misconfiguration can shut down 10 lines; need canary validation | Sync waves, ApplicationSets with cluster maturity tiers, staged rollouts |
| At-scale observability | 1,000 clusters = 1,000 divergent states; classic drift detection fails | Flux + Prometheus metrics, ArgoCD metrics aggregation, fleet-wide health dashboards |
The decision between ArgoCD and Flux hinges on your fleet topology and risk tolerance.
Decision: Hub-and-Spoke vs Per-Cluster Pull
ArgoCD Hub-and-Spoke (centralized control plane, child applications per cluster):
– Single control plane in your data center or cloud region pulls git and templates applications to each edge cluster
– ApplicationSets generate child applications dynamically (cluster generator, git generator, matrix)
– Best for: centralized oversight, compliance auditing, multi-cloud fleets with heterogeneous clusters
Flux Per-Cluster Pull (each cluster self-manages from git):
– Every edge cluster runs Flux source-controller, kustomization-controller, helm-controller
– Clusters independently pull their own manifests, resolve dependencies, and sync
– Best for: high-autonomy edge sites, disconnected regions, minimal hub resource footprint
Our recommendation for 1,000+ industrial sites: hybrid. Run ArgoCD hub for policy, compliance auditing, and multi-tenant governance. Deploy Flux on edge clusters for self-healing and resilience during hub downtime.
Why Standard Cloud GitOps Breaks at Industrial Scale
Single hub failure cascades to all 1,000 clusters. In cloud environments, control plane unavailability is momentary. In factories, it means production lines go dark. Your hub needs HA, but every edge cluster must remain operational during hub downtime. Pure ArgoCD hub-and-spoke fails without Flux agents on edges.
Bandwidth costs spiral. Pulling a container image in your data center costs nothing. At a remote factory with satellite uplink, pulling the same image to 100 clusters costs $$$ per gigabyte. You need image mirrors and delta sync, not vanilla registry pulls.
Regulatory approval isn’t instant. Cloud teams auto-sync because change advisory is Slack-based. Industrial sites require documented approval: who changed what, when, and why. Every sync must be traceable to a git commit with a signature. Auto-sync without audit trails becomes a liability.
Disconnected operation is mandatory, not nice-to-have. Cloud fleets assume connectivity. Industrial sites operate in zones with no internet for days (underground facilities, ships, oil rigs). Your GitOps system must function in true offline mode—not “slow connectivity” mode, but “zero connectivity for 72 hours” mode.
This is why industrial GitOps demands hybrid patterns: centralized policy (ArgoCD hub) + decentralized reconciliation (Flux on edges).
Architecture: Industrial Fleet Topology

Your fleet looks like this:
– Hub Control Plane (data center): ArgoCD server, Prometheus, Grafana, policy engine
– Regional Sync Nodes (edge DCs or on-prem): Flux controllers, local image mirrors, Git repo caches
– 10 Plants × 100 cells/site: Kubernetes edge clusters, minimal footprint, high autonomy
At scale, you can’t tolerate a single hub failure taking down all 1,000 clusters. Flux on the edge gives you resilience; ArgoCD hub gives you compliance visibility.
ArgoCD ApplicationSet for Fleet-Scale Deployment
ApplicationSets are ArgoCD’s answer to templating applications across hundreds of clusters. Instead of creating 1,000 Application CRDs by hand, you describe a generator and ApplicationSet creates and manages all the child apps.
Example 1: Cluster Generator (List of Clusters)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: fleet-core-services
namespace: argocd
spec:
generators:
- list:
elements:
- cluster: plant-001-line-a
region: us-midwest
tier: production
- cluster: plant-001-line-b
region: us-midwest
tier: production
- cluster: plant-002-assembly
region: us-south
tier: production
template:
metadata:
name: '{{cluster}}-core'
spec:
project: default
source:
repoURL: https://git.company.local/industrial-fleet
targetRevision: main
path: 'fleet/core-services/{{tier}}'
destination:
server: 'https://{{cluster}}.internal:6443'
namespace: kube-system
syncPolicy:
automated:
prune: false # Regulatory requirement: don't auto-delete
selfHeal: false
syncOptions:
- CreateNamespace=true
Creates separate Application CRDs for each cluster, pulling from tier-specific directories in git.
Example 2: Git Generator with Directory Traversal
For massive fleets (1,000+ clusters), listing every cluster in YAML becomes unmaintainable. Instead, use the Git generator:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: fleet-all-workloads
spec:
generators:
- git:
repoURL: https://git.company.local/industrial-fleet
revision: main
directories:
- path: 'clusters/*/manifests'
template:
metadata:
name: '{{path.basenameNormalized}}'
spec:
project: default
source:
repoURL: https://git.company.local/industrial-fleet
targetRevision: main
path: '{{path}}'
destination:
server: 'https://{{path.basename}}.internal:6443'
namespace: default
ApplicationSet auto-discovers all subdirectories and creates child apps. Add a new cluster? Push a new directory to git. ArgoCD handles the rest.
Example 3: Matrix Generator (Multi-Dimension Templating)
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: fleet-multi-region-workloads
spec:
generators:
- matrix:
generators:
- list:
elements:
- tier: canary
clusters: [plant-001-line-a]
- tier: production
clusters: [plant-001-line-b, plant-002-assembly]
- list:
elements:
- workload: monitoring
- workload: networking
- workload: storage
template:
metadata:
name: '{{clusters}}-{{workload}}'
spec:
project: default
source:
repoURL: https://git.company.local/industrial-fleet
targetRevision: main
path: 'workloads/{{workload}}/{{tier}}'
destination:
server: 'https://{{clusters}}.internal:6443'
namespace: '{{workload}}'
This generates 6 ApplicationSet children automatically, combining cluster tiers and workload types.
Flux Multi-Tenancy & Kustomization Layers
Flux takes the opposite approach: each edge cluster runs its own reconciliation loop. No hub dependency.
Flux GitRepository Source
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: fleet-manifests
namespace: flux-system
spec:
interval: 1m
url: https://git.company.local/industrial-fleet.git
ref:
branch: main
suspend: false
Kustomization with Multi-Layer Customization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: core-services
namespace: flux-system
spec:
interval: 5m
sourceRef:
kind: GitRepository
name: fleet-manifests
path: ./fleet/core-services
prune: false # Regulatory: no auto-deletion
wait: true
timeout: 5m
postBuild:
substitute:
CLUSTER_NAME: ${CLUSTER_NAME}
REGION: ${REGION}
TIER: ${TIER}
substituteFrom:
- kind: ConfigMap
name: cluster-config
validation: client
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: core-dns
namespace: kube-system
Helm Release with Dependency Management
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: industrial-edge-monitoring
namespace: monitoring
spec:
interval: 5m
chart:
spec:
chart: prometheus-community/kube-prometheus-stack
version: '57.0.0'
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
values:
prometheus:
prometheusSpec:
retention: 72h
resources:
requests:
memory: 512Mi
cpu: 250m
dependsOn:
- name: storage-class
namespace: flux-system
Handling Disconnected Sites: Mirror Registries & Manifest Snapshots
The moment a site loses internet, your GitOps sync stops. For industrial fleets, this is unacceptable.
Pattern 1: Local Image Mirror (Harbor or Artifactory)
apiVersion: v1
kind: Secret
metadata:
name: registry-mirror
namespace: flux-system
type: kubernetes.io/dockercfg
data:
.dockercfg: |
{
"https://local-harbor.plant-001.internal:5000": {
"auth": "base64-encoded-credentials"
}
}
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
name: edge-apps
namespace: flux-system
spec:
image: local-harbor.plant-001.internal:5000/industrial/workload
interval: 5m
secretRef:
name: registry-mirror
Deployment workflow:
1. Hub builds images → pushes to central registry
2. Regional sync node mirrors images to local harbors (scheduled, off-peak)
3. Edge clusters pull only from local mirrors (no outbound registry calls)
4. If internet drops → edge clusters still have cached images for 30–90 days
Pattern 2: Manifest Snapshots with Git-Based Caching
# On hub: daily snapshot
git clone https://git.company.local/industrial-fleet-live
kustomize build fleet/core-services > /tmp/snapshots/core-services.yaml
helm template monitoring prometheus-community/kube-prometheus-stack \
--values fleet/monitoring/values.yaml > /tmp/snapshots/monitoring.yaml
git add snapshots/
git commit -m "Daily manifest snapshot $(date +%Y-%m-%d)"
git push origin main:offline-snapshots
On disconnected edge clusters:
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
name: fleet-manifests-offline
spec:
interval: 24h
url: https://git.company.local/industrial-fleet.git
ref:
branch: offline-snapshots
If the cluster loses connection, Flux continues using cached snapshots for 24 hours.
Pattern 3: Sneakernet & USB Sync (Extreme Disconnection)
For truly isolated sites (satellite downlink, no local internet):
# Hub prepares USB drive
mkdir -p /mnt/usb/fleet-sync-$(date +%Y%m%d)
cp -r snapshots/manifests/* /mnt/usb/fleet-sync-$(date +%Y%m%d)/
docker save local-harbor.plant-001.internal:5000/industrial/workload \
| gzip > /mnt/usb/fleet-sync-$(date +%Y%m%d)/images.tar.gz
# Field engineer plugs in USB at site
cd /mnt/usb/fleet-sync-20260429
docker load < images.tar.gz
kubectl apply -f manifests/
Not glamorous, but essential for remote energy sites and offshore platforms.
Sync Waves: Plant-by-Plant Rollout with Approval Gates
You cannot afford a misconfiguration affecting all 1,000 clusters simultaneously. Use sync waves to stage rollouts:

ArgoCD Sync Waves
---
# Wave 1: Canary plant (single-line validation)
apiVersion: v1
kind: ConfigMap
metadata:
name: core-config-v2
annotations:
argocd.argoproj.io/sync-wave: "1"
data:
version: "2.0"
---
# Wave 3: Regional production (5 plants)
apiVersion: apps/v1
kind: Deployment
metadata:
name: fleet-workload
annotations:
argocd.argoproj.io/sync-wave: "3"
spec:
replicas: 5
template:
spec:
containers:
- name: workload
image: local-harbor.plant-001.internal:5000/industrial/workload:v2.0
resources:
requests:
memory: "256Mi"
cpu: "100m"
---
# Wave 5: Full fleet rollout
apiVersion: apps/v1
kind: Deployment
metadata:
name: fleet-workload-full
annotations:
argocd.argoproj.io/sync-wave: "5"
spec:
replicas: 995
Execution timeline:
Wave 1 (T+0): Canary plant (1 cluster)
Wave 2 (T+24h): Manual approval check
Wave 3 (T+24h): 5 regional production plants
Wave 4 (T+48h): Manual approval check
Wave 5 (T+48h): Remaining 994 clusters (full fleet)
Rollback Patterns: Git Revert vs ArgoCD Rollback
Industrial deployments demand rollback speed. When something goes wrong at 2 AM, you can’t wait 30 minutes.
Fast Rollback: ArgoCD Sync to Previous Revision
argocd app history plant-001-core
# Output:
# REVISION DEPLOY TIME USERNAME MESSAGE
# abc1234 2026-04-29 14:30:00Z automation GitOps sync
# def5678 2026-04-29 13:00:00Z automation GitOps sync
argocd app rollback plant-001-core abc1234
argocd app sync plant-001-core
Time to recover: 30 seconds.
Persistent Rollback: Git Revert + Commit
git log --oneline | head -5
git revert abc1234 --no-edit
git push origin main
# ArgoCD detects new commit, syncs within 1m
Policy Gates: OPA Gatekeeper & Kyverno Policies
OPA Gatekeeper: Require Resource Requests
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Deployment"
not input.request.object.spec.template.spec.containers[_].resources.requests
msg := "Containers must define resource requests"
}
Deploy the constraint:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
name: require-resources
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
excludedNamespaces: ["kube-system"]
Capacity Planning: Sizing Your Hub and Edge Agents
ArgoCD Hub Sizing (1,000 ApplicationSet-managed clusters):
– CPU: 2-4 cores (application controller reconciles 333 apps/minute)
– Memory: 2-4 GB (ArgoCD server + etcd)
– Persistent Storage: 100 GB (etcd history, logs, metrics)
– Network: 1 Gbps minimum (webhook payloads, metric exports)
Flux Agent Sizing (per edge cluster, minimal):
– CPU: 100m per controller (source-controller, kustomization-controller, helm-controller)
– Memory: 256 MB per controller
– Persistent Storage: 10 GB (git cache, image cache)
Flux is lightweight enough to run on edge clusters with 2 cores and 4 GB RAM.
Git Repository Performance:
With 1,000 clusters polling every 5 minutes:
– 200 polling requests/minute
– ~1 GB/month of manifest diffs
Solutions: GitHub Enterprise ($231/month), Gitea with replication, GitLab HA (~$2,000/year).
Security Model: Signed Commits, Image Signatures, and RBAC
Git Commit Signing (sigstore):
export COSIGN_EXPERIMENTAL=1
git commit -m "Deploy fleet workload v2.0"
git push origin main
Image Signature Verification (sigstore cosign):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-images
spec:
validationFailureAction: enforce
rules:
- name: check-signature
match:
resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- 'myregistry/industrial-*'
cosign:
key: |-
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
-----END PUBLIC KEY-----
RBAC: Per-Cluster Service Accounts
apiVersion: v1
kind: ServiceAccount
metadata:
name: argocd-plant-001-line-a
namespace: argocd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: argocd-plant-001-line-a
rules:
- apiGroups: ['apps']
resources: ['deployments', 'statefulsets']
verbs: ['get', 'list', 'patch', 'update']
# Intentionally restrictive: no wildcard, no deletions
Drift Detection at Scale
ArgoCD Drift Metrics:
# Applications out of sync
count(argocd_application_sync_total{status="OutOfSync"})
# Sync success rate per cluster (last hour)
increase(argocd_application_sync_total{result="Succeeded"}[1h])
/ increase(argocd_application_sync_total[1h])
# Time since last sync
(time() - argocd_app_info) > 3600
Alerting Strategy:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fleet-drift-alerts
spec:
groups:
- name: gitops-fleet
rules:
- alert: ClusterOutOfSyncLong
expr: (time() - argocd_app_info) > 3600
for: 15m
annotations:
summary: "Cluster out of sync > 1 hour"
- alert: SyncFailureRate
expr: |
(increase(argocd_application_sync_total{result="Failed"}[1h])
/ increase(argocd_application_sync_total[1h])) > 0.1
for: 10m
annotations:
summary: "Cluster has > 10% sync failures"
Observability: Fleet Health Dashboard
apiVersion: v1
kind: ConfigMap
metadata:
name: fleet-health-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "Industrial Fleet GitOps Health",
"panels": [
{
"title": "Clusters In Sync",
"targets": [{"expr": "count(argocd_application_sync_total{status=\"Synced\"})"}]
},
{
"title": "Drift Alerts (Out of Sync > 1h)",
"targets": [{"expr": "count((time() - argocd_app_info) > 3600 and argocd_application_sync_total{status=\"OutOfSync\"})"}]
},
{
"title": "Failed Syncs (Last 24h)",
"targets": [{"expr": "increase(argocd_application_sync_total{result=\"Failed\"}[24h])"}]
}
]
}
}
Production Checklist: Going Live with 1,000+ Clusters
Infrastructure & Networking
– [ ] Hub control plane: HA (3+ replicas), persistent storage (etcd + PVC)
– [ ] Flux agents: resource limits enforced (CPU: 250m, Mem: 512Mi)
– [ ] Image mirrors: deployed in all regions, sync jobs scheduled (off-peak)
– [ ] Git repo: redundancy (GitHub Enterprise, GitLab HA, Gitea with replication)
– [ ] Network policies: ingress to hub restricted to known clusters only
Compliance & Audit
– [ ] Git commits signed (sigstore): all manifests signed, keys rotated quarterly
– [ ] Audit logging: all syncs logged to Loki, retention = 1 year (regulatory)
– [ ] OPA Gatekeeper policies: deployed, tested, enforced (no permissive mode)
– [ ] RBAC: ArgoCD service accounts per cluster, no wildcard permissions
– [ ] Backup: ApplicationSet snapshots, Git repo backups (daily), etcd backups (hourly)
Operational Readiness
– [ ] Runbooks: rollback procedure (doc + tested), incident communication plan
– [ ] Observability: Prometheus + Loki + Grafana dashboards live, alerts configured
– [ ] Disaster recovery: RTO ≤ 1 hour, RPO ≤ 15 min, tested quarterly
– [ ] Canary validation: 5-plant canary sync tested, approval gates documented
– [ ] Disconnected site testing: local mirror failover tested, sneakernet tested
Tooling & Automation
– [ ] GitOps CLI: argocd or flux CLI integrated into release pipeline
– [ ] CI/CD gates: all manifests linted (Kube-linter, Kubescape), no merges without passing checks
– [ ] Image scanning: all container images scanned for CVEs before registry push
– [ ] Helm dependency updates: automated Renovate/Dependabot PRs, reviewed weekly
Real-World Deployment Example: 300-Cluster Retail Fleet
Let’s walk through a concrete example: deploying a payment processing update to 300 retail stores across North America.
Week 1: Preparation
- Create feature branch
payment-v2.1in git - Update manifests: modify Deployment image, bump payment-processing to v2.1
- Push to git (requires signed commit + peer review)
- Automated CI: run Kube-linter, Kubescape, image scanning; all must pass
- Create ApplicationSet for canary:
yaml
- cluster: test-store-001-nyc
tier: canary
Week 2: Canary Deployment
- Engineer creates PR to merge
payment-v2.1intomain - ArgoCD automatically previews ApplicationSet changes in dry-run mode
- After approval, merge to main
- ArgoCD syncs payment-v2.1-core app to test-store-001-nyc (canary)
- Team monitors for 48 hours:
– Payment transaction volume: normal
– Error logs: zero payment failures
– CPU/Memory: no degradation
– Logs for “card declined” errors: baseline
Week 3: Regional Rollout (Wave 3)
- If canary passes, create 2nd ApplicationSet:
yaml
- cluster: [store-002-boston, store-003-philadelphia, ...]
tier: production-wave-1 - Merge to main; ArgoCD syncs to 50 stores
- Monitor metrics dashboard for regional aggregate:
– Payment success rate should remain > 99.9%
– No spike in customer complaints - After 24h validation, proceed to Wave 5
Week 4: Full Fleet (Wave 5)
- Final ApplicationSet:
yaml
- cluster: [all remaining 250 stores]
tier: production-full - Merge; ArgoCD syncs payment-v2.1 to all 300 stores over 4-hour window
- Post-deployment validation:
– Smoke tests: every store can process a test transaction
– Metrics: payment volume smooth across all stores
– Support tickets: monitor Slack for escalations
Rollback (if needed)
If a bug surfaces:
- Emergency hotfix: revert git commit
bash
git revert abc1234 --no-edit
git push origin main - ArgoCD detects new commit, syncs rollback to all clusters within 1 minute
- Payment service automatically rolls back to v2.0 cluster-wide
- Post-incident: review why canary missed the bug
This entire process, from canary to full-fleet rollout, takes 1 week with zero downtime. Manual rollout to 300 stores? That would take months and risk human error at every site.
Common Industrial GitOps Failure Modes & Fixes
Scenario 1: Hub Loses Access to Edge Cluster
Symptoms: ArgoCD shows Application status “Unknown”, sync stuck for > 2 hours.
Root causes:
– Firewall rule deleted by security team
– Network outage affecting WAN link to remote plant
– TLS certificate rotation failed; webhook authentication rejected
Mitigation:
1. Deploy Flux agents on all edges; they survive hub downtime
2. Configure NetworkPolicy to whitelist only necessary flows
3. Monitor hub disk space; alert at 80% etcd capacity
Scenario 2: Git Repository Becomes Unavailable
Symptoms: All edge clusters show “GitRepository.Source ReconcileError” after 5 minutes.
Root causes:
– GitHub/GitLab experiences regional outage
– Gitea storage drive failed
– Manifest file accidentally deleted from git
Mitigation:
1. Deploy git mirroring: every edge region caches manifests every 2 hours
2. Use suspend: false on GitRepository to prevent forced re-sync during outages
3. Implement manifest validation in pre-commit hooks
Scenario 3: Image Registry Mirror Fills Up (Disk Full)
Symptoms: Cluster workloads can’t pull images, pending state with “imagepull backoff”.
Root causes:
– Mirror retention policy too aggressive
– New workload requires 10 GB image, mirror only has 20 GB available
Mitigation:
# Configure Harbor/Nexus cleanup policies
# Retention: keep only 5 most recent image tags
# Cleanup: run daily at 2 AM, delete unused images older than 30 days
# Monitor disk usage
docker exec harbor-core curl -s http://localhost/api/v2.0/systeminfo | jq '.storage'
# Alert: trigger at 85% capacity
# Auto-cleanup: trigger at 90% capacity
Scenario 4: OT Team Manually Changes Cluster (Drift)
Symptoms: Operator logs into cluster and runs kubectl patch deployment workload --patch='{"spec":{"replicas":1}}' to temporarily reduce load during peak production hours. GitOps shows Application is “OutOfSync”.
Problem: Auto-sync would immediately revert the change, restarting production.
Solution:
1. Disable automaticSync for production tier (manual sync required)
2. Set up Slack webhook to alert DevOps team when drift detected
3. Document a 4-hour “drift window” where OT team can make manual changes
4. Implement approval gates: OT supervisor must approve manual changes in ArgoCD UI before auto-sync
FAQ
Q: Should I use ArgoCD or Flux for my 500 industrial clusters?
A: If you need centralized policy enforcement and audit trails → ArgoCD. If you prioritize cluster autonomy and hub independence → Flux. For regulatory environments, start with ArgoCD hub + Flux edge for resilience.
Q: How do I handle image updates across 1,000 clusters without overwhelming my registry?
A: Use regional mirrors with staggered sync jobs. Hub pushes to central registry once daily. Regional sync nodes pull and mirror during off-peak hours (e.g., 2–4 AM local time). Edge clusters always pull from local mirrors, never the internet.
Q: Can I roll back a bad deployment in under 30 seconds?
A: Yes. Use argocd app rollback <app> <revision> (ArgoCD) or kubectl patch kustomization <name> (Flux). Both achieve ~30s RTO.
Q: How do I enforce that only signed images run on production?
A: Deploy Kyverno ClusterPolicy with sigstore verification. Every pod creation is denied unless the image is signed with an approved key.
Q: What’s the compliance story for GitOps?
A: Every sync is a git commit (immutable audit log). Every approval gate is documented. Use OPA Gatekeeper + sigstore for policy enforcement. Log all syncs to Loki with 1-year retention. This satisfies ISO 27001, IEC 62443, and OT security frameworks.
Helm + Kustomize Layering for Industrial Deployments
Raw YAML becomes unmaintainable at 1,000 clusters. Use Helm for templating plus Kustomize for per-cluster overlays to achieve scale wit
