GitOps for Industrial Fleets: ArgoCD vs Flux Production Tutorial

GitOps for Industrial Fleets: ArgoCD vs Flux Production Tutorial

GitOps for Industrial Fleets: ArgoCD vs Flux Production Tutorial

Last Updated: April 29, 2026

Industrial IoT deployments operate at a completely different scale and constraint than cloud-native SaaS. When you’re managing 1,000+ edge clusters across factories, energy plants, and retail sites, GitOps for industrial fleets demands a fundamentally different approach from pushing Helm charts to EKS in a single region. You face disconnected networks, regulatory approval gates, OT certification requirements, and the need for plant-by-plant rollout safety.

In this tutorial, we’ll walk through production-grade GitOps for industrial fleets using ArgoCD and Flux, cover the architectural decisions that matter at scale, and show you exactly how to handle disconnected sites, multi-site sync waves, and rollback patterns that keep your fleet compliant and safe.

What Makes Industrial Fleet GitOps Different?

Cloud GitOps assumes fast, reliable connectivity and autonomous cluster behavior. Industrial fleets operate under entirely different constraints:

Challenge Impact GitOps Solution
Disconnected sites (no internet 24/7) Clusters can’t fetch manifests; manifests stale in offline mode Image mirrors, manifest snapshots, sync windows, sneakernet patterns
Slow, metered links (satellite, cellular) Pulling 500MB image registries fails or costs $$$ Delta sync, minimal image footprints, local caching layers
Regulatory drift gates Can’t auto-sync production changes; need approval before any drift correction Sync modes (Manual vs Automatic), OPA Gatekeeper policies, audit logging
OT certification Changes to safety-critical workloads require documented approval trail GitOps audit log (who changed what, when), signed commits (sigstore), immutable webhooks
Plant-by-plant rollout One misconfiguration can shut down 10 lines; need canary validation Sync waves, ApplicationSets with cluster maturity tiers, staged rollouts
At-scale observability 1,000 clusters = 1,000 divergent states; classic drift detection fails Flux + Prometheus metrics, ArgoCD metrics aggregation, fleet-wide health dashboards

The decision between ArgoCD and Flux hinges on your fleet topology and risk tolerance.

Decision: Hub-and-Spoke vs Per-Cluster Pull

ArgoCD Hub-and-Spoke (centralized control plane, child applications per cluster):
– Single control plane in your data center or cloud region pulls git and templates applications to each edge cluster
– ApplicationSets generate child applications dynamically (cluster generator, git generator, matrix)
– Best for: centralized oversight, compliance auditing, multi-cloud fleets with heterogeneous clusters

Flux Per-Cluster Pull (each cluster self-manages from git):
– Every edge cluster runs Flux source-controller, kustomization-controller, helm-controller
– Clusters independently pull their own manifests, resolve dependencies, and sync
– Best for: high-autonomy edge sites, disconnected regions, minimal hub resource footprint

Our recommendation for 1,000+ industrial sites: hybrid. Run ArgoCD hub for policy, compliance auditing, and multi-tenant governance. Deploy Flux on edge clusters for self-healing and resilience during hub downtime.

Why Standard Cloud GitOps Breaks at Industrial Scale

Single hub failure cascades to all 1,000 clusters. In cloud environments, control plane unavailability is momentary. In factories, it means production lines go dark. Your hub needs HA, but every edge cluster must remain operational during hub downtime. Pure ArgoCD hub-and-spoke fails without Flux agents on edges.

Bandwidth costs spiral. Pulling a container image in your data center costs nothing. At a remote factory with satellite uplink, pulling the same image to 100 clusters costs $$$ per gigabyte. You need image mirrors and delta sync, not vanilla registry pulls.

Regulatory approval isn’t instant. Cloud teams auto-sync because change advisory is Slack-based. Industrial sites require documented approval: who changed what, when, and why. Every sync must be traceable to a git commit with a signature. Auto-sync without audit trails becomes a liability.

Disconnected operation is mandatory, not nice-to-have. Cloud fleets assume connectivity. Industrial sites operate in zones with no internet for days (underground facilities, ships, oil rigs). Your GitOps system must function in true offline mode—not “slow connectivity” mode, but “zero connectivity for 72 hours” mode.

This is why industrial GitOps demands hybrid patterns: centralized policy (ArgoCD hub) + decentralized reconciliation (Flux on edges).


Architecture: Industrial Fleet Topology

Industrial Fleet Topology

Your fleet looks like this:
Hub Control Plane (data center): ArgoCD server, Prometheus, Grafana, policy engine
Regional Sync Nodes (edge DCs or on-prem): Flux controllers, local image mirrors, Git repo caches
10 Plants × 100 cells/site: Kubernetes edge clusters, minimal footprint, high autonomy

At scale, you can’t tolerate a single hub failure taking down all 1,000 clusters. Flux on the edge gives you resilience; ArgoCD hub gives you compliance visibility.


ArgoCD ApplicationSet for Fleet-Scale Deployment

ApplicationSets are ArgoCD’s answer to templating applications across hundreds of clusters. Instead of creating 1,000 Application CRDs by hand, you describe a generator and ApplicationSet creates and manages all the child apps.

Example 1: Cluster Generator (List of Clusters)

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fleet-core-services
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - cluster: plant-001-line-a
            region: us-midwest
            tier: production
          - cluster: plant-001-line-b
            region: us-midwest
            tier: production
          - cluster: plant-002-assembly
            region: us-south
            tier: production
  template:
    metadata:
      name: '{{cluster}}-core'
    spec:
      project: default
      source:
        repoURL: https://git.company.local/industrial-fleet
        targetRevision: main
        path: 'fleet/core-services/{{tier}}'
      destination:
        server: 'https://{{cluster}}.internal:6443'
        namespace: kube-system
      syncPolicy:
        automated:
          prune: false  # Regulatory requirement: don't auto-delete
          selfHeal: false
        syncOptions:
          - CreateNamespace=true

Creates separate Application CRDs for each cluster, pulling from tier-specific directories in git.

Example 2: Git Generator with Directory Traversal

For massive fleets (1,000+ clusters), listing every cluster in YAML becomes unmaintainable. Instead, use the Git generator:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fleet-all-workloads
spec:
  generators:
    - git:
        repoURL: https://git.company.local/industrial-fleet
        revision: main
        directories:
          - path: 'clusters/*/manifests'
  template:
    metadata:
      name: '{{path.basenameNormalized}}'
    spec:
      project: default
      source:
        repoURL: https://git.company.local/industrial-fleet
        targetRevision: main
        path: '{{path}}'
      destination:
        server: 'https://{{path.basename}}.internal:6443'
        namespace: default

ApplicationSet auto-discovers all subdirectories and creates child apps. Add a new cluster? Push a new directory to git. ArgoCD handles the rest.

Example 3: Matrix Generator (Multi-Dimension Templating)

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: fleet-multi-region-workloads
spec:
  generators:
    - matrix:
        generators:
          - list:
              elements:
                - tier: canary
                  clusters: [plant-001-line-a]
                - tier: production
                  clusters: [plant-001-line-b, plant-002-assembly]
          - list:
              elements:
                - workload: monitoring
                - workload: networking
                - workload: storage
  template:
    metadata:
      name: '{{clusters}}-{{workload}}'
    spec:
      project: default
      source:
        repoURL: https://git.company.local/industrial-fleet
        targetRevision: main
        path: 'workloads/{{workload}}/{{tier}}'
      destination:
        server: 'https://{{clusters}}.internal:6443'
        namespace: '{{workload}}'

This generates 6 ApplicationSet children automatically, combining cluster tiers and workload types.


Flux Multi-Tenancy & Kustomization Layers

Flux takes the opposite approach: each edge cluster runs its own reconciliation loop. No hub dependency.

Flux GitRepository Source

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: fleet-manifests
  namespace: flux-system
spec:
  interval: 1m
  url: https://git.company.local/industrial-fleet.git
  ref:
    branch: main
  suspend: false

Kustomization with Multi-Layer Customization

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: core-services
  namespace: flux-system
spec:
  interval: 5m
  sourceRef:
    kind: GitRepository
    name: fleet-manifests
  path: ./fleet/core-services
  prune: false  # Regulatory: no auto-deletion
  wait: true
  timeout: 5m
  postBuild:
    substitute:
      CLUSTER_NAME: ${CLUSTER_NAME}
      REGION: ${REGION}
      TIER: ${TIER}
    substituteFrom:
      - kind: ConfigMap
        name: cluster-config
  validation: client
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: core-dns
      namespace: kube-system

Helm Release with Dependency Management

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: industrial-edge-monitoring
  namespace: monitoring
spec:
  interval: 5m
  chart:
    spec:
      chart: prometheus-community/kube-prometheus-stack
      version: '57.0.0'
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  values:
    prometheus:
      prometheusSpec:
        retention: 72h
        resources:
          requests:
            memory: 512Mi
            cpu: 250m
  dependsOn:
    - name: storage-class
      namespace: flux-system

Handling Disconnected Sites: Mirror Registries & Manifest Snapshots

The moment a site loses internet, your GitOps sync stops. For industrial fleets, this is unacceptable.

Pattern 1: Local Image Mirror (Harbor or Artifactory)

apiVersion: v1
kind: Secret
metadata:
  name: registry-mirror
  namespace: flux-system
type: kubernetes.io/dockercfg
data:
  .dockercfg: |
    {
      "https://local-harbor.plant-001.internal:5000": {
        "auth": "base64-encoded-credentials"
      }
    }
---
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: edge-apps
  namespace: flux-system
spec:
  image: local-harbor.plant-001.internal:5000/industrial/workload
  interval: 5m
  secretRef:
    name: registry-mirror

Deployment workflow:
1. Hub builds images → pushes to central registry
2. Regional sync node mirrors images to local harbors (scheduled, off-peak)
3. Edge clusters pull only from local mirrors (no outbound registry calls)
4. If internet drops → edge clusters still have cached images for 30–90 days

Pattern 2: Manifest Snapshots with Git-Based Caching

# On hub: daily snapshot
git clone https://git.company.local/industrial-fleet-live
kustomize build fleet/core-services > /tmp/snapshots/core-services.yaml
helm template monitoring prometheus-community/kube-prometheus-stack \
  --values fleet/monitoring/values.yaml > /tmp/snapshots/monitoring.yaml

git add snapshots/
git commit -m "Daily manifest snapshot $(date +%Y-%m-%d)"
git push origin main:offline-snapshots

On disconnected edge clusters:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: fleet-manifests-offline
spec:
  interval: 24h
  url: https://git.company.local/industrial-fleet.git
  ref:
    branch: offline-snapshots

If the cluster loses connection, Flux continues using cached snapshots for 24 hours.

Pattern 3: Sneakernet & USB Sync (Extreme Disconnection)

For truly isolated sites (satellite downlink, no local internet):

# Hub prepares USB drive
mkdir -p /mnt/usb/fleet-sync-$(date +%Y%m%d)
cp -r snapshots/manifests/* /mnt/usb/fleet-sync-$(date +%Y%m%d)/
docker save local-harbor.plant-001.internal:5000/industrial/workload \
  | gzip > /mnt/usb/fleet-sync-$(date +%Y%m%d)/images.tar.gz

# Field engineer plugs in USB at site
cd /mnt/usb/fleet-sync-20260429
docker load < images.tar.gz
kubectl apply -f manifests/

Not glamorous, but essential for remote energy sites and offshore platforms.


Sync Waves: Plant-by-Plant Rollout with Approval Gates

You cannot afford a misconfiguration affecting all 1,000 clusters simultaneously. Use sync waves to stage rollouts:

Sync Waves Rollout

ArgoCD Sync Waves

---
# Wave 1: Canary plant (single-line validation)
apiVersion: v1
kind: ConfigMap
metadata:
  name: core-config-v2
  annotations:
    argocd.argoproj.io/sync-wave: "1"
data:
  version: "2.0"
---
# Wave 3: Regional production (5 plants)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fleet-workload
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  replicas: 5
  template:
    spec:
      containers:
      - name: workload
        image: local-harbor.plant-001.internal:5000/industrial/workload:v2.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
---
# Wave 5: Full fleet rollout
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fleet-workload-full
  annotations:
    argocd.argoproj.io/sync-wave: "5"
spec:
  replicas: 995

Execution timeline:

Wave 1 (T+0):   Canary plant (1 cluster)
Wave 2 (T+24h): Manual approval check
Wave 3 (T+24h): 5 regional production plants
Wave 4 (T+48h): Manual approval check
Wave 5 (T+48h): Remaining 994 clusters (full fleet)

Rollback Patterns: Git Revert vs ArgoCD Rollback

Industrial deployments demand rollback speed. When something goes wrong at 2 AM, you can’t wait 30 minutes.

Fast Rollback: ArgoCD Sync to Previous Revision

argocd app history plant-001-core

# Output:
# REVISION DEPLOY TIME           USERNAME    MESSAGE
# abc1234  2026-04-29 14:30:00Z  automation  GitOps sync
# def5678  2026-04-29 13:00:00Z  automation  GitOps sync

argocd app rollback plant-001-core abc1234
argocd app sync plant-001-core

Time to recover: 30 seconds.

Persistent Rollback: Git Revert + Commit

git log --oneline | head -5
git revert abc1234 --no-edit
git push origin main

# ArgoCD detects new commit, syncs within 1m

Policy Gates: OPA Gatekeeper & Kyverno Policies

OPA Gatekeeper: Require Resource Requests

package kubernetes.admission

deny[msg] {
    input.request.kind.kind == "Deployment"
    not input.request.object.spec.template.spec.containers[_].resources.requests
    msg := "Containers must define resource requests"
}

Deploy the constraint:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: require-resources
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
    excludedNamespaces: ["kube-system"]

Capacity Planning: Sizing Your Hub and Edge Agents

ArgoCD Hub Sizing (1,000 ApplicationSet-managed clusters):
– CPU: 2-4 cores (application controller reconciles 333 apps/minute)
– Memory: 2-4 GB (ArgoCD server + etcd)
– Persistent Storage: 100 GB (etcd history, logs, metrics)
– Network: 1 Gbps minimum (webhook payloads, metric exports)

Flux Agent Sizing (per edge cluster, minimal):
– CPU: 100m per controller (source-controller, kustomization-controller, helm-controller)
– Memory: 256 MB per controller
– Persistent Storage: 10 GB (git cache, image cache)

Flux is lightweight enough to run on edge clusters with 2 cores and 4 GB RAM.

Git Repository Performance:
With 1,000 clusters polling every 5 minutes:
– 200 polling requests/minute
– ~1 GB/month of manifest diffs

Solutions: GitHub Enterprise ($231/month), Gitea with replication, GitLab HA (~$2,000/year).


Security Model: Signed Commits, Image Signatures, and RBAC

Git Commit Signing (sigstore):

export COSIGN_EXPERIMENTAL=1
git commit -m "Deploy fleet workload v2.0"
git push origin main

Image Signature Verification (sigstore cosign):

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-images
spec:
  validationFailureAction: enforce
  rules:
    - name: check-signature
      match:
        resources:
          kinds:
            - Pod
      verifyImages:
        - imageReferences:
            - 'myregistry/industrial-*'
          cosign:
            key: |-
              -----BEGIN PUBLIC KEY-----
              MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
              -----END PUBLIC KEY-----

RBAC: Per-Cluster Service Accounts

apiVersion: v1
kind: ServiceAccount
metadata:
  name: argocd-plant-001-line-a
  namespace: argocd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argocd-plant-001-line-a
rules:
  - apiGroups: ['apps']
    resources: ['deployments', 'statefulsets']
    verbs: ['get', 'list', 'patch', 'update']
  # Intentionally restrictive: no wildcard, no deletions

Drift Detection at Scale

ArgoCD Drift Metrics:

# Applications out of sync
count(argocd_application_sync_total{status="OutOfSync"})

# Sync success rate per cluster (last hour)
increase(argocd_application_sync_total{result="Succeeded"}[1h])
  / increase(argocd_application_sync_total[1h])

# Time since last sync
(time() - argocd_app_info) > 3600

Alerting Strategy:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fleet-drift-alerts
spec:
  groups:
    - name: gitops-fleet
      rules:
        - alert: ClusterOutOfSyncLong
          expr: (time() - argocd_app_info) > 3600
          for: 15m
          annotations:
            summary: "Cluster out of sync > 1 hour"

        - alert: SyncFailureRate
          expr: |
            (increase(argocd_application_sync_total{result="Failed"}[1h])
              / increase(argocd_application_sync_total[1h])) > 0.1
          for: 10m
          annotations:
            summary: "Cluster has > 10% sync failures"

Observability: Fleet Health Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: fleet-health-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "Industrial Fleet GitOps Health",
        "panels": [
          {
            "title": "Clusters In Sync",
            "targets": [{"expr": "count(argocd_application_sync_total{status=\"Synced\"})"}]
          },
          {
            "title": "Drift Alerts (Out of Sync > 1h)",
            "targets": [{"expr": "count((time() - argocd_app_info) > 3600 and argocd_application_sync_total{status=\"OutOfSync\"})"}]
          },
          {
            "title": "Failed Syncs (Last 24h)",
            "targets": [{"expr": "increase(argocd_application_sync_total{result=\"Failed\"}[24h])"}]
          }
        ]
      }
    }

Production Checklist: Going Live with 1,000+ Clusters

Infrastructure & Networking
– [ ] Hub control plane: HA (3+ replicas), persistent storage (etcd + PVC)
– [ ] Flux agents: resource limits enforced (CPU: 250m, Mem: 512Mi)
– [ ] Image mirrors: deployed in all regions, sync jobs scheduled (off-peak)
– [ ] Git repo: redundancy (GitHub Enterprise, GitLab HA, Gitea with replication)
– [ ] Network policies: ingress to hub restricted to known clusters only

Compliance & Audit
– [ ] Git commits signed (sigstore): all manifests signed, keys rotated quarterly
– [ ] Audit logging: all syncs logged to Loki, retention = 1 year (regulatory)
– [ ] OPA Gatekeeper policies: deployed, tested, enforced (no permissive mode)
– [ ] RBAC: ArgoCD service accounts per cluster, no wildcard permissions
– [ ] Backup: ApplicationSet snapshots, Git repo backups (daily), etcd backups (hourly)

Operational Readiness
– [ ] Runbooks: rollback procedure (doc + tested), incident communication plan
– [ ] Observability: Prometheus + Loki + Grafana dashboards live, alerts configured
– [ ] Disaster recovery: RTO ≤ 1 hour, RPO ≤ 15 min, tested quarterly
– [ ] Canary validation: 5-plant canary sync tested, approval gates documented
– [ ] Disconnected site testing: local mirror failover tested, sneakernet tested

Tooling & Automation
– [ ] GitOps CLI: argocd or flux CLI integrated into release pipeline
– [ ] CI/CD gates: all manifests linted (Kube-linter, Kubescape), no merges without passing checks
– [ ] Image scanning: all container images scanned for CVEs before registry push
– [ ] Helm dependency updates: automated Renovate/Dependabot PRs, reviewed weekly


Real-World Deployment Example: 300-Cluster Retail Fleet

Let’s walk through a concrete example: deploying a payment processing update to 300 retail stores across North America.

Week 1: Preparation

  1. Create feature branch payment-v2.1 in git
  2. Update manifests: modify Deployment image, bump payment-processing to v2.1
  3. Push to git (requires signed commit + peer review)
  4. Automated CI: run Kube-linter, Kubescape, image scanning; all must pass
  5. Create ApplicationSet for canary:
    yaml
    - cluster: test-store-001-nyc
    tier: canary

Week 2: Canary Deployment

  1. Engineer creates PR to merge payment-v2.1 into main
  2. ArgoCD automatically previews ApplicationSet changes in dry-run mode
  3. After approval, merge to main
  4. ArgoCD syncs payment-v2.1-core app to test-store-001-nyc (canary)
  5. Team monitors for 48 hours:
    – Payment transaction volume: normal
    – Error logs: zero payment failures
    – CPU/Memory: no degradation
    – Logs for “card declined” errors: baseline

Week 3: Regional Rollout (Wave 3)

  1. If canary passes, create 2nd ApplicationSet:
    yaml
    - cluster: [store-002-boston, store-003-philadelphia, ...]
    tier: production-wave-1
  2. Merge to main; ArgoCD syncs to 50 stores
  3. Monitor metrics dashboard for regional aggregate:
    – Payment success rate should remain > 99.9%
    – No spike in customer complaints
  4. After 24h validation, proceed to Wave 5

Week 4: Full Fleet (Wave 5)

  1. Final ApplicationSet:
    yaml
    - cluster: [all remaining 250 stores]
    tier: production-full
  2. Merge; ArgoCD syncs payment-v2.1 to all 300 stores over 4-hour window
  3. Post-deployment validation:
    – Smoke tests: every store can process a test transaction
    – Metrics: payment volume smooth across all stores
    – Support tickets: monitor Slack for escalations

Rollback (if needed)

If a bug surfaces:

  1. Emergency hotfix: revert git commit
    bash
    git revert abc1234 --no-edit
    git push origin main
  2. ArgoCD detects new commit, syncs rollback to all clusters within 1 minute
  3. Payment service automatically rolls back to v2.0 cluster-wide
  4. Post-incident: review why canary missed the bug

This entire process, from canary to full-fleet rollout, takes 1 week with zero downtime. Manual rollout to 300 stores? That would take months and risk human error at every site.


Common Industrial GitOps Failure Modes & Fixes

Scenario 1: Hub Loses Access to Edge Cluster

Symptoms: ArgoCD shows Application status “Unknown”, sync stuck for > 2 hours.

Root causes:
– Firewall rule deleted by security team
– Network outage affecting WAN link to remote plant
– TLS certificate rotation failed; webhook authentication rejected

Mitigation:
1. Deploy Flux agents on all edges; they survive hub downtime
2. Configure NetworkPolicy to whitelist only necessary flows
3. Monitor hub disk space; alert at 80% etcd capacity

Scenario 2: Git Repository Becomes Unavailable

Symptoms: All edge clusters show “GitRepository.Source ReconcileError” after 5 minutes.

Root causes:
– GitHub/GitLab experiences regional outage
– Gitea storage drive failed
– Manifest file accidentally deleted from git

Mitigation:
1. Deploy git mirroring: every edge region caches manifests every 2 hours
2. Use suspend: false on GitRepository to prevent forced re-sync during outages
3. Implement manifest validation in pre-commit hooks

Scenario 3: Image Registry Mirror Fills Up (Disk Full)

Symptoms: Cluster workloads can’t pull images, pending state with “imagepull backoff”.

Root causes:
– Mirror retention policy too aggressive
– New workload requires 10 GB image, mirror only has 20 GB available

Mitigation:

# Configure Harbor/Nexus cleanup policies
# Retention: keep only 5 most recent image tags
# Cleanup: run daily at 2 AM, delete unused images older than 30 days

# Monitor disk usage
docker exec harbor-core curl -s http://localhost/api/v2.0/systeminfo | jq '.storage'

# Alert: trigger at 85% capacity
# Auto-cleanup: trigger at 90% capacity

Scenario 4: OT Team Manually Changes Cluster (Drift)

Symptoms: Operator logs into cluster and runs kubectl patch deployment workload --patch='{"spec":{"replicas":1}}' to temporarily reduce load during peak production hours. GitOps shows Application is “OutOfSync”.

Problem: Auto-sync would immediately revert the change, restarting production.

Solution:
1. Disable automaticSync for production tier (manual sync required)
2. Set up Slack webhook to alert DevOps team when drift detected
3. Document a 4-hour “drift window” where OT team can make manual changes
4. Implement approval gates: OT supervisor must approve manual changes in ArgoCD UI before auto-sync


FAQ

Q: Should I use ArgoCD or Flux for my 500 industrial clusters?
A: If you need centralized policy enforcement and audit trails → ArgoCD. If you prioritize cluster autonomy and hub independence → Flux. For regulatory environments, start with ArgoCD hub + Flux edge for resilience.

Q: How do I handle image updates across 1,000 clusters without overwhelming my registry?
A: Use regional mirrors with staggered sync jobs. Hub pushes to central registry once daily. Regional sync nodes pull and mirror during off-peak hours (e.g., 2–4 AM local time). Edge clusters always pull from local mirrors, never the internet.

Q: Can I roll back a bad deployment in under 30 seconds?
A: Yes. Use argocd app rollback <app> <revision> (ArgoCD) or kubectl patch kustomization <name> (Flux). Both achieve ~30s RTO.

Q: How do I enforce that only signed images run on production?
A: Deploy Kyverno ClusterPolicy with sigstore verification. Every pod creation is denied unless the image is signed with an approved key.

Q: What’s the compliance story for GitOps?
A: Every sync is a git commit (immutable audit log). Every approval gate is documented. Use OPA Gatekeeper + sigstore for policy enforcement. Log all syncs to Loki with 1-year retention. This satisfies ISO 27001, IEC 62443, and OT security frameworks.


Helm + Kustomize Layering for Industrial Deployments

Raw YAML becomes unmaintainable at 1,000 clusters. Use Helm for templating plus Kustomize for per-cluster overlays to achieve scale wit

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *