Apache Kafka Tiered Storage Architecture: How KIP-405 Unlocks Infinite Retention

The “Infinite Retention” Pitch vs. Reality

Kafka’s marketing promise is seductive: “run a single broker forever and never delete data.” In practice, local disk is expensive and finite. A 3TB partition segment fills in hours at high throughput, and replicating it across brokers multiplies the pain. By 2020, operators were asking: can we offload old segments to S3 and keep only the hot stuff on expensive NVMe?

KIP-405 answers that question with a pluggable remote tier architecture. Instead of choosing between data loss and disk bankruptcy, you can now run 12-month retention on a single broker with <100GB of local disk. The tradeoff? Cold reads take 50–500ms instead of 1ms. But for audit logs, backfill ETL, and compliance archives, that’s a bargain.

This post dissects the mechanics: how segments roll to S3, how metadata stays consistent across brokers, and why that “infinite” promise still requires careful monitoring.

TL;DR

KIP-405 separates hot (local disk) and cold (object storage) tiers via a pluggable RemoteStorageManager.
Segments roll to S3 after reaching size/time limits; metadata is published to a compacted __remote_log_metadata topic.
RemoteLogMetadataManager caches epoch→offset mappings on each broker, enabling zero-coordination cold reads.
Consumer fetch path checks local segments first (p50 = 1–2ms), then queries metadata cache, then fetches from S3 if needed (p50 = 50–200ms).
Metadata compaction ensures the metadata topic stays bounded, and all brokers eventually converge to the same view.
Cost math: 12-month retention drops from $8/GB-month (all local) to ~$0.50/GB-month (tiered) on S3 Standard.
Edge cases: leader changes mid-upload, S3 transient failures, and log compaction + tiering interactions all require careful handling.

Key Concepts
Kafka’s Classic Storage Model
Tiered Storage: The KIP-405 Design
Remote Segment Lifecycle
Metadata: RemoteLogMetadataManager
Consumer Read Path Across Tiers
Benchmarks: Local vs Tiered
Edge Cases & Failure Modes
Implementation Guide
FAQ
Where Kafka Is Heading
References
Related Posts

Key Concepts

Before diving into the architecture, establish baseline terminology.

Segment. A single immutable log file on disk (e.g., 00000000000000000000.log). Typically 1–5 GB. Includes index and timeindex for binary search.

Log compaction. Optional per-topic retention policy where Kafka retains only the latest offset for each key, pruning older values. Incompatible with FIFO retention but compatible with tiered storage (more on that later).

ISR (In-Sync Replica). The set of brokers that have fully replicated the leader’s current log. A replica falls out of ISR if it lags too far or crashes.

Leader epoch. A monotonically increasing counter assigned by the controller each time leadership changes. Used to disambiguate stale data during failures.

RemoteStorageManager (RSM). Pluggable SPI that brokers call to upload/download segments. Implementations exist for S3 (Aiven, Confluent) and GCS (Aiven).

RemoteLogMetadataManager (RLM). Built-in component that tracks which segments exist in the remote tier and their epoch/offset ranges. Runs as a built-in Kafka consumer on each broker.

Cold read. A fetch request where the requested offset lives in the remote tier (S3), not the local hot tier.

Kafka’s Classic Storage Model

To understand why tiering exists, first grasp what it replaces.

In a stock Kafka cluster, each partition’s log is a sequence of segments, all stored on the broker’s local disk:

/var/kafka-logs/orders-7/
  ├── 00000000000000000000.log      (offsets 0–999,999)
  ├── 00000000000000000000.index
  ├── 00000000000001000000.log      (offsets 1,000,000–1,999,999)
  ├── 00000000000001000000.index
  └── ... (live segment still accumulating)

A consumer asking for offset 1,500,000 triggers this flow:

Cache hit: Most recent segment lives in the OS page cache (RAM). Broker’s ReplicaManager finds it in nanoseconds.
Zero-copy sendfile(): OS kernel reads pages directly from disk to NIC buffer, bypassing user space entirely. Latency: 1–2ms.
Client deserializes: TCP delivers data to consumer, which parses it. p99 latency: 5–10ms for a 10KB record batch.

This design is exceptionally fast for hot data. But it scales linearly with retention. A 365-day window at 10MB/s is ~315 TB per partition. With 3-way replication, a single 10-broker cluster can hold only ~100 days before filling all disk.

The page cache is a finite resource shared across all partitions. Under heavy load, older segments get evicted, and subsequent reads become synchronous disk I/O (100+ ms). That’s acceptable for backfill ETL but wasteful for compliance archives that are read rarely or never.

KIP-405 decouples hot from cold: keep 7 days on local disk, archive the rest to S3, and let consumers fetch from either tier transparently.

Tiered Storage: The KIP-405 Design

KIP-405 introduces a pluggable remote tier via the RemoteStorageManager interface:

public interface RemoteStorageManager {
  RemoteStorageManager configure(Map<String, ?> config);
  void copyLogSegmentData(LogSegmentData segment) throws IOException;
  InputStream fetchLogSegment(TopicPartition tp, long baseOffset) throws IOException;
  boolean deleteLogSegmentData(TopicPartition tp, long baseOffset) throws IOException;
}

A broker’s segment lifecycle now looks like this:

Local tier: Segments accumulate on disk until they hit a size/time threshold.
Roll trigger: When the in-flight segment exceeds log.segment.bytes (default 1 GB), the broker closes it and starts a new one.
Upload decision: The LocalTierStorageManager decides whether to upload based on remote.log.storage.enable and local.log.retention.* settings.
Remote upload: RSM plugin calls object storage API. For S3, this is a multi-part PUT with optional encryption and compression.
Metadata publish: Once upload succeeds, the broker publishes a RemoteLogSegmentMetadata record to the internal __remote_log_metadata topic.
Deletion eligibility: After metadata is replicated across the ISR, the segment can be deleted from local disk.

This design has three critical properties:

Durability: Remote upload is not synchronous. The broker acknowledges the write to the ISR before uploading to S3. This avoids coupling Kafka’s in-sync semantics to object storage latency. (A separate background task handles the upload.)
Modularity: RSM is a plugin. Different implementations (S3, GCS, HDFS) can coexist. Confluent and Aiven each ship proprietary and open-source implementations.
Metadata locality: Segment metadata lives in a compacted Kafka topic, so it’s naturally replicated and discoverable by all brokers and consumers.

Remote Segment Lifecycle

Understanding the exact sequence prevents surprises in production.

Segment Roll & Upload Trigger

When a broker’s active segment reaches log.segment.bytes (default 1 GB):

Broker flushes the active segment to disk (.log + .index + .timeindex).
Broker creates a new, empty active segment (for the next offset range).
If remote.log.storage.enable=true and local retention is shorter than remote retention:
– Broker calls RemoteStorageManager.copyLogSegmentData(segment).
– RSM plugin (e.g., Aiven’s S3 module) reads the .log file, optionally compresses and encrypts it, and uploads to S3.
– Broker publishes a RemoteLogSegmentMetadata record: {topicPartition, baseOffset, endOffset, leaderEpoch, size, uploadTimestamp}.

Metadata Topic Format

The __remote_log_metadata topic is log-compacted. It has a single partition and is stored on a designated broker (often the broker that wrote the first record). Each record key is {topicPartition, baseOffset}:

{
  "topicPartition": "orders-7",
  "baseOffset": 1000000,
  "endOffset": 1999999,
  "leaderEpoch": 42,
  "size": 1073741824,
  "uploadTimestamp": 1713398400000,
  "remoteLogFilePath": "s3://my-bucket/topics/orders/7/1000000",
  "eventTimestamp": 1713398400000
}

As brokers publish metadata, the topic grows. Compaction removes stale records (older values for the same key), keeping the log bounded. For a partition rolling a new segment every 10 minutes, the metadata topic grows ~144 segments/day/partition—manageable.

Fetch Path for Cold Reads

When a consumer requests an offset in the remote tier:

Client: Calls KafkaConsumer.poll() with offset 5,000,000.
Broker (leadership): Looks up leader epoch for offset 5,000,000 from the metadata cache.
Cache miss (first cold read): Broker queries the RemoteLogMetadataManager cache. If absent, RLM consumer fetches from __remote_log_metadata.
Segment found: Broker calls RemoteStorageManager.fetchLogSegment(offset).
RSM plugin: Downloads the segment from S3, decompresses, decrypts, and returns an InputStream.
Broker deserialization: Reads the requested record batch, applies offset translation (in case of leader change), and sends to client.

The key insight: the broker acts as a caching proxy. Cold reads are no longer “pull from S3 yourself”—they’re transparent to the client.

Metadata: RemoteLogMetadataManager

The metadata tier is the architectural heart of KIP-405. Without it, consumers couldn’t find remote segments without coordination.

Architecture

Each broker runs a RemoteLogMetadataManager that:

Consumes the internal __remote_log_metadata topic (from the beginning, at startup, and continuously).
Caches epoch-to-segment mappings in memory: epoch 42 → offsets [1000000, 1999999] in S3.
Serves metadata queries from replica managers and fetch handlers.
Publishes metadata when a segment is uploaded locally.

The metadata consumer is a built-in Kafka consumer that operates outside the normal request path. It runs asynchronously, so metadata propagation has a lag of a few seconds. This is acceptable because:

Segments are immutable; a lag of 5 seconds doesn’t break correctness.
The leader always publishes metadata first, so the leader’s local cache is always up-to-date.
Followers catch up as the consumer progresses.

Compaction Strategy

The metadata topic is log-compacted with a generous retention (e.g., 30 days). Compaction keeps only the latest metadata record per (topicPartition, baseOffset) pair. Old segments that are deleted locally are not removed from the metadata topic immediately; they stay until the topic’s retention window expires.

This creates a subtle invariant: if a segment is deleted from S3, the metadata remains for up to 30 days. Consumers can still attempt to fetch it, and the RSM plugin will return a NOT_FOUND error. The broker translates this to OffsetOutOfRangeException, and the consumer falls back to the earliest available offset. This is correct but can be noisy in logs.

Consumer Read Path Across Tiers

Let’s trace a concrete example: a consumer fetching offset 5,000,000 from a partition with 7 days of local retention and 12 months of remote retention.

Step-by-Step

Consumer sends FetchRequest(partition=orders-7, offset=5000000) to the broker (partition leader).
Broker’s ReplicaManager receives the request and checks local segments. All local segments start at, say, offset 100,000,000 (from 7 days ago). Offset 5,000,000 is out of range.
ReplicaManager consults RemoteLogMetadataManager. RLM searches its in-memory cache for the segment containing offset 5,000,000. Found: RemoteLogSegmentMetadata(baseOffset=4000000, endOffset=4999999, remoteLogFilePath=s3://...)
Broker calls RemoteStorageManager.fetchLogSegment(topicPartition, baseOffset=4000000). The S3 plugin downloads the segment, decompresses it, and returns an InputStream.
Broker reads the InputStream, seeking to offset 5,000,000 (a relative offset within the segment), and extracts the requested batch of records.
Broker constructs FetchResponse with the batch, CRC32 validation, and current LEO (log end offset). The response is identical to a hot-read response—the client is unaware it came from S3.
Client deserializes and yields records to the application.

Latency breakdown:
– Hot read (page cache hit): p50 = 1–2ms, p99 = 5–10ms
– Cold read (S3): p50 = 50–150ms, p99 = 200–500ms (depends on S3 latency + segment size)

Read-Ahead & Caching

The broker doesn’t cache cold reads locally; each cold fetch is a fresh trip to S3. However, modern S3 clients (boto3, AWS SDK) and brokers (Kafka 3.8+) employ read-ahead buffers. Fetching a 1GB segment downloads the full thing to a temporary buffer; subsequent reads from the same segment are cache hits.

Additionally, the OS page cache may retain decompressed segments if memory is available. This is application-transparent and ephemeral.

Benchmarks: Local vs Tiered

Performance metrics are the decision-making lingua franca. Here’s a realistic 3-tier comparison using Kafka 3.7 on AWS with Aiven’s S3 RSM plugin:

Metric	Local-only (14d)	Tiered Hot (7d local)	Tiered Cold (S3)
P50 Latency	1.2ms	1.8ms	85ms
P99 Latency	8.5ms	9.2ms	320ms
Throughput (Mbps)	950	940	620
Cost per GB-month	$8.00	$4.50	$0.35
Disk per broker	2.8TB	400GB	50GB
Retention span	14 days	12 months	12 months
Network (egress)	0	0	S3 standard rates

Notes:
– Local-only assumes 3-way replication; tiered assumes 2-way replication (acceptable for archives).
– Cold read throughput is lower because decompression and S3 latency dominate.
– Cost includes disk, replication bandwidth, and S3 storage. S3 retrieval costs are negligible (<$0.01/GB).
– P99 latency on cold reads spikes under bursty load; 99.9th percentile can exceed 1 second.

Edge Cases & Failure Modes

Production systems are messier than diagrams.

Leader Change During Upload

A segment is mid-upload to S3 when a network partition isolates the leader.

Broker A (leader) begins uploading segment 1000.
Network partition; Broker A is isolated.
Broker B becomes the new leader (elected by the controller).
Broker A’s upload eventually completes (or times out).
Metadata for segment 1000 was published by A but not yet replicated to B.

Resolution: The RemoteStorageManager is idempotent. Re-uploading the same segment with the same key is safe (overwrite). The new leader (B) will re-upload any segments that appear in the ISR’s local logs but not in the metadata cache. This is a background recovery task triggered after leadership change.

S3 Transient Failures

A request to upload segment 5000 to S3 times out.

Broker retries with exponential backoff (configurable, default 3 retries over 30 seconds).
If all retries fail, the segment remains on local disk and is not marked for deletion.
Local retention policies still apply; the segment will eventually be deleted if it exceeds local retention time.
Consumers requesting offsets from that segment will get OffsetOutOfRangeException (or similar), triggering a seek to the earliest available offset.

Mitigation: Monitor the RemoteStorageManager error rates in your broker logs. Set up alerting for repeated upload failures. Consider increasing retry counts and timeouts if your S3 infrastructure is congested.

Consumer Lag on Cold Reads

A consumer falls far behind (lag = 30 days) and is now reading from the remote tier.

Consumer fetch throughput drops 40% (from 950 Mbps to 620 Mbps).
Consumer is now a cold-read workload, contending with S3 throughput limits and network egress costs.
If the cold-read lag is intentional (e.g., backfill ETL), cost is unavoidable. If it’s a bug (consumer crash), fix the consumer, and lag will catch up naturally.

Mitigation: Tier your topics by expected access patterns. Frequently read topics (operational metrics) stay in the hot tier. Infrequently read topics (audit logs) tiered aggressively. Set up alerts for consumer lag exceeding a threshold (e.g., lag > 14 days on a production consumer).

Log Compaction + Tiering Interplay

A topic uses log compaction (e.g., user profile events). When a profile changes, only the latest offset for that user ID is kept; older offsets are eligible for deletion.

Compaction merges segments, producing a new segment with fewer records.
Original segments are marked for deletion (compaction cleanup policy).
Tiered storage uploads the original segments to S3.
Compaction deletion happens after upload completes (coordinated by the LogCleaner thread).

Subtlety: If compaction is very aggressive (e.g., reducing log size by 90%), the remote tier will still contain the original, uncompacted segments. A consumer reading from remote will see all versions of a key, not the compacted view. This is correct but can be surprising. Recommendation: Run compaction and tiering on separate topics, or accept that remote reads are non-compacted.

Implementation Guide

Deploying tiered storage requires careful configuration and monitoring. Here’s a production-grade walkthrough.

Step 1: Upgrade Kafka

Tiered storage is stable in Kafka 3.6+ (released Jan 2024). Upgrade your cluster in a rolling fashion:

# On each broker, in sequence:
# 1. Stop broker
# 2. Replace kafka_2.13-3.5.0.jar with kafka_2.13-3.6.0.jar
# 3. Start broker
# 4. Wait for ISR rebalance (~30 seconds)

Step 2: Install & Configure RemoteStorageManager Plugin

Use Aiven’s open-source S3 plugin (available on Maven Central):

<dependency>
    <groupId>io.aiven</groupId>
    <artifactId>tiered-storage-s3</artifactId>
    <version>0.10.0</version>
</dependency>

Copy the JAR to $KAFKA_HOME/libs/, then update server.properties:

# Enable tiered storage
remote.log.storage.enable=true

# RSM plugin class
remote.log.storage.system.enable=true
remote.log.storage.manager.class.name=io.aiven.kafka.tieredstorage.RemoteStorageManager
remote.log.storage.manager.class.path=/opt/kafka/libs/tiered-storage-s3-*.jar

# S3 configuration
remote.log.storage.s3.bucket=my-kafka-backups
remote.log.storage.s3.region=us-east-1
remote.log.storage.s3.endpoint.url=https://s3.us-east-1.amazonaws.com
remote.log.storage.s3.path.style=true

# Compression & encryption
remote.log.storage.s3.compression=snappy
remote.log.storage.s3.sse.algorithm=AES256

# Local tier retention
log.retention.hours=168  # 7 days
log.local.retention.ms=604800000  # 7 days (must match or be shorter than global retention)

# Remote tier retention
log.retention.ms=31536000000  # 365 days

Step 3: Create S3 Bucket & IAM Policy

# Create bucket
aws s3 mb s3://my-kafka-backups --region us-east-1

# Enable versioning (optional but recommended)
aws s3api put-bucket-versioning \
  --bucket my-kafka-backups \
  --versioning-configuration Status=Enabled

# Set lifecycle policy to expire objects after 2 years
aws s3api put-bucket-lifecycle-configuration \
  --bucket my-kafka-backups \
  --lifecycle-configuration file://lifecycle.json

Attach an IAM policy to the Kafka broker instances:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-kafka-backups/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::my-kafka-backups"
    }
  ]
}

Step 4: Enable Tiering on Topics

By default, all topics use tiering if enabled globally. To fine-tune per topic:

# Enable tiering on a specific topic
kafka-configs --bootstrap-server localhost:9092 \
  --entity-type topics \
  --entity-name orders \
  --alter \
  --add-config remote.log.retention.ms=31536000000

# Disable tiering on a topic
kafka-configs --bootstrap-server localhost:9092 \
  --entity-type topics \
  --entity-name events \
  --alter \
  --delete-config remote.log.retention.ms

Step 5: Monitoring & Health Checks

Track these metrics in Prometheus or your observability platform:

# Broker-level metrics (via JMX)
kafka.log:type=RemoteStorageManager,name=RemoteLogUploadTime
kafka.log:type=RemoteStorageManager,name=RemoteLogUploadErrorCount
kafka.log:type=RemoteLogMetadataManager,name=RemoteLogMetadataManagerTask
kafka.server:type=KafkaRequestHandlerPool,name=TotalTimeMs,clientId=*

# Consumer lag (per consumer group)
__consumer_offsets topic lag (use Burrow or Kafka exporter)

# S3 metrics (via CloudWatch)
NumberOfObjects
BucketSizeBytes
PutObject latency (p50, p99)

Step 6: Capacity Planning

For a 10-broker cluster with 100 topics, 3-way replication, 500 MB/s ingestion, 365-day retention:

Local disk: 7 days × 500 MB/s × 86400 s/day = ~301 GB per broker. Use 2–4 TB disks for headroom.
S3 storage: 365 days × 500 MB/s × 86400 s/day = ~15.77 TB total. At $0.023/GB-month, monthly cost is ~$360.
S3 egress: Assume 10% of segments are read (cold). 1.577 TB/month read × $0.09/GB = ~$142/month.

FAQ

Q: Does tiered storage affect exactly-once semantics?

No. Tiered storage is transparent to the producer and consumer APIs. Exactly-once semantics (idempotent producers + transactional reads) are unchanged. The remote tier is just another storage backend.

Q: Can I run tiered storage with ZooKeeper?

Yes. KIP-405 is compatible with ZooKeeper-mode Kafka (3.6+). However, newer versions (4.0+) deprecate ZooKeeper, so KRaft is the long-term path.

Q: What about Confluent vs. open-source RSM plugins?

Aiven’s open-source S3 plugin (tiered-storage-s3): Free, Apache 2.0 licensed, supports Kafka 3.6+.
Confluent Tiered Storage: Proprietary, includes GCS and Azure Blob Storage, commercial support, but requires a Confluent Cloud subscription.
Custom RSM: You can implement RemoteStorageManager yourself (e.g., HDFS, NFS, proprietary cloud storage).

Q: How do cold reads affect latency?

p50 latency increases from 1–2ms (hot) to 50–150ms (cold). p99 can spike to 500ms+ under contention. For backfill workloads, this is acceptable. For real-time queries, keep data in the hot tier.

Q: What happens to compacted topics?

Compacted segments are still uploaded to the remote tier in their original (uncompacted) form. Consumers reading from remote see all historical versions of a key, not the compacted view. Recommendation: Avoid combining aggressive compaction with tiering on the same topic, or accept the non-compacted read semantics.

Q: Can I change retention policies after enabling tiering?

Yes. You can adjust log.retention.ms, log.local.retention.ms, and remote.log.retention.ms on the fly (via kafka-configs). Segments older than the new local retention will be uploaded in the next rolling window.

Where Kafka Is Heading

Tiered storage is a stepping stone toward even more ambitious architectures.

KIP-1150: Diskless Brokers

A proposal (still in discussion) would extend tiering to the extreme: brokers become stateless proxies with minimal local storage (just the leader’s committed log). All data lives in a tiered remote store. Failover becomes instantaneous—no need to replay the log. This is similar to Pulsar’s architecture and would be a major step toward Kafka-as-a-platform (treating the broker as a microservice, not a pet).

Tiered Storage + KRaft Maturity

KRaft (Kafka Raft) removes ZooKeeper as a dependency (Kafka 3.3+, generally available by 4.0). Combined with tiered storage, a KRaft cluster is fully cloud-native: no local state, no external coordination service, and infinite retention via S3. This is the future of managed Kafka services (Confluent Cloud, Aiven, AWS MSK, etc.).

Multi-Tier & Heterogeneous Storage

Future KIPs may introduce three or more tiers: NVMe (ultra-hot, p50 = 0.1ms), SSD (hot, p50 = 1ms), HDD (cold, p50 = 50ms), and S3 (archive, p50 = 200ms). Brokers would intelligently demote segments based on access patterns, similar to tiered storage in modern databases (e.g., DynamoDB, BigTable).

References

KIP-405: Tiered Storage. https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Tiered+Storage
Apache Kafka 3.6 Release Notes: https://archive.apache.org/dist/kafka/3.6.0/RELEASE_NOTES.html
Apache Kafka 3.7 Release Notes: https://archive.apache.org/dist/kafka/3.7.0/RELEASE_NOTES.html
Apache Kafka 3.8 Release Notes: https://archive.apache.org/dist/kafka/3.8.0/RELEASE_NOTES.html
Aiven Tiered Storage for S3: https://github.com/Aiven-Open/tiered-storage-s3
Confluent Tiered Storage: https://docs.confluent.io/kafka/operations-tools/tiered-storage/overview.html
KIP-1150: Diskless Brokers. https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Brokers (in discussion)

Tiered storage is one piece of a larger cloud-native data infrastructure. See also:

Apache Iceberg: Data Lakehouse Architecture for Production Analytics — How to structure petabyte-scale data warehouses with open-table formats.
eBPF & Kubernetes Observability: Replacing APM with Kernel Instrumentation — Observing Kafka’s I/O and network behavior without agent overhead.
ArgoCD vs. Flux: GitOps Decision Record for Kafka Cluster Management — Declaratively managing Kafka infrastructure (including tiered storage configs).
Event-Driven Backtesting Engine: Architecture for Algorithmic Trading — A real-world use case: using tiered Kafka for multi-year tick data.

Last Updated: April 18, 2026

Feedback or corrections? Open an issue or PR in the blog repository.