Educational systems analysis only — not financial or trading advice.
Architecture at a glance





Educational systems analysis only — not financial or trading advice.
Streaming Trade Data into Iceberg: 2026 Architecture for Fintech
Trade-data lakes used to be a quiet corner of a fintech stack. Surveillance teams pulled overnight extracts, TCA decks shipped on Tuesday, and the only thing the regulator wanted was a tidy CSV every quarter. That world is gone. By 2026 the surveillance window is intraday, regulators expect bit-for-bit reproducibility of an order book on any past second, and quants want the same data their machine-learning pipeline trains on to be the same table the compliance officer queries an hour later. The architecture that absorbs all of that — without three copies of every byte — is converging on a single pattern: streaming trade data into Apache Iceberg, materialized directly from Kafka via Tableflow-style pipelines, governed by an open catalog, queried by whichever engine the team prefers.
This post walks the architecture end to end. We will look at the reference design, how Iceberg snapshot semantics give surveillance teams real time travel, how schema evolution survives FIX tag churn, what “near real-time” actually means when Tableflow is in the loop, where the money goes, and where the sharp edges hide. The goal is not a vendor pitch — it is a clean systems analysis that a head of platform engineering at a broker-dealer, an exchange, or a digital-asset venue can use to pressure-test their own roadmap.
Context: What a 2026 Fintech Trade-Data Lake Has to Do
A modern trade-data platform serves four overlapping workloads, and the architecture has to keep all four honest.
Surveillance and market-abuse detection is the most time-sensitive. Spoofing, layering, wash trading, marking the close — the patterns regulators care about live in the relationship between orders, cancels, modifies, and executions across milliseconds. The European Securities and Markets Authority and equivalent national bodies expect surveillance signals to be investigated promptly, with full reconstruction of the order book at the moment of interest. That reconstruction is the bit that breaks naive architectures: you need every event, in order, with no resampling.
Regulatory reporting and time travel is the long tail. Under MiFID II RTS 24, EU investment firms must retain order-related records for at least five years in a form that allows competent authorities to reconstruct order book activity, and the Consolidated Audit Trail (CAT) in the US imposes similar retention on broker-dealers and exchanges through FINRA CAT LLC. The retention window is years, the access pattern is “give me everything that touched ISIN X between 14:00:00 and 14:00:10 on a Tuesday two years ago”, and the storage bill cannot eat the P&L.
Transaction cost analysis (TCA) sits in between. Buy-side desks want to see slippage versus arrival price, implementation shortfall, venue toxicity, child-order behaviour — usually within minutes of the parent order finishing, sometimes intraday. TCA pipelines join order, execution, and market-data tables that were written by completely different services.
Post-trade analytics, quant research, and ML feature pipelines are the fourth workload. They want the same source of truth — the same partitioning, the same schema, the same access controls — without having to copy a petabyte of Parquet into a private bucket every Monday.
The job of the platform is to make those four workloads share one canonical table, with the right latency tier and the right access controls each time. That is exactly the problem Iceberg-on-streams was designed for.
The Reference Architecture
The reference design splits cleanly into six layers: edge ingest, the Kafka streaming backbone, a Tableflow-style materializer, the Iceberg lakehouse, an open catalog, and a fan-out of query engines. The high-level shape is in the diagram below.

Edge: FIX, OUCH, ITCH and Drop Copies
The first hop is the boring one that everything depends on. Order entry comes in over FIX 4.4 or 5.0 SP2 from buy-side clients, or over native exchange protocols like Nasdaq’s OUCH for order entry and ITCH for the matching-engine market-data feed. Internal OMS and EMS systems publish drop-copy streams that mirror every order, modify, cancel, and execution leaving the firm.
The decoders normalize these into protobuf or Avro records with a schema-registry-managed contract — never JSON for hot-path market data, because the wire-cost and parse-cost differences at a million messages per second are real. Timestamps are captured at three points: gateway ingress, OMS acceptance, and Kafka publish, all in nanoseconds where the upstream protocol supplies them. Losing nanosecond precision at this layer is unrecoverable later.
Tableflow: Kafka to Iceberg
Tableflow is Confluent’s name for the pattern of materializing a Kafka topic directly into an Iceberg table without writing a custom Flink or Spark job to do it. AWS MSK has a similar capability, and Redpanda’s Iceberg topics ship the same idea from the broker side. Open-source variants exist via Kafka Connect’s Iceberg sink. The point is not the brand — it is that the materializer reads from the topic, batches records into Parquet, writes them to object storage, and commits an Iceberg snapshot on a configured cadence, typically every 15 to 60 seconds.
This is the move that removes an entire class of plumbing code from a fintech stack. The schema in the registry becomes the schema of the Iceberg table. The ordering guarantees of the Kafka partition become the row order in the snapshot. Watermarks and exactly-once semantics are the materializer’s problem, not yours.
Compaction Strategy
Streaming writes produce small files — that is physics. A topic flushing every 15 seconds on 64 partitions writes 64 Parquet files per flush, each maybe 10–50 MB. Within a day you have a hundred thousand files, your query planner spends more time listing than reading, and Trino’s split scheduler starts to suffer.
Compaction is the answer, but it has to be configured deliberately. The pattern that works in practice is a tiered compaction job: bin-pack small files into 256–512 MB targets every hour for the current day, then run a sort-based rewrite overnight that orders rows by (symbol, ts_ns) to make point-in-time scans cheap. Iceberg’s rewrite_data_files procedure handles both modes. The compaction job is itself a writer, so it produces new snapshots — which is fine, because Iceberg snapshots are cheap until you keep too many of them.
Catalog: Polaris, Nessie, REST
The catalog is the boring-but-load-bearing part of the architecture. It is what tells every query engine where the current snapshot lives, who is allowed to read which columns, and how to discover tables across schemas. In 2026 the live options are Apache Polaris (the project Snowflake donated to the ASF, formerly Snowflake Open Catalog), Project Nessie (Dremio’s catalog with git-style branching), the AWS Glue Iceberg REST endpoint, and Unity Catalog from Databricks now that it has opened up Iceberg REST.
The architectural decision is to expose the Iceberg REST Catalog specification, not a vendor-proprietary API, so that Trino, Snowflake, Databricks, Flink, Spark, and DuckDB can all talk to the same catalog without translation. Access control belongs in the catalog layer, not in each query engine — row-level filters for desk-segregation, column masks for PII fields like client account numbers, and per-table audit logs that the compliance team owns.
Time Travel for Surveillance
This is the feature that makes Iceberg worth the migration for a surveillance team. Every commit produces a new snapshot, and every snapshot is queryable by ID, by timestamp, or as the input to a changelog scan. The pattern is illustrated below.

Imagine an alert fires at 09:31:00 UTC for a suspect cancel burst on a thinly traded mid-cap. The analyst’s first three queries are now trivial SQL:
-- Order book state exactly when the burst began
SELECT * FROM orders
FOR TIMESTAMP AS OF TIMESTAMP '2026-06-01 09:30:30 UTC'
WHERE symbol = 'XYZ';
-- Everything that changed between t and t+30s
SELECT * FROM orders.changes
WHERE snapshot_id BETWEEN <s3> AND <s4>
AND symbol = 'XYZ';
-- Diff: what was cancelled in that window
SELECT order_id, cancel_reason
FROM orders.changes
WHERE op = 'delete' AND snapshot_id = <s4>;
Why this beats nightly extracts is straightforward. A nightly Parquet drop gives you one state — end of day — and forces you to reconstruct intraday transitions from event logs that may have been resampled, lost, or shipped on a different cadence. Iceberg time travel gives you the snapshot the writer committed, byte for byte, including the equality deletes and position deletes that represent corrections and cancellations. The auditor’s question “what did the book look like at 09:30:30.250” becomes a single query against a specific snapshot_id, and that ID is an immutable artifact you can attach to the evidence pack.
Retention is the lever you have to set carefully. MiFID II RTS 24 requires a minimum of five years for order records, and many firms hold seven to match SEC Rule 17a-4 and the CAT retention requirement. Keeping every 15-second snapshot for seven years is wasteful and will explode your metadata. The pattern that works:
- Hot tier, last 7 days: keep every snapshot, expire none — supports same-week investigations and intraday changelog replay.
- Warm tier, days 8–90: retain hourly snapshots — enough granularity for almost every reconstruction request without exploding the manifest tree.
- Cold tier, days 91 to 7 years: retain end-of-day snapshots plus any snapshot the surveillance system flagged — meets the regulatory requirement and keeps the manifest count manageable.
Iceberg’s expire_snapshots procedure handles the lifecycle. The metadata-tree compaction (rewrite_manifests) is the unsung hero that keeps planning latency flat as the table ages.
Schema Evolution and Order Book Quirks
Trade data schemas drift constantly. New FIX tags appear when a venue launches a new order type, internal teams add algorithmic-strategy IDs, regulators introduce new fields (the lit/dark venue indicator under MiFID II’s transparency regime is one example), and someone always wants to widen a price decimal when a new asset class lands on the platform. Iceberg’s schema evolution rules — backed by stable field IDs rather than column names — are what keep historical reads working through all of that.

The rules worth memorizing:
- Add column: always safe. Old data reads NULL for the new field. This is how you absorb a new FIX tag without a backfill.
- Drop column: safe at the metadata level. The data files still contain the bytes; the catalog stops surfacing them. If you actually want the storage back, run a rewrite.
- Rename column: safe because Iceberg tracks columns by ID. Renaming
sidetoorder_sidedoes not rewrite a single byte. - Type promotion: allowed in one direction only —
int→long,float→double,decimal(P,S)→decimal(P',S)withP' ≥ P. Going the other way — narrowing a long back to an int — is blocked and would require a full rewrite.
The trap a lot of teams hit on first contact is mixing schema evolution with partition evolution. Iceberg supports both, but partition evolution requires rewriting old data into the new partition spec if you want the optimizer to take advantage of the new layout. The pragmatic pattern is to set a partition spec you can live with for a year (hour-bucketed for orders, day-bucketed plus symbol-bucket for market data) and resist the temptation to evolve it on a whim.
The order-book quirk that surprises people is corrections. When a trade is busted or a price is corrected post-trade, you do not want to in-place mutate the original row — you want the audit trail. Iceberg’s row-level operations via equality deletes or position deletes give you that. The original row stays in the snapshot it was written to; the deletion is a new file in a new snapshot; the time-travel query at the original timestamp still shows the original (pre-bust) data, which is exactly what the regulator expects.
Latency Budget
The most important thing to be honest about: streaming-into-Iceberg is near real-time, not real-time. Tableflow-class materializers commit on a cadence measured in tens of seconds, not microseconds. That latency budget is fine for surveillance, TCA, and reporting. It is not fine for live algorithmic-trading decision loops, live risk, or anything that has to react inside a single market quote. The diagram below shows the breakdown.

The default Tableflow path runs roughly: 2 ms producer ACK, 5 ms Kafka commit, 15 s batching window in the materializer, 18 s Iceberg manifest commit, 20 s catalog refresh, 25 s before a downstream query engine can see the snapshot. The exact numbers vary with batch size, partition count, and object-store latency, but 10 to 60 seconds end-to-end is the right mental model.
For workloads that need sub-second, the architecture pattern is to fan out, not push harder. The same Kafka topic feeds a Flink job in parallel with the Tableflow materializer. Flink keeps the live state (running VWAP, intraday position, risk exposure) in its state backend and pushes derived outputs to Redis or directly to the trading UI. Flink can also write Iceberg micro-commits more aggressively than Tableflow — sub-second commits are technically possible, but the metadata churn rate makes them a poor default. The clean three-tier model is: Tier 1 (live, <100 ms, Flink state); Tier 2 (near real-time, 10–60 s, Tableflow → Iceberg); Tier 3 (batch, minutes, Spark compaction → Iceberg).
Cost Engineering
A petabyte of trade history is no longer expensive to store — it is expensive to query badly. The cost decisions worth obsessing over are storage tiering, partition design, and engine choice, in that order.

Storage tiering maps cleanly onto the snapshot retention policy from earlier. Hot S3 Standard runs around $23/TB/month; S3 Infrequent Access and equivalents drop to around $12/TB/month; Glacier Instant Retrieval drops the cold tier to roughly $4/TB/month with millisecond first-byte for the rare regulator query. Iceberg’s manifest layer does not care which class an underlying object lives in, so lifecycle policies in the object store can do the heavy lifting transparently.
Partition design is where most teams pay the most preventable cost. Partitioning naively by symbol creates tens of thousands of tiny directories, kills file-listing performance, and burns money. The pattern that works is bucket(64, symbol) + hour(ts) for orders/executions and bucket(128, symbol) + day(ts) for market data, sorted by (symbol, ts_ns) after compaction. Bloom filters on order_id and client_id accelerate point lookups for surveillance investigations at marginal storage cost.
Engine choice is a workload-fit question, not a religious one. Trino is cheap, BYO-compute, and excellent for ad-hoc surveillance and TCA queries. Snowflake-on-Iceberg gives you premium ergonomics and elastic warehouses for the desk that does not want to run anything. Databricks SQL with Photon shines when you need ML pipelines alongside SQL. DuckDB on an analyst’s laptop is a perfectly valid read path for cold-data investigations and costs nothing. The Iceberg REST catalog makes all of them peers.
Trade-offs and Gotchas
The honest list of sharp edges:
- Equality deletes and the merge-on-read tax. Corrections, busts, and late-arriving reference data are handled with equality deletes, which the reader has to merge at query time. Heavy delete loads degrade scan performance. The compaction job should rewrite deletes into the data files on a regular cadence.
- Vendor compatibility drift. “Iceberg-compatible” varies by version. Iceberg v2 with deletes is well supported; v3 with row lineage and variant types is rolling out across engines through 2026 at different speeds. Pin your spec version and validate every engine you intend to use.
- Replay storms. Re-running a Tableflow pipeline from offset 0 against a years-old topic creates a huge volume of small commits and can saturate object-store request limits. Throttle replay and use a separate target table for backfills.
- Catalog as a single point of coordination. A misconfigured catalog can take down every reader at once. Run it HA, alert on commit latency, and keep emergency-read paths via direct snapshot-ID queries.
- Encryption and key rotation. Object-store-side encryption is table stakes; Iceberg’s table-level KMS keys with per-column encryption are the path for PII columns. Plan rotation up front.
Practical Recommendations
A pragmatic adoption checklist for a 2026 fintech platform team:
- Start with one high-value workload — surveillance is usually the strongest business case — and migrate the table for that workload to Iceberg first.
- Standardize on the Iceberg REST Catalog spec from day one; do not lock the catalog to a single vendor.
- Use Tableflow (or its equivalent) as the default ingest path; keep Flink for sub-second tiers and stateful aggregations.
- Set snapshot retention by tier (hot/warm/cold) and automate
expire_snapshotsplusrewrite_manifests. - Pick a partition spec you can live with for 12 months and resist evolution unless query latency demands it.
- Run compaction on an hourly cadence for the current day and a sort-based rewrite overnight.
- Treat schema evolution as a first-class change-management process — additions are cheap, but every change needs a registry update and a downstream-impact note.
- Map storage tiers to the retention policy; let the object store’s lifecycle rules do the work.
- Document the engine matrix — which engine is allowed to read which table, which is allowed to write — and audit it quarterly.
FAQ
Iceberg vs kdb+ for trade data — do I still need kdb+?
For tick-level, sub-millisecond, in-memory analytics on the current trading session — kdb+ and its peers (QuestDB, Vertica, ClickHouse) are still excellent and well-suited. Iceberg-on-streams is not trying to replace that tier. The architectural pattern is to keep the low-latency store for live workloads and use Iceberg as the canonical, long-retention, multi-engine source of truth for everything from the last few minutes back through the regulatory window. Many firms run both.
Can I do tick-level analytics directly on Iceberg?
Yes, with the right partitioning and engine. Sorted Parquet with (symbol, ts_ns) ordering, bloom filters on order_id, and an engine that respects file-level statistics (Trino, DuckDB, Spark with vectorized reads) will give you tens-of-milliseconds query latency on a single symbol’s daily ticks. It will not match an in-memory column store, but it is more than adequate for TCA, surveillance, and research.
Tableflow vs Flink — when do I pick which?
Tableflow if your target is “near real-time table” and you do not need stateful transforms — most surveillance and reporting workloads. Flink if you need stateful joins, windowed aggregations, sub-second latency, or any non-trivial business logic between Kafka and Iceberg. They coexist: most architectures end up running both on the same topics.
How long should I retain Iceberg snapshots?
Match retention to your tiers. Days 0–7: every snapshot. Days 8–90: hourly. Days 91 to your regulatory floor (5 years for MiFID II RTS 24, 7+ for many CAT-bound firms): daily snapshots plus any flagged for investigation. Expire snapshots aggressively outside those tiers and rewrite manifests after each pass.
Does this architecture work for digital-asset venues?
Yes — the protocol layer differs (WebSocket order entry, internal matching engines, on-chain settlement) but the streaming-into-Iceberg pattern is unchanged. The reporting regime (MiCA in the EU, evolving SEC and CFTC frameworks in the US) is moving in the same direction as MiFID II RTS 24, and the architecture survives the rule shifts because Iceberg’s time travel is the answer to “what did the book look like at T” regardless of which regulator is asking.
Does Snowflake Open Catalog or Polaris matter for fintech?
The catalog you pick determines who can read your tables and how access control is enforced. Polaris (formerly Snowflake Open Catalog), Nessie, Glue Iceberg REST, and Unity Catalog all expose the same Iceberg REST spec — choose based on operational fit, not feature differentiation. The architecturally important thing is that the spec is open.
Further Reading
- TWAP execution algorithm architecture for 2026 — companion piece on the execution side of the same data flow.
- Apache Pulsar geo-replication architecture for 2026 — when Kafka is not the right backbone and you need active-active geo-replicated streams.
Author
The IoT Digital Twin PLM editorial team writes systems analysis on the data, control, and integration layers that real industrial and financial platforms run on in production. We do not offer trading advice, financial advice, or predictions.
