This article is a systems and architecture analysis for engineering audiences. It is not financial, legal, or compliance advice.
This article is a systems and architecture analysis for engineering audiences. It is not financial, legal, or compliance advice.
Real-Time Fraud Detection Architecture: Sub-100ms Scoring
A card-not-present authorization arrives. Somewhere in the span of a human blink, a system must decide whether to let it through, challenge it, or decline it. A real-time fraud detection architecture is the set of components that turns that raw event into a calibrated risk decision before the payment rail commits — and it has to do so while the cardholder is still staring at a spinner. The hard part is not the model. The hard part is delivering a fresh, leak-free feature vector and a calibrated score inside a budget measured in tens of milliseconds, then acting on it without drowning the operations team in false alarms. This post is a reference architecture: four tiers, a worked latency budget, and the trade-offs that decide whether the system catches fraud or merely annoys good customers.
What this covers: the four-tier pipeline (ingest, features, scoring, decision), a sub-100ms latency breakdown, online/offline feature parity, the recall-versus-false-positive bind, and the failure modes that quietly erode accuracy over time.
Context and Background
For two decades, much fraud screening ran in batch. Transactions settled on rails that took a day or more, so a model could score overnight, analysts could review queues in the morning, and a suspicious charge could still be clawed back before money truly moved. That slack is gone. Instant-payment rails — FedNow in the United States, UPI in India, SEPA Instant in Europe — settle irrevocably in seconds. Card networks push authorization round-trips that the issuer is expected to answer in well under a second. When the rail is irrevocable and synchronous, a decision that arrives after settlement is not a decision; it is a post-mortem.
This is the latency-accuracy bind. You cannot move accuracy up by spending more time, because the time does not exist. Card-not-present fraud — the dominant loss category as chip cards pushed fraud away from the physical point of sale — compounds the problem: there is no card present to inspect, only a stream of attributes that an attacker can partially forge. The signal that separates a legitimate stranger from a fraudster often lives in behavior across many events (this device, this hour, this merchant, this velocity) rather than in any single field. Surfacing that behavioral signal in time is an architecture problem before it is a modeling one.
Batch scoring fails here for a structural reason: it scores the world as it looked at the last checkpoint, not as it looks now. A fraud ring that opens an account and runs forty transactions in ninety seconds is invisible to a model whose newest feature is an hour old. Real-time fraud detection exists to close that gap — to compute behavioral features as events arrive and score against them synchronously. The streaming substrate that makes this possible is the same one used across modern event systems; if you want the rails context, see our walk-through of real-time payment infrastructure across FedNow, UPI, and SEPA Instant. For the streaming engine itself, the Apache Flink documentation is the canonical reference on event-time windowing and exactly-once state, the two properties this architecture leans on hardest.
The Four-Tier Reference Architecture

A real-time fraud detection architecture decomposes into four tiers: a streaming ingest layer that captures and enriches events, an online feature layer that serves fresh aggregates with read latency in single-digit milliseconds, a scoring layer that assembles a feature vector and runs a model under a strict budget, and a decision layer that maps the score to an action. Each tier has its own latency, consistency, and failure profile, and the boundaries between them are where most real systems break.
Figure 1: The four-tier pipeline — ingest enriches raw auth events, the online feature layer serves velocity and graph aggregates, the scoring layer produces a risk score, and the decision layer maps that score to approve, step-up, or decline while emitting labels back to a case queue.
The diagram traces one event from capture to decision and shows the feedback edge that most diagrams omit: the decision and its eventual outcome flow back as labels. That loop is not decoration — it is how the system learns what it got wrong, and its latency (the label-delay problem) shapes everything upstream. Read the figure left-to-right as the synchronous hot path, and treat the bottom edge as the asynchronous learning path that runs on a different clock.
Tier 1 — Streaming ingest: capture and enrichment
The first tier turns a payment authorization into a structured, enriched event on a durable log. In practice this is a topic on a streaming bus — Kafka or an equivalent — that fronts the rest of the pipeline. Durability matters: if the bus loses events, the feature layer’s aggregates silently drift, and you will not know until your recall quietly degrades. The Apache Kafka documentation describes the partitioning and retention semantics that make this log both the system of record for in-flight events and the replay source when you need to rebuild state.
Enrichment happens here because it is cheap and reusable. The raw event carries an amount, a card token, a merchant identifier, and a timestamp. The ingest tier attaches derived context: geolocation from IP, device fingerprint, bank identification number (BIN) metadata, merchant category, and any tokenized identity links. Doing this once at ingest — rather than repeatedly at scoring time — keeps the hot path lean and ensures every downstream consumer sees the same enriched view. A subtle rule: enrichment must be deterministic and side-effect-free, because the same event may be replayed during recovery, and a non-deterministic enrichment (a lookup against a mutable table) reintroduces the very training/serving skew the architecture is trying to eliminate.
Tier 2 — Online feature layer: parity and point-in-time correctness
The second tier is where most of the discriminative signal is born and where most of the subtle bugs live. Behavioral fraud features are aggregates over windows: count of transactions on this card in the last sixty seconds, sum of amounts to this merchant in the last hour, number of distinct devices on this account today, time since the last decline. These velocity and aggregation features are computed by the streaming engine as events flow and written into an online store — a low-latency key-value system (Redis, a managed feature store’s online tier, or similar) that the scoring layer can read in single-digit milliseconds.
Two correctness properties dominate here. Online/offline parity means the feature value the model sees in production is computed by the same logic that produced the feature in training. If the offline pipeline computes “transactions in last hour” with a SQL window and the online pipeline approximates it with a slightly different boundary, the model trains on one distribution and serves on another — training/serving skew, the single most common cause of a model that looks great offline and underperforms live. Point-in-time correctness means the training set only ever sees feature values that were knowable at the moment of the historical event; leaking a future aggregate into a training row inflates offline metrics and collapses in production. A purpose-built feature store exists largely to enforce these two properties; the Feast feature store documentation is a good primary reference on the online/offline split and point-in-time joins, and we go deeper in our feature store architecture guide.
Tier 3 — Scoring: gradient-boosted trees plus graph signals
The third tier assembles the feature vector and produces a risk score. For tabular fraud features, gradient-boosted decision trees (GBDT — XGBoost, LightGBM, CatBoost) remain the workhorse: they handle mixed-type features, tolerate missing values, train fast, and serve in well under a millisecond for a single row once loaded into memory. They are also interpretable enough that an analyst can ask why a transaction scored high and get a defensible answer from feature attributions.
Graph signals layer on top. Fraud is relational — rings share devices, addresses, and funding instruments — and a graph neural network (GNN) or simpler graph features can surface “this new account is two hops from a known mule” in a way no per-transaction feature captures. The pragmatic pattern is to compute graph embeddings or risk propagation scores asynchronously, materialize them into the online feature store, and read them as just another feature at scoring time, so the synchronous path stays a fast tabular lookup plus a tree ensemble. For the modeling background, the survey “Graph Neural Networks for Financial Fraud Detection” (arXiv:2411.05815) catalogs how relational structure is exploited without forcing a full graph traversal into the hot path. Model serving here is a tight in-process or sidecar inference call; the budget does not tolerate a network hop to a heavyweight model server for the primary score.
Tier 4 — Decision: rules, thresholds, and step-up auth
The fourth tier turns a number into an action. A raw score is not a decision — the decision layer combines the model score with deterministic rules (hard blocks for sanctioned BINs, velocity ceilings, merchant-specific policy) and score thresholds that map risk bands to outcomes: approve silently, challenge with step-up authentication (3-D Secure, one-time passcode), or decline. Rules sit alongside the model rather than inside it because some constraints are policy, not probability, and must hold regardless of what the model thinks. The decision layer also owns case management: high-risk events that are not auto-declined land in an analyst queue, and every outcome — approved-and-good, declined-and-fraud, challenged-and-abandoned — becomes a label that flows back to training. This tier is where business cost meets statistical risk, and it is the right place to tune the recall/false-positive trade-off rather than baking it into the model.
Latency Budget, Features, and Tuning
The whole pipeline lives or dies on a budget. If the issuer must answer in roughly 100 milliseconds end-to-end, the fraud decision is one tenant inside that envelope and gets only a slice. The table below is illustrative — real numbers depend on hardware, colocation, and model size — but it shows how the budget is typically apportioned and where the pressure points are.
| Stage | Illustrative budget (ms) |
|---|---|
| Network in and request parse | 5 |
| Ingest enrichment (cached) | 8 |
| Online feature read | 12 |
| Feature vector assembly | 6 |
| Model inference (GBDT + graph lookup) | 15 |
| Rules and threshold evaluation | 4 |
| Decision serialization and response | 5 |
| Headroom and jitter buffer | 45 |
Table values are illustrative, not benchmarks. The headroom line is the honest part: you do not budget to the mean, you budget to the tail. A system whose p50 is 25ms but whose p99 is 140ms misses the budget on one transaction in a hundred — and at scale, one in a hundred is a large absolute number of mishandled authorizations. Tail latency is the enemy; garbage-collection pauses, a cold cache, a feature store node failover, or a noisy-neighbor effect all show up at p99, not p50. The discipline is to measure and govern the tail, set timeouts that degrade gracefully (a default-deny or default-allow policy when the model times out), and treat the timeout path as a first-class design decision rather than an afterthought.

Figure 2: The synchronous scoring path as a sequence — gateway to ingest to feature read to model to decision — every hop inside the same budget. The online feature read and model inference are the two stages most likely to blow the tail.
Figure 2 makes the serial dependency explicit: each hop waits on the previous one, so latencies add. There is no parallelism to hide behind on the critical path, which is why caching enrichment and keeping the model in-process matter so much — every avoidable network round-trip is budget you do not get back.
A useful discipline is to separate the budget into “must complete” and “best effort” stages. The score and the rules must complete inside the envelope; an expensive enrichment that is occasionally slow can be made best-effort, falling back to a cached or default value rather than blocking the decision. This is the same reasoning that drives keeping the primary GBDT model in-process: a network hop to a remote model server adds not just its own latency but a second tail-risk distribution on top of yours, and two independent tails compound badly at p99. Co-locating the model with the feature cache, and reserving remote calls for the asynchronous graph and embedding computations, is the structural move that keeps the synchronous path predictable.
Feature freshness versus completeness
Freshness is a feature-store property with a direct accuracy consequence. A velocity feature that updates every event catches a burst attack; one that updates every minute does not. But perfect freshness is expensive: maintaining exact sliding-window counts for millions of keys in real time is heavy. Many systems trade exactness for speed using approximate structures — count-min sketches for frequency, HyperLogLog for distinct counts — accepting bounded error in exchange for constant memory and constant-time reads. The architectural decision is per-feature: a “distinct devices today” count can tolerate approximation; a “this exact card declined twice in the last ten seconds” rule probably cannot. Document the freshness guarantee of every feature, because a stale feature is not a missing feature — it is a confidently wrong one.

Figure 3: One transformation definition feeds two paths — a streaming compute path that materializes the online store for serving, and a batch backfill path that produces point-in-time-correct training data. Sharing the definition is what prevents training/serving skew.
Figure 3 is the parity story in one picture. The single most important property is that the box labeled “feature transform logic” is one definition, not two implementations that happen to agree today and drift apart next quarter. When teams maintain separate online and offline feature code, skew is not a risk — it is a certainty on a long enough timeline.
Recall versus false-positive rate
This is the trade-off that defines the system’s character. False positive rate tuning is not a knob you set once; it is a continuous negotiation between two asymmetric costs. A missed fraud (false negative) costs the chargeback plus operational handling. A blocked legitimate transaction (false positive) costs the immediate lost sale, the support contact, and — most expensively — the long-term erosion of customer trust. Studies of payment friction repeatedly find that false declines cost issuers and merchants more in aggregate than the fraud they prevent, because legitimate volume dwarfs fraud volume.
Because fraud is rare, raw accuracy is a useless metric — a model that approves everything is 99.x% accurate and catches zero fraud. The honest metrics are precision and recall, usually summarized as a precision-recall curve and operating points like “recall at a fixed 0.1% false-positive rate.” You pick an operating point on that curve based on cost, not on a default 0.5 threshold. Two transactions with the same score can warrant different actions: a $5 charge at a low-risk merchant might be approved while a $5,000 charge with the same score is challenged. That is why the decision tier, not the model, owns the threshold — it can layer transaction value, merchant policy, and step-up availability on top of the raw score.
It helps to think in expected-cost terms rather than in raw error rates. Each possible action carries an expected loss that is the product of an outcome probability and that outcome’s cost: approving a transaction risks the expected fraud loss, declining it risks the expected value of a lost legitimate sale plus trust erosion, and challenging it carries a smaller friction cost plus the probability that a genuine customer abandons the challenge. The optimal action for a given score is the one with the lowest expected cost, and because the costs differ by transaction value and customer segment, the optimal threshold is not a single global number but a surface. Encoding that surface in the decision tier — rather than approximating it with one fixed cutoff — is what separates a system that merely scores from one that actually optimizes the business outcome.
Calibration and feedback loops
A score is only useful for thresholding if it is calibrated — if “0.9” actually means roughly a 90% chance of fraud across the population. GBDT outputs are often miscalibrated and benefit from a post-hoc step (isotonic regression or Platt scaling) so thresholds map to real probabilities and you can reason about expected cost. Calibration drifts as the population shifts, so it is re-checked, not set once.
The feedback loop has a cruel property: label delay. The ground truth for “was this fraud” arrives late — chargebacks can take weeks. So the model is always training on a lagged, partially-labeled view of the world while attackers adapt in real time. Architecturally, this means you cannot wait for clean labels to react to a new attack; you combine fast, noisy signals (step-up abandonment, manual analyst flags, velocity spikes) for early warning with slow, clean labels (confirmed chargebacks) for retraining. The event-driven plumbing that carries these labels back is the same backbone described in our event-driven backtesting engine architecture, where replaying historical events without leaking the future is the central discipline — exactly the constraint point-in-time correctness imposes here.
Trade-offs, Gotchas, and What Goes Wrong

Figure 4: The decision tier as a band-based flow — low scores approve silently, medium scores trigger step-up, high scores decline, and every branch emits an outcome label into the retrain loop. The step-up branch is where false positives become recoverable instead of lost.
The most pervasive failure is training/serving skew: the model serves on features computed slightly differently than it trained on. It hides because offline metrics look fine; you only see it as unexplained live underperformance. Sharing one feature definition across paths (Figure 3) is the structural fix; monitoring the live distribution of each feature against its training distribution is the detection mechanism.
Concept drift and adversarial adaptation are the next layer. Unlike a recommender, a fraud model faces an adversary who actively probes for the decision boundary and moves once they find it. A model frozen for six months is a model whose blind spots have been mapped and exploited. This forces frequent retraining, champion-challenger deployment, and drift monitoring on both inputs and outcomes — and it means a one-time accuracy number is meaningless without a decay curve next to it.
Cold-start on new accounts is structural: a brand-new account has no velocity history, so the behavioral features that carry most of the signal are empty. The mitigations are graph features (the account may be relationally close to known fraud even with zero transaction history), consortium/network signals, and a deliberately more conservative policy for thin-history accounts — accepting more friction where you have less evidence.
Alert fatigue is the human failure mode. Set thresholds too aggressively and the case queue floods, analysts rubber-stamp to clear backlog, and real fraud slips through in the noise. The fix is partly statistical (precision at the queue’s capacity, not just recall) and partly operational (ranking the queue by expected loss so the scarce human attention lands on the highest-value cases).
Finally, the online-feature consistency problem: under failover, replay, or a partial outage, the online store can serve a stale or partial aggregate, and the model scores confidently on bad input. Designing the timeout and degradation path — what the decision tier does when a feature is missing or the model times out — is not an edge case to handle later; it is core to the architecture.
Practical Recommendations
Treat the budget as a contract. Decide the end-to-end latency target first, apportion it per stage, and measure the p99 of every stage continuously — not the mean. The tail is where the budget is missed, and the tail is what your customers feel.
Build features once. One transformation definition, materialized to both an online store for serving and an offline store for point-in-time-correct training. This single decision eliminates the most common and most expensive class of bug in the system.
Put thresholds in the decision tier, not the model. Keep the model a calibrated probability source and let the decision layer combine that probability with transaction value, policy rules, and step-up availability. This keeps the recall/false-positive trade-off tunable without retraining.
A short engineering checklist:
- [ ] End-to-end latency budget defined, apportioned per stage, and p99-monitored.
- [ ] Single feature definition feeding both online serving and offline training.
- [ ] Point-in-time-correct training joins; no future leakage in any feature.
- [ ] Graph signals materialized asynchronously, read synchronously as tabular features.
- [ ] Model outputs calibrated; thresholds set from a precision-recall operating point, not a default 0.5.
- [ ] Explicit timeout and degradation policy for feature-read and model-inference failures.
- [ ] Drift monitoring on inputs and outcomes; champion-challenger retraining cadence.
- [ ] Case queue ranked by expected loss to contain alert fatigue.
- [ ] Conservative policy path for cold-start, thin-history accounts.
Frequently Asked Questions
What latency does real-time fraud detection need?
It is bounded by the payment rail, not by preference. Card authorization round-trips and instant-payment rails typically expect a synchronous answer in well under a second, and the fraud decision is only one tenant of that envelope — often a budget in the tens of milliseconds. The governing number is the p99, not the average: a system that is fast on median but slow at the tail mishandles a steady fraction of transactions. Designing the timeout and degradation path is as important as the happy-path latency, because the rail will not wait.
How do feature stores help fraud detection?
A feature store solves two correctness problems that otherwise quietly wreck accuracy. It enforces online/offline parity, so the feature the model serves on is computed by the same logic it trained on, and it enforces point-in-time correctness, so training rows never see values that were not knowable at the historical moment. It also provides the low-latency online serving layer that lets the scoring tier read fresh velocity and aggregation features in single-digit milliseconds. Without it, teams maintain two feature implementations that inevitably drift, producing training/serving skew.
How do you reduce false positives in fraud detection?
Move the threshold off the default and onto a deliberate operating point on the precision-recall curve, chosen by cost rather than by a 0.5 cutoff. Calibrate the model so scores are real probabilities you can reason about. Push value-aware and policy-aware logic into the decision tier so identical scores can yield different actions by transaction value or merchant. Most importantly, use step-up authentication as a middle path — challenging a borderline transaction recovers a would-be false decline instead of losing the customer outright.
Where do graph neural networks fit?
Graph methods capture relational fraud — rings sharing devices, addresses, or funding instruments — that no per-transaction feature can see. The practical pattern keeps them off the synchronous critical path: compute graph embeddings or risk-propagation scores asynchronously, materialize them into the online feature store, and read them at scoring time as ordinary tabular features. This gives the relational signal without forcing an expensive graph traversal into a tens-of-milliseconds budget. They are especially valuable for cold-start accounts, where transaction history is empty but relational proximity to known fraud may not be.
Batch versus real-time fraud scoring — which wins?
They answer different questions. Batch scoring is fine when the rail is reversible and you have hours to claw back a bad transaction; it is also where heavy, slow analytics and model training live. Real-time scoring is mandatory when the rail settles irrevocably in seconds and the decision must precede settlement. Most production systems run both: a real-time hot path for the synchronous approve/challenge/decline decision, and a batch cold path for retraining, graph computation, and deep investigation. The two share feature definitions to stay consistent.
