TikTok MLE System Design Interview: Design a Real-Time Fraud Detection System

One of the classic TikTok Machine Learning Engineer (MLE) system design interview.

We process millions of financial transactions daily, and need a system that can detect fraud in real-time, scale gracefully, and learn from new fraud patterns. Design such a system end-to-end.

Design a Real-Time Fraud Detection System

Functional Requirements1. Real-Time Transaction MonitoringThe system must ingest and evaluate transactions within milliseconds. Each incoming transaction is processed through a streaming pipeline: it’s normalized, enriched with metadata, analyzed by a detection engine, and finally routed for alerts if suspicious.
The challenge here is latency: we must guarantee that the end-to-end path from ingestion to fraud decision takes under 300ms.
2. Fraud Detection AlgorithmsA hybrid approach is essential:
Rule-based engine for known fraud behaviors (e.g., “multiple high-value transfers in 30 minutes”).
Machine Learning models trained on historical data to detect unknown fraud patterns.
For the ML pipeline, features are derived from a user’s historical activity (rolling averages, spend deltas, geo/device frequency) and transaction-specific features (amount, merchant type, location).
Scoring must be stateless, fast, and explainable — especially under regulatory scrutiny.
3. Alert GenerationWhen a transaction’s risk score surpasses a predefined threshold, an alert is generated. This alert includes metadata like:
Triggered rules
Model score breakdown
Anomaly metrics
Alerts are ranked by severity (e.g., risk score × transaction amount) and routed to fraud analysts or automated blocks.
4. User and Transaction ProfilingUser profiles are maintained with:
Historical aggregates (e.g., average transaction size, geo dispersion)
Temporal patterns (e.g., time-of-day frequency)
Behavioral fingerprints (device ID churn, login location variance)
These profiles are updated asynchronously via background workers and stored in a fast-access cache with persistence to a relational database.
5. Reporting and AnalyticsFraud detection is not only about real-time decisions — it’s also about trends.
The system needs to support:
Dashboards showing fraud trends over time
Ad-hoc queries for transaction patterns
KPIs like false positives, model latency, and fraud caught
To support this, we pre-aggregate data daily by user, country, and category. Heavy queries run on a dedicated analytical store.
6. Integration with External SourcesWe enrich transaction data with external feeds like:
Sanction lists
Compromised card databases
Geo-IP mappings
These are pre-joined as part of the enrichment phase and cached to minimize lookup latency.
7. ScalabilityThe system must scale linearly with traffic growth. Stateless components (e.g., model inference, enrichment) are scaled horizontally, while stateful components (e.g., user profiles) use consistent hashing and sharding.
8. Audit and LoggingAll decisions are logged with:
Timestamped records
Model version
Feature vector used for scoring
Any triggered rules
Logs are tamper-proof, signed, and archived for regulatory compliance.

Non-Functional Requirements

Performance: Sub-300ms end-to-end latency target for transaction decisions.
Reliability: Use redundant microservices, graceful degradation, and circuit breakers for service isolation.
Scalability: Stateless services behind load balancers; sharded stores for stateful components like profiles.
Security: Encrypt all PII at rest and in transit; use tokenization for sensitive fields.
Accuracy: ML pipelines monitored via AUC-ROC, precision@K; feedback loop from analyst labels improves models.
Extensibility: Add new features and models with hot reload support; feature definitions versioned and decoupled from code.
Compliance: Store historical data snapshots for at least one year; enable selective access control for audit logs.

Data Flow: What Happens to a Transaction?1. IngestionA transaction is received via a REST or gRPC call. It’s validated (schema, data types, sanity checks) and placed onto a queue for asynchronous processing.
2. EnrichmentThe transaction is enriched with:
User’s historical profile
Geo-IP and device metadata
External blacklist matches
Enrichment is cached aggressively for low-latency access and persisted for audits.
3. DetectionThe enriched transaction is passed to:
The rule engine (for known patterns)
The ML model scorer (for predictive patterns)
Both return scores, reasons, and confidence metrics.
4. DecisionIf the risk exceeds a threshold, the system generates an alert:
Stored in the alert DB
Routed to the alert queue (priority queue)
Sent to analyst dashboard or automated handler
5. Profile UpdateThe user’s profile is asynchronously updated. We use write-ahead logs to ensure consistency in case of crashes.

Fraud Detection Engine: Mechanism Deep Dive

Rule-Based Detection

Rules are stored as ASTs or DSL expressions:

IF amount > 3 * avg_daily_spend AND location != last_known_country THEN FLAG

Evaluated using an in-memory interpreter. Rules can be hot-deployed and tested in dry-run mode before production rollout.

ML-Based Detection

Models are trained on millions of historical transactions and retrained weekly.

Feature engineering includes:

Time deltas: time_since_last_txn, spend_in_last_hour
Aggregations: avg_spend_30d, txn_count_by_device
Behavioral shifts: change_in_device_id_freq

For explainability, we use:

SHAP values for feature impact
Model score logs for audit and debugging

Inference must be sub-20ms, so models are optimized and hosted in-memory with batching enabled.

Internal APIs and Component Interactions

Transaction Processor

Receives transaction events
Calls enrichment, fraud detection, and routing

User Profile Service

Provides real-time profile lookups and async updates

Alert Engine

Accepts flagged transactions, ranks them, and queues for review

Analytics Module

Generates daily aggregates and dashboards

Model Service

Exposes inference API for fraud models

All components interact via internal gRPC or async messaging (Pub/Sub-style), with strict versioning and timeouts for resilience.

Storage Breakdown and Estimations

Transactions: 1 million/day × 1 KB = ~365 GB/year
Alerts: ~10K/day = 5 MB/day = ~1.8 GB/year
Profiles: 10M users × 500 bytes = ~5 GB
Total storage footprint: ~370 GB (compressed with cold storage archiving for long-term data)

Data is partitioned by time (for queries) and user ID (for profile access).

Scaling and Trade-offs

1. Stateless vs Stateful Services

Stateless Services (e.g., ML scoring, rule evaluation):

Easy to scale horizontally by adding instances behind a load balancer.
Immutable inputs (transaction + features) → deterministic outputs.
Can autoscale based on CPU usage or queue depth.

Stateful Services (e.g., user profile service, alert deduplication):

Maintain rolling statistics, time series, and dynamic user behavior data.
Scale via sharding: hash-based partitioning on userId ensures each instance owns a unique key range.
For consistency, implement quorum writes or use single-writer-per-shard to prevent race conditions during profile updates.

2. Feature Engineering: Precompute vs Real-Time

Precomputed Features:

Batched every N minutes.
Great for high-latency aggregations (e.g., spend in last 30 days).
Stored in a key-value store (e.g., userId → features), easily accessible at scoring time.
Reduces runtime computation and standardizes input to the ML model.

Real-Time Features:

Calculated on-the-fly from recent activity streams (e.g., #txns in last 30 seconds, velocity score, geo anomaly delta).
More accurate for detecting real-time fraud patterns (e.g., card testing or bot attacks).
Require in-memory buffers and approximate algorithms (e.g., sliding windows, Bloom filters, count-min sketch).

Trade-off:
Precomputed features trade freshness for speed; real-time features trade speed for precision.
Strategy: Use hybrid composition — build a feature set as F = F_precomputed ∪ F_realtime.

3. Batch vs Online Model Training

Batch Learning:

Trained nightly/weekly on full historical dataset.
Enables robust, stable models with long-term patterns.
Expensive to retrain; deployed via CI/CD.

Online Learning:

Lightweight models (e.g., SGD, passive-aggressive) that adapt to drift with streaming data.
Can learn from user feedback in near real-time.
Prone to overfitting if not regularized or if noise is introduced.

Trade-off:
Batch models offer stability, online models offer reactivity.
Strategy:

Use online learners to maintain short-term memory.
Periodically consolidate into a new batch model with a blend of long-term and short-term data.

4. Consistency vs Availability

Transaction Ingestion: Must guarantee at-least-once delivery (e.g., via message queues with replay support).
Profile Updates: Can afford eventual consistency — a user’s profile may lag a few seconds behind but will converge.
Alert Generation: Must ensure exactly-once semantics to avoid duplicate alerts for the same event (deduplication via idempotency keys or transactional write guarantees).
Trade-off: Not all components need strong consistency — use consistency where correctness matters (e.g., fraud decisions, logs) and availability where tolerance exists (e.g., behavioral profile sync).

5. Horizontal Scaling Patterns

Sharding: Partition data by userId, region, or deviceId. This isolates load and increases cache hit rates.
Parallelization: Fraud detection can be run in parallel over transaction streams, with each stream worker being independently scalable.
Backpressure: Use message queues to absorb bursts and allow consumers to scale gradually.

6. Cache vs Persistent Storage

Hot data (e.g., user profile, recent txns) in an LRU cache with a TTL.
Cold data (e.g., historical txns) lives in a partitioned database (or data lake).
Write-behind cache design ensures low-latency reads with durability.

Handling Failures

1. Server Failures (Microservices)

Problem: A node running fraud scoring or profile update crashes.
Impact: May miss scoring a transaction, or user profile may be outdated.

Recovery Mechanism:

Stateless services are redeployed instantly with auto-healing orchestration (e.g., K8s health probes + restart).
Stateful services (e.g., profile cache) reload from DB snapshot.
Transaction events are retained in a persistent queue (e.g., Kafka), so failed events are reprocessed by surviving consumers.

Design Tip: Use at-least-once event processing + idempotent operations to ensure correctness on retries.

2. Model Drift or Degradation

Problem: Fraud model accuracy drops, leading to spike in false positives or negatives.

Detection:

Live monitoring of model metrics (e.g., fraud catch rate, analyst override rate).
Statistical tests (e.g., KS-statistic, PSI) to detect feature distribution shifts.

Mitigation:

Rollback to last good model version (models are versioned).
Trigger retraining pipeline.
Fall back to rule-based engine for critical decisions while model retrains.

3. Alert Flooding or Amplification

Problem: Misconfigured rule or model triggers excessive alerts, overwhelming analysts.

Detection:

Spike detection on alert volume per time window.
Entropy check: are many alerts nearly identical?

Mitigation:

Throttle or group similar alerts (alert coalescing).
Use alert suppression windows to avoid re-alerting for similar conditions within N minutes.
Support circuit breakers in the rule engine to disable problematic rules dynamically.

4. Data Corruption / Bad Input

Problem: Malformed transaction payloads or corrupted profiles cause model crashes or misclassification.

Mitigation:

Use strict input validation on ingestion.
Fail-safe scoring: if enrichment fails or data is missing, fallback to minimal scoring with default weights.
Log and quarantine malformed events for reprocessing.

5. Profile Cache Inconsistency

Problem: User profile is stale or partially updated, leading to incorrect fraud scores.

Mitigation:

Use versioned profiles with last-updated timestamp.
If cache miss, fallback to DB load and warm cache.
For multi-shard writes, use two-phase commit or version vector-based conflict resolution.

6. Queue Overload / Backpressure

Problem: Ingestion exceeds processing capacity, leading to delays in scoring.

Mitigation:

Auto-scale consumers based on queue lag.
Implement backpressure strategy: drop low-risk txns during overload or delay non-urgent events.
Prioritize queue segments: high-risk txn types can be given dedicated high-throughput paths.

7. Database Downtime

Problem: Primary database unavailable; writes fail or data cannot be read.

Mitigation:

Use read replicas for non-critical reads (e.g., dashboards).
Enable automatic failover with replicas promoted to primary.
All writes go through retry queues or write-ahead buffers that sync once DB is back up.

8. Disaster Recovery / Region-Wide Failures

Problem: Natural disaster or cloud region outage.

Mitigation:

Active-active deployment across multiple regions.
DNS-based failover for routing.
Cross-region replication of transaction logs and user profiles.
Conduct regular DR drills to test RTO (recovery time objective) and RPO (recovery point objective).

TikTok MLE System Design Interview: Design a Real-Time Fraud Detection System

Functional Requirements

1. Real-Time Transaction Monitoring

2. Fraud Detection Algorithms

3. Alert Generation

4. User and Transaction Profiling

5. Reporting and Analytics

6. Integration with External Sources

7. Scalability

8. Audit and Logging

Non-Functional Requirements

Data Flow: What Happens to a Transaction?

1. Ingestion

2. Enrichment

3. Detection

4. Decision

5. Profile Update

Fraud Detection Engine: Mechanism Deep Dive

Rule-Based Detection

ML-Based Detection

Internal APIs and Component Interactions

Storage Breakdown and Estimations

Scaling and Trade-offs

1. Stateless vs Stateful Services

2. Feature Engineering: Precompute vs Real-Time

3. Batch vs Online Model Training

4. Consistency vs Availability

5. Horizontal Scaling Patterns

6. Cache vs Persistent Storage

Handling Failures

1. Server Failures (Microservices)

2. Model Drift or Degradation

3. Alert Flooding or Amplification

4. Data Corruption / Bad Input

5. Profile Cache Inconsistency

6. Queue Overload / Backpressure

7. Database Downtime

8. Disaster Recovery / Region-Wide Failures

Like this:

By skyforbes

Related Posts

OpenAI launches Sora 2: when AI turns TikTok into a deepfake factory

OpenAI’s Sora 2 and New Social App: A TikTok Competitor?

OpenAI Sora 2 Is So Powerful It Could Replace TikTok Forever

Leave a ReplyCancel reply

You Missed

OpenAI launches Sora 2: when AI turns TikTok into a deepfake factory

Best AI for content writing

How to Start a Mom Blog

7 Surprising Ways to Boost Affiliate Sales