One of the classic TikTok Machine Learning Engineer (MLE) system design interview.
We process millions of financial transactions daily, and need a system that can detect fraud in real-time, scale gracefully, and learn from new fraud patterns. Design such a system end-to-end.
Functional Requirements
1. Real-Time Transaction Monitoring
The system must ingest and evaluate transactions within milliseconds. Each incoming transaction is processed through a streaming pipeline: it’s normalized, enriched with metadata, analyzed by a detection engine, and finally routed for alerts if suspicious.
The challenge here is latency: we must guarantee that the end-to-end path from ingestion to fraud decision takes under 300ms.
2. Fraud Detection Algorithms
A hybrid approach is essential:
- Rule-based engine for known fraud behaviors (e.g., “multiple high-value transfers in 30 minutes”).
- Machine Learning models trained on historical data to detect unknown fraud patterns.
For the ML pipeline, features are derived from a user’s historical activity (rolling averages, spend deltas, geo/device frequency) and transaction-specific features (amount, merchant type, location).
Scoring must be stateless, fast, and explainable — especially under regulatory scrutiny.
3. Alert Generation
When a transaction’s risk score surpasses a predefined threshold, an alert is generated. This alert includes metadata like:
- Triggered rules
- Model score breakdown
- Anomaly metrics
Alerts are ranked by severity (e.g., risk score × transaction amount) and routed to fraud analysts or automated blocks.
4. User and Transaction Profiling
User profiles are maintained with:
- Historical aggregates (e.g., average transaction size, geo dispersion)
- Temporal patterns (e.g., time-of-day frequency)
- Behavioral fingerprints (device ID churn, login location variance)
These profiles are updated asynchronously via background workers and stored in a fast-access cache with persistence to a relational database.
5. Reporting and Analytics
Fraud detection is not only about real-time decisions — it’s also about trends.
The system needs to support:
- Dashboards showing fraud trends over time
- Ad-hoc queries for transaction patterns
- KPIs like false positives, model latency, and fraud caught
To support this, we pre-aggregate data daily by user, country, and category. Heavy queries run on a dedicated analytical store.
6. Integration with External Sources
We enrich transaction data with external feeds like:
- Sanction lists
- Compromised card databases
- Geo-IP mappings
These are pre-joined as part of the enrichment phase and cached to minimize lookup latency.
7. Scalability
The system must scale linearly with traffic growth. Stateless components (e.g., model inference, enrichment) are scaled horizontally, while stateful components (e.g., user profiles) use consistent hashing and sharding.
8. Audit and Logging
All decisions are logged with:
- Timestamped records
- Model version
- Feature vector used for scoring
- Any triggered rules
Logs are tamper-proof, signed, and archived for regulatory compliance.
Non-Functional Requirements
- Performance: Sub-300ms end-to-end latency target for transaction decisions.
- Reliability: Use redundant microservices, graceful degradation, and circuit breakers for service isolation.
- Scalability: Stateless services behind load balancers; sharded stores for stateful components like profiles.
- Security: Encrypt all PII at rest and in transit; use tokenization for sensitive fields.
- Accuracy: ML pipelines monitored via AUC-ROC, precision@K; feedback loop from analyst labels improves models.
- Extensibility: Add new features and models with hot reload support; feature definitions versioned and decoupled from code.
- Compliance: Store historical data snapshots for at least one year; enable selective access control for audit logs.
Data Flow: What Happens to a Transaction?
1. Ingestion
A transaction is received via a REST or gRPC call. It’s validated (schema, data types, sanity checks) and placed onto a queue for asynchronous processing.
2. Enrichment
The transaction is enriched with:
- User’s historical profile
- Geo-IP and device metadata
- External blacklist matches
Enrichment is cached aggressively for low-latency access and persisted for audits.
3. Detection
The enriched transaction is passed to:
- The rule engine (for known patterns)
- The ML model scorer (for predictive patterns)
Both return scores, reasons, and confidence metrics.
4. Decision
If the risk exceeds a threshold, the system generates an alert:
- Stored in the alert DB
- Routed to the alert queue (priority queue)
- Sent to analyst dashboard or automated handler
5. Profile Update
The user’s profile is asynchronously updated. We use write-ahead logs to ensure consistency in case of crashes.
Fraud Detection Engine: Mechanism Deep Dive
Rule-Based Detection
Rules are stored as ASTs or DSL expressions:
IF amount > 3 * avg_daily_spend AND location != last_known_country THEN FLAG
Evaluated using an in-memory interpreter. Rules can be hot-deployed and tested in dry-run mode before production rollout.
ML-Based Detection
Models are trained on millions of historical transactions and retrained weekly.
Feature engineering includes:
- Time deltas:
time_since_last_txn
,spend_in_last_hour
- Aggregations:
avg_spend_30d
,txn_count_by_device
- Behavioral shifts:
change_in_device_id_freq
For explainability, we use:
- SHAP values for feature impact
- Model score logs for audit and debugging
Inference must be sub-20ms, so models are optimized and hosted in-memory with batching enabled.
Internal APIs and Component Interactions
Transaction Processor
- Receives transaction events
- Calls enrichment, fraud detection, and routing
User Profile Service
- Provides real-time profile lookups and async updates
Alert Engine
- Accepts flagged transactions, ranks them, and queues for review
Analytics Module
- Generates daily aggregates and dashboards
Model Service
- Exposes inference API for fraud models
All components interact via internal gRPC or async messaging (Pub/Sub-style), with strict versioning and timeouts for resilience.
Storage Breakdown and Estimations
- Transactions: 1 million/day × 1 KB = ~365 GB/year
- Alerts: ~10K/day = 5 MB/day = ~1.8 GB/year
- Profiles: 10M users × 500 bytes = ~5 GB
- Total storage footprint: ~370 GB (compressed with cold storage archiving for long-term data)
Data is partitioned by time (for queries) and user ID (for profile access).
Scaling and Trade-offs
1. Stateless vs Stateful Services
Stateless Services (e.g., ML scoring, rule evaluation):
- Easy to scale horizontally by adding instances behind a load balancer.
- Immutable inputs (transaction + features) → deterministic outputs.
- Can autoscale based on CPU usage or queue depth.
Stateful Services (e.g., user profile service, alert deduplication):
- Maintain rolling statistics, time series, and dynamic user behavior data.
- Scale via sharding: hash-based partitioning on
userId
ensures each instance owns a unique key range. - For consistency, implement quorum writes or use single-writer-per-shard to prevent race conditions during profile updates.
2. Feature Engineering: Precompute vs Real-Time
Precomputed Features:
- Batched every N minutes.
- Great for high-latency aggregations (e.g., spend in last 30 days).
- Stored in a key-value store (e.g.,
userId → features
), easily accessible at scoring time. - Reduces runtime computation and standardizes input to the ML model.
Real-Time Features:
- Calculated on-the-fly from recent activity streams (e.g.,
#txns in last 30 seconds
,velocity score
,geo anomaly delta
). - More accurate for detecting real-time fraud patterns (e.g., card testing or bot attacks).
- Require in-memory buffers and approximate algorithms (e.g., sliding windows, Bloom filters, count-min sketch).
Trade-off:
Precomputed features trade freshness for speed; real-time features trade speed for precision.
Strategy: Use hybrid composition — build a feature set as F = F_precomputed ∪ F_realtime
.
3. Batch vs Online Model Training
Batch Learning:
- Trained nightly/weekly on full historical dataset.
- Enables robust, stable models with long-term patterns.
- Expensive to retrain; deployed via CI/CD.
Online Learning:
- Lightweight models (e.g., SGD, passive-aggressive) that adapt to drift with streaming data.
- Can learn from user feedback in near real-time.
- Prone to overfitting if not regularized or if noise is introduced.
Trade-off:
Batch models offer stability, online models offer reactivity.
Strategy:
- Use online learners to maintain short-term memory.
- Periodically consolidate into a new batch model with a blend of long-term and short-term data.
4. Consistency vs Availability
- Transaction Ingestion: Must guarantee at-least-once delivery (e.g., via message queues with replay support).
- Profile Updates: Can afford eventual consistency — a user’s profile may lag a few seconds behind but will converge.
- Alert Generation: Must ensure exactly-once semantics to avoid duplicate alerts for the same event (deduplication via idempotency keys or transactional write guarantees).
- Trade-off: Not all components need strong consistency — use consistency where correctness matters (e.g., fraud decisions, logs) and availability where tolerance exists (e.g., behavioral profile sync).
5. Horizontal Scaling Patterns
- Sharding: Partition data by
userId
,region
, ordeviceId
. This isolates load and increases cache hit rates. - Parallelization: Fraud detection can be run in parallel over transaction streams, with each stream worker being independently scalable.
- Backpressure: Use message queues to absorb bursts and allow consumers to scale gradually.
6. Cache vs Persistent Storage
- Hot data (e.g., user profile, recent txns) in an LRU cache with a TTL.
- Cold data (e.g., historical txns) lives in a partitioned database (or data lake).
- Write-behind cache design ensures low-latency reads with durability.
Handling Failures
1. Server Failures (Microservices)
Problem: A node running fraud scoring or profile update crashes.
Impact: May miss scoring a transaction, or user profile may be outdated.
Recovery Mechanism:
- Stateless services are redeployed instantly with auto-healing orchestration (e.g., K8s health probes + restart).
- Stateful services (e.g., profile cache) reload from DB snapshot.
- Transaction events are retained in a persistent queue (e.g., Kafka), so failed events are reprocessed by surviving consumers.
Design Tip: Use at-least-once event processing + idempotent operations to ensure correctness on retries.
2. Model Drift or Degradation
Problem: Fraud model accuracy drops, leading to spike in false positives or negatives.
Detection:
- Live monitoring of model metrics (e.g., fraud catch rate, analyst override rate).
- Statistical tests (e.g., KS-statistic, PSI) to detect feature distribution shifts.
Mitigation:
- Rollback to last good model version (models are versioned).
- Trigger retraining pipeline.
- Fall back to rule-based engine for critical decisions while model retrains.
3. Alert Flooding or Amplification
Problem: Misconfigured rule or model triggers excessive alerts, overwhelming analysts.
Detection:
- Spike detection on alert volume per time window.
- Entropy check: are many alerts nearly identical?
Mitigation:
- Throttle or group similar alerts (alert coalescing).
- Use alert suppression windows to avoid re-alerting for similar conditions within N minutes.
- Support circuit breakers in the rule engine to disable problematic rules dynamically.
4. Data Corruption / Bad Input
Problem: Malformed transaction payloads or corrupted profiles cause model crashes or misclassification.
Mitigation:
- Use strict input validation on ingestion.
- Fail-safe scoring: if enrichment fails or data is missing, fallback to minimal scoring with default weights.
- Log and quarantine malformed events for reprocessing.
5. Profile Cache Inconsistency
Problem: User profile is stale or partially updated, leading to incorrect fraud scores.
Mitigation:
- Use versioned profiles with last-updated timestamp.
- If cache miss, fallback to DB load and warm cache.
- For multi-shard writes, use two-phase commit or version vector-based conflict resolution.
6. Queue Overload / Backpressure
Problem: Ingestion exceeds processing capacity, leading to delays in scoring.
Mitigation:
- Auto-scale consumers based on queue lag.
- Implement backpressure strategy: drop low-risk txns during overload or delay non-urgent events.
- Prioritize queue segments: high-risk txn types can be given dedicated high-throughput paths.
7. Database Downtime
Problem: Primary database unavailable; writes fail or data cannot be read.
Mitigation:
- Use read replicas for non-critical reads (e.g., dashboards).
- Enable automatic failover with replicas promoted to primary.
- All writes go through retry queues or write-ahead buffers that sync once DB is back up.
8. Disaster Recovery / Region-Wide Failures
Problem: Natural disaster or cloud region outage.
Mitigation:
- Active-active deployment across multiple regions.
- DNS-based failover for routing.
- Cross-region replication of transaction logs and user profiles.
- Conduct regular DR drills to test RTO (recovery time objective) and RPO (recovery point objective).