TikTok Software Engineer Interview: Design a Scalable Live Streaming Platform

Live streaming has become a cornerstone of digital engagement — from gaming broadcasts to live tutorials and real-time concerts. Building a scalable, low-latency, and resilient platform requires careful design across video delivery, real-time interaction, data storage, and system architecture. Here’s how to approach it.

Design a Scalable Live Streaming Platform

Functional Requirements

  • User Management: Registration, login, profile editing.
  • Live Streaming: Streamers can start/stop streams; viewers can watch in real-time.
  • Real-Time Chat: Two-way communication during live sessions.
  • Discovery: Streams are searchable by category and metadata.
  • Moderation: Filters for inappropriate content and messages.
  • Analytics: Basic viewership stats for streamers.

Non-Functional Requirements

  • Support for 1M+ concurrent viewers
  • <5s latency from streamer to viewer
  • 99.9% uptime target
  • Horizontal scalability
  • Global performance optimization
  • Strong security & privacy guarantees
  • Maintainable and extensible architecture

Traffic & Storage Estimates

Bandwidth (Peak)

  • 560K viewers @ 2 Mbps = 1,120 Gbps
  • 240K viewers @ 4 Mbps = 960 Gbps
  • Total Bandwidth: ~2,080 Gbps

Storage

  • Daily stream storage: ~17.6 TB/day
  • Monthly retention (30 days): ~528 TB

Chat

  • 75M messages/day, ~1.1 TB/month

Metadata

  • User data: 8 GB
  • Stream metadata: 0.3 GB/month

Core API Design

Users

  • POST /api/users/register
  • POST /api/users/login
  • GET /api/users/{userId}

Streams

  • POST /api/streams/start
  • POST /api/streams/{id}/stop
  • GET /api/streams/{id}

Chat

  • POST /api/streams/{id}/chat
  • GET /api/streams/{id}/chat

Discovery & Analytics

  • GET /api/streams/search
  • GET /api/streams/{id}/analytics

Database Schema Overview

Tables:

  • Users: id, username, email, password_hash, created_at
  • Streams: id, user_id, title, category, quality, status, timestamps, viewer count
  • ChatMessages: id, stream_id, user_id, message, timestamp
  • Followers: user_id, follower_id
  • Analytics: stream_id, viewer metrics, timestamp

High-Level System Architecture

  • API Gateway: Unified entry point, handles routing & auth
  • User Service: Auth, profile, user data
  • Streaming Service: Manages stream lifecycle, metadata
  • Media Server: Ingests, transcodes, segments video, pushes to CDN
  • CDN: Global video delivery, edge caching
  • Chat Service: WebSocket-powered, real-time message exchange
  • Discovery Service: Search, recommendation engine
  • Analytics Service: Real-time metrics and aggregation
  • DB Cluster: Separate clusters for users, streams, chat, analytics
  • Load Balancers: Distribute traffic across services and regions

Media Server Breakdown

  • Ingestion: Accepts RTMP streams; assigns unique stream keys
  • Transcoding: Converts to multiple resolutions using FFmpeg
  • Segmenting: Slices into HLS/DASH segments
  • CDN Push: Uploads segments to origin CDN
  • Auto-scaling: Based on concurrent stream count

Chat Service Breakdown

  • WebSocket Gateway: Persistent connection management
  • Message Queue: Redis/Kafka for buffering
  • Worker Pool: Validates, persists, and broadcasts
  • Moderation: Regex filters, banned word detection
  • Scalability: Sticky sessions or pub/sub for sharding

Trade-Off Analysis

RTMP vs. WebRTC

  • RTMP is well-supported by OBS and other broadcasting tools. It provides predictable ingestion workflows but adds 2–5 seconds of latency.
  • WebRTC offers ultra-low-latency (<1s) but requires complex peer-to-peer or SFU setups, consumes more server resources, and lacks support in popular broadcasting tools.
  • Decision: Use RTMP for ingestion, HLS/DASH for delivery. Consider WebRTC for ultra-low latency 1:1 interactions (e.g., mentoring, auctions).

CDN vs. Custom Edge Network

  • Commercial CDNs like Akamai or Cloudflare handle global distribution, DDoS protection, and TLS termination.
  • Building a custom edge network gives tighter control over caching and cost optimization, but it requires heavy investment in infrastructure, monitoring, and redundancy planning.
  • Decision: Use commercial CDN with fallback to origin. Explore hybrid strategies once scale and cost justify it.

WebSockets vs. Polling

  • WebSockets provide low-latency bi-directional communication; ideal for real-time chat.
  • Polling is easier to implement but introduces delays and consumes unnecessary compute/bandwidth.
  • WebSocket challenges include scaling with millions of connections — often solved using sticky sessions or pub/sub systems like Redis, NATS, or Kafka.

Relational vs. NoSQL Databases

  • SQL (e.g., PostgreSQL) offers ACID guarantees and strong consistency, suitable for user data and stream metadata.
  • NoSQL (e.g., Cassandra, MongoDB) may be used for high-write throughput components like chat logs and analytics.
  • Redis is ideal for ephemeral data like live viewer counts.

Horizontal vs. Vertical Scaling

  • Horizontal scaling supports load distribution and failure isolation but requires services to be stateless or externally coordinated (e.g., distributed session storage).
  • Vertical scaling simplifies architecture but is limited by hardware constraints and increases failure blast radius.
  • Decision: Design for horizontal scaling with auto-scaling groups and stateless microservices.

Automated vs. Manual Moderation

  • NLP-based filters and ML classifiers provide scalable first-pass moderation.
  • Human moderators ensure nuanced decision-making but don’t scale linearly with users.
  • Decision: Combine AI-driven filtering with a manual review and user flagging system.

Failure Modes & Mitigations

1. Media Server Overload

  • Symptoms: Dropped frames, stream disconnects
  • Root cause: Spike in concurrent ingest streams without enough transcode resources
  • Mitigations: Implement autoscaling groups using metrics like CPU/network usage; throttle ingest if needed; prioritize streamers based on account tier.

2. CDN Outage or Degradation

  • Symptoms: Users experience buffering or complete video failure
  • Mitigations: Multi-CDN setup using failover logic or DNS-based geo load balancing (e.g., NS1, Route 53); allow fallback to origin servers with restricted capacity.

3. Chat Service Bottlenecks

  • Symptoms: Delayed messages or chat server crashes during popular events
  • Mitigations: Use distributed message queues; shard chat rooms; rate limit users; pre-warm nodes for anticipated large events.

4. Database Hotspots

  • Symptoms: High latency or failure during heavy write/read operations (e.g., chat flood)
  • Mitigations: Partition large tables (e.g., ChatMessages), implement read replicas, use Redis or Memcached for high-read endpoints.

5. Single Point of Failure (SPOF)

  • Examples: A single API gateway or DB instance going down
  • Mitigations: Use redundant load balancers (e.g., HAProxy/NLB); ensure DB clusters are multi-master or offer failover nodes; practice chaos testing.

6. Content Moderation Failures

  • Symptoms: Abuse slipping through filters or false flags causing user dissatisfaction
  • Mitigations: Audit moderation logs; continuously retrain classifiers; allow user appeals; provide moderator dashboards with escalation paths.

7. Security Breaches

  • Attack vectors: Token hijacking, RTMP stream key leaks, unauthorized DB access
  • Mitigations: Rotate JWT secrets and stream keys; encrypt sensitive data at rest and in transit (TLS everywhere); implement WAFs and RASP; conduct regular pen tests.

Final Thoughts

Designing a robust live streaming platform demands not just scalable video infrastructure but a well-integrated set of real-time, analytics, and content management services. Prioritize low-latency delivery, global scalability, and a modular microservice architecture to ensure seamless user experience.

View the comprehensive solution: https://bugfree.ai/system-design/live-streaming-platform

practice the system design question: https://bugfree.ai/practice/system-design/live-streaming-platform

Leave a Reply