zoukankan      html  css  js  c++  java
  • system design(how to design tweet)

    Catalog

    • Clarify the requirements
    • Capacity Estimation
    • System APIs
    • High-level System Design
    • Data Storage
    • Scalability

    Step1: Clarify the requirements

    Clarify requirements and goals of the system

    • Requirements
    • Traffic size(e.g. Daily Active User)

    Nobody expect you do design a complete system in 30-40 mins

    Discuss the functionalities, align with interviewers or components to focus

    Type1: Functional Requirement

    1. Tweet
      • a. Create
      • b. Delete
    2. Timeline/Feed
      • a. Home
      • b. User
    3. Follow a user
    4. Like a tweet
    5. Search tweets
      ...

    image.png

    image.png

    Type2: Non-Functional Requirement

    • Consistency
      • Every read receives the most recent write or an error
      • Sacrifice: Eventual consistency
    • Availability
      • Every request receives a response, without the guarantee that it contains the most recent write
      • Scalable
        • Performance: low latency
    • Partion tolerance(Fault Tolerance)
      • The system continues to operate despite an arbitrary number of messages being dropped by the network between nodes

    Step2: Capacity Estimation

    Assumption:
    - 200 million DAU, 100 million new tweets
    - Each user: visit home timeline 5 times; other user timeline 3 times
    - Each timeline/page has 20 tweets
    - Each tweet has size 280 bytes, matadatda 30 bytes
    - per photo: 200kb, 20% tweets have images
    - per video: 2mb, 10% tweets have video, 30% videos will be watched

    Storage Estimate

    • Write size daily:
      • Text:
        • 100M new tweets*(280+30)bytes/tweet = 31GB/day
      • Image:
        • 4TB/day
      • Video:
        • 20TB/day
    • Total
      • 24TB/day

    Bandwidth Estimate (Social Networking => read heavy)

    Daily Read Tweets Volume:
    - 200M * (5 home visit + 3 user visit) * 20 tweets/page = 32B tweets/day
    Daily Read Band

    • Text: 23B * 280bytes / 86400 = 100MB/s
    • Image: 14GB/s
    • Video: 20GB/s
    • Total: 35GB/s

    Step3: System APIs

    postTweet(userToken, string tweet)
    
    deleteTweet(userToken, string tweetId)
    
    likeOrUnlikeTweet(userToken, string tweetId, bool like)
    
    readHomeTimeLine(userToken, int pageSize, opt string pageToken)
    
    readUserTimeLine(userToken, int pageSize, opt string pageToken)
    

    Step4: High-Level System Design:

    • post tweets

    image.png

    • user timeline(push/pull mode)

    image.png

    image.png

    https://medium.com/@winapp/read-fast-with-fan-out-write-f25257117297

    Home Timeline (cant d)

    Fan out on write

    • Not efficient for users with huge amount of followers(like Taylor Swift)

    image.png

    Hybrid Solution

    • Non-hot users:

      • fan out on write(push)
    • Hot users:

      • fan in on write(pull): read during timeline request from tweets cache, and aggregate with results from non-hot users

    image.png

    Step5: Data Storage

    image.png

    principles

    • SQL database:
      • e.g, user table
    • NoSQL database:
      • e.g, timelines
    • File system:
      • media file: image, audio, video

    Step6: Scalability

    • Identify potential bottlenecks
    • Discussion solutions, focusing on tradeoffs
      • Data sharding
        • data store, cache
      • Load balancing
        • user <-> application server
        • application server <-> cache server
        • application server <-> db
      • Data caching
        • read heavy

    Sharding

    Why?

    • impossible to store/process all data in a single machine

    How?

    • Break large tables into smaller shards on multiple servers

    Pros

    • Horizontal scaling

    Cons

    • Complexity(distributed query, resharding...)

    Option 1: shard by tweets' creation time

    Pros:

    • Limited shards to query

    Cons:

    • Hot/Cold data issue
    • New shards fill up quickly

    Option 2: Shard by hash(userId): store all the data of user on a single shard

    Pros:

    • Simple
    • Query user timeline is straightforward

    Cons:

    • Home timeline stall needs to query multiple shards
    • Non-uniform distribution of storage
    • Hot users
    • Availability

    Option 3: Shard by hash(tweetId)

    Pros:

    • uniform distribution
    • high availability

    Cons:

    • need to query all shards in order to generate user/home timeline(cache solution)

    Caching

    Why?

    • social networks have heavy read traffic
    • queries can be slow and cosyly

    How?

    • store hot/ precompuyed data in memory, reads can much faster

    Timeline service

    • user timelinme: user_id -> {tweet_id}
    • home timeline: user_id -> {tweet_id}
    • tweets: tweet_id -> tweet

    Topics:

    • caching policy
    • sharding
    • performance

    ref

    https://www.youtube.com/watch?v=PMCdWr6ejpw&list=PLLuMmzMTgVK4RuSJjXUxjeUt3-vSyA1Or&index=1

    一个没有高级趣味的人。 email:hushui502@gmail.com
  • 相关阅读:
    tornado web 框架的认识
    JavaScript 作用域知识点梳理
    服务器
    git——学习
    webservice——和API接口
    celery——任务调度模块
    supervisor——进程管理工具
    Python常用的语句
    数据类型比较总结
    字符集和字符编码问题
  • 原文地址:https://www.cnblogs.com/CherryTab/p/15102605.html
Copyright © 2011-2022 走看看