zoukankan      html  css  js  c++  java
  • Amazon Virginia数据中心故障回放

    前段时间看到了Amazon 位于Virginia的数据中心down机,使得很多架设于Amazon云服务上的网站访问不了,引起了很大的反响和讨论,当时我认为几个小时内应该能够恢复正常,结果居然是以天计数的,下面列一下几个重要的时间点,并说一点自己的想法:

    At 12:47 AM PDT on April 21st, incorrect traffic shift operation lead to disaster.

    At 2:40 AM PDT on April 21st, the team deployed a change that disabled all new Create Volume requests in the affected Availability Zone, and by 2:50 AM PDT, latencies and error rates for all other EBS related APIs recovered.

    By 5:30 AM PDT on April 21st, error rates and latencies again increased for EBS API calls across the Region.

    At 8:20 AM PDT on April 21st, the team began disabling all communication between the degraded EBS cluster in the affected Availability Zone and the EBS control plane.

    At 11:30AM PDT on April 21st, the team developed a way to prevent EBS servers in the degraded EBS cluster from futilely contacting other servers. Latencies and error rates for new EBS-backed EC2 instances declined rapidly and returned to near-normal at Noon PDT. About 13% of the volumes in the affected Availability Zone were in this “stuck” (out of service) state.

    At 02:00AM PDT on April 22nd, the team successfully started adding significant amounts of new capacity and working through the replication backlog.

    At 12:30PM PDT on April 22nd, all but about 2.2% of the volumes in the affected Availability Zone were restored.

    At 11:30 AM PDT on April 23rd we began steadily processing the backlog.

    At 6:15 PM PDT on April 23rd, API access to EBS resources was restored in the affected Availability Zone.

    At 3:00 PM PDT on Arpil 24rd, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

    Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state.

    0.07%对Amazon来说只是个数字,但是对于某些站点来说就是死亡,不知道Amazon会如何赔偿这些网站。

    我在看整个recover过程的时候就像看一部惊险的小说,不知道Amazon的支持工程师如何度过这几天的,大概会是身心俱疲。

    同时,这样的事情可能每时每刻在不同的公司上演,但是当Amazon发生的时候其意义又重大的多,因为Amazon已经成为了一个平台,换句话说,全球都在关注你。

    回顾整个事故的过程,能够将其归咎于操作不当吗?似乎不能简单的这么认为,这个操作不当暴漏了ESB和panel设计上的一些不足和潜在的bug,这对以后Amazon也许持续良好的运行未必是件坏事,否则,在以后serve更多更大的应用时出现这个问题,成本就会高很多。不能抱着侥幸心理认为不过不是操作失误,事故就不会发生,所谓条条大路通罗马,只要问题存在,随着数据规模的膨胀,总有一个条件会通向那个bug,灾难总会出现的。

    深究该事故的root cause,感觉Amazon没有很好的重视net partition的问题,至少在这个问题上测试不足,我之前看Google强调net partition的时候也不是特别在意,现在看来这些问题不常见,见了就要命。还有就是request处理的优先级问题,我想很多logic/proxy server处理程序都会碰到类似的问题,就是后端部分server block住的时候导致请求队列满而无法serve那些正常的请求,该问题业界也已经有很多解决办法,但是我想说的是,解决办法不是最重要的,重要的是你有没有意识到“这”是个问题。

    http://aws.amazon.com/message/65648/

  • 相关阅读:
    关于lockkeyword
    关于多层for循环迭代的效率优化问题
    Android 面试精华题目总结
    Linux基础回想(1)——Linux系统概述
    linux源代码编译安装OpenCV
    校赛热身 Problem C. Sometimes Naive (状压dp)
    校赛热身 Problem C. Sometimes Naive (状压dp)
    校赛热身 Problem B. Matrix Fast Power
    校赛热身 Problem B. Matrix Fast Power
    集合的划分(递推)
  • 原文地址:https://www.cnblogs.com/raymondshiquan/p/2033873.html
Copyright © 2011-2022 走看看