  • 《Cloud Native Infrastructure》CHAPTER 7(3)

    Application Requirements on Infrastructure

    云原生应用程序对基础设施的期望远远超过执行二进制文件。 他们需要抽象,隔离,并保证他们如何运行和管理。 并且它们需要提供钩子(hook)和API以允许基础设施来管理它们。 要取得成功,需要有共生关系。

    Cloud native applications expect more from infrastructure than just executing a binary. They need abstractions, isolations, and guarantees about how they’ll run and be managed. In return they are required to provide hooks and APIs to allow the infrastructure to manage them. To be successful, there needs to be a symbiotic relationship.

    我们在第1章中定义了云原生应用程序,并讨论了一些生命周期要求。 现在让我们看一下为运行它们而构建的基础架构的更多期望:

    We defined cloud native applications in Chapter 1, and just discussed some life cycle requirements. Now let’s look at more expectations they have from an infrastructure built to run them:

    • Runtime and isolation(runtime与隔离,指资源的隔离,例如CPU、内存、硬盘)
    • Resource allocation and scheduling(资源分配与调度)
    • Environment isolation(环境隔离,例如dev、test、beta、online)
    • Service discovery(服务发现)
    • State management(状态管理)
    • Monitoring and logging(监控与日志)
    • Metrics aggregation(度量指标聚合)
    • Debugging and tracing(debug与追踪)
      所有这些都应该是服务的默认选项,或者是从自助服务API提供的。 我们将更详细地解释每个要求,以确保明确期望是明确定义的。

    All of these should be default options for services or provided from self-service APIs. We will explain each requirement in more detail to make sure the expectations are clearly defined.

    Application Runtime and Isolation


    Traditional applications only needed a kernel and possibly an interpreter to run. Cloud native applications still need that, but they also need to be isolated from the operating system and other applications where they run. Isolation enables multiple applications to run on the same server and control their dependencies and resources.


    Application isolation is sometimes called multitenancy. That term can be used for multiple applications running on the same server and for multiple users running applications in a shared cluster. The users can be running verified, trusted code, or they may be running code you have no control over and do not trust.

    成为云原生并不一定需要使用容器。 Netflix是许多云原生模式的先驱,当公司转变为在公共云上运行时,它使用VM作为部署工件,而不是容器。 FaaS服务(例如,AWS Lambda,serverless的商业产品)是用于打包和部署代码的另一种流行的云原生技术。在大多数情况下,他们使用容器进行应用程序隔离,但容器包装对用户是隐藏的。

    To be cloud native does not require using containers. Netflix pioneered many of the cloud native patterns, and when the company transitioned to running on a public cloud, it used VMs as their deployment artifact, not containers. FaaS services (e.g., AWS Lambda) are another popular cloud native technology for packaging and deploying code. In most cases, they use containers for application isolation but container packaging is hidden from the user.

    容器有许多不同的实现。 Docker推出了"容器"术语,以描述在隔离环境中打包和运行应用程序的方法。从根本上说,容器使用内核原语或硬件功能来隔离单个操作系统上的进程。
    容器隔离级别可能会有所不同,但通常意味着应用程序使用隔离的根文件系统,命名空间和来自同一服务器上其他进程的资源分配(例如,CPU和RAM)运行。容器格式已被许多项目采用,并创建了Open Container Initiative(OCI),它定义了如何打包和运行应用程序容器的标准。

    What Is a Container?
    There are many different implementations of containers. Docker popularized the term container to describe a way to package and run an application in an isolated environment. Fundamentally, containers use kernel primitives or hardware features to isolate processes on a single operating system.
    Levels of container isolation can vary, but usually it means the application runs with an isolated root filesystem, namespaces, and resource allocation (e.g., CPU and RAM) from other processes on the same server. The container format has been adopted by many projects and has created the Open Container Initiative (OCI), which defines standards on how to package and run an application container.


    Isolation also puts a burden on the engineers writing the application. They are now responsible for declaring all software dependencies. If they fail to do so, the application will not run because necessary libraries will not be available.


    Containers are often chosen for cloud native applications because better tooling, processes, and orchestration tools have emerged for managing them. While containers are currently the easiest way to implement runtime and resource isolation, this has not (and likely will not) always be the case.

    Resource Allocation and Scheduling


    Historically, applications would provide rough estimates around minimum system requirements, and it was the responsibility of a human to figure out where the application could run.4 Human scheduling can take a long time to prepare the operating system and dependencies for the application to run.


    The deployment can be automated through configuration management and provisioning, but it still requires a human to verify resources and tag a server to run the application. Cloud native infrastructure relies on dependency isolation and allows applications to run wherever resources are available.


    With isolation, as long as a system has available processing, storage, and access to dependencies, applications can be scheduled anywhere. Dynamic scheduling removes the human bottleneck from making decisions that are better left to machines. A cluster scheduler gathers resource information from all systems and figures out the best place to run the application.


    Having humans schedule application placement doesn’t scale. Humans get sick, take vacations (or at least they should), and are generally bottlenecks. As scale and complexity increases, it also becomes impossible for a human to remember where applications are running. Many companies try to scale by hiring more people. This exacerbates the problem because then scheduling needs to be coordinated between multiple people. Eventually, human scheduling resorts to keeping a spreadsheet (or similar solution) of where each application runs.


    Dynamic scheduling doesn’t mean operators have no control. There are still ways an operator can override or force a scheduling decision based on knowledge the scheduler may not have. Overrides and manual resource scheduling should be provided via an API and not a meeting request.

    解决这些问题是Google编写名为Borg的内部集群调度程序的主要原因之一。在Borg研究论文中,谷歌指出:Borg提供三个主要好处:(1)它隐藏了资源管理和故障处理的细节,因此其用户可以专注于应用程序开发; (2)以非常高的可靠性和可用性运行,并支持执行相同操作的应用程序; (3)让我们有效地在数万台机器上运行工作负载。

    Solving these problems is one of the main reasons Google wrote its internal cluster scheduler named Borg. In the Borg research paper, Google points out that:Borg provides three main benefits: it (1) hides the details of resource management and failure handling so its users can focus on application development instead; (2) operates with very high reliability and availability, and supports applications that do the same; and (3) lets us run workloads across tens of thousands of machines effectively.


    The role of a scheduler in any cloud native environment is much of the same. Fundamentally, it needs to abstract away the many machines and allow users to request resources, not servers.

    Environment Isolation


    When applications are made of many services, infrastructure needs to provide a way to have defined isolation with all dependencies. Separating dependencies is tradition‐ ally managed by duplicating servers, networks, or clusters into development or test‐ ing environments. Infrastructure should be able to logically separate dependencies through application environments without full cluster duplication.


    Logically splitting environments allows for better utilization of hardware, less duplication of automation, and easier testing for the application. On some occasions, a separate testing environment is required (e.g., when low-level changes need to be made). However, application testing is not a situation where a full duplicate infra‐ structure should be required.

    环境可以是传统的开发,测试和生产,也可以是动态分支或commit-based。 它们甚至可以是生产环境的一部分,通过动态配置和选择性路由到实例来启用功能。

    Environments can be traditional permanent dev, test, stage, and production, or they can be dynamic branch or commit-based. They can even be segments of the production environment with features enabled via dynamic configuration and selective routing to the instances.

    环境应包含应用程序所需的所有数据,服务和网络资源。 这包括数据库,文件共享和任何外部服务等内容。 云原生基础设施可以创建开销非常低的环境。

    Environments should consist of all the data, services, and network resources needed by the application. This includes things such as databases, file shares, and any external services. Cloud native infrastructure can create environments with very low overhead.

    基础设施应该能够提供被使用的环境。 应用程序应遵循最佳实践,以允许灵活配置以支持环境并通过“服务发现”来发现支持服务的端点。

    Infrastructure should be able to provision the environment however it’s used. Applications should follow best practices to allow flexible configuration to support environments and discover the endpoints for supporting services through service discovery.

    Service Discovery


    Applications almost certainly depend on one or more services to provide business benefit. It is the responsibility of the infrastructure to provide a way for services to find each other on a per-environment basis.


    Some service discovery requires applications to make an API call, while others do it transparently with DNS or network proxies. It does not matter what tool is used, but it’s important that services use service discovery.


    While service discovery is one of the oldest network services (i.e., ARP and DNS), it is often overlooked and not utilized. Statically defining service endpoints in a per- instance text file or in code is not scalable and not suitable for a cloud native environment. Endpoint registration should happen automatically when services are created and endpoints become available or go away.


    Cloud native applications work together with infrastructure to discover their dependent services. These include, but are not limited to, DNS, cloud metadata services, or standalone service discovery tools (i.e., etcd and consul).

    State Management


    State management is how infrastructure can know what needs to be done, if anything, to an application instance. This is distinctly different from application life cycle because the life cycle applies to applications throughout their development process. States apply to instances as they are started and stopped.


    It is the application’s responsibility to provide an API or hook so it can check for its current state. The infrastructure’s responsibility is to monitor the instance’s current state and act accordingly.


    The following are some application states:

    • Submitted
    • Scheduled(实体任务已被安排部署的意思)
    • Ready
    • Healthy
    • Unhealthy
    • Terminating


    1. 提交申请。
    2. 基础设施检查所请求的资源并安排应用程序。应用程序启动时,它将提供就绪/未就绪状态。
    3. 基础设施将等待就绪状态,然后允许消耗应用程序资源(例如,将实例添加到负载平衡器)。如果应用程序在指定的超时之前未就绪,则基础设施将终止它并安排应用程序的新实例。
    4. 应用程序准备就绪后,基础设施将监视活动状态并等待不正常状态或直到应用程序设置为不再运行。

    A brief overview of the states and corresponding actions would be as follows:

    1. An application is submitted to be run.
    2. The infrastructure checks the requested resources and schedules the application.
      While the application is starting, it will provide a ready/not ready status.
    3. The infrastructure will wait for a ready state and then allow for consumption of the applications resources (e.g., adding the instance to a load balancer).
      If the application is not ready before a specified timeout, the infrastructure will terminate it and schedule a new instance of the application.
    4. Once the application is ready, the infrastructure will watch the liveness status and wait for an unhealthy status or until the application is set to no longer run.

    列出的状态不是全面的。如果要正确检查和采取行动,各状态需要得到基础设施的支持。 Kubernetes通过事件,探测器和hook实现应用程序状态管理,但每个业务流程平台应具有类似的应用程序管理功能。

    There are more states than those listed. States need to be supported by the infrastructure if they are to be correctly checked and acted upon. Kubernetes implements application state management through events, probes, and hooks, but every orchestration platform should have similar application management capabilities.


    A Kubernetes event is triggered when an application is submitted, scheduled, or scaled. Probes are used to check when an application is ready to serve traffic (readiness) and to make sure an application is healthy (liveness). Hooks are used for events that need to happen before or after processes start.


    The state of an application instance is just as important as application life cycle management. Infrastructure plays a key role in making sure instances are available and acting on them accordingly.

    Monitoring and Logging


    Applications should never have to request to be monitored or logged; they are basic assumptions for running on the infrastructure. More importantly, configuration for monitoring and logging, if required, should be declarative as code in the same way that application resource requests are made. If you have all the automation to deploy applications but can’t dynamically monitor services, there is still work to be done.


    State management (i.e., process health checks) and logging deal with individual instances of an application. The logging system should be able to consolidate logs base on the applications, environments, tags, or any other useful metadata.


    Applications should, as much as possible, not have single points of failure and should have multiple instances running. If an application has 100 instances running, the monitoring system should not trigger an alert if a single instance becomes unhealthy.


    Monitoring looks holistically at applications and is used for debugging and verifying desired states. Monitoring is different than alerting, because alerting should be triggered based on the metrics and SLO of the applications.

    Metrics Aggregation

    指标是来了解应用程序在处于健康状态时的行为方式。它们还可以提供对不健康时可能被破坏的内容的洞察 - 就像监控一样,度量指标收集的请求接口应成为应用程序定义的一部分。

    Metrics are required to know how applications behave when they’re in a healthy state. They also can provide insight into what may be broken when they are unhealthy— and just like monitoring, metrics collecting should be requested as code as part of an application definition.


    The infrastructure can automatically gather metrics around resource utilization, but it is the application’s responsibility to preset metrics for service-level indicators.


    While monitoring and logging are application health checks, metrics provide the needed telemetry data. Without metrics, there is no way of knowing if the application is meeting service-level objectives to provide business value.


    It may be tempting to pull telemetry and health check data from logs, but be careful, because logging requires post-processing and more overhead than application-specific metric endpoints.When it comes to gathering metrics, you want as close to real-time data as possible. This requires a simple and low-overhead solution that can scale.Logging should be used for debugging, and a delay for data processing should be expected.


    Similarly to logging, metrics are usually gathered at an instance level and then composed together in aggregate to provide a view of a complete service instead of individual instances.


    Once applications present a way to gather metrics, it is the infrastructure’s job to scrape, consolidate, and store the metrics for analysis. Endpoints for gathering metrics should be configurable on a per-application basis, but the data formatting should be standardized so all metrics can be viewed in a single system.

    Debugging and Tracing


    Applications are easy to debug during development. Integrated development environments (IDE), code break points, and running in debug mode are all tools the engineer has at his disposal when writing code.


    Introspection is much more difficult for deployed applications. This problem is more acute when applications are composed of tens or hundreds of microservices or independently deployed functions. It may also be impossible to have tooling built into applications when services are written in multiple languages and by different teams.


    The infrastructure needs to provide a way to debug a whole application and not just the individual services. Debugging can sometimes be done through the logging system, but reproducing bugs requires a shorter feedback loop.


    Debugging is a good use of dynamic configuration, discussed earlier. When issues are found, applications can be switched to verbose logging, without restarting, and traffic can be routed to the instances selectively through application proxies.


    If the issue cannot be resolved through log output, then distributed tracing provides a different interface to visualize what is happening. Distributed tracing systems such as OpenTracing can complement logs to help humans debug problems.


    Tracing provides shorter feedback loops for debugging distributed systems. If it can not be built into applications, it can be done transparently by the infrastructure through proxies or traffic analysis. When you are running any coordinated applications at scale, it is a requirement that the infrastructure provides a way to debug applications.


    While there are many benefits and implementation details for setting up tracing in a distributed system, we will not discuss them here. Application tracing has always been important, and is increasingly difficult in a distributed system.



    The applications requirements have changed: a server with an operating system and package manager is no longer enough. Applications now require coordination of services and higher levels of abstraction. The abstractions allow resources to be separated from servers and consumed programmatically as needed.


    The requirements laid out in this chapter are not all the services that infrastructure can provide, but they are the basis for what cloud native applications expect. If the infrastructure does not provide these services, then applications will have to implement them, or they will fail to reach the scale and velocity required by modern business.


    Infrastructure won’t evolve on its own; people need to change their behavior and fundamentally think of what it takes to run an application a different way. Luckily there are projects that build on experience from companies that have pioneered these solutions.


    Applications depend on the features and services of infrastructure to support agile development. Infrastructure requires applications to expose endpoints and integrations to be managed autonomously. Engineers should use existing tools when possible and build with the goal of designing resilient, simple solutions.

