zoukankan      html  css  js  c++  java
  • 转:Google and Google cluster architecture

    Tue, Aug 15, 2006, by TheRaveN

     

    Unlike common high traffic servers’ architecture, the Google cluster architecture is based on strong software and lots of PC’s. Some say more then 15,000 PC’s are taking part of Google’s phenomenon. Let’s take a look at this wonderful search engine intestine.


    Google’s architecture provides reliability in the Google’s servers and Pc’s environment at the software
    level, by replicating services across many different machines. Google is also proud of its own failures detecting mechanism which handles different threats and malfunctioning in its web .

    When a user enters a query to Google the user’s browser first performs a domain name system (DNS) lookup to map www.google.com to a particular IP address. To provide sufficient capacity to handle query traffic, the Google service is being spread to multiple clusters distributed worldwide.

    Each cluster has around a few thousand machines, and the geographically distributed setup protects Google against disaster at the data centers (like those arising from earthquakes and large scale power failures).

    A DNS-based load-balancing system selects a cluster by the user’s geographic location to each physical cluster. The load-balancing system minimizes back and forward trips for the user’s request.

    The user’s browser then sends HTTP request to one of these clusters, and thereafter, the processing of that query is entirely local to that cluster. A hardware-based load balancer in each cluster monitors the available set of Google Web servers (GWSs) and performs local load balancing of requests across a set of them. After receiving a query, a GWS machine coordinates the query execution and formats the results into HTML response to the user’s browser. Query execution consists of two major phases.
    In the first phase, the index servers consult an inverted index that maps each query word to a matching list of documents (the hit list). The index servers then determine a set of relevant documents by intersecting the hit lists of the individual query words, and they compute a relevance score for each document. This relevance score determines the order of results on the output page. The search process is challenging because of the large amount of data: The raw documents comprise several tens of terabytes of uncompressed data, and the inverted index resulting from this raw data is itself many terabytes of data. Fortunately, the search is highly parallelizable by dividing the index into pieces

    (Index shards), each having a randomly chosen subset of documents from the full index. A pool of machines serves requests for each shard, and the overall index cluster contains one pool for each shard. Each request chooses a machine within a pool using an intermediate load balancer which means – each query goes to one machine (or a subset of machines) assigned to each shard. If a shard’s replica goes down, the load balancer will avoid using it for queries, and other components of our cluster management system will try to revive it or eventually replace it with another machine. During the downtime, the system capacity is reduced in proportion to the total fraction of capacity that this machine represented. However, service remains uninterrupted, and all parts of the index remain available. The final result of this first phase of query execution is an ordered list of document identifiers (docids). The second phase involves taking this list of docids and computing the actual title and uniform resource locator of these documents, along with a query-specific document summary.

    Document servers (docservers) handle this job, fetching each document from disk to extract the title and the keyword-in-context snippet. As with the index lookup phase, the strategy is to partition the processing of all documents by randomly distributing documents into smaller shards, having multiple server replicas responsible for handling each shard, and routing requests through a load balancer.
  • 相关阅读:
    Eclipse使用之将Git项目转为Maven项目, ( 注意: 最后没有pom.xml文件的, 要转化下 )
    静态类型语言,动态类型语言的区别
    UT, FT ,E2E 测试的意思
    git tag 用法 功能作用
    java 常用异常及作用
    git checkout -b 分支name 分支的新建, 切换, 删除, 查看
    项目上有红色感叹号, 一般就是依赖包有问题, remove依赖,重新加载,maven的也行可认删除,自己也会得新加载
    密码学初级教程(五)消息认证码MAC-Message Authentication Code
    密码学初级教程(六)数字签名 Digital Signature
    密码学初级教程(四)单向散列函数-指纹-
  • 原文地址:https://www.cnblogs.com/Donal/p/1651974.html
Copyright © 2011-2022 走看看