zoukankan html css js c++ java

[译]Cassandra 架构简述

本文翻译主要来自Datastax的cassandra1.2文档。http://www.datastax.com/documentation/cassandra/1.2/index.html。此外还有一些来自于相关官方博客。

该翻译作为ISE实验室大数据组Laud的学习材料的一部分，适合对Cassandra已经有一定了解的读者。

未经本人许可，请勿转载。

一个Cassnadra的简介。（下文有时候又将Cassandra简称C）

Cassandra被设计来通过没有单点故障的多节点模式去处理海量数据工作负载。他的架构是基于理解系统和硬件故障可以而且会发生的基础上的。C通过所有节点都相同并且数据分布在所有节点上的p2p分布式系统来解决故障问题。集群中的每个节点每秒都在交换信息。每个节点上的commit log 捕获写行为来确保数据的持久化。数据也会被写到一个内存结构中，叫做memtable，然后当内存结构满了的时候就写数据到磁盘文件中，叫做SSTable。所有的写入都是自动分区和复制的。

cassandra是一种面向行的数据库。C的架构允许任何授权的用户连接任意数据中心的任意的节点，并使用cql访问数据。为了简化使用，cql使用和sql类似的语法。从cql的视角出发，database是由tables组成的。典型地，一个集群中每个应用拥有一个keyspace。开发者可以通过cqlsh调用cql，也可以使用其他驱动。

客户端的读写请求可以到达集群的任意节点。当一个客户连接到一个节点做了一个请求时，那个节点服务器就作为这个特定的客户操作的一个coordinator了。协调器扮演了客户应用和拥有用户请求的数据的节点之间的代理（proxy）的角色。协调器决定了集群环中的哪些节点应该响应请求。（更多信息，请查阅关于用户请求）

配置C的关键组件列表：

Gossip：一个p2p的交流协议来发现和共享其他节点的位置和状态信息。
gossip信息也被每个节点保存在本地，这样当一个节点重启时，它能够立刻使用这些信息。你可能会想清空某个节点上的gossip历史，比如节点ip地址改变了等原因。（译者注：大概就是system.local表）
Partitioner：一个分区器决定了如何分布数据到各个节点。选择一个分区器决定了哪个节点存储数据的第一个备份。
你必须设置分区器的类型，并且指派给每个节点一个num_tokens值。如果没有使用虚拟节点的话，使用initial_token来代替。（译者注：虚拟节点是1.2中新增的）
副本存放策略：C存储数据的备份到多个节点上去来确保可用性和故障容忍。一个备份策略决定了哪些节点存放备份。it is not unique in any sense.it is not unique in any sense. 当你创建了一个keyspace的时候，你必须指定副本存放策略和你想备份的数量。
Snitch：一个snitch定义了拓扑信息，这些信息是副本备份侧罗和请求路由时经常使用的。当你创建一个集群的时候需要配置一个snitch。snitch is responsible for 知道在你的网络拓扑中节点的位置以及通过聚合机器成为数据中心或者rack时的分配副本。
cassandra.yaml：C的配置文件。在这个文件中，你要设置集群的初始化信息，表的缓存参数，资源的使用参数，超时设置，客户端连接，备份以及安全策略。
C将属性都存到系统keyspace中。你需要对每一个keyspace或者columnfamily进行存储配置（比如使用cql）。
默认的，一个节点被设置为存储他管理的数据到/var/lib/cassandra目录。在一个生产环境中，你需要修改commitlog目录到一个其他硬盘上去（别和data file 在一个硬盘上）。

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

关于内部通信Gossip：

cassandra使用称为gossip的协议来发现加入C集群中的其他节点的位置和状态信息。这是一个p2p的交流协议，每个节点定期的交换他们自己的和他们所知道的其他人的状态信息。gossip进程每秒都在进行，并与至多三个节点交换状态信息。节点交换他们自己和所知道的信息，于是所有的节点很快就能学习到整个集群中的其他节点的信息。gossip信息有一个相关的版本号，于是在一次gossip信息交换中，旧的信息会被新的信息覆盖重写。

要阻止分区进行gossip交流，那么在集群中的所有节点中使用相同的seed list（译者注：指的是cassandra。yaml中的seeds）。默认的，在重新启动时，一个节点记得他曾经gossip过得其他节点。

注意：种子节点的指定除了启动起gossip进程外，没有其他的目的。种子节点不是一个单点故障，他们在集群操作中也没有其他的特殊目的，除了引导节点以外..

设置Gossip设置

任务：

当一个节点第一次启动的时候，他去yaml中读取配置，得到集群的名字，并得到从哪些seeds中获取其他节点的信息，还有其他的一些参数，比如端口，范围等等。。

属性	描述
cluster_name
listen_address	与其他节点连接的ip
seed_provider
storage_port	内部节点交流端口（默认7000），每个节点之间必须相同
initial_token	在1.1以及之前，决定节点的数据的管理范围
num_tokens	在1.2以及之后，决定节点的数据的管理范围

清理gossip状态:

-Dcassandra.load_ring_state= false

关于故障检测和修复

C使用信息来避免路由用户的请求到坏了的节点（C还能避免路由到可用但是性能很差的节点，通过动态snitch技术）

Rather than have a fixed threshold for marking failing nodes, Cassandra uses an accrual detection mechanism to calculate a per-node threshold that takes into account network performance, workload, or other conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster. In Cassandra, configuring the phi_convict_threshold property adjusts the sensitivity of the failure detector. Use default value for most situations, but increase it to 12 for Amazon EC2 (due to the frequently experienced network congestion).

（译者注：这是04年的一篇论文的失效检测算法

1.Hayashibara, N., Defago, X., Yared, R. & Katayama, T. The phi; accrual failure detector. in Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004 66–78 (2004). doi:10.1109/RELDIS.2004.1353004

）

一个节点的当机往往不代表这个节点永远的离开了，因此并不会自动的从环中删除。其他的节点会定期的尝试与失效节点联系看看他们恢复了没。要永久的改变一个节点的关系，administrators must explicitly add or remove nodes from a Cassandra cluster using the nodetool utility.

当一个节点返回的时候，他可能错过了他需要维护的副本数据的写入命令。一旦失效检测标记一个节点当机了，错过的写入就会被存储到其他的副本中一段时间，叫做 hinted handoff。 当一个节点当机超过max_hint_windows_in_ms(默认3小时)，hints就不在存储了。这时候你应该等节点启动后运行修复程序了。

此外，你应该日常地运行nodetool repair 在所有的节点上，以保证他们的数据的一致性。

For more explanation about recovery, see Modern hinted handoff.

 （该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

数据分配和备份

在C中，数据分配和备份是同时的。这是由于C被设计成p2p系统，可以拷贝数据和在节点群组中分配副本。数据被组织成表并且由主键进行标识。主键决定了数据存在哪个节点上。行的副本就叫做备份了，当数据第一次被写入时候，他也被称做一个备份。

When your create a cluster, you must specify the following:

Virtual nodes: assigns data ownership to physical machines.
Partitioner: partitions the data across the cluster.
Replication strategy: determines the replicas for each row of data.
Snitch: defines the topology information that the replication strategy uses to place replicas.

Consistent hashing

比如，有如下数据：

jim	age: 36	car: camaro	gender: M
carol	age: 37	car: bmw	gender: F
johnny	age: 12	gender: M
suzy	age: 10	gender: F

Cassandra assigns a hash value to each primary key:

Primary key	Murmur3 hash value
jim	-2245462676723223822
carol	7723358927203680754
johnny	-6723372854036780875
suzy	1168604627387940318

集群中的每个节点都负责一部分数据，哈希范围是：

Node	Murmur3 start range	Murmur3 end range
A	-9223372036854775808	-4611686018427387903
B	-4611686018427387904	-1
C	0	4611686018427387903
D	4611686018427387904	9223372036854775807

Cassandra places the data on each node according to the value of the primary key and the range that the node is responsible for. For example, in a four node cluster, the data in this example is distributed as follows:

Node	Start range	End range	Primary key	Hash value
A	-9223372036854775808	-4611686018427387903	johnny	-6723372854036780875
B	-4611686018427387904	-1	jim	-2245462676723223822
C	0	4611686018427387903	suzy	1168604627387940318
D	4611686018427387904	9223372036854775807	carol	7723358927203680754

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

virtual nodes

1、当增删节点的时候不需要重新平衡鸡群了。当一个节点加入时，他就假定均匀的对其他节点的一部分数据负责。当一个节点挂了，那么负载也会均匀的传递给其他节点。

2、重建一个死亡节点的时候会更快了。因为他涉及到了其他各个节点，而且数据数据会被增量式的发送到代替节点而不是等待着知道验证结束。

3、异构集群的优化。可以指派成比例的虚拟节点数到小集群和大集群上去。

Vnodes change this paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.

The top portion of the graphic shows a cluster without virtual nodes. In this paradigm, each node is assigned a single token that represents a location in the ring. Each node stores data determined by mapping the row key to a token value within a range from the previous node to its assigned value. Each node also contains copies of each row from other nodes in the cluster. For example, range E replicates to nodes 5, 6, and 1. Notice that a node owns exactly one contiguous range in the ring space.

The bottom portion of the graphic shows a ring with virtual nodes. Within a cluster, virtual nodes are randomly selected and non-contiguous. The placement of a row is determined by the hash of the row key within many smaller ranges belonging to each node.

假设我们有30个节点，备份3. 一个节点完全死了，并且我们需要加入一个替代点。这时候替代节点需要得到3个不同范围的副本来重新生成数据，这不仅仅包括他管理的第一份数据（译者注：应该是指的通过hash计算到的归他负责的那份数据），也包括他负责的第二备份、第三备份的副本（尽管do recall no replica has ‘priority’ over another in Cassandra）。

因为我们的备份数是3，那么每个缺失的副本还有2个。一个节点负责了三份数据，所以我们涉及了至多6个节点。在当前的实现中，每份数据cassanra只会从一个备份中恢复，于是我们只需要和三个节点沟通就能恢复数据。如下图：

我们是想最小化这个操作的时间的。因为如果在这个期间又挂了一个节点，那么我们可能对某些范围的数据而言只剩一个备份了，那么这时候任何一致性级别要求大于1的都将失败。Even if we used all 6 possible replica nodes, we’d only be using 20% of our cluster, however.

而virtual node情况下，拷贝恢复的数据量是一样的，但是速度大大加快了：

Repair is two phases, first a validation compaction that iterates all the data and generates a Merkle tree, and then streaming when the actual data that is needed is sent. The validation phase might take an hour, while the streaming only takes a few minutes, meaning your replaced disk sits empty for at least an hour.

那么在virtual node情况下：with vnodes you’ll gain two distinct advantages in this situation. The first is that since the ranges are smaller, data will be sent to the damaged node in a more incremental fashion instead of waiting until the end of a large validation phase. The second is that the validation phase will be parallelized across more machines, causing it to complete faster.

新老机器升级的时候，一定会想让好机器管理更多的数据。

If you have vnodes it becomes much simpler, you just assign a proportional number of vnodes to the larger machines. If you started your older machines with 64 vnodes per node and the new machines are twice as powerful, simply give them 128 vnodes each and the cluster remains balanced even during transition.

具体设置方法：

Set the number of tokens on each node in your cluster with the num_tokens parameter in the cassandra.yaml file.

Generally when all nodes have equal hardware capability, they should have the same number of virtual nodes. If the hardware capabilities vary among the nodes in your cluster, assign a proportional number of virtual nodes to the larger machines. For example, you could designate your older machines to use 128 virtual nodes and your new machines (that are twice as powerful) with 256 virtual nodes.

Set the number of tokens on each node in your cluster with the num_tokens parameter in the cassandra.yaml file. The recommended value is 256. Do not set the initial_token parameter.

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

数据复制：

The total number of replicas across the cluster is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one node. A replication factor of 2 means two copies of each row, where each copy is on a different node.

Two replication strategies are available:

SimpleStrategy: Use for a single data center only. If you ever intend more than one data center, use the NetworkTopologyStrategy.
NetworkTopologyStrategy: Highly recommended for most deployments because it is much easier to expand to multiple data centers when required by future expansion.

SimpleStrategy:

第一个备份时根据哈希环算的，第二第三备份时顺时针沿环取得，而不考虑拓扑结构

NetworkTopologyStrategy：

Use NetworkTopologyStrategy when you have (or plan to have) your cluster deployed across multiple data centers. This strategy specify how many replicas you want in each data center.

NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack. NetworkTopologyStrategy attempts to place replicas on distinct racks because nodes in the same rack (or similar physical grouping) often fail at the same time due to power, cooling, or network issues.

决定设置有多少备份在每个节点上时，两个主要的考虑是：

1、能本地读，而不要跨数据中心去读

2、故障场景。

1、每个数据中心备份两份数据。这种设置可以容忍单点故障时，仍然从本地读取，（一致性：ONE）
2、每个数据中心3个备份。这种设置可以容忍单点故障时，仍然从本地读取，（一致性：LOCAL_QUORUM）

不对称的备份设置，Asymmetrical replication groupings are also possible. For example, you can have three replicas per data center to serve real-time application requests and use a single replica for running analytics.

To set the replication strategy for a keyspace, see CREATE KEYSPACE.

When you use NetworkToplogyStrategy, during creation of the keyspace strategy_options, you use the data center names defined for the snitch used by the cluster. To place replicas in the correct location, Cassandra requires a keyspace definition that uses the snitch-aware data center names. For example, if the cluster uses the PropertyFileSnitch, create the keyspace using the user-defined data center and rack names in the cassandra-topologies.properties file. If the cluster uses the EC2Snitch, create the keyspace using EC2 data center and rack names.

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

关于分区器：

分区器决定了数据和副本在节点中如何分配. 最基本的,一个分区器就是一个用来计算一个行健的token的哈希函数.

Murmur3Partitioner 和randomPartitioner 都使用token来帮助均匀指派分区。即使表使用了不同的row key，比如用户名、时间戳。此外，读写请求也均匀的分布并且能达到负载均衡，因为哈喜欢的每一部分都接收到了相等数量的行。

Cassandra offers the following partitioners:

Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values.
RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes

关于Mumur3Partition：

Murmur3Partitioner 提供了更快的哈希函数。集群中确定了某个分区器，就不能再变了。（因此旧数据升级到1.2时，要将分区器改成Random才行）Murmur3Partitioner使用Murmur哈希函数，得到的哈希值的范围是：-2⁶³ to +2⁶³.

关于RandomPartitioner：

仍然可用，他是通过MD5计算哈希的。值是0-2¹²⁷ -1

关于ByteOrderedPartitioner:

有序分区器。将rowkey转化为十六进制数值，比如‘A’=x41.这样你就可以进行有序的查找了。比如username是key，你就可以找名字在Jake 和Joe之间的用户了。这种查询在前两种分区器中是做不到的。

尽管能做范围查找让有序分区器听起来不错，但是事实上通过表索引也可以达到这种效果。

不推荐使用这种排序器：

Difficult load balancing: More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges (formerly token ranges) based on their estimates of the row key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.

Sequential writes can cause hot spots: If your application tends to write or update a sequential block of rows at a time, then the writes are not be distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing with timestamped data.

Uneven load balancing for multiple tables: If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

About snitches

snitch确定哪个数据中心或者rack被写或者读，并且通知Cassandra有关网络拓扑的信息，使得请求能够有效地路由，并且允许C通过组合数据中心和rack中的机器来分配副本。

Cassandra会尽最大努力达到同一个rack上不超过一个副本。

注意：如果你在数据已经被写入集群后，修改了snitch，你必须运行一次全修复了，因为snitch影响了副本的存放位置。

SimpleSnitch

默认的，不识别数据中心和rack信息。可以用在单数据中心环境下。使用这种snitch时候，keyspace strategy option的参数只需设置 replication factor备份数量。

RackInferringSnitch

他假设节点的ip代表了一定的含义，由此可以确定数据中心/rack中节点的位置. 一般可以以这个snitch为例子，自己定制一个snitch。ip示意图如下：

PropertyFileSnitch

这个snitch通过用户自己描述网络详情来确定节点的位置。描述可以写在cassandra-topology.properties文件中。当你的节点ip并不同意或者你有很复杂的副本组合需求时。当使用这种snitch的时候，你可以定义你的数据中心名字。注意确保你定义的数据中心名字和你的keyspace strategy_options参数中的数据中心名字一致（译者注：这不是废话么？）。

集群中的每个节点都应该在配置文件中进行描述。并且这个文件应该在集群中的每个节点上都相同。注意这个配置文件的位置和发行版有关，see Locations of the configuration files or DataStax Enterprise File Locations.

举例：

If you had non-uniform IPs and two physical data centers with two racks in each, and a third logical data center for replicating analytics data, the cassandra-topology.properties file might look like this:

多数据中心的snitch手工配置举例# Data Center One

175.56.12.105 =DC1:RAC1
175.50.13.200 =DC1:RAC1
175.54.35.197 =DC1:RAC1

120.53.24.101 =DC1:RAC2
120.55.16.200 =DC1:RAC2
120.57.102.103 =DC1:RAC2

# Data Center Two

110.56.12.120 =DC2:RAC1
110.50.13.201 =DC2:RAC1
110.54.35.184 =DC2:RAC1

50.33.23.120 =DC2:RAC2
50.45.14.220 =DC2:RAC2
50.17.10.203 =DC2:RAC2

# Analytics Replication Group

172.106.12.120 =DC3:RAC1
172.106.12.121 =DC3:RAC1
172.106.12.122 =DC3:RAC1

# default for unknown nodes default =DC3:RAC1

GossipingPropertyFileSnitch

The GossipingPropertyFileSnitch defines a local node's data center and rack; it uses gossip for propagating this information to other nodes. Theconf/cassandra-rackdc.properties file defines the default data center and rack used by this snitch:

dc =DC1
 rack =RAC1

The location of the conf directory depends on the type of installation; see Locations of the configuration files or DataStax Enterprise File Locations.

To migrate from the PropertyFileSnitch to the GossipingPropertyFileSnitch, update one node at a time to allow gossip time to propagate. The PropertyFileSnitch is used as a fallback when cassandra-topologies.properties is present.

EC2Snitch

假设集群在Amazon EC2环境下，如果集群都在同一个地理位置（译者注：EC2可以让你选择你的虚拟机的物理位置）。The region is treated as the data center and the availability zones are treated as racks within the data center..

比如你有一台机器在：us-east-1a。那么us-east就是数据中心名字，1a就是rack位置。

When defining your keyspace strategy option, use the EC2 region name (for example,``us-east``) as your data center name.

EC2MultiRegionSnitch

跟上面类似，只是在多个地理位置下。As with the EC2Snitch, regions are treated as data centers and availability zones are treated as racks within a data center. For example, if a node is in us-east-1a, us-east is the data center name and 1a is the rack location.

以下是在EC2上配置的细节：

This snitch uses public IPs as broadcast_address to allow cross-region connectivity. This means that you must configure each Cassandra node so that thelisten_address is set to the private IP address of the node, and the broadcast_address is set to the public IP address of the node. This allows Cassandra nodes in one EC2 region to bind to nodes in another region, thus enabling multiple data center support. (For intra-region traffic, Cassandra switches to the private IP after establishing a connection.)

Additionally, you must set the addresses of the seed nodes in the cassandra.yaml file to that of the public IPs because private IPs are not routable between networks. For example:

seeds: 50.34.16.33, 60.247.70.52

To find the public IP address, run this command from each of the seed nodes in EC2:

curl http://instance-data/latest/meta-data/public-ipv4

Finally, be sure that the storage_port or ssl_storage_port is open on the public IP firewall.

When defining your keyspace strategy option, use the EC2 region name, such as ``us-east``, as your data center names.

Dynamic snitching

监控从副本中读取的性能，并且根据历史去选择最佳副本。

默认的,所有的snitch都使用动态snitch布局来监控读延迟，并在路由时跳过性能很菜的节点。动态路由默认就启动着，而且也推荐用在在大多数生产环境中。

Configure dynamic snitch thresholds for each node in the cassandra.yaml configuration file.

动态snitching是从Cassandra 0.6.5版本就开始有的，但是有很多地方一直捉摸不透。这篇博文尽可能揭开你可能想知道的所有谜团。

首先解答什么事动态snitch。先来回顾下什么是snitch。一个snitch的功能是确定从哪个数据中心和rack上读写。那么，为什么说是“动态”呢？

在cassandra的写过程中，我们发送消息给节点。然后就阻塞知道符合一致性级别的数量被写成功了，因此我们并不做任何统计信息。而读过程中，Cassandra只向一个节点请求数据，并且依赖于一致性级别和读修复，他会用校验和去询问其他的副本。看不懂的看英文：

So, why would that be ‘dynamic?’ This comes into play on the read side only (there’s nothing to be done for writes since we send them all and then block to until the consistency level is achieved.) When doing reads however, Cassandra only asks one node for the actual data, and, depending on consistency level and read repair chance, it asks the remaining replicas for checksums only. This means that it has a choice of however many replicas exist to ask for the actual data, and this is where the dynamic snitch goes to work.

因为只有一个副本会发送我们需要的完整的数据，因此我们需要选择最好的副本去做请求。The dynamic snitch handles this task by monitoring the performance of reads from the various replicas and choosing the best one based on this history.

在现在的Cassandra中，读修复事实上已经很少了，因为hints机制目前很可靠。因此，那当我们的一致性级别是one的时候，能不能最大化我们的缓存能力呢？可！这就是动态snitch中badness 阈值的设计缘由。这个参数是一个百分比，它定义了

how much worse the natural first replica must perform, in order to switch to a different one. 因此给定副本X，Y，Z。 X副本江北首选知道他的性能比Y和Z低badness threshold。这意味着当所有节点都很健康的时候，节点中的缓存能力是最大的。但是如果有事情变坏，特别是比badness_threshold还差，那么Cassandra将继续通过使用其他副本来提供可用服务。

动态snitch并不会确定副本的位置，这是你的snitch要做的。动态snitch简单的包装了snitch，并在读数据时提供了自适应性。

这是如何完成的？最初，动态snitch是想模仿failure detector，因为失效检测也是自适应的。为了节省CPU，他采用了两种方法，一种很简单（接收更新），一种很复杂（计算每个host的得分）。

默认的，设置为每100ms计算一次得分。The updates are capped at a maximum of 10,000 per scoring interval,但是这引入了一些新的问题。首先，如果我们觉得一个节点性能不行，不再去读他了，那么我们如何知道他什么时候恢复了呢？因为没有新的信息来评价他，我们不得不增加一条新的规则：每十分钟重置一次评分表。这样的话，我们又不得不在重置后马上采样一些读，其中这次读取是对每个副本都公正的。

第二，10000是不是一个好的苏子呢，我们不知道，因为这依赖于Cassandra运行的机器的好坏。最后，延迟是唯一并且最好的判定从哪里读取数据的指标吗？

在接下来的版本中，对上述问题做了一些处理：1、基于概率的随机采样；2、还考虑其他因此，比如节点是否正在做compact。原文：

In the next release of Cassandra, these latter two problems have been addressed. Instead of sampling a fixed amount of updates, we now use a statistically significant random sample, and weight more recent information heavier than past information. But wait, there’s more! Instead of relying purely on latency information from reads, we now also consider other factors, like whether or not a node is currently doing a compaction, since that can often penalize reads.

最后一个问题：当缺少信息的时候，动态snitch不能工作。显然我们需要等待超过rpc_timeout的时间才能知道一个节点读取失败了，但是我们不想等这么久。如何破？我们定义了另一个时间，In actuality, we do have a signal we can respond to, and that is time itself. Thus, if a node suddenly becomes a black hole, we’ll only throw reads at it for one scoring interval, and when the next score is calculated we’ll consider latency, the node’s state (called a severity factor) and how long it has been since the node last replied, penalizing it so that we stop trying to read from it (badness_threshold permitting.)

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

About client requests

这章比较简单，可以扫一眼原文，只摘抄重要的话在下面。

写成功指的是数据被写入到了commit log中并且将更改希尔了memtable中。

About multiple data center write requests

多个数据中心下，C在每个远程数据中心下都选择一个coordinator节点来处理副本请求。和客户端连接的coordinator只需要将写请求转发给其他数据中心的coordinator就行了。

About read requests

coordinator发给副本的读请求包括两种：

1、直接的读请求

2、一个后台的读修复请求。

对于直接读请求来说，需要联系的副本个数由客户端指定的一致性级别决定。后台读修复请求则是被发送给任何没收到直接读请求的其他副本。读修复请求确保请求的行在所有的副本上是一致的。

因此，协调器首先根据一致性级别联系副本。对于那些当前最先反应的副本，协调器会发送这些请求到副本。这些被联系的节点就返回请求的数据。如果多个节点都被联系了，那么来自每个副本的行数据（rows）会被在内存中进行比较来看看他们是不是一致的。如果他们不是，那么拥有最新数据（基于时间戳）的副本将被coordinator返回给用户。

为了确保所有的副本都拥有最新版本的频繁读取的数据，协调器也会在后台去联系所有拥有该行数据的剩下的副本去比较数据。如果副本是不一致的，协调器会让旧数据副本更新这一行到最新版本。这就是读修复。

举例：3备份集群，度一致性级别是QUORUM，也就是说3个中的两个需要被联系来满足读请求。假设联系的副本具有不同的版本，那么拥有最新版本数据的副本将返回结果。同时在后台，第三个副本将被检查跟前两个的一致性，如果需要，最新的数据将被写到这些旧数据副本上去。

（该翻译作为实验室大数据组的学习材料的一部分，适合对Cassandra已经有一定了解的读者。未经本人许可，请勿转载。）

Planning a cluster deployment

Selecting hardware for enterprise implementations

内存越大，就能缓存更多的数据，memtable也能放更多的数据。

推荐：

企业环境，16-64GB，至少也要8GB。

虚拟机的话，8-16，最小4GB吧

测试环境，256,mb

CPU：

对于写入负载很重的应用，对于Cassandra来说CPU比内存更容易到达极限。Cassandra是一个高并发的应用，会使用很多CPU核：

专用环境，8核还不错。

虚拟机环境，考虑下using a provider that allows CPU bursting, such as Rackspace Cloud Servers.

硬盘：

先来了解下架构：Cassandra写数据：1，写入到commit log中来durability。2.刷memtable到SSTable中来持久化。3、SSTable定期压缩。通过合并和重写数据以及删除旧数据，压缩能够提升性能。然而，压缩会大量使用磁盘IO以及磁盘目录。因此，你应该留下足够的空闲磁盘空间：50%（至少）for SizeTieredCompactionStrategy and large compactions, and 10% for LeveledCompactionStrategy.

更多压缩内容见下一篇数据模型的文章。

推荐：

Capacity and I/O

When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate). Some workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).

Solid-state drives

SSDs are the recommended choice for Cassandra. Cassandra's sequential, streaming write patterns minimize the undesirable effects of write amplification associated with SSDs. This means that Cassandra deployments can take advantage of inexpensive consumer-grade SSDs. Enterprise level SSDs are not necessary because Cassandra's SSD access wears out consumer-grade SSDs in the same time frame as more expensive enterprise SSDs.

Number of disks - SATA: Ideally Cassandra needs at least two disks, one for the commit log and the other for the data directories. At a minimum the commit log should be on its own partition.

Commit log disk - SATA: The disk not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).

Data disks

Use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction.

RAID on data disks

It is generally not necessary to use RAID for the following reasons:

Data is replicated across the cluster based on the replication factor you've chosen.
Starting in version 1.2, Cassandra includes takes care of disk management with the JBOD (Just a bunch of disks) support feature. Because Cassandra properly reacts to a disk failure, based on your availability/consistency requirements, either by stopping the affected node or by blacklisting the failed drive, this allows you to deploy Cassandra nodes with large disk arrays without the overhead of RAID 10.

RAID on the commit log disk

Generally RAID is not needed for the commit log disk. Replication adequately prevents data loss. If you need the extra redundancy, use RAID 1.

Extended file systems

DataStax recommends deploying Cassandra on XFS. On ext2 or ext3, the maximum file size is 2TB even using a 64-bit kernel. On ext4 it is 16TB.

Because Cassandra can use almost half your disk space for a single file, use XFS when using large disks, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit.

简单总结就是：不需要RAID。至少两块磁盘（commitlog和sstable各一块）。SSD 是很好的。推荐使用XFS文件系统，ext4似乎也还行。

节点数量：

1.2之前，推荐一个节点有300-500GB数据，1.2之后，由于JBOD，虚拟节点，off-heap 布隆过滤器，SSD的并发压缩等等，允许你使用更少的机器，每个机器增加更多的数据空间。

网络：

一定要选择可靠的、冗余的网络接口,确保您的网络能处理交通节点之间没有瓶颈。