https://github.com/cloudera/flume/blob/master/flume-docs/src/docs/UserGuide/Introduction
=== Reliability | |
Reliability, the ability to continue delivering events in the face of | |
failures without losing data, is a vital feature of Flume. Large | |
distributed systems can and do suffer partial failures in many ways - | |
physical hardware can fail, resources such as network bandwidth or | |
memory can become scarce, or software can crash or run slowly. Flume | |
emphasizes fault-tolerance as a core design principle and keeps | |
running and collecting data even when many components have failed. | |
Flume can guarantee that all data received by an agent node will | |
eventually make it to the collector at the end of its flow as long as | |
the agent node keeps running. That is, data can be *reliably* | |
delivered to its eventual destination. | |
However, reliable delivery can be very resource intensive and is often | |
a stronger guarantee than some data sources require. Therefore, Flume | |
allows the user to specify, on a per-flow basis, the level of | |
reliability required. There are three supported reliability levels: | |
* End-to-end | |
* Store on failure | |
* Best effort | |
.A Note About Reliability | |
****************** | |
Although Flume is extremely tolerant to machine, network, and software | |
failures, there is never any such thing as '100% reliability'. If all | |
the machines in a Flume installation were irrevocably destroyed in | |
some terrible data center incident, all copies of Flume's data would | |
be lost and there would be no way to recover them. Therefore all of | |
Flume's reliability levels make guarantees about data delivery 'until | |
some maximum number of failures have occurred'. Flume's failure modes | |
- in terms of what can fail and what will keep running if they do - | |
are described in detail later in this guide. | |
****************** | |
The *end-to-end* reliability level guarantees that once Flume accepts | |
an event, that event will make it to the endpoint - as long as the | |
agent that accepted the event remains live long enough. The first | |
thing the agent does in this setting is write the event to disk in a | |
''write-ahead log'' (WAL) so that, if the agent crashes and restarts, | |
knowledge of the event is not lost. After the event has successfully | |
made its way to the end of its flow, an acknowledgment is sent back to | |
the originating agent so that it knows it no longer needs to store the | |
event on disk. This reliability level can withstand any number of | |
failures downstream of the initial agent. | |
The *store on failure* reliability level causes nodes to only require | |
an acknowledgement from the node one hop downstream. If the sending | |
node detects a failure, it will store data on its local disk until the | |
downstream node is repaired, or an alternate downstream destination | |
can be selected. While this is effective, data can be lost if a | |
compound or silent failure occurs. | |
The *best-effort* reliability level sends data to the next hop with no | |
attempts to confirm or retry delivery. If nodes fail, any data that | |
they were in the process of transmitting or receiving can be | |
lost. This is the weakest reliability level, but also the most | |
lightweight. |
=== Reliability | |
Reliability, the ability to continue delivering events in the face of | |
failures without losing data, is a vital feature of Flume. Large | |
distributed systems can and do suffer partial failures in many ways - | |
physical hardware can fail, resources such as network bandwidth or | |
memory can become scarce, or software can crash or run slowly. Flume | |
emphasizes fault-tolerance as a core design principle and keeps | |
running and collecting data even when many components have failed. | |
Flume can guarantee that all data received by an agent node will | |
eventually make it to the collector at the end of its flow as long as | |
the agent node keeps running. That is, data can be *reliably* | |
delivered to its eventual destination. | |
However, reliable delivery can be very resource intensive and is often | |
a stronger guarantee than some data sources require. Therefore, Flume | |
allows the user to specify, on a per-flow basis, the level of | |
reliability required. There are three supported reliability levels: | |
* End-to-end | |
* Store on failure | |
* Best effort | |
.A Note About Reliability | |
****************** | |
Although Flume is extremely tolerant to machine, network, and software | |
failures, there is never any such thing as '100% reliability'. If all | |
the machines in a Flume installation were irrevocably destroyed in | |
some terrible data center incident, all copies of Flume's data would | |
be lost and there would be no way to recover them. Therefore all of | |
Flume's reliability levels make guarantees about data delivery 'until | |
some maximum number of failures have occurred'. Flume's failure modes | |
- in terms of what can fail and what will keep running if they do - | |
are described in detail later in this guide. | |
****************** | |
The *end-to-end* reliability level guarantees that once Flume accepts | |
an event, that event will make it to the endpoint - as long as the | |
agent that accepted the event remains live long enough. The first | |
thing the agent does in this setting is write the event to disk in a | |
''write-ahead log'' (WAL) so that, if the agent crashes and restarts, | |
knowledge of the event is not lost. After the event has successfully | |
made its way to the end of its flow, an acknowledgment is sent back to | |
the originating agent so that it knows it no longer needs to store the | |
event on disk. This reliability level can withstand any number of | |
failures downstream of the initial agent. | |
The *store on failure* reliability level causes nodes to only require | |
an acknowledgement from the node one hop downstream. If the sending | |
node detects a failure, it will store data on its local disk until the | |
downstream node is repaired, or an alternate downstream destination | |
can be selected. While this is effective, data can be lost if a | |
compound or silent failure occurs. | |
The *best-effort* reliability level sends data to the next hop with no | |
attempts to confirm or retry delivery. If nodes fail, any data that | |
they were in the process of transmitting or receiving can be | |
lost. This is the weakest reliability level, but also the most | |
lightweight. |