zoukankan      html  css  js  c++  java
  • HDFS Architecture Notes

    HDFS Architecture Notes

     1、Moving Computation is Cheaper than Moving Data

      A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

     2、Safemode

      On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.

     3、The Persistence of File System Metadata

      FsImage、EditLog.

     4、Staging

      HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode.

     5、Replication Pipelining

      When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next.

     6、File Deletes and Undeletes

      When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trashdirectory.  The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

      Reference:http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

  • 相关阅读:
    推荐19个很有用的 JavaScript 库
    李开复:我对年轻人是分享经验 不是要当导师
    DotNET企业架构应用实践数据库表记录的唯一性设计的设计兼议主键设定原则
    cookies,session,viewstate浅析
    不是HR,Leader你到底需要招什么样的程序员(变形金刚?超人?可能吗!)
    IBatis.Net学习笔记系列文章
    学习mvc的一些资料
    数据库日常维护常用的脚本部分收录
    设定Grid行的颜色
    被WSS3.0耍了一把
  • 原文地址:https://www.cnblogs.com/tekkaman/p/3332951.html
Copyright © 2011-2022 走看看