zoukankan      html  css  js  c++  java
  • MapReduce英语面试

    1——What's Mapreduce.(How does Mapreduce works?)

    Mapreduce is a progarmming model to process data process.Mapduce works by breaking the processing into two phases:the map phase and the reduce phase.

    Each phase has key-value pairs as input and output,the types of which can be chosen by programmers.(InputFormat).To implenment the Mapredue,we need to specify two functions:map function and reduce funciton.

     2——......

    Rather than use build-in java types,Hadoop provides its own sets of basis types that are optimized for network seralization,which we can find it in the package of  org.apache.haoop.io.

     3——Data Flow

    A Mapreduce is a unit work that the clients want to be performed:it consits of the input data,the Mapreduce program and the configuration information.Hadoop run the job by dividing it into tasks,of which there are two types:map tasks and reduce tasks.

    There are two types of nodes that control the job execution process:a jobtracker and a number of tasktrackers .The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.Tasktracters run tasks and send progress reports to the job tracker,which keeps a record of overall progress of each job.If a task fails,the jobtracker can reschedule it on a different tasktracker.

    Hadoop divides the input to a Mapreduce job into fixed-size pieces called input splits.Hadoop creates one map task for each split,which run the-user difined map function for each record in the split.(For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this
    can be changed for the cluster, or specified when each fileis created.)

      4——HDFS

     When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary  to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.

     HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

      5——Streaming data access

    HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

       6——NameNode and DataNode

     An HDFS cluster has two types of node: a namenode which is the master and a number of datanodes which act as the workers.The namenode manages the
    filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not  store  block  locations  persistently,  since  this  information  is  reconstructed  from datanodes when the system starts.

    Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.

     For secondaryNameNode its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

        7-- Serialization

    Serializationis the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.Deserializationis the reverse process of turning a byte stream back into a series of structured objects.

    Serialization  appears  in  two  quite  distinct  areas  of  distributed  data  processing:  for interprocess communication and for persistent storage.

  • 相关阅读:
    以太坊客户端Ethereum Wallet与Geth区别简介
    苹果企业版签名分发相关问题,蒲公英签名,fir.im分发,安装ipa设置信任
    usdt钱包开发,比特币协议 Omni 层协议 USDT
    产品经理-需求分析-用户故事-敏捷开发 详解 一张图帮你了解Scrum敏捷流程
    产品经理杂谈,产品管理=技术+设计+业务
    使用NodeJsScan扫描nodejs代码检查安全性
    人人都是操盘手(李笑来内部录音,揭秘币圈黑幕完整文字版)运营驱动时代,欧神比李笑来究竟差在哪里?
    Node.js中环境变量process.env详解
    selenium-java web自动化测试工具抓取百度搜索结果实例
    前一天或后一天
  • 原文地址:https://www.cnblogs.com/conie/p/3632429.html
Copyright © 2011-2022 走看看