zoukankan      html  css  js  c++  java
  • hadoop 2.7.3本地环境运行官方wordcount

    hadoop 2.7.3本地环境运行官方wordcount

    基本环境:
    系统:win7
    虚机环境:virtualBox
    虚机:centos 7
    hadoop版本:2.7.3

    本次先以独立模式(本地模式)来运行。

    参考:

    1 hadoop 安装

    java环境

    yum install java-1.8.0-openjdk
    

    hadoop下载压缩包并安装

    mkdir ~/hadoop/
    cd ~/hadoop/
    
    # http://apache.fayea.com/hadoop/common/hadoop-2.7.3/
    curl http://apache.fayea.com/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz -O
    # 如果下载出现中断,则可以使用-C参数继续下载
    
    ls -l
    #-rw-rw-r--. 1 jungle jungle 165297920 Jan  6 13:10 hadoop-2.7.3.tar.gz
    
    curl http://apache.fayea.com/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz -C 165297920 -O
    # ** Resuming transfer from byte position 165297920 ……
    
    # download checksum 
    curl http://apache.fayea.com/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz.mds -O
    
    # check
    cat hadoop-2.7.3.tar.gz.mds
    
    md5sum hadoop-2.7.3.tar.gz
    sha256sum hadoop-2.7.3.tar.gz
    
    tar -zxf hadoop-2.7.3.tar.gz
    mv hadoop-2.7.3 hadoop-local
    

    2 配置环境

    因为是使用本地模式,需要配置的项非常少,只需要涉及环境变量。

    # java path
    whereis java
    java: /usr/bin/java /usr/lib/java /etc/java /usr/share/java 
    
    ls -l /usr/bin/java
    lrwxrwxrwx. 1 root root 22 Dec 30 12:26 /usr/bin/java -> /etc/alternatives/java
    
    ls -l /etc/alternatives/java
    lrwxrwxrwx. 1 root root 73 Dec 30 12:26 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre/bin/java
    

    在~/.bashrc中增加如下三行

    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre
    export HADOOP_INSTALL=/home/jungle/hadoop/hadoop-local
    export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
    

    确认hadoop可用:

    hadoop version
    Hadoop 2.7.3
    Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
    Compiled by root on 2016-08-18T01:41Z
    Compiled with protoc 2.5.0
    From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
    This command was run using /home/jungle/hadoop/hadoop-local/share/hadoop/common/hadoop-common-2.7.3.jar
    

    2 使用linux文件系统做测试

    直接使用linux的文件系统做测试,即不使用hadoop fs相关命令。

    2.1 wordcount

    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar 
    An example program must be given as the first argument.
    Valid program names are:
      aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
      aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
      bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
      dbcount: An example job that count the pageview counts from a database.
    # ...
      wordcount: A map/reduce program that counts the words in the input files.
    # ...
    
    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar  wordcount
    # Usage: wordcount <in> [<in>...] <out>
    

    2.2 准备数据

    mkdir -p dataLocal/input/
    cd dataLocal/input/
    
    echo "hello world, I am jungle. bye world" > file1.txt
    echo "hello hadoop. hello jungle. bye hadoop." > file2.txt
    echo "the great software is hadoop." >> file2.txt
    
    

    2.3 运行

    cd /home/jungle/hadoop/hadoop-local/
    
    hadoop jar /home/jungle/hadoop/hadoop-local/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount  dataLocal/input/ dataLocal/outout
    # dataLocal/outout 当前不存在,由程序生成
    echo $?
    
    ls -la dataLocal/outout/
    total 12
    drwxrwxr-x. 2 jungle jungle 84 Jan  6 16:53 .
    drwxrwxr-x. 4 jungle jungle 31 Jan  6 16:53 ..
    -rw-r--r--. 1 jungle jungle 82 Jan  6 16:53 part-r-00000
    -rw-r--r--. 1 jungle jungle 12 Jan  6 16:53 .part-r-00000.crc
    -rw-r--r--. 1 jungle jungle  0 Jan  6 16:53 _SUCCESS
    -rw-r--r--. 1 jungle jungle  8 Jan  6 16:53 ._SUCCESS.crc
    
    # 结果
    cat dataLocal/outout//part-r-00000
    I	1
    am	1
    bye	2
    great	1
    hadoop.	3
    hello	3
    is	1
    jungle.	2
    software	1
    the	1
    world.	2
    	
    

    2.4 最后是运行日志

    通过日志,可以了解运行时的一些常用参数和配置。

    17/01/06 16:53:26 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
    17/01/06 16:53:26 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
    17/01/06 16:53:26 INFO input.FileInputFormat: Total input paths to process : 2
    17/01/06 16:53:26 INFO mapreduce.JobSubmitter: number of splits:2
    17/01/06 16:53:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1147390429_0001
    17/01/06 16:53:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
    17/01/06 16:53:27 INFO mapreduce.Job: Running job: job_local1147390429_0001
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: OutputCommitter set in config null
    17/01/06 16:53:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Waiting for map tasks
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Starting task: attempt_local1147390429_0001_m_000000_0
    17/01/06 16:53:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/01/06 16:53:27 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
    17/01/06 16:53:27 INFO mapred.MapTask: Processing split: file:/home/jungle/hadoop/hadoop-local/dataLocal/input/file2.txt:0+70
    17/01/06 16:53:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
    17/01/06 16:53:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
    17/01/06 16:53:27 INFO mapred.MapTask: soft limit at 83886080
    17/01/06 16:53:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
    17/01/06 16:53:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
    17/01/06 16:53:27 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: 
    17/01/06 16:53:27 INFO mapred.MapTask: Starting flush of map output
    17/01/06 16:53:27 INFO mapred.MapTask: Spilling map output
    17/01/06 16:53:27 INFO mapred.MapTask: bufstart = 0; bufend = 114; bufvoid = 104857600
    17/01/06 16:53:27 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214356(104857424); length = 41/6553600
    17/01/06 16:53:27 INFO mapred.MapTask: Finished spill 0
    17/01/06 16:53:27 INFO mapred.Task: Task:attempt_local1147390429_0001_m_000000_0 is done. And is in the process of committing
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: map
    17/01/06 16:53:27 INFO mapred.Task: Task 'attempt_local1147390429_0001_m_000000_0' done.
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local1147390429_0001_m_000000_0
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Starting task: attempt_local1147390429_0001_m_000001_0
    17/01/06 16:53:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/01/06 16:53:27 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
    17/01/06 16:53:27 INFO mapred.MapTask: Processing split: file:/home/jungle/hadoop/hadoop-local/dataLocal/input/file1.txt:0+37
    17/01/06 16:53:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
    17/01/06 16:53:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
    17/01/06 16:53:27 INFO mapred.MapTask: soft limit at 83886080
    17/01/06 16:53:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
    17/01/06 16:53:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
    17/01/06 16:53:27 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: 
    17/01/06 16:53:27 INFO mapred.MapTask: Starting flush of map output
    17/01/06 16:53:27 INFO mapred.MapTask: Spilling map output
    17/01/06 16:53:27 INFO mapred.MapTask: bufstart = 0; bufend = 65; bufvoid = 104857600
    17/01/06 16:53:27 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214372(104857488); length = 25/6553600
    17/01/06 16:53:27 INFO mapred.MapTask: Finished spill 0
    17/01/06 16:53:27 INFO mapred.Task: Task:attempt_local1147390429_0001_m_000001_0 is done. And is in the process of committing
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: map
    17/01/06 16:53:27 INFO mapred.Task: Task 'attempt_local1147390429_0001_m_000001_0' done.
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local1147390429_0001_m_000001_0
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: map task executor complete.
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Waiting for reduce tasks
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Starting task: attempt_local1147390429_0001_r_000000_0
    17/01/06 16:53:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/01/06 16:53:27 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
    17/01/06 16:53:27 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2aa26fdb
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10
    17/01/06 16:53:27 INFO reduce.EventFetcher: attempt_local1147390429_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
    17/01/06 16:53:27 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1147390429_0001_m_000000_0 decomp: 98 len: 102 to MEMORY
    17/01/06 16:53:27 INFO reduce.InMemoryMapOutput: Read 98 bytes from map-output for attempt_local1147390429_0001_m_000000_0
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 98, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->98
    17/01/06 16:53:27 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1147390429_0001_m_000001_0 decomp: 68 len: 72 to MEMORY
    
    17/01/06 16:53:27 WARN io.ReadaheadPool: Failed readahead on ifile
    EBADF: Bad file descriptor
    	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
    	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
    	at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
    	at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    	
    17/01/06 16:53:27 INFO reduce.InMemoryMapOutput: Read 68 bytes from map-output for attempt_local1147390429_0001_m_000001_0
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 68, inMemoryMapOutputs.size() -> 2, commitMemory -> 98, usedMemory ->166
    17/01/06 16:53:27 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
    
    17/01/06 16:53:27 WARN io.ReadaheadPool: Failed readahead on ifile
    EBADF: Bad file descriptor
    	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native Method)
    	at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
    	at org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
    	at org.apache.hadoop.io.ReadaheadPool$ReadaheadRequestImpl.run(ReadaheadPool.java:206)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    	
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
    17/01/06 16:53:27 INFO mapred.Merger: Merging 2 sorted segments
    17/01/06 16:53:27 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 156 bytes
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: Merged 2 segments, 166 bytes to disk to satisfy reduce memory limit
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: Merging 1 files, 168 bytes from disk
    17/01/06 16:53:27 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
    17/01/06 16:53:27 INFO mapred.Merger: Merging 1 sorted segments
    17/01/06 16:53:27 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 160 bytes
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
    17/01/06 16:53:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
    17/01/06 16:53:27 INFO mapred.Task: Task:attempt_local1147390429_0001_r_000000_0 is done. And is in the process of committing
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
    17/01/06 16:53:27 INFO mapred.Task: Task attempt_local1147390429_0001_r_000000_0 is allowed to commit now
    17/01/06 16:53:27 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1147390429_0001_r_000000_0' to file:/home/jungle/hadoop/hadoop-local/dataLocal/outout/_temporary/0/task_local1147390429_0001_r_000000
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: reduce > reduce
    17/01/06 16:53:27 INFO mapred.Task: Task 'attempt_local1147390429_0001_r_000000_0' done.
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local1147390429_0001_r_000000_0
    17/01/06 16:53:27 INFO mapred.LocalJobRunner: reduce task executor complete.
    17/01/06 16:53:28 INFO mapreduce.Job: Job job_local1147390429_0001 running in uber mode : false
    17/01/06 16:53:28 INFO mapreduce.Job:  map 100% reduce 100%
    17/01/06 16:53:28 INFO mapreduce.Job: Job job_local1147390429_0001 completed successfully
    17/01/06 16:53:28 INFO mapreduce.Job: Counters: 30
    	File System Counters
    		FILE: Number of bytes read=889648
    		FILE: Number of bytes written=1748828
    		FILE: Number of read operations=0
    		FILE: Number of large read operations=0
    		FILE: Number of write operations=0
    	Map-Reduce Framework
    		Map input records=3
    		Map output records=18
    		Map output bytes=179
    		Map output materialized bytes=174
    		Input split bytes=256
    		Combine input records=18
    		Combine output records=14
    		Reduce input groups=11
    		Reduce shuffle bytes=174
    		Reduce input records=14
    		Reduce output records=11
    		Spilled Records=28
    		Shuffled Maps =2
    		Failed Shuffles=0
    		Merged Map outputs=2
    		GC time elapsed (ms)=43
    		Total committed heap usage (bytes)=457912320
    	Shuffle Errors
    		BAD_ID=0
    		CONNECTION=0
    		IO_ERROR=0
    		WRONG_LENGTH=0
    		WRONG_MAP=0
    		WRONG_REDUCE=0
    	File Input Format Counters 
    		Bytes Read=107
    	File Output Format Counters 
    		Bytes Written=94
    

    EBADF: Bad file descriptor。这两个错误,从网上看可以忽略。

  • 相关阅读:
    架构之道(5)
    项目的命名规范
    semantic框架
    jquery.timepicker.js
    jquery.marquee.js
    CkEditor
    快速测试,其实没什麽大不了
    架构之道(4)
    架构之道(3)
    子网划分与子网掩码
  • 原文地址:https://www.cnblogs.com/qinqiao/p/local-hadoop-wordcount.html
Copyright © 2011-2022 走看看