zoukankan      html  css  js  c++  java
  • hadoop学习记录1 初始hadoop

     

    起因

    因为工作需要用到,所以需要学习hadoop,所以记录这篇文章,主要分享自己快速搭建hadoop环境与运行一个demo

    搭建环境

    网上搭建hadoop环境的例子我看蛮多的.但是我看都比较复杂,要求安装java,hadoop,然后各种设置..很多参数变量都不明白是啥意思...我的目标很简单,首先应该是用最简单的方法搭建好一个环境.各种变量呀参数呀这些我觉得一开始对我都不太重要..我只要能跑起来1个自己的简单demo就行.而且现实中基本上环境也不会让我来维护..所以对我来说简单就行.

    刚好最近我一直在看docker..所以我就打算用docker来搭建这个环境.算是同时学习hadoop和docker吧.

    首先安装docker....很简单...这里就不介绍了.官方有一键安装脚本...

    docker hub中有1个官方的hadoop的例子.

    https://hub.docker.com/r/sequenceiq/hadoop-docker/

    我稍微修改了一下命令:

    额外挂载了1个目录,因为我要上传我自己写的demo jar到docker里去用hadoop运行.

    另外把这个container取名字为hadoop2,因为我跑了很多容器,取名字便于区分,而且后面可能要用多个hadoop容器来制作集群.

    docker run -it -v /dockerVolumes/hadoop2:/dockerVolume --name hadoop2  sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash

    运行好这个命令,这个容器就已经运行起来了.我们可以跑一下官方的example.

    cd $HADOOP_PREFIX
    # run the mapreduce
    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
    
    # check the output
    bin/hdfs dfs -cat output/*

    输出内容:

    bash-4.1# clear
    bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
    18/06/11 07:35:38 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    18/06/11 07:35:39 INFO input.FileInputFormat: Total input paths to process : 31
    18/06/11 07:35:39 INFO mapreduce.JobSubmitter: number of splits:31
    18/06/11 07:35:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0007
    18/06/11 07:35:40 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0007
    18/06/11 07:35:40 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0007/
    18/06/11 07:35:40 INFO mapreduce.Job: Running job: job_1528635021541_0007
    18/06/11 07:35:45 INFO mapreduce.Job: Job job_1528635021541_0007 running in uber mode : false
    18/06/11 07:35:45 INFO mapreduce.Job:  map 0% reduce 0%
    18/06/11 07:36:02 INFO mapreduce.Job:  map 10% reduce 0%
    18/06/11 07:36:03 INFO mapreduce.Job:  map 19% reduce 0%
    18/06/11 07:36:19 INFO mapreduce.Job:  map 35% reduce 0%
    18/06/11 07:36:20 INFO mapreduce.Job:  map 39% reduce 0%
    18/06/11 07:36:33 INFO mapreduce.Job:  map 42% reduce 0%
    18/06/11 07:36:35 INFO mapreduce.Job:  map 55% reduce 0%
    18/06/11 07:36:36 INFO mapreduce.Job:  map 55% reduce 15%
    18/06/11 07:36:39 INFO mapreduce.Job:  map 55% reduce 18%
    18/06/11 07:36:45 INFO mapreduce.Job:  map 58% reduce 18%
    18/06/11 07:36:46 INFO mapreduce.Job:  map 61% reduce 18%
    18/06/11 07:36:47 INFO mapreduce.Job:  map 65% reduce 18%
    18/06/11 07:36:48 INFO mapreduce.Job:  map 65% reduce 22%
    18/06/11 07:36:49 INFO mapreduce.Job:  map 71% reduce 22%
    18/06/11 07:36:51 INFO mapreduce.Job:  map 71% reduce 24%
    18/06/11 07:36:57 INFO mapreduce.Job:  map 74% reduce 24%
    18/06/11 07:36:59 INFO mapreduce.Job:  map 77% reduce 24%
    18/06/11 07:37:00 INFO mapreduce.Job:  map 77% reduce 26%
    18/06/11 07:37:01 INFO mapreduce.Job:  map 84% reduce 26%
    18/06/11 07:37:03 INFO mapreduce.Job:  map 87% reduce 28%
    18/06/11 07:37:06 INFO mapreduce.Job:  map 87% reduce 29%
    18/06/11 07:37:08 INFO mapreduce.Job:  map 90% reduce 29%
    18/06/11 07:37:09 INFO mapreduce.Job:  map 94% reduce 29%
    18/06/11 07:37:11 INFO mapreduce.Job:  map 100% reduce 29%
    18/06/11 07:37:12 INFO mapreduce.Job:  map 100% reduce 100%
    18/06/11 07:37:12 INFO mapreduce.Job: Job job_1528635021541_0007 completed successfully
    18/06/11 07:37:12 INFO mapreduce.Job: Counters: 49
    	File System Counters
    		FILE: Number of bytes read=345
    		FILE: Number of bytes written=3697476
    		FILE: Number of read operations=0
    		FILE: Number of large read operations=0
    		FILE: Number of write operations=0
    		HDFS: Number of bytes read=80529
    		HDFS: Number of bytes written=437
    		HDFS: Number of read operations=96
    		HDFS: Number of large read operations=0
    		HDFS: Number of write operations=2
    	Job Counters
    		Launched map tasks=31
    		Launched reduce tasks=1
    		Data-local map tasks=31
    		Total time spent by all maps in occupied slots (ms)=400881
    		Total time spent by all reduces in occupied slots (ms)=52340
    		Total time spent by all map tasks (ms)=400881
    		Total time spent by all reduce tasks (ms)=52340
    		Total vcore-seconds taken by all map tasks=400881
    		Total vcore-seconds taken by all reduce tasks=52340
    		Total megabyte-seconds taken by all map tasks=410502144
    		Total megabyte-seconds taken by all reduce tasks=53596160
    	Map-Reduce Framework
    		Map input records=2060
    		Map output records=24
    		Map output bytes=590
    		Map output materialized bytes=525
    		Input split bytes=3812
    		Combine input records=24
    		Combine output records=13
    		Reduce input groups=11
    		Reduce shuffle bytes=525
    		Reduce input records=13
    		Reduce output records=11
    		Spilled Records=26
    		Shuffled Maps =31
    		Failed Shuffles=0
    		Merged Map outputs=31
    		GC time elapsed (ms)=2299
    		CPU time spent (ms)=11090
    		Physical memory (bytes) snapshot=8178929664
    		Virtual memory (bytes) snapshot=21830377472
    		Total committed heap usage (bytes)=6461849600
    	Shuffle Errors
    		BAD_ID=0
    		CONNECTION=0
    		IO_ERROR=0
    		WRONG_LENGTH=0
    		WRONG_MAP=0
    		WRONG_REDUCE=0
    	File Input Format Counters
    		Bytes Read=76717
    	File Output Format Counters
    		Bytes Written=437
    18/06/11 07:37:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    18/06/11 07:37:12 INFO input.FileInputFormat: Total input paths to process : 1
    18/06/11 07:37:12 INFO mapreduce.JobSubmitter: number of splits:1
    18/06/11 07:37:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0008
    18/06/11 07:37:12 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0008
    18/06/11 07:37:12 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0008/
    18/06/11 07:37:12 INFO mapreduce.Job: Running job: job_1528635021541_0008
    18/06/11 07:37:24 INFO mapreduce.Job: Job job_1528635021541_0008 running in uber mode : false
    18/06/11 07:37:24 INFO mapreduce.Job:  map 0% reduce 0%
    18/06/11 07:37:29 INFO mapreduce.Job:  map 100% reduce 0%
    18/06/11 07:37:35 INFO mapreduce.Job:  map 100% reduce 100%
    18/06/11 07:37:35 INFO mapreduce.Job: Job job_1528635021541_0008 completed successfully
    18/06/11 07:37:35 INFO mapreduce.Job: Counters: 49
    	File System Counters
    		FILE: Number of bytes read=291
    		FILE: Number of bytes written=230541
    		FILE: Number of read operations=0
    		FILE: Number of large read operations=0
    		FILE: Number of write operations=0
    		HDFS: Number of bytes read=569
    		HDFS: Number of bytes written=197
    		HDFS: Number of read operations=7
    		HDFS: Number of large read operations=0
    		HDFS: Number of write operations=2
    	Job Counters
    		Launched map tasks=1
    		Launched reduce tasks=1
    		Data-local map tasks=1
    		Total time spent by all maps in occupied slots (ms)=3210
    		Total time spent by all reduces in occupied slots (ms)=3248
    		Total time spent by all map tasks (ms)=3210
    		Total time spent by all reduce tasks (ms)=3248
    		Total vcore-seconds taken by all map tasks=3210
    		Total vcore-seconds taken by all reduce tasks=3248
    		Total megabyte-seconds taken by all map tasks=3287040
    		Total megabyte-seconds taken by all reduce tasks=3325952
    	Map-Reduce Framework
    		Map input records=11
    		Map output records=11
    		Map output bytes=263
    		Map output materialized bytes=291
    		Input split bytes=132
    		Combine input records=0
    		Combine output records=0
    		Reduce input groups=5
    		Reduce shuffle bytes=291
    		Reduce input records=11
    		Reduce output records=11
    		Spilled Records=22
    		Shuffled Maps =1
    		Failed Shuffles=0
    		Merged Map outputs=1
    		GC time elapsed (ms)=55
    		CPU time spent (ms)=1090
    		Physical memory (bytes) snapshot=415494144
    		Virtual memory (bytes) snapshot=1373601792
    		Total committed heap usage (bytes)=354942976
    	Shuffle Errors
    		BAD_ID=0
    		CONNECTION=0
    		IO_ERROR=0
    		WRONG_LENGTH=0
    		WRONG_MAP=0
    		WRONG_REDUCE=0
    	File Input Format Counters
    		Bytes Read=437
    	File Output Format Counters
    		Bytes Written=197
    View Code

    可以看到利用了docker...安装hadoop就1行命令....就能成功运行官方example了.超级简单

    运行自己写的demo

    我自己尝试写了个demo.就是读取一个txt里的文字,然后统计它的字符数量

    1.首先我往hdfs里创建1个txt:

    hdfs的命令可以参考 https://blog.csdn.net/zhaojw_420/article/details/53161624

    hdfs dfs -put in.txt /myinput/in.txt

    2.写自己的mapper和reducer

    代码参考 https://gitee.com/abcwt112/hadoopDemo

    参考里面的MyFirstMapper和MyFirstReducer和MyFirstStarter

    package demo;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reporter;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import java.io.IOException;
    import java.util.Iterator;
    
    public class MyFirstReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        @Override
        protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int total = 0;
            for (IntWritable value : values) {
                total += value.get();
            }
            context.write(new IntWritable(1), new IntWritable(total));
        }
    
    }
    View Code
    package demo;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    import java.io.IOException;
    
    public class MyFirstMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
        @Override
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            context.write(new IntWritable(0), new IntWritable(line.length()));
        }
    }
    View Code
    package demo;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.Mapper;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.FileInputStream;
    import java.io.IOException;
    
    public class MyFirstStarter {
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
            Job job = new Job();
            job.setJarByClass(MyFirstStarter.class);
            job.setJobName("============ My First Job ==============");
    
            FileInputFormat.addInputPath(job, new Path("/myinput/in.txt"));
            FileOutputFormat.setOutputPath(job, new Path("/myout"));
    
            job.setMapperClass(MyFirstMapper.class);
            job.setReducerClass(MyFirstReducer.class);
    
            job.setOutputKeyClass(IntWritable.class);
            job.setOutputValueClass(IntWritable.class);
    
            System.exit(job.waitForCompletion(true) ? 0: 1);
        }
    }
    View Code

    运行mvn package以后打成jar包丢掉linux的/dockerVolumes/hadoop2目录就可以了.因为在docker里挂载了目录,所以会自动丢到hadoop2这个容器里.

    另外提一句...我mvn package打出来的jar里的MF文件没有指定main方法...导致各种找不到入口....在同事的帮助下了解到可以通过maven配置来解决:

    <build>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.1</version>
                    <configuration>
                        <source>1.7</source>
                        <target>1.7</target>
                    </configuration>
                </plugin>
                <plugin>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <configuration>
                        <descriptorRefs>
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                        <archive>
                            <manifest>
                                <mainClass>${mainClass}</mainClass>
                            </manifest>
                        </archive>
                    </configuration>
                    <executions>
                        <execution>
                            <id>make-assembly</id>
                            <phase>package</phase>
                            <goals>
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    
    
        <properties>
            <mainClass>demo.MyFirstStarter</mainClass>
        </properties>

    另外docker安装的hadoop里的jdk是1.7我的环境是1.8..所以我再pom里还额外指定了用1.7去编码..

    3.在hadoop2这个容器里运行我自己写的demo.

    在$HADOOP_PREFIX目录下运行bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar

    bash-4.1# bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar
    18/06/11 07:54:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    18/06/11 07:54:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    18/06/11 07:54:13 INFO input.FileInputFormat: Total input paths to process : 1
    18/06/11 07:54:13 INFO mapreduce.JobSubmitter: number of splits:1
    18/06/11 07:54:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0009
    18/06/11 07:54:13 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0009
    18/06/11 07:54:13 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0009/
    18/06/11 07:54:13 INFO mapreduce.Job: Running job: job_1528635021541_0009
    18/06/11 07:54:20 INFO mapreduce.Job: Job job_1528635021541_0009 running in uber mode : false
    18/06/11 07:54:20 INFO mapreduce.Job:  map 0% reduce 0%
    18/06/11 07:54:25 INFO mapreduce.Job:  map 100% reduce 0%
    18/06/11 07:54:31 INFO mapreduce.Job:  map 100% reduce 100%
    18/06/11 07:54:31 INFO mapreduce.Job: Job job_1528635021541_0009 completed successfully
    18/06/11 07:54:31 INFO mapreduce.Job: Counters: 49
    	File System Counters
    		FILE: Number of bytes read=1606
    		FILE: Number of bytes written=232725
    		FILE: Number of read operations=0
    		FILE: Number of large read operations=0
    		FILE: Number of write operations=0
    		HDFS: Number of bytes read=6940
    		HDFS: Number of bytes written=7
    		HDFS: Number of read operations=6
    		HDFS: Number of large read operations=0
    		HDFS: Number of write operations=2
    	Job Counters
    		Launched map tasks=1
    		Launched reduce tasks=1
    		Data-local map tasks=1
    		Total time spent by all maps in occupied slots (ms)=3059
    		Total time spent by all reduces in occupied slots (ms)=3265
    		Total time spent by all map tasks (ms)=3059
    		Total time spent by all reduce tasks (ms)=3265
    		Total vcore-seconds taken by all map tasks=3059
    		Total vcore-seconds taken by all reduce tasks=3265
    		Total megabyte-seconds taken by all map tasks=3132416
    		Total megabyte-seconds taken by all reduce tasks=3343360
    	Map-Reduce Framework
    		Map input records=160
    		Map output records=160
    		Map output bytes=1280
    		Map output materialized bytes=1606
    		Input split bytes=104
    		Combine input records=0
    		Combine output records=0
    		Reduce input groups=1
    		Reduce shuffle bytes=1606
    		Reduce input records=160
    		Reduce output records=1
    		Spilled Records=320
    		Shuffled Maps =1
    		Failed Shuffles=0
    		Merged Map outputs=1
    		GC time elapsed (ms)=43
    		CPU time spent (ms)=1140
    		Physical memory (bytes) snapshot=434499584
    		Virtual memory (bytes) snapshot=1367728128
    		Total committed heap usage (bytes)=354942976
    	Shuffle Errors
    		BAD_ID=0
    		CONNECTION=0
    		IO_ERROR=0
    		WRONG_LENGTH=0
    		WRONG_MAP=0
    		WRONG_REDUCE=0
    	File Input Format Counters
    		Bytes Read=6836
    	File Output Format Counters
    		Bytes Written=7

    运行成功!

    查看输出结果

    bash-4.1# bin/hdfs dfs -ls  /myout
    Found 2 items
    -rw-r--r--   1 root supergroup          0 2018-06-11 07:54 /myout/_SUCCESS
    -rw-r--r--   1 root supergroup          7 2018-06-11 07:54 /myout/part-r-00000
    bash-4.1# bin/hdfs dfs -cat  /myout/part-r-00000
    1	6676
    bash-4.1#

    总共6676个字符..

    6836 - 160个换行符 = 6676

    成功运行自己写的demo!

     

  • 相关阅读:
    ElasticSearch(ES)学习笔记
    Lucene學習日志
    velocity代码生成器的使用
    springboot学习笔记
    springmvc json 类型转换错误
    在做del业务时,传递参数,和接口中入参注释
    做add添加业务时,字符集乱码,form标签库,button的href 问题,添加后页面跳转,forward,redirect 。定制错误输出
    mybatis中联合查询的返回结果集
    mybatis分页,绝对路径的2种写法
    maven导入项目时报错,配置应用程序监听器[org.springframework.web.context.ContextLoaderListener]错误
  • 原文地址:https://www.cnblogs.com/abcwt112/p/9168950.html
Copyright © 2011-2022 走看看