自学hadoop真的很困难,主要是hadoop版本太混乱了,各个版本之间兼容性并不算太好。更主要的是网上的很多MapReduce的Java例子不写import!!!只写类名!!!偏偏Hadoop中有很多重名的类,不写Import根本不知道是哪个类!!!而且也不写上hadoop的版本号!!!让人根本看不明白!!!
所以这里我写下所有要注意的情况,特别要注意import的是哪一个类!!!
环境: hadoop1.2.1+jdk1.7+eclipse4.5+maven
maven的pom文件是:(如果不知道maven,那得稍微看看maven是什么)
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.howso</groupId> <artifactId>hadoopmaven</artifactId> <version>0.0.1-SNAPSHOT</version> <name>hadoopmaven</name> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <hadoop.version>1.2.1</hadoop.version> </properties> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.hamcrest</groupId> <artifactId>hamcrest-all</artifactId> <version>1.1</version> <scope>test</scope> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.11</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.mrunit</groupId> <artifactId>mrunit</artifactId> <version>1.1.0</version> <classifier>hadoop2</classifier> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-minicluster</artifactId> <version>${hadoop.version}</version> <scope>test</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-test</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>com.sun.jersey</groupId> <artifactId>jersey-core</artifactId> <version>1.8</version> <scope>test</scope> </dependency> </dependencies> <build> <finalName>hadoopx</finalName> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compilter-plugin</artifactId> <version>3.1</version> <configuration> <source>1.6</source> <target>1.6</target> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <version>2.5</version> <configuration> <outputDirectory>basedir</outputDirectory> <archive> <manifest> <mainClass>hadoopmaven.Driver</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> </build> </project>
这里面有一些组件是用来写hadoop的test的:mrunit,hadoop-test。
总共有3个类:Driver, MaxMapper, MaxReducer。 这三个类合力来获得每年最大的温度。这三个类都在hadoopmaven包下面。
一定要注意import的是哪个类,hadoop中相同的名字的类不少,特别是Mapper,Reducer这两个,竟然都有相同名称的,一定要注意。
Driver类:
package hadoopmaven; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class Driver extends Configured implements Tool{ //这个能够运行成功 public static void main(String[] args) throws Exception { int k =ToolRunner.run(new Driver(), args); System.out.println("ks is : "+k); System.exit(k); } public int run(String[] arg0) throws Exception { Job job = new Job(getConf(), "word count"); job.setJarByClass(getClass()); job.setJarByClass(Driver.class); job.setMapperClass(MaxMapper.class); job.setCombinerClass(MaxReducer.class); job.setReducerClass(MaxReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path("/input/temp.txt")); FileOutputFormat.setOutputPath(job, new Path("/output4")); return job.waitForCompletion(true)?0:1; } }
MaxMapper类:
package hadoopmaven; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MaxMapper extends Mapper<LongWritable, Text, Text, IntWritable> { //输入的格式是 // 1991,90 // 1991,91 // 1993,98 @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { String[] line=value.toString().split(","); context.write(new Text(line[0]), new IntWritable(Integer.parseInt(line[1]))); } }
MaxReducer类:
package hadoopmaven; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class MaxReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protected void reduce(Text arg0, Iterable<IntWritable> arg1, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int max=Integer.MIN_VALUE; for(IntWritable v:arg1){ max=Math.max(max, v.get()); } context.write(arg0, new IntWritable(max)); } }
这个MapReduce任务的作用是从hdfs的 /input/temp.txt文件中读取信息(/input/temp.txt的文件格式如下),获得每个年份对应的最大的数值,放到/output4文件夹中去。
1991,33 1991,45 1992,94 1992,85 1992,5 1993,78 1993,75
最后用maven的clean package打个包,maven会自动在打好的jar包中写上main class(因为在pom文件中配置了main class的名称了),打好的jar包在项目根目录下的basedir目录中,名字叫做hadoopx.jar(这些都是在pom中配置的。)
把temp.txt文件放入hdfs中去,把hadoopx.jar放入hadoop根目录,进入hadoop根目录,使用命令 bin/hadoop jar hadoopx.jar 运行