Hadoop开始:
1. 下载最新的发行版,解压到你喜欢的路径。
2. 配置,Hadoop的配置文件位于~/hadoop/conf/ 目录下。这里我先只配置了core-site.xml文件。
1 <?xml version="1.0"?> 2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 3 4 <!-- Put site-specific property overrides in this file. --> 5 6 <configuration> 7 <property> 8 <name>fs.default.name</name> 9 <value>hdfs://localhost:9000</value> 10 </property> 11 <property> 12 <name>hadoop.tmp.dir</name> 13 <value>/home/Jack/dfs</value> 14 </property> 15 </configuration>
上面我指定了hadoop的DFS文件系统的路径。
3. 格式化DFS系统,输入命令: > ./hadoop namenode -format
4. 启动Hadoop,输入命令: > ./start-all.sh
**到这里Hadoop的启动已经正常,可以在端口50070和50030查看集群的状态。
======================================================================
第一个程序:HadoopHelloWorld
import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class HadoopHelloWorld { public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable> { private final static IntWritable one=new IntWritable(1); private Text word=new Text(); public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { String line= value.toString(); StringTokenizer tokenizer=new StringTokenizer(line); while(tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key,Iterator<IntWritable> values,OutputCollector<Text,IntWritable>output, Reporter reporter) throws IOException{ int sum=0; while(values.hasNext()) { sum+=values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String args[]) throws Exception { JobConf conf=new JobConf(HadoopHelloWorld.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
需要引入的基础包:
JRE system Library
Hadoop-core.jar
commons-logging.jar
说明一下,别的文档中没有将需要commons-logging.jar 这个包,可以我的没有这个包一直报错。java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
以上工作做好了之后,编译HadoopHelloWorld.java文件就好,将生成的class文件放入文件夹~/source/java2013/HadoopHelloWorld/,然后打成一个jar包。
[Jack@win bin]$ jar -cvf HadoopHelloWorld.jar -C ~/source/java2013/HadoopHelloWorld/ .
上传2个input文件作为程序输入[ file01,file02 ]。
[Jack@win bin]$./ hadoop fs -mkdir input
[Jack@win bin]$ ./hadoop dfs -put ~/source/java2012/FirstJar/input/file* input
运行程序:
[Jack@win bin]$./hadoop jar HadoopHelloWorld.jar HadoopHelloWorld input output
13/06/20 03:16:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/06/20 03:16:45 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/06/20 03:16:45 WARN snappy.LoadSnappy: Snappy native library not loaded 13/06/20 03:16:45 INFO mapred.FileInputFormat: Total input paths to process : 4 13/06/20 03:16:45 INFO mapred.JobClient: Running job: job_201306200226_0002 13/06/20 03:16:46 INFO mapred.JobClient: map 0% reduce 0% 13/06/20 03:16:59 INFO mapred.JobClient: map 40% reduce 0% 13/06/20 03:17:05 INFO mapred.JobClient: map 80% reduce 0% 13/06/20 03:17:08 INFO mapred.JobClient: map 80% reduce 26% 13/06/20 03:17:11 INFO mapred.JobClient: map 100% reduce 26% 13/06/20 03:17:23 INFO mapred.JobClient: map 100% reduce 100% 13/06/20 03:17:28 INFO mapred.JobClient: Job complete: job_201306200226_0002 13/06/20 03:17:28 INFO mapred.JobClient: Counters: 30 13/06/20 03:17:28 INFO mapred.JobClient: Job Counters 13/06/20 03:17:28 INFO mapred.JobClient: Launched reduce tasks=1 13/06/20 03:17:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=32074 13/06/20 03:17:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/06/20 03:17:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/06/20 03:17:28 INFO mapred.JobClient: Launched map tasks=5 13/06/20 03:17:28 INFO mapred.JobClient: Data-local map tasks=3 13/06/20 03:17:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=23534 13/06/20 03:17:28 INFO mapred.JobClient: File Input Format Counters 13/06/20 03:17:28 INFO mapred.JobClient: Bytes Read=54 13/06/20 03:17:28 INFO mapred.JobClient: File Output Format Counters 13/06/20 03:17:28 INFO mapred.JobClient: Bytes Written=41 13/06/20 03:17:28 INFO mapred.JobClient: FileSystemCounters 13/06/20 03:17:28 INFO mapred.JobClient: FILE_BYTES_READ=104 13/06/20 03:17:28 INFO mapred.JobClient: HDFS_BYTES_READ=541 13/06/20 03:17:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=128481 13/06/20 03:17:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=41 13/06/20 03:17:28 INFO mapred.JobClient: Map-Reduce Framework 13/06/20 03:17:28 INFO mapred.JobClient: Map output materialized bytes=128 13/06/20 03:17:28 INFO mapred.JobClient: Map input records=2 13/06/20 03:17:28 INFO mapred.JobClient: Reduce shuffle bytes=122 13/06/20 03:17:28 INFO mapred.JobClient: Spilled Records=16 13/06/20 03:17:28 INFO mapred.JobClient: Map output bytes=82 13/06/20 03:17:28 INFO mapred.JobClient: Total committed heap usage (bytes)=912719872 13/06/20 03:17:28 INFO mapred.JobClient: CPU time spent (ms)=5190 13/06/20 03:17:28 INFO mapred.JobClient: Map input bytes=50 13/06/20 03:17:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=487 13/06/20 03:17:28 INFO mapred.JobClient: Combine input records=0 13/06/20 03:17:28 INFO mapred.JobClient: Reduce input records=8 13/06/20 03:17:28 INFO mapred.JobClient: Reduce input groups=5 13/06/20 03:17:28 INFO mapred.JobClient: Combine output records=0 13/06/20 03:17:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=932745216 13/06/20 03:17:28 INFO mapred.JobClient: Reduce output records=5 13/06/20 03:17:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2390478848 13/06/20 03:17:28 INFO mapred.JobClient: Map output records=8