每日学习 - 走看看

zoukankan html css js c++ java

每日学习

今日学习MapReduce：

数据清洗

“ETL，是英文 Extract-Transform-Load 的缩写，用来描述将数据从来源端经过抽取

（Extract）、转换（Transform）、加载（Load）至目的端的过程。ETL 一词较常用在数据仓

库，但其对象并不限于数据仓库

在运行核心业务 MapReduce 程序之前，往往要先对数据进行清洗，清理掉不符合用户

要求的数据。清理的过程往往只需要运行 Mapper 程序，不需要运行 Reduce 程序

1）需求

去除日志中字段个数小于等于 11 的日志。

2）需求分析

需要在 Map 阶段对输入的数据根据规则进行过滤清洗。

实现代码：

（1）编写 WebLogMapper 类

package com.atguigu.mapreduce.weblog;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WebLogMapper extends Mapper<LongWritable, Text, Text,

NullWritable>{

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// 1 获取 1 行数据

String line = value.toString();

// 2 解析日志

boolean result = parseLog(line,context);

// 3 日志不合法退出

if (!result) {

return;

}

// 4 日志合法就直接写出

context.write(value, NullWritable.get());

}

// 2 封装解析日志的方法

private boolean parseLog(String line, Context context) {

// 1 截取

String[] fields = line.split(" ");

// 2 日志长度大于 11 的为合法

if (fields.length > 11) {

return true;

}else {

return false;

}

}

}

（2）编写 WebLogDriver 类

package com.atguigu.mapreduce.weblog;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WebLogDriver {

public static void main(String[] args) throws Exception {

// 输入输出路径需要根据自己电脑上实际的输入输出路径设置

args = new String[] { "D:/input/inputlog", "D:/output1" };

// 1 获取 job 信息

Configuration conf = new Configuration();

Job job = Job.getInstance(conf);

// 2 加载 jar 包

job.setJarByClass(LogDriver.class);

// 3 关联 map

job.setMapperClass(WebLogMapper.class);

// 4 设置最终输出类型

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(NullWritable.class);

// 设置 reducetask 个数为 0

job.setNumReduceTasks(0);

// 5 设置输入和输出路径

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 6 提交

boolean b = job.waitForCompletion(true);

System.exit(b ? 0 : 1);

}

}

作者：哦心有

出处：https://www.cnblogs.com/haobox/

本文版权归作者和博客园共有，欢迎转载，但必须给出原文链接，并保留此段声明，否则保留追究法律责任的权利。

查看全文

相关阅读:
用户、群组、权限
 分页提纲
 网页分页显示
 OMR数据查询
 ORM增删改查询例题
 人工智能将推动云存储和数据服务的创新
 如何在智能家居中提高IoT安全性？
云计算是物联网的重要支柱
 一个高薪的码农，应具备的8种能力
 如何跨越比特币的认知障碍？

原文地址：https://www.cnblogs.com/haobox/p/15630817.html