zoukankan html css js c++ java

Hadoop相关日常操作

1.Hive相关

脚本导数据，并设置运行队列

bin/beeline -u 'url' --outputformat=tsv -e "set mapreduce.job.queuename=queue_1" 
-e "select * from search_log where date <= 20150525 and date >= 20150523" > test.txt

将毫秒转换为日期

select from_unixtime(cast(createTime/1000 as bigint)) from video_information;

对值类型为JSON的数据进行解析，如下就是一个字段data为json类型，其中的type代表日志类型，查询搜索日志。

get_json_object(field, "$.field") 
select * from video where date=20151215 and get_json_object(data, "$.type")="search" limit 1;

JSONArray类型解析

表格有3个字段（asrtext array, asraudiourl string)

asraudiourl	string	https://xxx
asrtext	array	[{"text":"我是业主","confidence":1.0,"queryvendor":"1009","querydebug":"{"recordId":"92e12fe7","applicationId":"","eof":1,"result":{"rec":"我是业主","eof":1}}","isfinal":true}]

select asr, asraudiourl, asrvendor from aiservice.asr_info LATERAL VIEW explode(asrtext) asrTable As asr where date=20170523 and asrvendor='AiSpeech' and asr.isfinal=true and asr.text="我是业主" limit 1;

distinct误区

当distinct要求字段值不是null，当distinct x,y时，如果有null，会造成数据错误。所以我们来人工把null转换成一个值

select count(distinct requestid, CASE WHEN resid is null THEN "1" ELSE resid END)

2.Spark相关

spark任务提交，jar包后面跟的是jar包需要的参数

$SPARK_HOME/bin/spark-submit --class com.test.SimilarQuery 
--master yarn-cluster --num-executors 40 --driver-memory 4g 
--executor-memory 2g --executor-cores 1 
similar-query-0.0.1-SNAPSHOT-jar-with-dependencies.jar 20150819 /user/similar-query

3.Hadoop

执行MapReduce Job，并设置运行队列，后面两个是主类需要的参数

hadoop jar game-query-down-0.0.1-SNAPSHOT.jar QueryDownJob 
-Dmapreduce.job.queuename=sns_default arg1 arg2

4.MapReduce输入输出格式

TextInputFormat：默认格式，读取文件的行，key是行的字节偏移量(LongWritable)，value是行内容(Text)

KeyValueInputFormat:把行解析为键值对，key+ ab+value

SequenceFileInputFormat/SequenceFileOutputFormat:二进制格式，key/value都是用户自定义，input和output要保持一致

TextOutputFormat:输出纯文本，每行为key+ ab+value

NullOutputFormat:没有输出，忽略输出数据

MapFileOutputFormat：将结果写入一个MapFile中。MapFile中的键必须是排序的，所以在reducer中必须保证输出的键有序

DBInputFormat/DBOutputFormat：使用JDBC从关系数据库读文件或写文件

查看全文

相关阅读:
时间选择器UIDatePicker的使用
 在app中屏蔽第三方键盘
 plist文件的相关操作
 查看mac上的隐藏文件
 设置ARC有效或者无效
 Linux 下源代码安装编译 ImageMagick6.8.48 且使其支持 JPEG
Linux Netcat 命令—网络工具中的瑞士军刀
 Linux 好书、经典书籍推荐
 让你拥有超能力：程序员应该掌握的统计学公式
 shell 脚本实现的守护进程

原文地址：https://www.cnblogs.com/whuqin/p/4981953.html