zoukankan      html  css  js  c++  java
  • 【Todo】找出共同好友 & Spark & Hadoop面试题

    找了这篇文章看了一下面试题<Spark 和hadoop的一些面试题(准备)>

    http://blog.csdn.net/qiezikuaichuan/article/details/51578743

    其中有一道题目很不错,详见:

    http://www.aboutyun.com/thread-18826-1-1.html

    http://www.cnblogs.com/lucius/p/3483494.html

    我觉得可以在Hadoop上面实际编程做一下。

    我觉得第一篇文章里面下面这一段总结的很好:

    简要描述你知道的数据挖掘算法和使用场景

    (一)基于分类模型的案例

        (1)垃圾邮件的判别    通常会采用朴素贝叶斯的方法进行判别

        (2)医学上的肿瘤判断   通过分类模型识别 

    (二)基于预测模型的案例  

        (1)红酒品质的判断  分类回归树模型进行预测和判断红酒的品质  

        (2)搜索引擎的搜索量和股价波动

    (三)基于关联分析的案例:沃尔玛的啤酒尿布

    (四)基于聚类分析的案例:零售客户细分

    (五)基于异常值分析的案例:支付中的交易欺诈侦测

    (六)基于协同过滤的案例:电商猜你喜欢和推荐引擎

    (七)基于社会网络分析的案例:电信中的种子客户

    (八)基于文本分析的案例

    (1)字符识别:扫描王APP

    (2)文学著作与统计:红楼梦归属

    上面的统计共同好友的题目。写了个程序试了一下。

    在Intellij项目 HadoopProj里面。maven项目,依赖如下:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.hadoop.my</groupId>
        <artifactId>hadoop-proj</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <dependencies>
    
            <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.7.3</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.7.3</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs -->
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-hdfs</artifactId>
                <version>2.7.3</version>
            </dependency>
    
        </dependencies>
    
        <repositories>
            <repository>
                <id>aliyunmaven</id>
                <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
            </repository>
        </repositories>
    
    </project>

    代码如下:

    package com.hadoop.my;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    
    /**
     * Created by baidu on 16/12/3.
     */
    public class HadoopProj {
        public static class CommonFriendsMapper extends Mapper<LongWritable, Text, Text, Text> {
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                String line = value.toString();
                String[] split = line.split(":");
                String person = split[0];
                String[] friends = split[1].split(",");
    
                for (String f: friends) {
                    context.write(new Text(f), new Text(person));
                }
            }
    
        }
    
        public static class CommonFriendsReducer extends Reducer<Text, Text, Text, Text> {
            // 输入<B->A><B->E><B->F>....
            // 输出 B A,E,F,J
            protected void reduce(Text friend, Iterable<Text> persons, Context context) throws IOException, InterruptedException {
                StringBuffer sb = new StringBuffer();
    
                for (Text person: persons) {
                    sb.append(person+",");
                }
    
                context.write(friend, new Text(sb.toString()));
            }
        }
    
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
            //读取classpath下的所有xxx-site.xml配置文件,并进行解析
            Configuration conf = new Configuration();
    
            Job friendJob = Job.getInstance(conf);
    
            //通过主类的类加载器机制获取到本job的所有代码所在的jar包
            friendJob.setJarByClass(HadoopProj.class);
            //指定本job使用的mapper类
            friendJob.setMapperClass(CommonFriendsMapper.class);
            //指定本job使用的reducer类
            friendJob.setReducerClass(CommonFriendsReducer.class);
    
            //指定reducer输出的kv数据类型
            friendJob.setOutputKeyClass(Text.class);
            friendJob.setOutputValueClass(Text.class);
    
            //指定本job要处理的文件所在的路径
            FileInputFormat.setInputPaths(friendJob, new Path(args[0]));
            //指定本job输出的结果文件放在哪个路径
            FileOutputFormat.setOutputPath(friendJob, new Path(args[1]));
    
            //将本job向hadoop集群提交执行
            boolean res = friendJob.waitForCompletion(true);
    
            System.exit(res?0:1);
    
        }
    
    }

    打成Jar包之后,传到Hadoop机器m42n05上面。

    在上面还要新建输入文件,内容:

    A:B,C,D,F,E,O
    B:A,C,E,K
    C:F,A,D,I
    D:A,E,F,L
    E:B,C,D,M,L
    F:A,B,C,D,E,O,M
    G:A,C,D,E,F
    H:A,C,D,E,O
    I:A,O
    J:B,O
    K:A,C,D
    L:D,E,F
    M:E,F,G
    O:A,H,I,J

    命令:

    $ hadoop fs -mkdir /input/frienddata
    
    $ hadoop fs -put text.txt /input/frienddata
    
    $ hadoop fs -ls /input/frienddata
    Found 1 items
    -rw-r--r--   3 work supergroup        142 2016-12-03 17:12 /input/frienddata/text.txt

    把hadoop-proj.jar 拷贝到 m42n05的/home/work/data/installed/hadoop-2.7.3/myjars

    运行命令

    $ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /output/friddata

    报错:

    $ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar com.hadoop.my.HadoopProj /input/frienddata /outputtput/frienddata
    16/12/03 17:19:52 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032                                                                             /fri
    Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://master.Hadoop:8390/input/frienddata already exists

    看起来是命令行后面参数的索引不对,注意代码里是这样写的。

    //指定本job要处理的文件所在的路径
    FileInputFormat.setInputPaths(friendJob, new Path(args[0]));
    //指定本job输出的结果文件放在哪个路径
    FileOutputFormat.setOutputPath(friendJob, new Path(args[1]));

    而Java里面,和C++不同,参数的确是从0开始的。程序名本身不占位。

    所以可能是不需要输入类名。重新输入命令:

    $ hadoop jar /home/work/data/installed/hadoop-2.7.3/myjars/hadoop-proj.jar /input/frienddata /output/frienddata

    获得输出:

    16/12/03 17:24:33 INFO client.RMProxy: Connecting to ResourceManager at master.Hadoop/10.117.146.12:8032
    16/12/03 17:24:33 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    16/12/03 17:24:34 INFO input.FileInputFormat: Total input paths to process : 1
    16/12/03 17:24:34 INFO mapreduce.JobSubmitter: number of splits:1
    16/12/03 17:24:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1478254572601_0002
    16/12/03 17:24:34 INFO impl.YarnClientImpl: Submitted application application_1478254572601_0002
    16/12/03 17:24:34 INFO mapreduce.Job: The url to track the job: http://master.Hadoop:8320/proxy/application_1478254572601_0002/
    16/12/03 17:24:34 INFO mapreduce.Job: Running job: job_1478254572601_0002
    16/12/03 17:24:40 INFO mapreduce.Job: Job job_1478254572601_0002 running in uber mode : false
    16/12/03 17:24:40 INFO mapreduce.Job:  map 0% reduce 0%
    16/12/03 17:24:45 INFO mapreduce.Job:  map 100% reduce 0%
    16/12/03 17:24:49 INFO mapreduce.Job:  map 100% reduce 100%
    16/12/03 17:24:50 INFO mapreduce.Job: Job job_1478254572601_0002 completed successfully
    16/12/03 17:24:50 INFO mapreduce.Job: Counters: 49
            File System Counters
                    FILE: Number of bytes read=348
                    FILE: Number of bytes written=238531
                    FILE: Number of read operations=0
                    FILE: Number of large read operations=0
                    FILE: Number of write operations=0
                    HDFS: Number of bytes read=258
                    HDFS: Number of bytes written=156
                    HDFS: Number of read operations=6
                    HDFS: Number of large read operations=0
                    HDFS: Number of write operations=2
            Job Counters 
                    Launched map tasks=1
                    Launched reduce tasks=1
                    Data-local map tasks=1
                    Total time spent by all maps in occupied slots (ms)=2651
                    Total time spent by all reduces in occupied slots (ms)=2446
                    Total time spent by all map tasks (ms)=2651
                    Total time spent by all reduce tasks (ms)=2446
                    Total vcore-milliseconds taken by all map tasks=2651
                    Total vcore-milliseconds taken by all reduce tasks=2446
                    Total megabyte-milliseconds taken by all map tasks=2714624
                    Total megabyte-milliseconds taken by all reduce tasks=2504704
            Map-Reduce Framework
                    Map input records=14
                    Map output records=57
                    Map output bytes=228
                    Map output materialized bytes=348
                    Input split bytes=116
                    Combine input records=0
                    Combine output records=0
                    Reduce input groups=14
                    Reduce shuffle bytes=348
                    Reduce input records=57
                    Reduce output records=14
                    Spilled Records=114
                    Shuffled Maps =1
                    Failed Shuffles=0
                    Merged Map outputs=1
                    GC time elapsed (ms)=111
                    CPU time spent (ms)=1850
                    Physical memory (bytes) snapshot=455831552
                    Virtual memory (bytes) snapshot=4239388672
                    Total committed heap usage (bytes)=342360064
            Shuffle Errors
                    BAD_ID=0
                    CONNECTION=0
                    IO_ERROR=0
                    WRONG_LENGTH=0
                    WRONG_MAP=0
                    WRONG_REDUCE=0
            File Input Format Counters 
                    Bytes Read=142
            File Output Format Counters 
                    Bytes Written=156

    看一下输出文件的位置:

    $ hadoop fs -ls /output/frienddata
    Found 2 items
    -rw-r--r--   3 work supergroup          0 2016-12-03 17:24 /output/frienddata/_SUCCESS
    -rw-r--r--   3 work supergroup        156 2016-12-03 17:24 /output/frienddata/part-r-00000
    
    
    $ hadoop fs -cat /output/frienddata/part-r-00000
    A       I,K,C,B,G,F,H,O,D,
    B       A,F,J,E,
    C       A,E,B,H,F,G,K,
    D       G,C,K,A,L,F,E,H,
    E       G,M,L,H,A,F,B,D,
    F       L,M,D,C,G,A,
    G       M,
    H       O,
    I       O,C,
    J       O,
    K       B,
    L       D,E,
    M       E,F,
    O       A,H,I,J,F,

    当然,也可以把输出merge到本地文件

    $ hdfs dfs -getmerge hdfs://master.Hadoop:8390/output/frienddata /home/work/frienddatatmp
    
    $ cat frienddatatmp 
    A       I,K,C,B,G,F,H,O,D,
    B       A,F,J,E,
    C       A,E,B,H,F,G,K,
    D       G,C,K,A,L,F,E,H,
    E       G,M,L,H,A,F,B,D,
    F       L,M,D,C,G,A,
    G       M,
    H       O,
    I       O,C,
    J       O,
    K       B,
    L       D,E,
    M       E,F,
    O       A,H,I,J,F,

    上面这道题目,做完了。

  • 相关阅读:
    Nginx软件优化
    分布式文件系统---GlusterFS
    内建DNS服务器--BIND
    ESXI 6.5 从载到安装
    在Linux下写一个简单的驱动程序
    TQ2440开发板网络配置方式
    虚拟机Linux下找不到/dev/cdrom
    求最大公约数
    strcmp的源码实现
    转:嵌入式软件工程师经典笔试题
  • 原文地址:https://www.cnblogs.com/charlesblc/p/6126346.html
Copyright © 2011-2022 走看看