zoukankan      html  css  js  c++  java
  • hadoop之 hdfs FilePattern

    举一个例子:使用mapreduce统计一个月或者两个的日志文件,这里可能有大量的日志文件。如何快速的提取文件路径?
    在HDFS中,可以使用通配符来解决这个问题。与linux shell的通配符相同。

    例如:

    Tables Are
    2016/* 2016/05 2016/04
    2016/0[45] 2016/05 2016/04
    2016/0[4-5] 2016/05 2016/04

    代码:

        public static void globFiles(String pattern){
    
            try {
                FileSystem fileSystem = FileSystem.get(configuration);
    
                FileStatus[] statuses = fileSystem.globStatus(new Path(pattern));
                Path[] listPaths = FileUtil.stat2Paths(statuses);
                for (Path path : listPaths){
                    System.out.println(path);
                }
    
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    
    

    hdfs 还提供了一个PathFilter 对我们获取的文件路径进行过滤,与java.io.FileFilter类似

      /**
       * Return an array of FileStatus objects whose path names match pathPattern
       * and is accepted by the user-supplied path filter. Results are sorted by
       * their path names.
       * Return null if pathPattern has no glob and the path does not exist.
       * Return an empty array if pathPattern has a glob and no path matches it. 
       * 
       * @param pathPattern
       *          a regular expression specifying the path pattern
       * @param filter
       *          a user-supplied path filter
       * @return an array of FileStatus objects
       * @throws IOException if any I/O error occurs when fetching file status
       */
      public FileStatus[] globStatus(Path pathPattern, PathFilter filter)
          throws IOException {
        return new Globber(this, pathPattern, filter).glob();
      }
    
    

    hdfs自身提供了许多filter,在hadoop权威指南中,提供一种 正则表达式filter的实现

    public class RegexExcludePathFilter implements PathFilter {
    
        private  String regex;
    
        public RegexExcludePathFilter(String regex) {
            this.regex = regex;
        }
    
        @Override
        public boolean accept(Path path) {
            return !path.toString().matches(regex);
        }
    }
    
    

    利用正则表达式优化结果

    fileSystem.listStatus(new Path(uri),new RegexExcludePathFilter("^.*/2016/0$"));
    
    

    结果输出如下:

    hdfs://hadoop:9000/hadoop/2016/04
    hdfs://hadoop:9000/hadoop/2016/05
    
    

    过滤器由Path表示,只能作用于文件名以及路径。

    用放荡不羁的心态过随遇而安的生活
  • 相关阅读:
    November 07th, 2017 Week 45th Tuesday
    November 06th, 2017 Week 45th Monday
    November 05th, 2017 Week 45th Sunday
    November 04th, 2017 Week 44th Saturday
    November 03rd, 2017 Week 44th Friday
    Asp.net core 学习笔记 ( Area and Feature folder structure 文件结构 )
    图片方向 image orientation Exif
    Asp.net core 学习笔记 ( Router 路由 )
    Asp.net core 学习笔记 ( Configuration 配置 )
    qrcode render 二维码扫描读取
  • 原文地址:https://www.cnblogs.com/re-myself/p/5527587.html
Copyright © 2011-2022 走看看