zoukankan      html  css  js  c++  java
  • hbase SingleColumnValueFilter 列不存在 无法过滤

    问题描述

    对一张log表按时间过滤

    正常数据的话,每行有一个时间戳列timestamp作为操作时间,按这个列值过滤出特定时间段的log信息

    但是不知怎么的log表中多了一些垃圾数据(不一定是垃圾数据,只是没有timestamp这个字段)。

    过滤第一天的话会有5800条没有操作时间(timestamp),

    过滤第二天的时候还是有5800条没有操作时间的,

    过滤前两天的时候还是5800条。

    问题分析

    问题很明显了,就是当某一行没有要过滤的字段时,SingleColumnValueFilter是默认这一行符合过滤条件的。

    接下来就要让SingleColumnValueFilter在判断的时候把这个策略改改。

    查看源码发现是有方法可以更改这个策略的

    代码展现

    在SingleColumnValueFilter的源码开头的一段注释中(加粗加大的位置)说明了方法

    /**
     * This filter is used to filter cells based on value. It takes a {@link CompareFilter.CompareOp}
     * operator (equal, greater, not equal, etc), and either a byte [] value or
     * a ByteArrayComparable.
     * <p>
     * If we have a byte [] value then we just do a lexicographic compare. For
     * example, if passed value is 'b' and cell has 'a' and the compare operator
     * is LESS, then we will filter out this cell (return true).  If this is not
     * sufficient (eg you want to deserialize a long and then compare it to a fixed
     * long value), then you can pass in your own comparator instead.
     * <p>
     * You must also specify a family and qualifier.  Only the value of this column
     * will be tested. When using this filter on a {@link Scan} with specified
     * inputs, the column to be tested should also be added as input (otherwise
     * the filter will regard the column as missing).
     * <p>
     * To prevent the entire row from being emitted if the column is not found
     * on a row, use {@link #setFilterIfMissing}.
     * Otherwise, if the column is found, the entire row will be emitted only if
     * the value passes.  If the value fails, the row will be filtered out.
     * <p>
     * In order to test values of previous versions (timestamps), set
     * {@link #setLatestVersionOnly} to false. The default is true, meaning that
     * only the latest version's value is tested and all previous versions are ignored.
     * <p>
     * To filter based on the value of all scanned columns, use {@link ValueFilter}.
     */

    更改代码

    SingleColumnValueFilter f1 = new SingleColumnValueFilter(Bytes.toBytes(FAMILY), Bytes.toBytes("timestamp"), CompareOp.GREATER_OR_EQUAL, Bytes.toBytes(starttime));
    SingleColumnValueFilter f2 = new SingleColumnValueFilter(Bytes.toBytes(FAMILY), Bytes.toBytes("timestamp"), CompareOp.LESS, Bytes.toBytes(endtime));
    f1.setFilterIfMissing(true);  //true 跳过改行;false 通过该行
    f2.setFilterIfMissing(true);
    filters.add(f1);
    filters.add(f2);

    反思

    一开始打算继承出一个新类,然后重写部分方法,不过好像还是这样更灵活一些

  • 相关阅读:
    Java 获取当前时间最近12个月(字符串)
    Java 取本月第一天和最后一天
    find_in_set()和in()比较
    Latex 箭头、下标、符号上下写文字、正方形和三角形
    LaTeX 公式字体大小设置
    哈希学习(2)—— Hashing图像检索资源
    局部敏感哈希-Locality Sensitivity Hashing
    Mathtype 公式显示方框
    Linux下安装MATLAB
    局部敏感哈希(Locality-Sensitive Hashing, LSH)
  • 原文地址:https://www.cnblogs.com/erbin/p/4330734.html
Copyright © 2011-2022 走看看