zoukankan      html  css  js  c++  java
  • [LTR] RankLib.jar 包介绍

    一、介绍

    RankLib.jar 是一个学习排名(Learning to rank)算法的库,目前已经实现了如下几种算法:

    • MART
    • RankNet
    • RankBoost
    • AdaRank
    • Coordinate Ascent
    • LambdaMART
    • ListNet
    • Random Forests
    • Linear regression

    二、jar 包

    Usage: java -jar RankLib.jar <Params>
    Params:
      [+] Training (+ tuning and evaluation)
            # 训练数据
            -train <file>           Training data
            # 指定排名算法
            -ranker <type>          Specify which ranking algorithm to use
                                    0: MART (gradient boosted regression tree)
                                    1: RankNet
                                    2: RankBoost
                                    3: AdaRank
                                    4: Coordinate Ascent
                                    6: LambdaMART
                                    7: ListNet
                                    8: Random Forests
                                    9: Linear regression (L2 regularization)
            # 特征描述文件,列出要学习的特征,每行一个特征,默认使用所有特征
            [ -feature <file> ]     Feature description file: list features to be considered by the learner, each on a separate line
                                    If not specified, all features will be used.
            # 
            [ -metric2t <metric> ]  Metric to optimize on the training data. Supported: MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k (default=ERR@10)
            [ -gmax <label> ]       Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4})
            
            [ -silent ]             Do not print progress messages (which are printed by default)
            # 是否在验证数据集上调整模型
            [ -validate <file> ]    Specify if you want to tune your system on the validation data (default=unspecified)
                                    If specified, the final model will be the one that performs best on the validation data
            # 训练-验证数据集的分割比例
            [ -tvs <x in [0..1]> ] If you don't have separate validation data, use this to set train-validation split to be (x)(1.0-x)
            # 学习模型保存到指定文件
            [ -save <model> ]       Save the model learned (default=not-save)
            # 是否要在数据上测试训练的模型
            [ -test <file> ]        Specify if you want to evaluate the trained model on this data (default=unspecified)
            # 训练-测试数据集的分割比例
            [ -tts <x in [0..1]> ] Set train-test split to be (x)(1.0-x). -tts will override -tvs
            # 默认与 metric2t 一致
            [ -metric2T <metric> ]  Metric to evaluate on the test data (default to the same as specified for -metric2t)
            # 归一化特征向量,方法包括求和归一化,均值/标准差归一化,最大值/最小值归一化
            [ -norm <method>]       Normalize all feature vectors (default=no-normalization). Method can be:
                                    sum: normalize each feature by the sum of all its values
                                    zscore: normalize each feature by its mean/standard deviation
                                    linear: normalize each feature by its min/max values
            # 在训练数据集上执行交叉验证
            [ -kcv <k> ]            Specify if you want to perform k-fold cross validation using the specified training data (default=NoCV)
                                    -tvs can be used to further reserve a portion of the training data in each fold for validation
            # 交叉验证训练库模型的目录
            [ -kcvmd <dir> ]        Directory for models trained via cross-validation (default=not-save)
            
            [ -kcvmn <model> ]      Name for model learned in each fold. It will be prefix-ed with the fold-number (default=empty)
    
        [-] RankNet-specific parameters # 特定参数
            # 训练迭代次数
            [ -epoch <T> ]          The number of epochs to train (default=100)
            # 隐含层个数
            [ -layer <layer> ]      The number of hidden layers (default=1)
            # 每层隐含节点个数
            [ -node <node> ]        The number of hidden nodes per layer (default=10)
            # 学习率
            [ -lr <rate> ]          Learning rate (default=0.00005)
    
        [-] RankBoost-specific parameters # 特定参数
            # 训练迭代次数
            [ -round <T> ]          The number of rounds to train (default=300)
            # 搜索的阈值候选个数
            [ -tc <k> ]             Number of threshold candidates to search. -1 to use all feature values (default=10)
    
        [-] AdaRank-specific parameters # 特定参数
            # 训练迭代次数
            [ -round <T> ]          The number of rounds to train (default=500)
            # 
            [ -noeq ]               Train without enqueuing too-strong features (default=unspecified)
            # 连续两轮学习之间的误差
            [ -tolerance <t> ]      Tolerance between two consecutive rounds of learning (default=0.002)
            # 一个特征可以被连续选择而不改变性能的最大次数
            [ -max <times> ]        The maximum number of times can a feature be consecutively selected without changing performance (default=5)
    
        [-] Coordinate Ascent-specific parameters # 特定参数
            [ -r <k> ]              The number of random restarts (default=5)
            [ -i <iteration> ]      The number of iterations to search in each dimension (default=25)
            [ -tolerance <t> ]      Performance tolerance between two solutions (default=0.001)
            [ -reg <slack> ]        Regularization parameter (default=no-regularization)
    
        [-] {MART, LambdaMART}-specific parameters # 特定参数
            # 树的个数
            [ -tree <t> ]           Number of trees (default=1000)
            # 一个叶子的样本个数
            [ -leaf <l> ]           Number of leaves for each tree (default=10)
            # 学习率
            [ -shrinkage <factor> ] Shrinkage, or learning rate (default=0.1)
            # 树分割时的候选特征个数
            [ -tc <k> ]             Number of threshold candidates for tree spliting. -1 to use all feature values (default=256)
            # 一个叶子最少的样本个数
            [ -mls <n> ]            Min leaf support -- minimum #samples each leaf has to contain (default=1)
            [ -estop <e> ]          Stop early when no improvement is observed on validaton data in e consecutive rounds (default=100)
    
        [-] ListNet-specific parameters
            [ -epoch <T> ]          The number of epochs to train (default=1500)
            [ -lr <rate> ]          Learning rate (default=0.00001)
    
        [-] Random Forests-specific parameters # 随机森林特定参数
            [ -bag <r> ]            Number of bags (default=300)
            # 子集采样率
            [ -srate <r> ]          Sub-sampling rate (default=1.0)
            # 特征采样率
            [ -frate <r> ]          Feature sampling rate (default=0.3)
            [ -rtype <type> ]       Ranker to bag (default=0, i.e. MART)
            # 树个数
            [ -tree <t> ]           Number of trees in each bag (default=1)
            # 每棵树的叶节点个数
            [ -leaf <l> ]           Number of leaves for each tree (default=100)
            # 学习率
            [ -shrinkage <factor> ] Shrinkage, or learning rate (default=0.1)
            # 树分割时使用的候选特征阈值个数
            [ -tc <k> ]             Number of threshold candidates for tree spliting. -1 to use all feature values (default=256)
            [ -mls <n> ]            Min leaf support -- minimum #samples each leaf has to contain (default=1)
    
        [-] Linear Regression-specific parameters
            [ -L2 <reg> ]           L2 regularization parameter (default=1.0E-10)
    
      [+] Testing previously saved models # 测试已保存的模型
            # 加载模型
            -load <model>           The model to load
                                    Multiple -load can be used to specify models from multiple folds (in increasing order),
                                      in which case the test/rank data will be partitioned accordingly.
            # 测试数据
            -test <file>            Test data to evaluate the model(s) (specify either this or -rank but not both)
            # 对指定文件中的样本排序,与 -test 不能同时使用
            -rank <file>            Rank the samples in the specified file (specify either this or -test but not both)
            [ -metric2T <metric> ]  Metric to evaluate on the test data (default=ERR@10)
            [ -gmax <label> ]       Highest judged relevance label. It affects the calculation of ERR (default=4, i.e. 5-point scale {0,1,2,3,4})
            [ -score <file>]        Store ranker's score for each object being ranked (has to be used with -rank)
            # 打印单个排名列表上的性能(必须与 -test 一起使用)
            [ -idv <file> ]         Save model performance (in test metric) on individual ranked lists (has to be used with -test)
            # 特征归一化
            [ -norm ]               Normalize feature vectors (similar to -norm for training/tuning)
    

    1. -train <file>

    指定训练数据的文件,训练数据格式:

    label    qid:$id    $featureid:$featurevalue    $featureid:$featurevalue ... # description
    

    每行代表一个样本,相同查询请求的样本的 qid 相同,label 表示该样本和该查询请求的相关程度,description 描述信息,不参与训练计算。

    2、-ranker <type>

    指定排名算法

    • MART(Multiple Additive Regression Tree)多重增量回归树
    • GBDT(Gradient Boosting Decision Tree)梯度渐进决策树
    • GBRT(Gradient Boosting Regression Tree)梯度渐进回归树
    • TreeNet 决策树网络
    • RankNet
    • RankBoost
    • AdaRank
    • Coordinate Ascent
    • LambdaMART
    • ListNet
    • Random Forests
    • Linear regression

    3、-feature <file>

    指定样本的特征定义文件,格式如下:

    feature1
    feature2
    ...
    # featureK(该特征不参与分析)
    

    4、-metric2t <metric>

    指定信息检索中的评价指标,包括:
    MAP, NDCG@k, DCG@k, P@k, RR@k, ERR@k

    5、Example

    java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -test MQ2008/Fold1/test.txt -validate MQ2008/Fold1/vali.txt -ranker 6 -metric2t NDCG@10 -metric2T ERR@10 -save mymodel.txt
    

    命令解释 >>>
    训练数据:MQ2008/Fold1/train.txt
    测试数据:MQ2008/Fold1/test.txt
    验证数据:MQ2008/Fold1/vali.txt
    排名算法:6,LambdaMART
    评估指标:NDCG,取排名前 10 个数据进行计算
    测试数据评估指标:ERR,取排名前 10 个数据进行计算
    保存模型:mymodel.txt

    • 参数 -validate 是可选的,但可以更好的模型结果,对于 RankNet/MART/LambdaMART 非常重要。
    • -metric2t 仅应用于 list-wise 算法(AdaRank、Coordinate Ascent 和 LambdaMART);point-wise 和 Pair-wise 算法(MART、RankNet、RankBoost)是使用自己内部的 RMSE/pair-wise loss 作为评价指标。ListNet 虽然是 list-wise 算法,但是也不用 metric2t 指定评价指标。

    6、k-fold cross validation

    • 顺序分区
    java -jar bin/RankLib.jar -train MQ2008/Fold1/train.txt -ranker 4 -kcv 5 -kcvmd models/ -kcvmn ca -metric2t NDCG@10 -metric2T ERR@10
    

    按顺序将训练数据拆分5等份,第 i 份数据作为第 i 折叠的测试数据,第 i 折叠的训练数据则是由其他折叠的数据组成。

    • 随机分区
    java -cp bin/RankLib.jar ciir.umass.edu.features.FeatureManager -input MQ2008/Fold1/train.txt -output mydata/ -shuffle
    

    将训练数据 train.txt 重新洗牌存储在 mydata/ 目录下 train.txt.shuffled

    • 获取每个折叠中的数据
    java -cp bin/RankLib.jar ciir.umass.edu.features.FeatureManager -input MQ2008/Fold1/train.txt.shuffled -output mydata/ -k 5
    

    7、评估已训练的模型

    java -jar bin/RankLib.jar -load mymodel.txt -test MQ2008/Fold1/test.txt -metric2T ERR@10
    

    8、模型对比

    java -jar bin/RankLib.jar -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/baseline.ndcg.txt
    java -jar bin/RankLib.jar -load ca.model.txt -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/ca.ndcg.txt
    java -jar bin/RankLib.jar -load lm.model.txt -test MQ2008/Fold1/test.txt -metric2T NDCG@10 -idv output/lm.ndcg.txt
    

    输出文件中包含了每条查询的 NDCG@10 指标值,以及所有查询的综合指标,例如:

    NDCG@10   170   0.0
    NDCG@10   176   0.6722390270733757
    NDCG@10   177   0.4772656487866462
    NDCG@10   178   0.539003131276382
    NDCG@10   185   0.6131471927654585
    NDCG@10   189   1.0
    NDCG@10   191   0.6309297535714574
    NDCG@10   192   1.0
    NDCG@10   194   0.2532778777010656
    NDCG@10   197   1.0
    NDCG@10   200   0.6131471927654585
    NDCG@10   204   0.4772656487866462
    NDCG@10   207   0.0
    NDCG@10   209   0.123151194370365
    NDCG@10   221   0.39038004999210174
    NDCG@10   all   0.5193204478059303
    

    然后再进行对比:

    java -cp RankLib.jar ciir.umass.edu.eval.Analyzer -all output/ -base baseline.ndcg.txt > analysis.txt
    

    对比结果 analysis.txt 如下:

    Overall comparison
      ------------------------------------------------------------------------
      System  Performance     Improvement     Win     Loss    p-value
      baseline_ndcg.txt [baseline]    0.093
      LM_ndcg.txt     0.2863  +0.1933 (+207.8%)       9       1       0.03
      CA_ndcg.txt     0.5193  +0.4263 (+458.26%)      12      0       0.0
    
      Detailed break down
      ------------------------------------------------------------------------
                 [ < -100%)  [-100%,-75%)  [-75%,-50%)  [-50%,-25%)  [-25%,0%)  (0%,+25%]  (+25%,+50%]  (+50%,+75%]  (+75%,+100%]  ( > +100%]
      LM_ndcg.txt    0        0           1            0            0         4            2            2            1            0
      CA_ndcg.txt    0             0            0            0            0        1            6            2            3            0
    

    9、利用训练模型重排名

    java -jar RankLib.jar -load mymodel.txt -rank myResultLists.txt -score myScoreFile.txt
    

    myScoreFile.txt 文件中只是增加了一列,表示重新计算的排名评分,需要自己另外根据该评分排序获取新的排名顺序。

    1   0   -7.528650760650635
    1   1   2.9022061824798584
    1   2   -0.700125515460968
    1   3   2.376657485961914
    1   4   -0.29666265845298767
    1   5   -2.038628101348877
    1   6   -5.267711162567139
    1   7   -2.022146463394165
    1   8   0.6741248369216919
    ...
    

    参考

    RankLib wiki

  • 相关阅读:
    FreeMark教程
    Intellij IDEA 创建Web项目并在Tomcat中部署运行
    catalina.home和catalina.base这两个属性的作用
    如何用javac 和java 编译运行整个Java工程
    Java中Properties类的操作
    注册邮箱验证激活技术
    commons-logging的使用
    Windows下安装GDB
    BM算法
    Intellij IDEA 部署 项目在tomcat 原理
  • 原文地址:https://www.cnblogs.com/memento/p/9398047.html
Copyright © 2011-2022 走看看