zoukankan      html  css  js  c++  java
  • Mahout推荐算法之ItemBased

    Mahout推荐之ItemBased

    一、   算法原理

    (一)    基本原理

    如下图评分矩阵所示:行为user,列为item.

    图(1)


        该算法的原理:

    1.  计算Item之间的相似度。

    2.  对用户U做推荐


    公式(一)

    Map tmp ;

    Map tmp1 ;

    for(item a  in userRatedItems){

      rate  =userforItemRate(a)

      ListsimItem =getSimItem(a);

      For(Jin simItem){

        Item b =j;

        Simab=sim(a,b);

        Tmp.add(b,Tmp .get(b)+simab*rate)

    tmp1.add(b, tmp1.get(b)+simab)

    }

    }

    Maptmp2=temp/temp1

    Sortbyval(tmp2)

    return topK(tmp2,k)

     

    (二)    相似度计算

    1.  Cos相似度


    公式(二)

    2.  皮尔逊相似度


    公式(三)

    3.  调整的cos相似度


    公式(四)

    (三)    采样

    计算全量的itemPair之间的相似度耗费大量的时间,也是没有必要的,所以需要采样,减小计算量。

    二、   单机模式实现

    (一)    候选Item搜索

    计算所有Item Pair之间的相似度在单机模式下是不现实的,需要在海量的候选集中搜索出一部分最有可能的候选集用于计算。Mahout提供了4中候选Item选择策略。

    1.  AllSimilarItemsCandidateItemsStrategy

    @Override

      FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet candidateItemIDs = new FastIDSet();

        for (long itemID : preferredItemIDs) {

          candidateItemIDs.addAll(similarity.allSimilarItemIDs(itemID));

        }

        candidateItemIDs.removeAll(preferredItemIDs);

        return candidateItemIDs;

      }

    2.  AllUnknownItemsCandidateItemsStrategy

    @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet possibleItemIDs = new FastIDSet(dataModel.getNumItems());

        LongPrimitiveIterator allItemIDs = dataModel.getItemIDs();

        while (allItemIDs.hasNext()) {

          possibleItemIDs.add(allItemIDs.nextLong());

        }

        possibleItemIDs.removeAll(preferredItemIDs);

        return possibleItemIDs;

      }

    3.  PreferredItemsNeighborhoodCandidateItemsStrategy

      @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet possibleItemsIDs = new FastIDSet();

        for (long itemID : preferredItemIDs) {

          PreferenceArray itemPreferences = dataModel.getPreferencesForItem(itemID);

          int numUsersPreferringItem = itemPreferences.length();

          for (int index = 0; index < numUsersPreferringItem; index++) {

            possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(itemPreferences.getUserID(index)));

          }

        }

        possibleItemsIDs.removeAll(preferredItemIDs);

        return possibleItemsIDs;

      }

    4.  SamplingCandidateItemsStrategy

    private static int computeMaxFrom(int factor, int numThings) {

        if (factor == NO_LIMIT_FACTOR) {

          return MAX_LIMIT;

        }

        long max = (long) (factor * (1.0 + Math.log(numThings) / LOG2));

        return max > MAX_LIMIT ? MAX_LIMIT : (int) max;

      }

     

      @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        LongPrimitiveIterator preferredItemIDsIterator = new LongPrimitiveArrayIterator(preferredItemIDs);

        if (preferredItemIDs.length > maxItems) {

          double samplingRate = (double) maxItems / preferredItemIDs.length;

    //      log.info("preferredItemIDs.length {}, samplingRate {}", preferredItemIDs.length, samplingRate);

          preferredItemIDsIterator =

              new SamplingLongPrimitiveIterator(preferredItemIDsIterator, samplingRate);

        }

        FastIDSet possibleItemsIDs = new FastIDSet();

        while (preferredItemIDsIterator.hasNext()) {

          long itemID = preferredItemIDsIterator.nextLong();

          PreferenceArray prefs = dataModel.getPreferencesForItem(itemID);

          int prefsLength = prefs.length();

          if (prefsLength > maxUsersPerItem) {

            Iterator<Preference> sampledPrefs =

                new FixedSizeSamplingIterator<Preference>(maxUsersPerItem, prefs.iterator());

            while (sampledPrefs.hasNext()) {

              addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(sampledPrefs.next().getUserID()));

            }

          } else {

            for (int i = 0; i < prefsLength; i++) {

              addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(prefs.getUserID(i)));

            }

          }

        }

        possibleItemsIDs.removeAll(preferredItemIDs);

        return possibleItemsIDs;

      }

     

      private void addSomeOf(FastIDSet possibleItemIDs, FastIDSet itemIDs) {

        if (itemIDs.size() > maxItemsPerUser) {

          LongPrimitiveIterator it =

              new SamplingLongPrimitiveIterator(itemIDs.iterator(), (double) maxItemsPerUser / itemIDs.size());

          while (it.hasNext()) {

            possibleItemIDs.add(it.nextLong());

          }

        } else {

          possibleItemIDs.addAll(itemIDs);

        }

      }

    (二)    估值

    protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)

        throws TasteException {

        double preference = 0.0;

        double totalSimilarity = 0.0;

        int count = 0;

        double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());

        for (int i = 0; i < similarities.length; i++) {

          double theSimilarity = similarities[i];

          if (!Double.isNaN(theSimilarity)) {

            // Weights can be negative!

            preference += theSimilarity * preferencesFromUser.getValue(i);

            totalSimilarity += theSimilarity;

            count++;

          }

        }

        // Throw out the estimate if it was based on no data points, of course, but also if based on

        // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.

        // The reason is that in this case the estimate is, simply, the user's rating for one item

        // that happened to have a defined similarity. The similarity score doesn't matter, and that

        // seems like a bad situation.

        if (count <= 1) {

          return Float.NaN;

        }

        float estimate = (float) (preference / totalSimilarity);

        if (capper != null) {

          estimate = capper.capEstimate(estimate);

        }

        return estimate;

      }

    (三)    推荐

    1.  根据历史评分列表推荐

    这种推荐方式根据用户之前产生过评分的item做推荐,推荐结果按照估计值的大小排序。

    @Override

      public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) throws TasteException {

        Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

        log.debug("Recommending items for user ID '{}'", userID);

        PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);

        if (preferencesFromUser.length() == 0) {

          return Collections.emptyList();

        }

        FastIDSet possibleItemIDs = getAllOtherItems(userID, preferencesFromUser);

        TopItems.Estimator<Long> estimator = new Estimator(userID, preferencesFromUser);

        List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer,

          estimator);

        log.debug("Recommendations are: {}", topItems);

        return topItems;

      }

    2.  Because推荐

    这种推荐方式用于实时推荐。

    @Override

      public List<RecommendedItem> recommendedBecause(long userID, long itemID, int howMany) throws TasteException {

        Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

        DataModel model = getDataModel();

        TopItems.Estimator<Long> estimator = new RecommendedBecauseEstimator(userID, itemID);

        PreferenceArray prefs = model.getPreferencesFromUser(userID);

        int size = prefs.length();

        FastIDSet allUserItems = new FastIDSet(size);

        for (int i = 0; i < size; i++) {

          allUserItems.add(prefs.getItemID(i));

        }

        allUserItems.remove(itemID);

        return TopItems.getTopItems(howMany, allUserItems.iterator(), null, estimator);

      }

     

    //估值方法

    @Override

    public double estimate(Long itemID) throws TasteException {

          Float pref = getDataModel().getPreferenceValue(userID, itemID);

          if (pref == null) {

            return Float.NaN;

          }

          double similarityValue = similarity.itemSimilarity(recommendedItemID, itemID);

          return (1.0 + similarityValue) * pref;

        }

    三、   MapReduce模式实现

    (一)    将偏好文件转换成评分矩阵(PreparePreferenceMatrixJob)

    (二)    计算共现矩阵相似度(RowSimilarityJob)

    (三)    挑选最相似的K个Item

    (四)    用户偏好向量和相似降维后的共现矩阵做乘法

    (五)    过滤制定的user item

    (六)    生成最终的推荐结果

    四、   实例演示

    1.  单机模式

    1)  批量推荐

    DataModel  dataModel = new FileDataModel(new File("p/pereference"));

     

    ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

     

    ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

     

    System.out.println(recommender.recommend(10, 10));

    2)  Because推荐

    DataModel  dataModel = new FileDataModel(new File("p/pereference"));

     

    ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

     

    ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

     

    System.out.println(recommender.recommendedBecause(10, 10328, 100));

    2.  MapReduce模式

    API

    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args)

    --input

    偏好数据路径,文本文件。格式 userid itemid preference

    --output

    推荐结果路径

    -- numRecommendations

    推荐个数

    --usersFile

    需要做出推荐的user,默认全部做推荐

    --itemsFile

    需要做出推荐的item,默认全部做推荐

    --filterFile

    文件格式文本,useriditemid 。目的是给userid的用户不要推荐itemid的item

    --booleanData

    是否是布尔数据

    --maxPrefsPerUser

    最大偏好值

    --minPrefsPerUser

    最小偏好值

    --maxSimilaritiesPerItem

    给每一个Item计算最多的相似item数目

    --maxPrefsPerUserInItemSimilarity

    ItemSimilarity估计item相似度时,对每一个user最多偏好数目

    --similarityClassname

    SIMILARITY_PEARSON_CORRELATION、SIMILARITY_COOCCURRENCE、SIMILARITY_LOGLIKELIHOOD、SIMILARITY_TANIMOTO_COEFFICIENT、SIMILARITY_CITY_BLOCK、SIMILARITY_COSINE、SIMILARITY_EUCLIDEAN_DISTANCE

    --threshold

    删除低于该阈值的item对

    --outputPathForSimilarityMatrix

    指定生成的item相似矩阵路径,文本文件,格式为 itemA itemB 相似值

        实例

    String  [] args ={"--input","p",

    "--output","recommender",

    "--numRecommendations","10",

    "--outputPathForSimilarityMatrix","simMatrix",

    "--similarityClassname","SIMILARITY_PEARSON_CORRELATION"}

    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args);

    五、   参考文献

    1.  M.Deshpandeand G. Karypis. Item-based top-n recommendation algorithms.

    2.  B.M.Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborativefiltering recommendation algorithms.

    3.  Item-based collaborative filtering

    4.  Accuratelycomputing running variance

  • 相关阅读:
    nginx申请并配置免费https
    linux安装git
    linux安装openssl
    nginx配置支持http2
    linux服务器升级nginx
    linux 增加虚拟内存swap(使用文件)
    使用shell安装lnmp
    mysql 数据库主从同步
    Android四大组件之Service
    Android四大组件之Activity
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205080.html
Copyright © 2011-2022 走看看