zoukankan      html  css  js  c++  java
  • Mahout推荐算法之ItemBased

    Mahout推荐之ItemBased

    一、   算法原理

    (一)    基本原理

    如下图评分矩阵所示:行为user,列为item.

    图(1)


        该算法的原理:

    1.  计算Item之间的相似度。

    2.  对用户U做推荐


    公式(一)

    Map tmp ;

    Map tmp1 ;

    for(item a  in userRatedItems){

      rate  =userforItemRate(a)

      ListsimItem =getSimItem(a);

      For(Jin simItem){

        Item b =j;

        Simab=sim(a,b);

        Tmp.add(b,Tmp .get(b)+simab*rate)

    tmp1.add(b, tmp1.get(b)+simab)

    }

    }

    Maptmp2=temp/temp1

    Sortbyval(tmp2)

    return topK(tmp2,k)

     

    (二)    相似度计算

    1.  Cos相似度


    公式(二)

    2.  皮尔逊相似度


    公式(三)

    3.  调整的cos相似度


    公式(四)

    (三)    采样

    计算全量的itemPair之间的相似度耗费大量的时间,也是没有必要的,所以需要采样,减小计算量。

    二、   单机模式实现

    (一)    候选Item搜索

    计算所有Item Pair之间的相似度在单机模式下是不现实的,需要在海量的候选集中搜索出一部分最有可能的候选集用于计算。Mahout提供了4中候选Item选择策略。

    1.  AllSimilarItemsCandidateItemsStrategy

    @Override

      FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet candidateItemIDs = new FastIDSet();

        for (long itemID : preferredItemIDs) {

          candidateItemIDs.addAll(similarity.allSimilarItemIDs(itemID));

        }

        candidateItemIDs.removeAll(preferredItemIDs);

        return candidateItemIDs;

      }

    2.  AllUnknownItemsCandidateItemsStrategy

    @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet possibleItemIDs = new FastIDSet(dataModel.getNumItems());

        LongPrimitiveIterator allItemIDs = dataModel.getItemIDs();

        while (allItemIDs.hasNext()) {

          possibleItemIDs.add(allItemIDs.nextLong());

        }

        possibleItemIDs.removeAll(preferredItemIDs);

        return possibleItemIDs;

      }

    3.  PreferredItemsNeighborhoodCandidateItemsStrategy

      @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        FastIDSet possibleItemsIDs = new FastIDSet();

        for (long itemID : preferredItemIDs) {

          PreferenceArray itemPreferences = dataModel.getPreferencesForItem(itemID);

          int numUsersPreferringItem = itemPreferences.length();

          for (int index = 0; index < numUsersPreferringItem; index++) {

            possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(itemPreferences.getUserID(index)));

          }

        }

        possibleItemsIDs.removeAll(preferredItemIDs);

        return possibleItemsIDs;

      }

    4.  SamplingCandidateItemsStrategy

    private static int computeMaxFrom(int factor, int numThings) {

        if (factor == NO_LIMIT_FACTOR) {

          return MAX_LIMIT;

        }

        long max = (long) (factor * (1.0 + Math.log(numThings) / LOG2));

        return max > MAX_LIMIT ? MAX_LIMIT : (int) max;

      }

     

      @Override

      protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {

        LongPrimitiveIterator preferredItemIDsIterator = new LongPrimitiveArrayIterator(preferredItemIDs);

        if (preferredItemIDs.length > maxItems) {

          double samplingRate = (double) maxItems / preferredItemIDs.length;

    //      log.info("preferredItemIDs.length {}, samplingRate {}", preferredItemIDs.length, samplingRate);

          preferredItemIDsIterator =

              new SamplingLongPrimitiveIterator(preferredItemIDsIterator, samplingRate);

        }

        FastIDSet possibleItemsIDs = new FastIDSet();

        while (preferredItemIDsIterator.hasNext()) {

          long itemID = preferredItemIDsIterator.nextLong();

          PreferenceArray prefs = dataModel.getPreferencesForItem(itemID);

          int prefsLength = prefs.length();

          if (prefsLength > maxUsersPerItem) {

            Iterator<Preference> sampledPrefs =

                new FixedSizeSamplingIterator<Preference>(maxUsersPerItem, prefs.iterator());

            while (sampledPrefs.hasNext()) {

              addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(sampledPrefs.next().getUserID()));

            }

          } else {

            for (int i = 0; i < prefsLength; i++) {

              addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(prefs.getUserID(i)));

            }

          }

        }

        possibleItemsIDs.removeAll(preferredItemIDs);

        return possibleItemsIDs;

      }

     

      private void addSomeOf(FastIDSet possibleItemIDs, FastIDSet itemIDs) {

        if (itemIDs.size() > maxItemsPerUser) {

          LongPrimitiveIterator it =

              new SamplingLongPrimitiveIterator(itemIDs.iterator(), (double) maxItemsPerUser / itemIDs.size());

          while (it.hasNext()) {

            possibleItemIDs.add(it.nextLong());

          }

        } else {

          possibleItemIDs.addAll(itemIDs);

        }

      }

    (二)    估值

    protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)

        throws TasteException {

        double preference = 0.0;

        double totalSimilarity = 0.0;

        int count = 0;

        double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());

        for (int i = 0; i < similarities.length; i++) {

          double theSimilarity = similarities[i];

          if (!Double.isNaN(theSimilarity)) {

            // Weights can be negative!

            preference += theSimilarity * preferencesFromUser.getValue(i);

            totalSimilarity += theSimilarity;

            count++;

          }

        }

        // Throw out the estimate if it was based on no data points, of course, but also if based on

        // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.

        // The reason is that in this case the estimate is, simply, the user's rating for one item

        // that happened to have a defined similarity. The similarity score doesn't matter, and that

        // seems like a bad situation.

        if (count <= 1) {

          return Float.NaN;

        }

        float estimate = (float) (preference / totalSimilarity);

        if (capper != null) {

          estimate = capper.capEstimate(estimate);

        }

        return estimate;

      }

    (三)    推荐

    1.  根据历史评分列表推荐

    这种推荐方式根据用户之前产生过评分的item做推荐,推荐结果按照估计值的大小排序。

    @Override

      public List<RecommendedItem> recommend(long userID, int howMany, IDRescorer rescorer) throws TasteException {

        Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

        log.debug("Recommending items for user ID '{}'", userID);

        PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);

        if (preferencesFromUser.length() == 0) {

          return Collections.emptyList();

        }

        FastIDSet possibleItemIDs = getAllOtherItems(userID, preferencesFromUser);

        TopItems.Estimator<Long> estimator = new Estimator(userID, preferencesFromUser);

        List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer,

          estimator);

        log.debug("Recommendations are: {}", topItems);

        return topItems;

      }

    2.  Because推荐

    这种推荐方式用于实时推荐。

    @Override

      public List<RecommendedItem> recommendedBecause(long userID, long itemID, int howMany) throws TasteException {

        Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");

        DataModel model = getDataModel();

        TopItems.Estimator<Long> estimator = new RecommendedBecauseEstimator(userID, itemID);

        PreferenceArray prefs = model.getPreferencesFromUser(userID);

        int size = prefs.length();

        FastIDSet allUserItems = new FastIDSet(size);

        for (int i = 0; i < size; i++) {

          allUserItems.add(prefs.getItemID(i));

        }

        allUserItems.remove(itemID);

        return TopItems.getTopItems(howMany, allUserItems.iterator(), null, estimator);

      }

     

    //估值方法

    @Override

    public double estimate(Long itemID) throws TasteException {

          Float pref = getDataModel().getPreferenceValue(userID, itemID);

          if (pref == null) {

            return Float.NaN;

          }

          double similarityValue = similarity.itemSimilarity(recommendedItemID, itemID);

          return (1.0 + similarityValue) * pref;

        }

    三、   MapReduce模式实现

    (一)    将偏好文件转换成评分矩阵(PreparePreferenceMatrixJob)

    (二)    计算共现矩阵相似度(RowSimilarityJob)

    (三)    挑选最相似的K个Item

    (四)    用户偏好向量和相似降维后的共现矩阵做乘法

    (五)    过滤制定的user item

    (六)    生成最终的推荐结果

    四、   实例演示

    1.  单机模式

    1)  批量推荐

    DataModel  dataModel = new FileDataModel(new File("p/pereference"));

     

    ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

     

    ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

     

    System.out.println(recommender.recommend(10, 10));

    2)  Because推荐

    DataModel  dataModel = new FileDataModel(new File("p/pereference"));

     

    ItemSimilarity  similarity  = new PearsonCorrelationSimilarity(dataModel);

     

    ItemBasedRecommender  recommender = new GenericItemBasedRecommender(dataModel,similarity );

     

    System.out.println(recommender.recommendedBecause(10, 10328, 100));

    2.  MapReduce模式

    API

    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args)

    --input

    偏好数据路径,文本文件。格式 userid itemid preference

    --output

    推荐结果路径

    -- numRecommendations

    推荐个数

    --usersFile

    需要做出推荐的user,默认全部做推荐

    --itemsFile

    需要做出推荐的item,默认全部做推荐

    --filterFile

    文件格式文本,useriditemid 。目的是给userid的用户不要推荐itemid的item

    --booleanData

    是否是布尔数据

    --maxPrefsPerUser

    最大偏好值

    --minPrefsPerUser

    最小偏好值

    --maxSimilaritiesPerItem

    给每一个Item计算最多的相似item数目

    --maxPrefsPerUserInItemSimilarity

    ItemSimilarity估计item相似度时,对每一个user最多偏好数目

    --similarityClassname

    SIMILARITY_PEARSON_CORRELATION、SIMILARITY_COOCCURRENCE、SIMILARITY_LOGLIKELIHOOD、SIMILARITY_TANIMOTO_COEFFICIENT、SIMILARITY_CITY_BLOCK、SIMILARITY_COSINE、SIMILARITY_EUCLIDEAN_DISTANCE

    --threshold

    删除低于该阈值的item对

    --outputPathForSimilarityMatrix

    指定生成的item相似矩阵路径,文本文件,格式为 itemA itemB 相似值

        实例

    String  [] args ={"--input","p",

    "--output","recommender",

    "--numRecommendations","10",

    "--outputPathForSimilarityMatrix","simMatrix",

    "--similarityClassname","SIMILARITY_PEARSON_CORRELATION"}

    org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args);

    五、   参考文献

    1.  M.Deshpandeand G. Karypis. Item-based top-n recommendation algorithms.

    2.  B.M.Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborativefiltering recommendation algorithms.

    3.  Item-based collaborative filtering

    4.  Accuratelycomputing running variance

  • 相关阅读:
    LeetCode "Jump Game"
    LeetCode "Pow(x,n)"
    LeetCode "Reverse Linked List II"
    LeetCode "Unique Binary Search Trees II"
    LeetCode "Combination Sum II"
    LeetCode "Divide Two Integers"
    LeetCode "First Missing Positive"
    LeetCode "Clone Graph"
    LeetCode "Decode Ways"
    LeetCode "Combinations"
  • 原文地址:https://www.cnblogs.com/cl1024cl/p/6205080.html
Copyright © 2011-2022 走看看