zoukankan      html  css  js  c++  java
  • prefixspan python

    from:https://github.com/chuanconggao/PrefixSpan-py

    API Usage

    Alternatively, you can use the algorithms via API.

    from prefixspan import PrefixSpan
    
    db = [
        [0, 1, 2, 3, 4],
        [1, 1, 1, 3, 4],
        [2, 1, 2, 2, 0],
        [1, 1, 1, 2, 2],
    ]
    
    ps = PrefixSpan(db)

    For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

    设置长度限制:

    ps = PrefixSpan(db)
    ps.minlen = 3
    ps.maxlen = 5
    print("?"*66)
    ------------------
    print(ps.frequent(2))
    # [(2, [0]),
    #  (4, [1]),
    #  (3, [1, 2]),
    #  (2, [1, 2, 2]),
    #  (2, [1, 3]),
    #  (2, [1, 3, 4]),
    #  (2, [1, 4]),
    #  (2, [1, 1]),
    #  (2, [1, 1, 1]),
    #  (3, [2]),
    #  (2, [2, 2]),
    #  (2, [3]),
    #  (2, [3, 4]),
    #  (2, [4])]
    
    print(ps.topk(5))
    # [(4, [1]),
    #  (3, [2]),
    #  (3, [1, 2]),
    #  (2, [1, 3]),
    #  (2, [1, 3, 4])]
    
    
    print(ps.frequent(2, closed=True))
    
    print(ps.topk(5, closed=True))
    
    
    print(ps.frequent(2, generator=True))
    
    print(ps.topk(5, generator=True))

    Closed Patterns and Generator Patterns

    一个 频繁的顺序模式 是一种出现在序列数据库的至少“minsup”序列中的模式,其中 最小支持度 是用户设置的参数。

    一个 频繁闭合序列模式 是一种频繁的顺序模式,使得它不包括在具有完全相同支持的另一顺序模式中。

    算法如 的PrefixSpan 找到频繁的顺序模式。算法如 BIDE+找到频繁的闭合序列模式。 BIDE +通常比PrefixSpan快得多,因为它使用修剪技术来避免生成所有顺序模式。此外,闭合模式集通常比连续模式集小得多,因此BIDE +也更具存储效率。

    另一个重要的事情是,闭合序列模式是所有序列模式的紧凑和无损表示。这意味着闭合序列模式的集合通常要小得多,但它是无损的,这意味着它允许恢复整个连续模式集(没有信息丢失),这非常方便。

    我可以举个简单的例子。

    让我们考虑4个序列:

    a  b  c  d  e
    a  b  d
    b  e  a  
    b  c  d  e

    让我们说minsup = 2。

    b c 是一种频繁的序列模式,因为它出现在两个序列中(它支持2)。 b c 不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d 得到同样的支持。

    b c d 它也是一个支持2.它也不是一个封闭的顺序模式,因为它包含在一个更大的顺序模式中 b c d e 得到同样的支持。 b c d e 是一个封闭的顺序模式,因为它没有包含在具有相同支持的任何其他顺序模式中。

    The closed patterns are much more compact due to the smaller number.

    • A pattern is closed if there is no super-pattern with the same frequency.
    prefixspan-cli frequent 2 --closed test.dat
    
    0 : 2
    1 : 4
    1 2 : 3
    1 2 2 : 2
    1 3 4 : 2
    1 1 1 : 2
    

    The generator patterns are even more compact due to both the smaller number and the shorter lengths.

    • A pattern is generator if there is no sub-pattern with the same frequency.

    • Due to the high compactness, generator patterns are useful as features for classification, etc.

    prefixspan-cli frequent 2 --generator test.dat
    
    0 : 2
    1 1 : 2
    2 : 3
    2 2 : 2
    3 : 2
    4 : 2
    

    There are patterns that are both closed and generator.

    prefixspan-cli frequent 2 --closed --generator test.dat
    
    0 : 2

    备注:模式挖掘有很多算法。

    SPMF offers implementations of the following data mining algorithms.

    Sequential Pattern Mining

    These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.

    Sequential Rule Mining

    These algorithms discover sequential rules in a set of sequences.

    Sequence Prediction

    These algorithms predict the next symbol(s) of a sequence based on a set of training sequences

    Itemset Mining

    These algorithms discover interesting itemsets (sets of values) that appear in a transaction database (database records containing symbolic data). For a good overview of itemset mining, please read this survey paper.

    • algorithms for discovering frequent itemsets in a transaction database.
    • algorithms for discovering frequent closed itemsets in a transaction database.
    • algorithms for recovering all frequent itemsets from frequent closed itemsets:
      • the LevelWise algorithm (Pasquier et al., 1999) new
      • the DFI-Growth algorithm (___ et al., 2018) new
    • algorithms for discovering frequent maximal itemsets in a transaction database.
      • the FPMax algorithm (Grahne and Zhu, 2003)
      • the Charm-MFI algorithm for discovering frequent closed itemsets and maximal frequent itemsets by post-processing in a transaction database (Szathmary et al. 2006)
    • algorithms for mining frequent itemsets with multiple minimum supports
    • algorithms for mining generator itemsets in a transaction database
      • the DefMe algorithm for mining frequent generator itemsets in a transaction database (Soulet & Rioult, 2014)
      • the Pascal algorithm for mining frequent itemsets, and identifying at the same time which one are generators (Bastide et al., 2002)
      • the Zart algorithm for discovering frequent closed itemsets and their generators in a transaction database (Szathmary et al. 2007)
    • algorithms for mining rare itemsets and/or correlated itemsets in a transaction database
      • the AprioriInverse algorithm for mining perfectly rare itemsets (Koh & Roundtree, 2005)
      • the AprioriRare algorithm for mining minimal rare itemsets and frequent itemsets (Szathmary et al. 2007b)
      • the CORI algorithm for mining minimal rare correlated itemsets using the support and bond measures (Bouasker et al. 2015)
      • the RP-Growth algorithm for mining rare itemsets (Tsang et al., 2011) new
    • algorithms for performing targeted and dynamic queries about association rules and frequent itemsets.
      • the Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Kubat et al, 2003)
      • the Memory-Efficient Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Fournier-Viger, 2013powerpoint)
    • algorithms to discover frequent itemsets in a stream
      • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
      • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
      • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
    • the U-Apriori algorithm for mining frequent itemsets in uncertain data (Chui et al, 2007)
    • the VME algorithm for mining erasable itemsets (Deng & Xu, 2010)
    • algorithms to discover fuzzy frequent itemsets in a quantitative transaction database

    Periodic Pattern Mining

    These algorithms discover patterns that periodically appear in a sequence of complex events (also called a transaction database)

    • the PFPM algorithm (Fournier-Viger et al, 2016apowerpointvideo  ) for mining frequent periodic patterns in a sequence of transactions (a transaction database))new
    • the PHM algorithm (Fournier-Viger et al, 2016bpowerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information new

    Episode Mining

    These algorithms discover episodes that appear in a single sequence of complex events.

    • the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information new
    • the US-SPAN algorithm (Wu et al., 2013 ) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information new

    High-Utility Pattern Mining

    These algorithms discover patterns having a high utility (importance) in different kinds of data. For a good overview of high utility itemset mining, you may read this survey paper, and the high utility-pattern mining book.

    • algorithms for mining high-utility itemsets in a transaction database having profit information
    • algorithm for efficiently mining high-utility itemsets with length constraints in a transaction database
    • algorithm for mining correlated high-utility itemsets in a transaction database
    • algorithm for mining high-utility itemsets in a transaction database containing negative unit profit values
    • algorithm for mining frequent high-utility itemsets in a transaction database
    • algorithm for mining on-shelf high-utility itemsets in a transaction database containing information about time periods of items
    • algorithm for incremental high-utility itemset mining in a transaction database
    • algorithm for mining concise representations of high-utility  itemsets in a transaction database
    • algorithm for mining the skyline high-utility itemsets in a transaction database
    • algorithm for mining the top-k high-utility itemsets in a transaction database
    • algorithms for mining the top-k high utility itemsets from a data stream with a window
    • algorithm for mining frequent skyline utility patterns in a transaction database
    • algorithm for mining quantitative high utility itemsets in a transaction database:
    • algorithm for mining high-utility sequential rules in a sequence database 
    • algorithm for mining high-utility sequential patterns in a sequence database 
      • the USPAN algorithm (Yin et al. 2012)
    • algorithm for mining high-utility probability sequential patterns in a sequence database 
    • algorithm for mining high-utility itemsets in a transaction database using evolutionary algorithms
    • algorithm for mining high average-utility itemsets in a transaction database
      • the HAUI-Miner algorithm for mining high average-utility itemsets (Lin et al, 2016)
      • the EHAUPM algorithm for mining high average-utility itemsets (Lin et al, 2017new
      • the HAUI-MMAU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2016)
      • the MEMU algorithm for mining high average-utility itemsets with multiple thresholds (Lin et al, 2018)
    • algorithms for mining high utility episodes in a sequence of complex events (a transaction database)
      • the TUP algorithm (Rathore et al., 2016) for mining frequent periodic patterns in a sequence of transactions (a transaction database))new
      • the UP-SPAN algorithm (Wu et al., 2013 ) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information new
    • algorithms for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information
    • algorithms for discovering irregular high utility itemsets (non periodic patterns) in a transaction database with utility information
      • the PHM_irregular algorithm, which is a simple variation of the PHM algorithm new
    • algorithm for discovering local high utility itemsets in a database with utility information and timestamps
    • algorithm for discovering peak high utility itemsets in a database with utility information and timestamps

    Association Rule Mining

    These algorithms discover interesting associations between symbols (values) in a transaction database (database records with binary attributes).

    • an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
    • an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
    • an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
    • an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
    • an algorithm for mining closed association rules (Szathmary et al. 2006).
    • an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
    • the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
    • the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
    • the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012bpowerpoint)
    • the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012bpowerpoint)
    • the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012dpowerpoint)

    Stream pattern mining

    These algorithms discovers various kinds of patterns in a stream (an infinite sequence of database records (transactions))

    • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
    • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
    • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
    • algorithms for mining the top-k high utility itemsets from a data stream with a window

    Clustering

    These algorithms automatically find clusters in different kinds of data

    • the original K-Means algorithm (MacQueen, 1967)
    • the Bisecting K-Means algorithm (Steinbach et al, 2000)
    • algorithms for density-based clustering
      • the DBScan algorithm (Ester et al., 1996)
      • the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
    • hierarchical clustering algorithm
    • a tool called Cluster Viewer for visualizing clusters
    • a tool called Instance Viewer for visualizing the input of clustering algorithms

    Time series mining

    These algorithms perform various tasks to analyze time series data

      • an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007).
      • algorithms for calculating the prior moving average of a time series (to remove noise)
      • algorithms for calculating the cumulative moving average f a time series (to remove noise)
      • algorithms for calculating the central moving average of a time series (to remove noise)
      • an algorithm for calculating the median smoothing of a time series (to remove noise)
      • an algorithm for calculating the exponential smoothing of a time series (to remove noise) new
      • an algorithm for calculating the min max normalization of a time series new
      • an algorithm for calculating the autocorrelation function of a time series new
      • an algorithm for calculating the standardization of a time series new
      • an algorithm for calculating the first and second order differencing of a time series
      • an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series)
      • an algorithm for calculating the linear regression of a time series (using the least squares method) new
      • an algorithm for splitting a time series into segments of a given length
      • an algorithm for splitting a time series into a given number of segments
      • algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
      • a tool called Time Series Viewer for visualizing time series new
     
  • 相关阅读:
    vuejs学习小结(数据处理)
    vuejs的遇到的问题小结
    ES6 对象扩展
    webpack的两个难点
    Sass入门:第二章
    Sass入门:第一章
    Effective JavaScript :第六章
    Effective JavaScript :第五章
    Effective JavaScript :第四章
    Effective JavaScript :第三章
  • 原文地址:https://www.cnblogs.com/bonelee/p/10696521.html
Copyright © 2011-2022 走看看