zoukankan      html  css  js  c++  java
  • CS100.1x-lab3_text_analysis_and_entity_resolution_student

    这次作业叫Text Analysis and Entity Resolution,比前几次作业难度要大很多。相关ipynb文件见我github

    实体解析在数据清洗和数据整合中是一个很重要,且有难度的问题。这次作业将用Apache Spark和文本分析的方法应用到实体解析。实体解析是指,从不同的数据源的记录里找到相同的实体,当进行数据融合时,这个步骤是必须的。

    这次作业的数据来源于metric-learning project,主要目录包括:

    • Google.csv, the Google Products dataset
    • Amazon.csv, the Amazon dataset
    • Google_small.csv, 200 records sampled from the Google data
    • Amazon_small.csv, 200 records sampled from the Amazon data
    • Amazon_Google_perfectMapping.csv, the "gold standard" mapping
    • stopwords.txt, a list of common English words

    除此之外,作业还有一些样本数据用于Part 1和一个存储着google和amazon所有的实体的映射表,这个表是用来评价算法的性能。

    Part 0 Preliminaries

    下面我们要读取google和amazon的数据,并转化为RDD。其中,这两个数据集的格式是这样的。

    The file format of an Amazon line is:                                 
    "id","title","description","manufacturer","price"                                
    The file format of a Google line is:                             
    "id","name","description","manufacturer","price"               
    

    我们这一步要把ID这一列抽取出来。google的数据集里面,ID是指url,而amazon里面,ID是指包括数字和字母的字符串。我们第一步就是把数据变为pair RDD的形式,其中,ID是key,name/title, description, and manufacturer是value。

    import re
    DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'
    
    def removeQuotes(s):
        """ Remove quotation marks from an input string
        Args:
            s (str): input string that might have the quote "" characters
        Returns:
            str: a string without the quote characters
        """
        return ''.join(i for i in s if i!='"')
    
    
    def parseDatafileLine(datafileLine):
        """ Parse a line of the data file using the specified regular expression pattern
        Args:
            datafileLine (str): input string that is a line from the data file
        Returns:
            str: a string parsed using the given regular expression and without the quote characters
        """
        match = re.search(DATAFILE_PATTERN, datafileLine)
        if match is None:
            print 'Invalid datafile line: %s' % datafileLine
            return (datafileLine, -1)
        elif match.group(1) == '"id"':
            print 'Header datafile line: %s' % datafileLine
            return (datafileLine, 0)
        else:
            product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
            return ((removeQuotes(match.group(1)), product), 1)
    
    import sys
    import os
    from test_helper import Test
    
    baseDir = os.path.join('data')
    inputPath = os.path.join('cs100', 'lab3')
    
    GOOGLE_PATH = 'Google.csv'
    GOOGLE_SMALL_PATH = 'Google_small.csv'
    AMAZON_PATH = 'Amazon.csv'
    AMAZON_SMALL_PATH = 'Amazon_small.csv'
    GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping.csv'
    STOPWORDS_PATH = 'stopwords.txt'
    
    def parseData(filename):
        """ Parse a data file
        Args:
            filename (str): input file name of the data file
        Returns:
            RDD: a RDD of parsed lines
        """
        return (sc
                .textFile(filename, 4, 0)
                .map(parseDatafileLine)
                .cache())
    
    def loadData(path):
        """ Load a data file
        Args:
            path (str): input file name of the data file
        Returns:
            RDD: a RDD of parsed valid lines
        """
        filename = os.path.join(baseDir, inputPath, path)
        raw = parseData(filename).cache()
        failed = (raw
                  .filter(lambda s: s[1] == -1)
                  .map(lambda s: s[0]))
        for line in failed.take(10):
            print '%s - Invalid datafile line: %s' % (path, line)
        valid = (raw
                 .filter(lambda s: s[1] == 1)
                 .map(lambda s: s[0])
                 .cache())
        print '%s - Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (path,
                                                                                            raw.count(),
                                                                                            valid.count(),
                                                                                            failed.count())
        assert failed.count() == 0
        assert raw.count() == (valid.count() + 1)
        return valid
    
    googleSmall = loadData(GOOGLE_SMALL_PATH)
    google = loadData(GOOGLE_PATH)
    amazonSmall = loadData(AMAZON_SMALL_PATH)
    amazon = loadData(AMAZON_PATH)
    

    通过跑这段代码,我们得到如下结果。

    Google_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
    Google.csv - Read 3227 lines, successfully parsed 3226 lines, failed to parse 0 lines
    Amazon_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
    Amazon.csv - Read 1364 lines, successfully parsed 1363 lines, failed to parse 0 lines
    

    我们跑这段代码看看数据长什么样子。

    for line in googleSmall.take(3):
        print 'google: %s: %s
    ' % (line[0], line[1])
    
    for line in amazonSmall.take(3):
        print 'amazon: %s: %s
    ' % (line[0], line[1])
    
    google: http://www.google.com/base/feeds/snippets/11448761432933644608: spanish vocabulary builder "expand your vocabulary! contains fun lessons that both teach and entertain you'll quickly find yourself mastering new terms. includes games and more!" 
    
    google: http://www.google.com/base/feeds/snippets/8175198959985911471: topics presents: museums of world "5 cd-rom set. step behind the velvet rope to examine some of the most treasured collections of antiquities art and inventions. includes the following the louvre - virtual visit 25 rooms in full screen interactive video detailed map of the louvre ..." 
    
    google: http://www.google.com/base/feeds/snippets/18445827127704822533: sierrahome hse hallmark card studio special edition win 98 me 2000 xp "hallmark card studio special edition (win 98 me 2000 xp)" "sierrahome"
    
    amazon: b000jz4hqo: clickart 950 000 - premier image pack (dvd-rom)  "broderbund"
    
    amazon: b0006zf55o: ca international - arcserve lap/desktop oem 30pk "oem arcserve backup v11.1 win 30u for laptops and desktops" "computer associates"
    
    amazon: b00004tkvy: noah's ark activity center (jewel case ages 3-8)  "victory multimedia"
    
    

    Part 1 ER as Text Similarity - Bags of Words

    我们在解析实体的时候,经常把所有的记录都当成字符串来处理,然后计算它们的相似度。这里我们用bag of words的方法。这个在文本分析中是简单且有效的方法。其主要思想是,把一个文档当作是没有顺序的word的合集,或者是tokens的合集。token是我们分解完文档后的最小单位,它可能是单词,数字,缩写等等。

    当我们比较两个文档的相似度的时候,我们会看看这两个文档有多少个共同的token。而且当我们用关键词搜索文档时,我们可以直接看转换后的文档是否含有这个key。这个方法的优点就是,它对单词顺序和标点符号存在一定的鲁棒性。

    Tokenize a String

    下面是开始了作业部分了,注释里面含有TODO的,就表示这个功能要我们实现。我们要实现的函数是把一个String转换为一个token的list,要注意的是,把所以的token转换为小写。

    # TODO: Replace <FILL IN> with appropriate code
    quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
    split_regex = r'W+'
    
    def simpleTokenize(string):
        """ A simple implementation of input string tokenization
        Args:
            string (str): input string
        Returns:
            list: a list of tokens
        """
        return [x for x in filter(lambda x:len(x) > 0, re.split(split_regex,string.lower()))]
    
    print simpleTokenize(quickbrownfox) # Should give ['a', 'quick', 'brown', ... ]
    

    这个比较难,我稍微解释一下。filter(function, sequence):对sequence中的item依次执行function(item),将执行结果为True的item组成一个List/String/Tuple(取决于sequence的类型)返回。所以这里的function就是lambda x:len(x) > 0,sequence就是re.split(split_regex,string.lower()))。re.split和str.split不一样。

    >>>'hello, world'.split(',')
    >>>['hello',' world']
    >>>re.split(r'W+','hello, world')
    >>>['hello','world']
    

    Removing stopwords

    在英语中,stopwords指那些对整个句子的意义没多大作用的单词,比如"the", "a", "is", "to",这在bag of words方法里,就是噪声了。因为这些单词在句子太常见了,两个没有联系的句子可能因为stopwords太多而被判断相似。环境给我们提供了stopwords的文档,我们读取后转换为set,直接用in来判断就行。

    # TODO: Replace <FILL IN> with appropriate code
    stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
    stopwords = set(sc.textFile(stopfile).collect())
    print 'These are the stopwords: %s' % stopwords
    
    def tokenize(string):
        """ An implementation of input string tokenization that excludes stopwords
        Args:
            string (str): input string
        Returns:
            list: a list of tokens without stopwords
        """
        return [x for x in simpleTokenize(string) if x not in stopwords]
    
    print tokenize(quickbrownfox) # Should give ['quick', 'brown', ... ]
    

    Tokenizing the small datasets

    这里是对之前的汇总。这里要统计一个文档的所有token,直接把每行的token总数加起来就行。

    # TODO: Replace <FILL IN> with appropriate code
    amazonRecToToken = amazonSmall.map(lambda (a,b): (a,tokenize(b)))
    googleRecToToken = googleSmall.map(lambda (a,b): (a,tokenize(b)))
    
    def countTokens(vendorRDD):
        """ Count and return the number of tokens
        Args:
            vendorRDD (RDD of (recordId, tokenizedValue)): Pair tuple of record ID to tokenized output
        Returns:
            count: count of all tokens
        """
        return vendorRDD.map(lambda x: len(x[1])).reduce(lambda x,y: x+y)
    
    totalTokens = countTokens(amazonRecToToken) + countTokens(googleRecToToken)
    print 'There are %s tokens in the combined datasets' % totalTokens
    

    Amazon record with the most tokens

    这里又要排序了。只不过要按照value的长度从大到小排序。

    # TODO: Replace <FILL IN> with appropriate code
    def findBiggestRecord(vendorRDD):
        """ Find and return the record with the largest number of tokens
        Args:
            vendorRDD (RDD of (recordId, tokens)): input Pair Tuple of record ID and tokens
        Returns:
            list: a list of 1 Pair Tuple of record ID and tokens
        """
        return vendorRDD.takeOrdered(1, key=lambda x: -1*len(x[1]))
    
    biggestRecordAmazon = findBiggestRecord(amazonRecToToken)
    print 'The Amazon record with ID "%s" has the most tokens (%s)' % (biggestRecordAmazon[0][0], len(biggestRecordAmazon[0][1]))
    

    Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF

    Bag of words在实际中效果不太好,因为不同单词在一个文档里面的意义是不一样的,用数学的观点就是,权重不一样。仅仅用频率来衡量一个单词的权重是不科学的。所以有了TF-IDF算法。这里推荐阮一峰老师的TF-IDF与余弦相似性的应用,讲的非常通俗易懂。

    Implement a TF function

    这里要实现TF function。入参是string的list。我们要做的是统计每个单词出现的次数和总的单词次数,然后用每个单词的次数除以总的单词数,这就是TF-IDF里的TF了。

    # TODO: Replace <FILL IN> with appropriate code
    def tf(tokens):
        """ Compute TF
        Args:
            tokens (list of str): input list of tokens from tokenize
        Returns:
            dictionary: a dictionary of tokens to its TF values
        """
        dic = {}
        count = 0
        for word in tokens:
            if(word in dic):
                dic[word] += 1
            else:
                dic[word] = 1
            count += 1
        for key in dic:
            dic[key] = float(dic[key])/count
        return dic
    
    print tf(tokenize(quickbrownfox)) # Should give { 'quick': 0.1666 ... }
    

    Create a corpus

    这里是把两个RDD合在一起。用union()就行。

    # TODO: Replace <FILL IN> with appropriate code
    corpusRDD = amazonRecToToken.union(googleRecToToken)
    

    Implement an IDFs function

    IDF算法需要全量的数据,所以上一步是把数据合在一起。这里需要大家理解了IDF才能做好。

    # TODO: Replace <FILL IN> with appropriate code
    def idfs(corpus):
        """ Compute IDF
        Args:
            corpus (RDD): input corpus
        Returns:
            RDD: a RDD of (token, IDF value)
        """
        N = corpus.count()
        uniqueTokens = corpus.map(lambda x:(x[0],list(set(x[1]))))
        tokenCountPairTuple = uniqueTokens.flatMap(lambda x:x[1]).map(lambda x: (x,1))
        tokenSumPairTuple = tokenCountPairTuple.reduceByKey(lambda a,b : a+b)
        return (tokenSumPairTuple.map(lambda x:(x[0],N/float(x[1]))))
    
    idfsSmall = idfs(amazonRecToToken.union(googleRecToToken))
    uniqueTokenCount = idfsSmall.count()
    
    print 'There are %s unique tokens in the small datasets.' % uniqueTokenCount
    

    Tokens with the smallest IDF

    smallIDFTokens = idfsSmall.takeOrdered(11, lambda s: s[1])
    print smallIDFTokens
    

    IDF Histogram

    import matplotlib.pyplot as plt
    
    small_idf_values = idfsSmall.map(lambda s: s[1]).collect()
    fig = plt.figure(figsize=(8,3))
    plt.hist(small_idf_values, 50, log=True)
    pass
    

    Implement a TF-IDF function

    这一步是把上面的步骤结合起来了。

    # TODO: Replace <FILL IN> with appropriate code
    def tfidf(tokens, idfs):
        """ Compute TF-IDF
        Args:
            tokens (list of str): input list of tokens from tokenize
            idfs (dictionary): record to IDF value
        Returns:
            dictionary: a dictionary of records to TF-IDF values
        """
        tfs = tf(tokens)
        tfIdfDict = {t:tfs[t]*idfs[t] for t in tfs}
        return tfIdfDict
    
    recb000hkgj8k = amazonRecToToken.filter(lambda x: x[0] == 'b000hkgj8k').collect()[0][1]
    idfsSmallWeights = idfsSmall.collectAsMap()
    rec_b000hkgj8k_weights = tfidf(recb000hkgj8k, idfsSmallWeights)
    
    print 'Amazon record "b000hkgj8k" has tokens and weights:
    %s' % rec_b000hkgj8k_weights
    

    Part 3 ER as Text Similarity - Cosine Similarity

    这里有关余弦相似度的问题,还是参见阮一峰老师的TF-IDF与余弦相似性的应用

    Implement the components of a cosineSimilarity function

    这里实现余弦相似度分三步:计算两个向量的内积;计算向量的长度;结合上面两步。

    # TODO: Replace <FILL IN> with appropriate code
    import math
    
    def dotprod(a, b):
        """ Compute dot product
        Args:
            a (dictionary): first dictionary of record to value
            b (dictionary): second dictionary of record to value
        Returns:
            dotProd: result of the dot product with the two input dictionaries
        """  
        return sum(a[k]*b[k]for k in a if k in b)
    
    def norm(a):
        """ Compute square root of the dot product
        Args:
            a (dictionary): a dictionary of record to value
        Returns:
            norm: a dictionary of tokens to its TF values
        """
        count=0
        for key in a:
            count += a[key]*a[key]
        return math.sqrt(count)
    
    def cossim(a, b):
        """ Compute cosine similarity
        Args:
            a (dictionary): first dictionary of record to value
            b (dictionary): second dictionary of record to value
        Returns:
            cossim: dot product of two dictionaries divided by the norm of the first dictionary and
                    then by the norm of the second dictionary
        """
        return dotprod(a,b)/(norm(a)*norm(b))
    
    testVec1 = {'foo': 2, 'bar': 3, 'baz': 5 }
    testVec2 = {'foo': 1, 'bar': 0, 'baz': 20 }
    dp = dotprod(testVec1, testVec2)
    nm = norm(testVec1)
    print dp, nm
    

    Implement a cosineSimilarity function

    # TODO: Replace <FILL IN> with appropriate code
    def cosineSimilarity(string1, string2, idfsDictionary):
        """ Compute cosine similarity between two strings
        Args:
            string1 (str): first string
            string2 (str): second string
            idfsDictionary (dictionary): a dictionary of IDF values
        Returns:
            cossim: cosine similarity value
        """
        w1 = tfidf(tokenize(string1),idfsDictionary) 
        w2 = tfidf(tokenize(string2),idfsDictionary) 
        return cossim(w1, w2)
    
    cossimAdobe = cosineSimilarity('Adobe Photoshop',
                                   'Adobe Illustrator',
                                   idfsSmallWeights)
    
    print cossimAdobe
    

    Perform Entity Resolution

    这里我们先计算google数据里面到amazon数据里面记录的相似度,计算结果保存成key为(Google URL, Amazon ID),value为余弦相似度的值。我们会用两种方法来计算,第一种是不用broadcast variable。

    这里分三步走:1.计算所有的tuple,格式是[ ((Google URL1, Google String1), (Amazon ID1, Amazon String1)), ((Google URL1, Google String1), (Amazon ID2, Amazon String2)), ((Google URL2, Google String2), (Amazon ID1, Amazon String1)), ... ]。2.写个函数计算所有tuple的余弦相似度结果。3.把函数用到RDD中。

    # TODO: Replace <FILL IN> with appropriate code
    crossSmall = (googleSmall
                  .cartesian(amazonSmall)
                  .cache())
    
    def computeSimilarity(record):
        """ Compute similarity on a combination record
        Args:
            record: a pair, (google record, amazon record)
        Returns:
            pair: a pair, (google URL, amazon ID, cosine similarity value)
        """
        googleRec = record[0]
        amazonRec = record[1]
        googleURL = googleRec[0]
        amazonID = amazonRec[0]
        googleValue = googleRec[1]
        amazonValue = amazonRec[1]
        cs = cosineSimilarity(googleValue,amazonValue,idfsSmallWeights)
        return (googleURL, amazonID, cs)
    
    similarities = (crossSmall
                    .map(lambda line:computeSimilarity(line))
                    .cache())
    
    
    def similar(amazonID, googleURL):
        """ Return similarity value
        Args:
            amazonID: amazon ID
            googleURL: google URL
        Returns:
            similar: cosine similarity value
        """
        return (similarities
                .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
                .collect()[0][2])
    
    similarityAmazonGoogle = similar('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
    print 'Requested similarity is %s.' % similarityAmazonGoogle
    

    Perform Entity Resolution with Broadcast Variables

    上面那一步对小数据集还可以,但是数据量太大的时候,Spark需要把计算好的idf传向各个worker。假如我们没有缓存的话,Spark可能会重复计算相似度。这会导致Spark多次传idf值。

    所以我们用broadcast variable解决这个问题。它只需要传一次就行。代码和上一步差不多。

    # TODO: Replace <FILL IN> with appropriate code
    def computeSimilarityBroadcast(record):
        """ Compute similarity on a combination record, using Broadcast variable
        Args:
            record: a pair, (google record, amazon record)
        Returns:
            pair: a pair, (google URL, amazon ID, cosine similarity value)
        """
        googleRec = record[0]
        amazonRec = record[1]
        googleURL = googleRec[0]
        amazonID = amazonRec[0]
        googleValue = googleRec[1]
        amazonValue = amazonRec[1]
        cs = cosineSimilarity(googleValue,amazonValue,idfsSmallBroadcast.value)
        return (googleURL, amazonID, cs)
    
    idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)
    similaritiesBroadcast = (crossSmall
                             .map(lambda record:computeSimilarityBroadcast(record))
                             .cache())
    
    def similarBroadcast(amazonID, googleURL):
        """ Return similarity value, computed using Broadcast variable
        Args:
            amazonID: amazon ID
            googleURL: google URL
        Returns:
            similar: cosine similarity value
        """
        return (similaritiesBroadcast
                .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
                .collect()[0][2])
    
    similarityAmazonGoogleBroadcast = similarBroadcast('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
    print 'Requested similarity is %s.' % similarityAmazonGoogleBroadcast
    

    Perform a Gold Standard evaluation

    下面我们要用gold standard的数据来回答一些问题,我们首先读取和解析数据。

    GOLDFILE_PATTERN = '^(.+),(.+)'
    
    # Parse each line of a data file useing the specified regular expression pattern
    def parse_goldfile_line(goldfile_line):
        """ Parse a line from the 'golden standard' data file
        Args:
            goldfile_line: a line of data
        Returns:
            pair: ((key, 'gold', 1 if successful or else 0))
        """
        match = re.search(GOLDFILE_PATTERN, goldfile_line)
        if match is None:
            print 'Invalid goldfile line: %s' % goldfile_line
            return (goldfile_line, -1)
        elif match.group(1) == '"idAmazon"':
            print 'Header datafile line: %s' % goldfile_line
            return (goldfile_line, 0)
        else:
            key = '%s %s' % (removeQuotes(match.group(1)), removeQuotes(match.group(2)))
            return ((key, 'gold'), 1)
    
    goldfile = os.path.join(baseDir, inputPath, GOLD_STANDARD_PATH)
    gsRaw = (sc
             .textFile(goldfile)
             .map(parse_goldfile_line)
             .cache())
    
    gsFailed = (gsRaw
                .filter(lambda s: s[1] == -1)
                .map(lambda s: s[0]))
    for line in gsFailed.take(10):
        print 'Invalid goldfile line: %s' % line
    
    goldStandard = (gsRaw
                    .filter(lambda s: s[1] == 1)
                    .map(lambda s: s[0])
                    .cache())
    
    print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (gsRaw.count(),
                                                                                     goldStandard.count(),
                                                                                     gsFailed.count())
    assert (gsFailed.count() == 0)
    assert (gsRaw.count() == (goldStandard.count() + 1))
    

    接下来,我们把之前算好的有相似度的RDD和这里的gold standard的RDD用join()结合起来,然后统计有多少个pairs,以及计算相似度的平均值。然后计算没有匹配上的相似度平均值。

    # TODO: Replace <FILL IN> with appropriate code
    sims = similaritiesBroadcast.map(lambda line:('%s %s' %(line[1],line[0]),line[2]))
    
    trueDupsRDD = (sims.join(goldStandard)) 
    trueDupsCount = trueDupsRDD.count() 
    avgSimDups = trueDupsRDD.map(lambda (k,v):v[0]).mean()
    
    nonDupsRDD = (sims 
    .leftOuterJoin(goldStandard).map(lambda (k,v):v[0] if v[1] is None else -1)).filter(lambda v:v!=-1) 
    avgSimNon = nonDupsRDD.mean()
    
    print 'There are %s true duplicates.' % trueDupsCount
    print 'The average similarity of true duplicates is %s.' % avgSimDups
    print 'And for non duplicates, it is %s.' % avgSimNon
    

    Part 4 Scalable ER

    上面的例子不完全属于分布式实现,所以时间复杂度会很高。我们在这一部分将介绍更加适合分布式的算法。

    我们在计算token和权重时,由于记录之间的token比较,所以消耗了大量的计算。我们这里用一个叫inverted index的数据结构来避免平方增长率的token比较。它把数据集映射成token和文档,key为token,value为含有该token的文档。

    Tokenize the full dataset

    # TODO: Replace <FILL IN> with appropriate code
    amazonFullRecToToken = amazon.map(lambda (k,v):(k,tokenize(v)))
    googleFullRecToToken = google.map(lambda (k,v):(k,tokenize(v)))
    print 'Amazon full dataset is %s products, Google full dataset is %s products' % (amazonFullRecToToken.count(),
                                                                                        googleFullRecToToken.count())
    

    Compute IDFs and TF-IDFs for the full datasets

    这里会用到之前的代码。我们要做的是,把新的RDD组合到一起,实现idf算法,并设为broadcast variable。

    # TODO: Replace <FILL IN> with appropriate code
    fullCorpusRDD = amazonFullRecToToken.union(googleFullRecToToken) 
    idfsFull = idfs(fullCorpusRDD) 
    idfsFullCount = idfsFull.count() 
    print 'There are %s unique tokens in the full datasets.' % idfsFullCount
    
    # Recompute IDFs for full dataset
    idfsFullWeights = idfsFull.collectAsMap() 
    idfsFullBroadcast = sc.broadcast(idfsFullWeights)
    
    # Pre-compute TF-IDF weights.  Build mappings from record ID weight vector.
    amazonWeightsRDD = amazonFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value))) 
    googleWeightsRDD = googleFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value))) 
    print 'There are %s Amazon weights and %s Google weights.' % (amazonWeightsRDD.count(),
                                                                  googleWeightsRDD.count())
    

    Compute Norms for the weights from the full datasets

    # TODO: Replace <FILL IN> with appropriate code
    amazonNorms = amazonWeightsRDD.map(lambda (k,d):(k,norm(d))) 
    amazonNormsBroadcast = sc.broadcast(amazonNorms.collectAsMap()) 
    googleNorms = googleWeightsRDD.map(lambda (k,d):(k,norm(d))) 
    googleNormsBroadcast = sc.broadcast(googleNorms.collectAsMap())
    

    Create inverted indicies from the full datasets

    这里我们要做两步:实现invert function,输入是(ID, tokens),输出是(ID, token vector);把该函数用到上面的结果中,得到token和含有token的文档的映射。

    # TODO: Replace <FILL IN> with appropriate code
    def invert(record):
        """ Invert (ID, tokens) to a list of (token, ID)
        Args:
            record: a pair, (ID, token vector)
        Returns:
            pairs: a list of pairs of token to ID
        """
        ID, tokenvector = record
        pairs = [(k,ID) for k in tokenvector]
        return (pairs)
    
    amazonInvPairsRDD = (amazonWeightsRDD 
                            .flatMap(invert) 
                            .cache())
    
    googleInvPairsRDD = (googleWeightsRDD 
                            .flatMap(invert) 
                            .cache())
    
    print 'There are %s Amazon inverted pairs and %s Google inverted pairs.' % (amazonInvPairsRDD.count(),
                                                                                googleInvPairsRDD.count())
    

    Identify common tokens from the full dataset

    这里把amazon的RDD和google的RDD合起来,得到((ID, URL), token)。

    # TODO: Replace <FILL IN> with appropriate code
    def swap(record):
        """ Swap (token, (ID, URL)) to ((ID, URL), token)
        Args:
            record: a pair, (token, (ID, URL))
        Returns:
            pair: ((ID, URL), token)
        """
        token = record[0]
        keys = record[1]
        return (keys, token)
    
    commonTokens = (amazonInvPairsRDD 
                        .join(googleInvPairsRDD) 
                        .map(swap) 
                        .groupByKey() 
                        .cache())
    
    print 'Found %d common tokens' % commonTokens.count()
    

    Identify common tokens from the full dataset

    最后一步了,这里要把之前计算的两个RDD:amazonWeights和googleWeights和上面的结果结合起来,计算权重。

    # TODO: Replace <FILL IN> with appropriate code
    amazonWeightsBroadcast = sc.broadcast(amazonWeightsRDD.collectAsMap()) 
    googleWeightsBroadcast = sc.broadcast(googleWeightsRDD.collectAsMap())
    
    def fastCosineSimilarity(record):
        """ Compute Cosine Similarity using Broadcast variables
        Args:
            record: ((ID, URL), token)
        Returns:
            pair: ((ID, URL), cosine similarity value)
        """
        amazonRec = record[0][0] 
        googleRec = record[0][1] 
        tokens = record[1]
        value = sum((amazonWeightsBroadcast.value[amazonRec][t])*(googleWeightsBroadcast.value[googleRec][t])
                for t in tokens if t in amazonWeightsBroadcast.value[amazonRec] and t in googleWeightsBroadcast.value[googleRec])
            /((amazonNormsBroadcast.value[amazonRec])*(googleNormsBroadcast.value[googleRec]))
        key = (amazonRec, googleRec)
        return (key, value)
    
    similaritiesFullRDD = (commonTokens
                           .map(fastCosineSimilarity) 
                           .cache())
    
    print similaritiesFullRDD.count()
    

    Part 5 Analysis

    计算部分到此结束。我们现在要验证结果了。我们需要选个阈值来觉得两个数据集的记录是否是一个实体。我们可以通过precision and recall来判断。一般来说用F-score来衡量模型的好坏。

    Counting True Positives, False Positives, and False Negatives

    # Create an RDD of ((Amazon ID, Google URL), similarity score)
    simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
    assert (simsFullRDD.count() == 2441100)
    
    # Create an RDD of just the similarity scores
    simsFullValuesRDD = (simsFullRDD
                         .map(lambda x: x[1])
                         .cache())
    assert (simsFullValuesRDD.count() == 2441100)
    
    # Look up all similarity scores for true duplicates
    
    # This helper function will return the similarity score for records that are in the gold standard and the simsFullRDD (True positives), and will return 0 for records that are in the gold standard but not in simsFullRDD (False Negatives).
    def gs_value(record):
        if (record[1][1] is None):
            return 0
        else:
            return record[1][1]
    
    # Join the gold standard and simsFullRDD, and then extract the similarities scores using the helper function
    trueDupSimsRDD = (goldStandard
                      .leftOuterJoin(simsFullRDD)
                      .map(gs_value)
                      .cache())
    print 'There are %s true duplicates.' % trueDupSimsRDD.count()
    assert(trueDupSimsRDD.count() == 1300)
    

    为了选一个合适的阈值,我们用Spark Accumulators来实现counting function。这是第一次出现Spark Accumulators的用法。

    from pyspark.accumulators import AccumulatorParam
    class VectorAccumulatorParam(AccumulatorParam):
        # Initialize the VectorAccumulator to 0
        def zero(self, value):
            return [0] * len(value)
    
        # Add two VectorAccumulator variables
        def addInPlace(self, val1, val2):
            for i in xrange(len(val1)):
                val1[i] += val2[i]
            return val1
    
    # Return a list with entry x set to value and all other entries set to 0
    def set_bit(x, value, length):
        bits = []
        for y in xrange(length):
            if (x == y):
              bits.append(value)
            else:
              bits.append(0)
        return bits
    
    # Pre-bin counts of false positives for different threshold ranges
    BINS = 101
    nthresholds = 100
    def bin(similarity):
        return int(similarity * nthresholds)
    
    # fpCounts[i] = number of entries (possible false positives) where bin(similarity) == i
    zeros = [0] * BINS
    fpCounts = sc.accumulator(zeros, VectorAccumulatorParam())
    
    def add_element(score):
        global fpCounts
        b = bin(score)
        fpCounts += set_bit(b, 1, BINS)
    
    simsFullValuesRDD.foreach(add_element)
    
    # Remove true positives from FP counts
    def sub_element(score):
        global fpCounts
        b = bin(score)
        fpCounts += set_bit(b, -1, BINS)
    
    trueDupSimsRDD.foreach(sub_element)
    
    def falsepos(threshold):
        fpList = fpCounts.value
        return sum([fpList[b] for b in range(0, BINS) if float(b) / nthresholds >= threshold])
    
    def falseneg(threshold):
        return trueDupSimsRDD.filter(lambda x: x < threshold).count()
    
    def truepos(threshold):
        return trueDupSimsRDD.count() - falsenegDict[threshold]
    

    Precision, Recall, and F-measures

    # Precision = true-positives / (true-positives + false-positives)
    # Recall = true-positives / (true-positives + false-negatives)
    # F-measure = 2 x Recall x Precision / (Recall + Precision)
    
    def precision(threshold):
        tp = trueposDict[threshold]
        return float(tp) / (tp + falseposDict[threshold])
    
    def recall(threshold):
        tp = trueposDict[threshold]
        return float(tp) / (tp + falsenegDict[threshold])
    
    def fmeasure(threshold):
        r = recall(threshold)
        p = precision(threshold)
        return 2 * r * p / (r + p)
    

    Line Plots

    thresholds = [float(n) / nthresholds for n in range(0, nthresholds)]
    falseposDict = dict([(t, falsepos(t)) for t in thresholds])
    falsenegDict = dict([(t, falseneg(t)) for t in thresholds])
    trueposDict = dict([(t, truepos(t)) for t in thresholds])
    
    precisions = [precision(t) for t in thresholds]
    recalls = [recall(t) for t in thresholds]
    fmeasures = [fmeasure(t) for t in thresholds]
    
    print precisions[0], fmeasures[0]
    assert (abs(precisions[0] - 0.000532546802671) < 0.0000001)
    assert (abs(fmeasures[0] - 0.00106452669505) < 0.0000001)
    
    
    fig = plt.figure()
    plt.plot(thresholds, precisions)
    plt.plot(thresholds, recalls)
    plt.plot(thresholds, fmeasures)
    plt.legend(['Precision', 'Recall', 'F-measure'])
    pass
    

    用最先进的方法,我们的F-score能有60%,但是这里只有40%。我们可能从三个方面来改进:1.用其他的特征;2.用其他更好的模型来处理特征,比如stemming, n-grams;3.换一个衡量相似度的函数。

  • 相关阅读:
    nginx,php for window 7(64bit) install
    ret.concat.apply([],ret)
    JavaScript:constructor属性
    jquery 插件address
    学习笔记(C++)
    关于网站缓存设计的几点思考
    Ubuntu20.04 换源 Learner
    什么叫程序集
    c# 中堆和栈的区别
    命名空间和类的概念
  • 原文地址:https://www.cnblogs.com/-Sai-/p/6714582.html
Copyright © 2011-2022 走看看