zoukankan      html  css  js  c++  java
  • 【知识图谱】实体对齐(1)

      实体对齐、实体消歧作为知识图谱中的重要一环,最近感兴趣找了些资料记录分享一下,本篇将实现一种简单的实体对齐的方法,主要思路是利用tfidf计算文本相似度,本文准备了两个测试数据集,entity_list.csv是50个实体,valid_data.csv是需要消歧的语句。

    知识库entity_list的格式如下

    需要进行实体消歧的语句valid_data如下:

     具体的实现步骤如下

    1.导入数据以及相关库

    import jieba
    import collections
    import pandas as pd
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.feature_extraction.text import TfidfVectorizer
    entity_data = pd.read_csv('data/entity_disambiguation/entity_list.csv', encoding = 'utf-8')
    valid_data = pd.read_csv('data/entity_disambiguation/valid_data.csv', encoding = 'gb18030')

    2.建立待消歧词列表

    s = ''
    keywords = []
    for i in entity_data['entity_name'].values.tolist():
        s +=i + '|'
    for k,v in collections.Counter(s.split('|')).items():
        if v>1:#频次大于1的词才会有歧义,此处进行筛选
            keywords.append(k)

    3.生成tfidf矩阵

    rain_sentence = []
    for i in entity_data['desc'].values:
        train_sentence.append(' '.join(jieba.cut(i)))
    vec = TfidfVectorizer()
    X = vec.fit_transform(train_sentence)

    4.获取关键词所属entity_id

    def get_entityid(sentence):
        id_start = 1001
        a_list = [' '.join(jieba.cut(sentence))]
        res = cosine_similarity(vec.transform(a_list),X)[0]#[[x,x,x,x]]格式
        top_idx = np.argsort(res)[-1]#获取相似度最大文本所在位置索引
        return id_start + top_idx

    5.循环计算各词索引并返回

    result_data = []
    neighbor_sentence = ''
    for sen in valid_data['sentence']:
        res = []
        for keyword in keywords:
            if keyword in sen:
                len_key = len(keyword)
                ss = ''
                for i in range(len(sen)-len_key+1):
                    if sen[i:i+len_key] == keyword:
                        s = str(i) + '-' +str(i+len_key) + ':'
                        if i > 10 and i + len_key < len(sen) - 9:
                            neighbor_sentence = sen[i - 10:i + len_key + 9]
                        elif i < 10:
                            neighbor_sentence = sen[:20]
                        elif i + len_key > len(sen) - 9:
                            neighbor_sentence = sen[-20:]
                        s += str(get_entityid(neighbor_sentence))  # 拿到 x-x:id
                        ss += s + '|'  # 拿到 x-x:id|x-x:id
                res.append(ss[:-1])  #去除最后一个'|'
        result_data.append(res)

    输出格式:“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串

    参考引用:

    https://blog.csdn.net/lt326030434/article/details/88668764?utm_medium=distribute.pc_relevant.none-task-blog-baidujs-1

    https://github.com/Maginaaa/NamedEntityDisambiguation

  • 相关阅读:
    C# 填充客户端提交的值到T对象
    mvc中hangfire全局简单配置
    mvc企业微信开发全局配置
    js获取简单表单对象(1)
    MVC伪静态路由简单搭配
    [转]一些实用的图表Chart制作工具
    【转】SQL Server 数据库内部版本号
    SVN的搭建和使用总结
    解决ext时间插件在谷歌下变宽的BUG
    Hibernate中Session.get()/load()之区别
  • 原文地址:https://www.cnblogs.com/pobby/p/13131753.html
Copyright © 2011-2022 走看看