实体对齐、实体消歧作为知识图谱中的重要一环,最近感兴趣找了些资料记录分享一下,本篇将实现一种简单的实体对齐的方法,主要思路是利用tfidf计算文本相似度,本文准备了两个测试数据集,entity_list.csv是50个实体,valid_data.csv是需要消歧的语句。
知识库entity_list的格式如下
需要进行实体消歧的语句valid_data如下:
具体的实现步骤如下
1.导入数据以及相关库
import jieba import collections import pandas as pd import numpy as np from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import TfidfVectorizer entity_data = pd.read_csv('data/entity_disambiguation/entity_list.csv', encoding = 'utf-8') valid_data = pd.read_csv('data/entity_disambiguation/valid_data.csv', encoding = 'gb18030')
2.建立待消歧词列表
s = '' keywords = [] for i in entity_data['entity_name'].values.tolist(): s +=i + '|' for k,v in collections.Counter(s.split('|')).items(): if v>1:#频次大于1的词才会有歧义,此处进行筛选 keywords.append(k)
3.生成tfidf矩阵
rain_sentence = [] for i in entity_data['desc'].values: train_sentence.append(' '.join(jieba.cut(i))) vec = TfidfVectorizer() X = vec.fit_transform(train_sentence)
4.获取关键词所属entity_id
def get_entityid(sentence): id_start = 1001 a_list = [' '.join(jieba.cut(sentence))] res = cosine_similarity(vec.transform(a_list),X)[0]#[[x,x,x,x]]格式 top_idx = np.argsort(res)[-1]#获取相似度最大文本所在位置索引 return id_start + top_idx
5.循环计算各词索引并返回
result_data = [] neighbor_sentence = '' for sen in valid_data['sentence']: res = [] for keyword in keywords: if keyword in sen: len_key = len(keyword) ss = '' for i in range(len(sen)-len_key+1): if sen[i:i+len_key] == keyword: s = str(i) + '-' +str(i+len_key) + ':' if i > 10 and i + len_key < len(sen) - 9: neighbor_sentence = sen[i - 10:i + len_key + 9] elif i < 10: neighbor_sentence = sen[:20] elif i + len_key > len(sen) - 9: neighbor_sentence = sen[-20:] s += str(get_entityid(neighbor_sentence)) # 拿到 x-x:id ss += s + '|' # 拿到 x-x:id|x-x:id res.append(ss[:-1]) #去除最后一个'|' result_data.append(res)
输出格式:“实体起始位坐标-实体结束位坐标:实体序号”以“|”分隔的字符串