关系抽取 -- 评测数据集简述

zoukankan html css js c++ java

关系抽取 -- 评测数据集简述
常用数据集
- ACE 2005: 599 docs. 7 types;
- SemiEval 2010 Task8 Dataset:
  
  19 types
  
  train data: 8000
  
  test data: 2717
- NYT+FreeBase 通过Distant Supervised method 提取，里面会有噪音数据:
  
  53 types
  
  train data: 522611 sentences; 需要注意的是，这里面有近80%的句子的标签为NA
  
  test data: 172448 sentences;
下面以学习方法的不同来对这些文章进行分类：
- Fully Supervised Learning
- Distant Supervised Learning
- Joint Learning with entity and relation
- Tree Based Methods
其中：

　　Fully Supervised 一般评测使用label完全准确的SemEval 2010 Task 8 数据集。

　　格式：　　　　
　　　　1 The <e1>microphone</e1> converts sound into an electrical <e2>signal</e2>. 　　　　2 Cause-Effect(e1,e2) 　　　　3 Comment:
　　　　其中第一行为sentence，第二行为两个entity的relation，第三行为备注。

　　Distant Supervised 使用NYT+FreeBase数据集。 NYT 训练数据样例:

　　　　 1 m.0ccvx　　m.05gf08　　queens　　belle_harbor　　/location/location/contains　　.....officials yesterday to reopen their investigation into the fatal crash of a passenger jet in belle_harbor , queens......　　###END###

　　　　一共6列，前两列为两个entity的Freebase mid, 第三四列为两个entity在句子中的string。第五列为relation，最后一列为sentence（有省略），以###END###结尾

这两个数据集相对来说用的最广泛。

　　在NYT数据集上，常用的有两个版本的数据集：

　　　　　27类关系，Zeng2015,Ji2017等用到的经过过滤之后的数据集，相对较小，以SMALL表示。

　　　　　53类关系，Lin2016 发布的数据集，相对较大，训练数据大概是小数据的4倍，以LARGE表示。

　　

　　
查看全文

相关阅读:
遍历文件下所有文件
 访问网址（使用CDN）时智能DNS调度与用户定位调度（根据IP定位）
UV,IP,PV
vector list deque
mailto: HTML e-mail 链接
 freemarker 用template快速构造XML
Oracle varchar2 length 分析
 Flex grid 复杂表头
 Oracle 动态设置SEQUENCE startwith 的值
 ssh和ssh2之间的免密码登陆详解

原文地址：https://www.cnblogs.com/dhName/p/11778016.html

关系抽取 -- 评测 数据集 简述

关系抽取 -- 评测数据集简述