zoukankan      html  css  js  c++  java
  • 学习笔记(13)- decaNLP训练WikiSQL

    将自然语言转为sql语句,达到对话查询报表的效果。

    参考资料

    参考1

    https://mp.weixin.qq.com/s/i7WAFjQHK1NGVACR8x3v0A

    语义解析。SQL查询生成与语义解析相关。基于WikiSQL数据集的模型将自然语言问题转化成结构化的SQL查询,以便用户可以使用自然语言与数据库进行交互。WikiSQL通过逻辑形式精确匹配(lfEM)进行评估,以确保模型不会从错误生成的查询中获得正确的答案。

    参考2

    http://decanlp.com/

    Semantic Parsing
    Semantic parsing requires models to translate unstructured information into structured formats so that users can interact with structured information (e.g. a database) in natural language . decaNLP includes the WikiSQL dataset, which maps natural language questions into structured SQL queries.

    参考3

    https://github.com/salesforce/WikiSQL

    安装

    创建python虚拟环境
    下载源码:
    git clone https://github.com/salesforce/WikiSQL
    cd WikiSQL
    pip install -r requirements.txt
    tar xvjf data.tar.bz2
    

    数据

    解压之后的数据文件目录:

    .jsonl文件每行是一个json文件,

    .db是SQLite3数据库格式。

    查看db文件,可以从这里下载工具:https://github.com/pawelsalawa/sqlitestudio/releases/tag/3.2.1

    问题、查询命令和表ID

    文件/Users/huihui/git/WikiSQL/data/dev.jsonl

    {
    	"phase": 1,
    	"table_id": "1-10015132-11",  
    	"question": "What position does the player who played for butler cc (ks) play?", 
    	"sql": {
    		"sel": 3, 
    		"conds": [
    			[5, 0, "Butler CC (KS)"]
    		],
    		"agg": 0
    	}
    }
    
    • phase: 数据集收集的阶段,在2个阶段收集WikiSQL。
    • table_id: 该问题所在的表ID。
    • question: 工作人员编写的自然语言问题。
    • sql: 该问题对应的SQL查询语句。有以下子字段:
      • sel: 列的下标。
      • agg: 聚合操作的下标。agg_ops = ['', 'MAX', 'MIN', 'COUNT', 'SUM', 'AVG']
      • conds: 三元组列表:
        • column_index: 列下标
        • operator_index: 操作符的下标。['=', '>', '<', 'OP']
        • condition: 条件的比较值,float或者string

    可以进行max、min、count、sum、avg、大于小于等于、这些查询。

    表文件

    /Users/huihui/git/WikiSQL/data/dev.tables.jsonl

    {
    	"header": ["Player", "No.", "Nationality", "Position", "Years in Toronto", "School/Club Team"],
    	"page_title": "Toronto Raptors all-time roster",
    	"types": ["text", "text", "text", "text", "text", "text"],
    	"id": "1-10015132-11",
    	"section_title": "L",
    	"caption": "L",
    	"rows": [
    		["Antonio Lang", "21", "United States", "Guard-Forward", "1999-2000", "Duke"],
    		["Voshon Lenard", "2", "United States", "Guard", "2002-03", "Minnesota"],
    		["Martin Lewis", "32, 44", "United States", "Guard-Forward", "1996-97", "Butler CC (KS)"],
    		["Brad Lohaus", "33", "United States", "Forward-Center", "1996", "Iowa"],
    		["Art Long", "42", "United States", "Forward-Center", "2002-03", "Cincinnati"],
    		["John Long", "25", "United States", "Guard", "1996-97", "Detroit"],
    		["Kyle Lowry", "3", "United States", "Guard", "2012-Present", "Villanova"]
    	],
    	"name": "table_10015132_11"
    }
    

    数据库db文件

    表中列名用col0、col1等替代,目的是为了节省空间。

    测试

    测试的样例,可见文件test/example.pred.dev.jsonl

    {
    	"query": {
    		"sel": 3,
    		"agg": 0,
    		"conds": [
    			[5, 0, "butler cc (ks)"]
    		]
    	},
    	"seq": {
    		"words": ["symselect", "symagg", "symcol", "position", "symwhere", "symcol", "school/club", "team", "symop", "=", "symcond", "butler", "cc", "-lrb-", "ks", "-rrb-"],
    		"after": [" ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", "", "", " "],
    		"num": [1, 12, 4, 28, 2, 4, 32, 33, 9, 20, 10, 40, 41, 42, 43, 44],
    		"gloss": ["SYMSELECT", "SYMAGG", "SYMCOL", "Position", "SYMWHERE", "SYMCOL", "School/Club", "Team", "SYMOP", "=", "SYMCOND", "butler", "cc", "(", "ks", ")"]
    	},
    	"error": ""
    }
    

    提供了一个测试文件test/example.pred.dev.jsonl.bz2. 使用命令 bunzip2 test/example.pred.dev.jsonl.bz2 -k 进行解压。

    提供了一个docker文件,打包了一些依赖文件,可以运行评估脚本。

    首先在根目录构建镜像
    docker build -t wikisqltest -f test/Dockerfile .
    然后运行镜像文件
    docker run --rm --name wikisqltest wikisqltest
    如果一切运行正常,输入如下:
    {
      "ex_accuracy": 0.5380596128725804,
      "lf_accuracy": 0.35375846099038116
    }
    
    我用了sudo
    xuehp@haomeiya002:~/git/WikiSQL$ sudo docker build -t wikisqltest -f test/Dockerfile .
    xuehp@haomeiya002:~/git/WikiSQL$ sudo docker run --rm --name wikisqltest wikisqltest
    

  • 相关阅读:
    Mysql 批量插入数据的方法
    sql server 多行合并一行
    跨服务器多库多表查询
    OPENQUERY用法以及使用需要注意的地方
    C# 判断操作系统的位数
    rpc介绍
    JavaScript decodeURI()与decodeURIComponent() 使用与区别
    UNIX 时间戳 C#
    C# winform javascript 互调用
    oracle 实例名和服务名以及数据库名区别
  • 原文地址:https://www.cnblogs.com/xuehuiping/p/12258404.html
Copyright © 2011-2022 走看看