zoukankan      html  css  js  c++  java
  • 电商商品概念与销售知识图谱

    GoodsKG

      GoodsKG, 基于京东网站的1300种商品上下级概念,约10万商品品牌,约65万品牌销售关系,商品描述维度等知识库,基于该知识库可以支持商品属性库构建,商品销售问答,品牌物品生产等知识查询服务,也可用于情感分析等下游应用.

    项目介绍

      概念层级知识是整个常识知识体系中的重要组成部分。概念层级目前包括百科性概念层级和专业性概念层级两类,百度百科概念体系以及互动百科分类体系是其中的一个代表。就后者而言,目前出现了许多垂直行业概念层级,如医疗领域概念体系,垫上领域概念体系,其中以淘宝、京东等为代表的电商网站在以商品为中心上,构建起了一种商品概念目录层级以及商品与品牌,品牌与品牌之间的关联关系。
      本项目认为,电商网站中的商品分类目录能够供我们构建起一个商品概念体系,基于商品首页,我们可以得到商品与商品品牌之间的关系,商品的属性以及属性的取值信息。基于这类信息,又可以进一步得到商品的画像以及商品品牌的画像。基于该画像。可以对自然语言处理处理的几个下游应用带来帮助,如商品品牌识别,商品对象及属性级别情感分析,商品评价短语库构建,商品品牌竞争关系梳理等。
      因此,本项目以京东电商为实验数据来源,采集京东商品目录树,并获取其对应的底层商品概念信息,组织形成商品知识图谱。目前,该图谱包括有概念的上下位is a关系以及商品品牌与商品之间的销售sale关系共两类关系,涉及商品概念数目1300+,商品品牌数目约10万+,属性数目几千种,关系数目65万规模。该项目可以进一步增强商品领域概念体系的应用,对自然语言处理处理的几个下游应用带来帮助,如商品品牌识别,商品对象及属性级别情感分析,商品评价短语库构建,商品品牌竞争关系梳理等提供基础性的概念服务。

    数据介绍

    1, 基本数据内容 

     

    2, is-a概念上下位关系 

    MATCH p=()-[r:is_a]->() RETURN p LIMIT 25

    3, sale销售关系

    MATCH p=()-[r:sales]->() RETURN p LIMIT 25

    4, 混合关联关系 

    MATCH (n:Product) where n.name=’手机’ RETURN n LIMIT 25

    GitHub源码

    https://github.com/chenjj9527/ProductKnowledgeGraph-master

      1 #!/usr/bin/env python3
      2 # coding: utf-8
      3 # File: build_kg.py
      4 # Author: cjj
      5 # Date: 19-12-23
      6 
      7 import urllib.request
      8 from urllib.parse import quote_plus
      9 from lxml import etree
     10 import gzip
     11 import chardet
     12 import json
     13 import pymongo
     14 
     15 class GoodSchema:
     16     def __init__(self):
     17         self.conn = pymongo.MongoClient()
     18         return
     19 
     20     '''获取搜索页'''
     21     def get_html(self, url):
     22         headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17"}
     23         try:
     24             req = urllib.request.Request(url, headers=headers)
     25             data = urllib.request.urlopen(req).read()
     26             coding = chardet.detect(data)
     27             html = data.decode(coding['encoding'])
     28         except:
     29             req = urllib.request.Request(url, headers=headers)
     30             data = urllib.request.urlopen(req).read()
     31             html = data.decode('gbk')
     32 
     33 
     34         return html
     35 
     36     '''获取详情页'''
     37     def get_detail_html(self, url):
     38         headers = {
     39             "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
     40             "accept-encoding": "gzip, deflate, br",
     41             "accept-language": "en-US,en;q=0.9",
     42             "cache-control": "max-age=0",
     43             "referer": "https://www.jd.com/allSort.aspx",
     44             "upgrade-insecure-requests": 1,
     45             "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/66.0.3359.181 Chrome/66.0.3359.181 Safari/537.36"
     46         }
     47         try:
     48             req = urllib.request.Request(url, headers=headers)
     49             data = urllib.request.urlopen(req).read()
     50             html = gzip.decompress(data)
     51             coding = chardet.detect(html)
     52             html = html.decode(coding['encoding'])
     53         except Exception as e:
     54             req = urllib.request.Request(url, headers=headers)
     55             data = urllib.request.urlopen(req).read()
     56             html = gzip.decompress(data)
     57             html = html.decode('gbk')
     58         return html
     59 
     60 
     61     '''根据主页获取数据'''
     62     def home_list(self):
     63         url = 'https://www.jd.com/allSort.aspx'
     64         html = self.get_html(url)
     65         selector = etree.HTML(html)
     66         divs = selector.xpath('//div[@class= "category-item m"]')
     67         for indx, div in enumerate(divs):
     68             first_name = div.xpath('./div[@class="mt"]/h2/span/text()')[0]
     69             second_classes = div.xpath('./div[@class="mc"]/div[@class="items"]/dl')
     70             for dl in second_classes:
     71                 second_name = dl.xpath('./dt/a/text()')[0]
     72                 third_classes = ['https:' + i for i in dl.xpath('./dd/a/@href')]
     73                 third_names = dl.xpath('./dd/a/text()')
     74                 for third_name, url in zip(third_names, third_classes):
     75                     try:
     76                         attr_dict = self.parser_goods(url)
     77                         attr_brand = self.collect_brands(url)
     78                         attr_dict.update(attr_brand)
     79                         data = {}
     80                         data['fisrt_class'] = first_name
     81                         data['second_class'] = second_name
     82                         data['third_class'] = third_name
     83                         data['attrs'] = attr_dict
     84                         self.conn['goodskg']['data'].insert(data)
     85                         print(indx, len(divs), first_name, second_name, third_name)
     86                     except Exception as e:
     87                         print(e)
     88         return
     89 
     90     '''解析商品数据'''
     91     def parser_goods(self, url):
     92         html = self.get_detail_html(url)
     93         selector = etree.HTML(html)
     94         title = selector.xpath('//title/text()')
     95         attr_dict = {}
     96         other_attrs = ''.join([i for i in html.split('
    ') if 'other_exts' in i])
     97         other_attr = other_attrs.split('other_exts =[')[-1].split('];')[0]
     98         if other_attr and 'var other_exts ={};' not in other_attr:
     99             for attr in other_attr.split('},'):
    100                 if '}' not in attr:
    101                     attr = attr + '}'
    102                 data = json.loads(attr)
    103                 key = data['name']
    104                 value = data['value_name']
    105                 attr_dict[key] = value
    106         attr_divs = selector.xpath('//div[@class="sl-wrap"]')
    107         for div in attr_divs:
    108             attr_name = div.xpath('./div[@class="sl-key"]/span/text()')[0].replace('','')
    109             attr_value = ';'.join([i.replace('  ','') for i in div.xpath('./div[@class="sl-value"]/div/ul/li/a/text()')])
    110             attr_dict[attr_name] = attr_value
    111 
    112         return attr_dict
    113 
    114     '''解析品牌数据'''
    115     def collect_brands(self, url):
    116         attr_dict = {}
    117         brand_url = url + '&sort=sort_rank_asc&trans=1&md=1&my=list_brand'
    118         html = self.get_html(brand_url)
    119         if 'html' in html:
    120             return attr_dict
    121         data = json.loads(html)
    122         brands = []
    123 
    124         if 'brands' in data and data['brands'] is not None:
    125             brands = [i['name'] for i in data['brands']]
    126         attr_dict['品牌'] = ';'.join(brands)
    127 
    128         return attr_dict
    129 
    130 
    131 
    132 if __name__ == '__main__':
    133     handler = GoodSchema()
    134     handler.home_list()
      1 #!/usr/bin/env python3
      2 # coding: utf-8
      3 # File: build_kg.py
      4 # Author: cjj
      5 # Date: 19-12-23
      6 
      7 import json
      8 import os
      9 from py2neo import Graph, Node, Relationship
     10 
     11 
     12 class GoodsKg:
     13     def __init__(self):
     14         cur = '/'.join(os.path.abspath(__file__).split('/')[:-1])
     15         self.data_path = os.path.join(cur, 'data/goods_info.json')
     16         self.g = Graph(
     17             host="127.0.0.1",  # neo4j 搭载服务器的ip地址,ifconfig可获取到
     18             http_port=7474,  # neo4j 服务器监听的端口号
     19             user="neo4j",  # 数据库user name,如果没有更改过,应该是neo4j
     20             password="111111")
     21         return
     22 
     23     '''读取数据'''
     24     def read_data(self):
     25         rels_goods = []
     26         rels_brand = []
     27         goods_attrdict = {}
     28         concept_goods = set()
     29         concept_brand = set()
     30         count = 0
     31         for line in open(self.data_path,encoding='UTF-8'):
     32             count += 1
     33             print(count)
     34             line = line.strip()
     35             data = json.loads(line)
     36             first_class = data['fisrt_class'].replace("'",'')
     37             second_class = data['second_class'].replace("'",'')
     38             third_class = data['third_class'].replace("'",'')
     39             attr = data['attrs']
     40             concept_goods.add(first_class)
     41             concept_goods.add(second_class)
     42             concept_goods.add(third_class)
     43             rels_goods.append('@'.join([second_class, 'is_a', '属于', first_class]))
     44             rels_goods.append('@'.join([third_class, 'is_a', '属于', second_class]))
     45 
     46             if attr and '品牌' in attr:
     47                 brands = attr['品牌'].split(';')
     48                 for brand in brands:
     49                     brand = brand.replace("'",'')
     50                     concept_brand.add(brand)
     51                     rels_brand.append('@'.join([brand, 'sales', '销售', third_class]))
     52 
     53             goods_attrdict[third_class] = {name:value for name,value in attr.items() if name != '品牌'}
     54 
     55         return concept_brand, concept_goods, rels_goods, rels_brand
     56 
     57     '''构建图谱'''
     58     def create_graph(self):
     59         concept_brand, concept_goods, rels_goods, rels_brand = self.read_data()
     60         print('creating nodes....')
     61         self.create_node('Product', concept_goods)
     62         self.create_node('Brand', concept_brand)
     63         print('creating edges....')
     64         self.create_edges(rels_goods, 'Product', 'Product')
     65         self.create_edges(rels_brand, 'Brand', 'Product')
     66         return
     67 
     68     '''批量建立节点'''
     69     def create_node(self, label, nodes):
     70         pairs = []
     71         bulk_size = 1000
     72         batch = 0
     73         bulk = 0
     74         batch_all = len(nodes)//bulk_size
     75         print(batch_all)
     76         for node_name in nodes:
     77             sql = """CREATE(:%s {name:'%s'})""" % (label, node_name)
     78             pairs.append(sql)
     79             bulk += 1
     80             if bulk % bulk_size == 0 or bulk == batch_all+1:
     81                 sqls = '
    '.join(pairs)
     82                 self.g.run(sqls)
     83                 batch += 1
     84                 print(batch*bulk_size,'/', len(nodes), 'finished')
     85                 pairs = []
     86         return
     87 
     88 
     89     '''构造图谱关系边'''
     90     def create_edges(self, rels, start_type, end_type):
     91         batch = 0
     92         count = 0
     93         for rel in set(rels):
     94             count += 1
     95             rel = rel.split('@')
     96             start_name = rel[0]
     97             end_name = rel[3]
     98             rel_type = rel[1]
     99             rel_name = rel[2]
    100             sql = 'match (m:%s), (n:%s) where m.name = "%s" and n.name = "%s" create (m)-[:%s{name:"%s"}]->(n)' %(start_type, end_type, start_name, end_name,rel_type,rel_name)
    101             try:
    102                 self.g.run(sql)
    103             except Exception as e:
    104                 print(e)
    105             if count%10 == 0:
    106                 print(count)
    107 
    108         return
    109 
    110 
    111 if __name__ =='__main__':
    112     handler = GoodsKg()
    113     handler.create_graph()
  • 相关阅读:
    使用WebAPI流式传输大文件(在IIS上大于2GB)
    网页扫描仪图像上传
    网页高拍仪图像上传
    C++
    Tomcat Connector三种执行模式(BIO, NIO, APR)的比較和优化
    编程精粹--编写高质量C语言代码(1):假想编译程序
    一个软件项目的总纲性的測试计划叫什么?
    Java字符串的格式化与输出
    Servlet入门(第一个Servlet的Web程序)
    求职小技巧,赢得大机会
  • 原文地址:https://www.cnblogs.com/chen8023miss/p/12083522.html
Copyright © 2011-2022 走看看