zoukankan      html  css  js  c++  java
  • Python学习笔记(2) Python提取《釜山行》人物关系

    参考http://www.jianshu.com/p/3bd06f8816d7

     
    项目原理:
      实验基于简单共现关系,编写 Python 代码从纯文本中提取出人物关系网络,并用Gephi 将生成的网络可视化。下面介绍共现网络的基本原理。(共现网络简单的英文介绍
     
    共现网络的基本原理:
       实体间的共现是是一种基于统计信息的提取,关系密切的人物往往会在文中的多段连续出现,通过文中以出现的实体(人名),计算不同实体共同出现的比率和次数,设定一个阈值,大于该阈值认为实体间存在某种联系。
     
    准备:
    1. 环境 windows Python3.6
    2. 模块jieba  https://github.com/fxsjy/jieba
    3. jephi软件

     人名字典  http://labfile.oss.aliyuncs.com/courses/677/dict.txt 

    《釜山行》中文剧本  http://labfile.oss.aliyuncs.com/courses/677/busan.txt

      

    代码:

    # -*- coding: utf-8 -*-
    import

    os, sys
    import jieba, codecs, math
    import jieba.posseg as pseg


    names = {} # 姓名字典
    relationships = {} # 关系字典
    lineNames = [] # 每段内人物关系

    # count names
    jieba.load_userdict("D:\ResearchContent\Exercise_Programm\PythonExercise\Python\dict.txt")

    # 加载字典
    with

    codecs.open("D:\ResearchContent\Exercise_Programm\PythonExercise\Python\fushan.txt", "r", "utf8") as f

    :
    for

    line in f.readlines()

    :

    poss = pseg.cut(line)     

    # 分词并返回该词词性

    lineNames.append([])      

    # 为新读入的一段添加人物名称列表
    for

    w in poss

    :
    if

    w.flag 

    != "nr" or len

    (w.word) 

    < 2:
    continue
    # 当分词长度小于2或该词词性不为nr时认为该词不为人名

    lineNames[

    -1

    ].append(w.word)      

    # 为当前段的环境增加一个人物
    if

    names.get(w.word) 

    is None:

    names[w.word] = 

    0

    relationships[w.word] = {}
    names[w.word]

    += 1

    # 该人物出现次数加 1

    # explore relationships
    for

    line in lineNames:             

    # 对于每一段
    for

    name1 in line

    :
    for

    name2 in line:          

    # 每段中的任意两个人
    if

    name1 == name2:

    continue
    if

    relationships[name1].get(name2) is None:       

    # 若两人尚未同时出现则新建项

    relationships[name1][name2]= 

    1
    else:

    relationships[name1][name2] = relationships[name1][name2]

    + 1

    # 两人共同出现次数加 1

    # output
    with

    codecs.open("busan_node.txt", "w", "gbk") as f

    :

    f.write("Id Label Weight
    ")
    for name, times in names.items()

    :

    f.write(name 

    + " " +

    name 

    + " " + str

    (times) 

    + "

    ")

    with codecs.open("busan_edge.txt", "w", "gbk") as f

    :

    f.write("Source Target Weight
    ")
    for name, edges in relationships.items()

    :
    for

    v, w in edges.items()

    :
    if

    w 

    > 3:

    f.write(name 

    + " " +

    v 

    + " " + str

    (w) 

    + "

    ")
    


     

    参考:

    共线网络简单英文介绍https://forec.github.io/2016/10/03/co-occurrence-structure-capture/

    Python中文分词:结巴分词http://www.cnblogs.com/kaituorensheng/p/3595879.html

    import as 解释:https://www.zhihu.com/question/20871904

    修改2

  • 相关阅读:
    SSH框架中使用注解和xml配置的区别
    web项目中log4j的配置
    嵌入式—ASCII码
    MATLAB
    MATLAB
    MATLAB
    MATLAB
    CentOS 7将网卡名称eno16777736改为eth0
    图像增强处理
    Debussy与modelsim联仿时 do 文件脚本
  • 原文地址:https://www.cnblogs.com/jiawang/p/6155186.html
Copyright © 2011-2022 走看看