zoukankan      html  css  js  c++  java
  • 机器人写诗项目——数据预处理

    首先来看全部代码

    import collections
    
    start_token = 'G'
    end_token = 'E'
    
    def process_poems(file_name):
        # 诗集
        poems = []
        with open(file_name, "r", encoding='utf-8', ) as f:
            for line in f.readlines():
                try:
                    title, content = line.strip().split(':')
                    content = content.replace(' ', '')
                    if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or 
                            start_token in content or end_token in content:
                        continue
                    if len(content) < 5 or len(content) > 79:
                        continue
                    content = start_token + content + end_token
                    poems.append(content)
                except ValueError as e:
                    pass
        # 按诗的字数排序
        poems = sorted(poems, key=lambda l: len(line))
    
        # 统计每个字出现次数
        all_words = []
        for poem in poems:
            all_words += [word for word in poem]
        # 这里根据包含了每个字对应的频率
        counter = collections.Counter(all_words)
        count_pairs = sorted(counter.items(), key=lambda x: -x[1])
        words, _ = zip(*count_pairs)
    
        # 取前多少个常用字
        words = words[:len(words)] + (' ',)
        # 每个字映射为一个数字ID
        word_int_map = dict(zip(words, range(len(words))))
        poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]
    
        return poems_vector, word_int_map, words
    

    之后看一下数据集

    在这里插入图片描述

    最后来一点点分析

    定义一个数据预处理函数:

    def process_poems(file_name):
    

    首先把处理好的结果指定成一个list:

        poems = []
    

    打开处理模块,首先制定好一个路径,然后以读的方式打开 ,最后因为诗是中文的,所以编码方式为‘utf-8’:

        with open(file_name, "r", encoding='utf-8', ) as f:
    

    一行一行去读

            for line in f.readlines():
    

    用冒号将文本分割为诗的题目和内容:

                    title, content = line.strip().split(':')
    

    如果训练数据集中古诗存在问题,应该舍弃该诗:

                    if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or 
                            start_token in content or end_token in content:
                        continue
                    if len(content) < 5 or len(content) > 79:
                        continue
    

    对诗的内容进行处理,加上开始和中止符号,然后才能将诗的内容传进结果的list里:

                    content = start_token + content + end_token
                    poems.append(content)
    

    对得到的结果list进行排序处理:

        poems = sorted(poems, key=lambda l: len(line))
    

    统计每个字出现的次数,两层循环,第一层是循环每一首诗,第二层是循环每首诗里的每一个字:

        all_words = []
        for poem in poems:
            all_words += [word for word in poem]
    

    计算词频:

        counter = collections.Counter(all_words)
        count_pairs = sorted(counter.items(), key=lambda x: -x[1])
        words, _ = zip(*count_pairs)
    

    取前多少个常用字:

        words = words[:len(words)] + (' ',)
    

    每个字映射为一个数字ID:

        word_int_map = dict(zip(words, range(len(words))))
        poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]
    

    返回所需要的值:

        return poems_vector, word_int_map, words
    
  • 相关阅读:
    DOM
    JS方法
    边界与边框,列表与方块
    for 练习
    背景与前景温习
    AD域账号验证
    邮件发送案例
    获取每个月最后一天的小技巧
    SQL 执行顺序
    常用下载地址
  • 原文地址:https://www.cnblogs.com/AlexKing007/p/12338188.html
Copyright © 2011-2022 走看看