zoukankan      html  css  js  c++  java
  • 机器人写诗项目——数据预处理

    首先来看全部代码

    import collections
    
    start_token = 'G'
    end_token = 'E'
    
    def process_poems(file_name):
        # 诗集
        poems = []
        with open(file_name, "r", encoding='utf-8', ) as f:
            for line in f.readlines():
                try:
                    title, content = line.strip().split(':')
                    content = content.replace(' ', '')
                    if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or 
                            start_token in content or end_token in content:
                        continue
                    if len(content) < 5 or len(content) > 79:
                        continue
                    content = start_token + content + end_token
                    poems.append(content)
                except ValueError as e:
                    pass
        # 按诗的字数排序
        poems = sorted(poems, key=lambda l: len(line))
    
        # 统计每个字出现次数
        all_words = []
        for poem in poems:
            all_words += [word for word in poem]
        # 这里根据包含了每个字对应的频率
        counter = collections.Counter(all_words)
        count_pairs = sorted(counter.items(), key=lambda x: -x[1])
        words, _ = zip(*count_pairs)
    
        # 取前多少个常用字
        words = words[:len(words)] + (' ',)
        # 每个字映射为一个数字ID
        word_int_map = dict(zip(words, range(len(words))))
        poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]
    
        return poems_vector, word_int_map, words
    

    之后看一下数据集

    在这里插入图片描述

    最后来一点点分析

    定义一个数据预处理函数:

    def process_poems(file_name):
    

    首先把处理好的结果指定成一个list:

        poems = []
    

    打开处理模块,首先制定好一个路径,然后以读的方式打开 ,最后因为诗是中文的,所以编码方式为‘utf-8’:

        with open(file_name, "r", encoding='utf-8', ) as f:
    

    一行一行去读

            for line in f.readlines():
    

    用冒号将文本分割为诗的题目和内容:

                    title, content = line.strip().split(':')
    

    如果训练数据集中古诗存在问题,应该舍弃该诗:

                    if '_' in content or '(' in content or '(' in content or '《' in content or '[' in content or 
                            start_token in content or end_token in content:
                        continue
                    if len(content) < 5 or len(content) > 79:
                        continue
    

    对诗的内容进行处理,加上开始和中止符号,然后才能将诗的内容传进结果的list里:

                    content = start_token + content + end_token
                    poems.append(content)
    

    对得到的结果list进行排序处理:

        poems = sorted(poems, key=lambda l: len(line))
    

    统计每个字出现的次数,两层循环,第一层是循环每一首诗,第二层是循环每首诗里的每一个字:

        all_words = []
        for poem in poems:
            all_words += [word for word in poem]
    

    计算词频:

        counter = collections.Counter(all_words)
        count_pairs = sorted(counter.items(), key=lambda x: -x[1])
        words, _ = zip(*count_pairs)
    

    取前多少个常用字:

        words = words[:len(words)] + (' ',)
    

    每个字映射为一个数字ID:

        word_int_map = dict(zip(words, range(len(words))))
        poems_vector = [list(map(lambda word: word_int_map.get(word, len(words)), poem)) for poem in poems]
    

    返回所需要的值:

        return poems_vector, word_int_map, words
    
  • 相关阅读:
    CodeForces 734F Anton and School
    CodeForces 733F Drivers Dissatisfaction
    CodeForces 733C Epidemic in Monstropolis
    ZOJ 3498 Javabeans
    ZOJ 3497 Mistwald
    ZOJ 3495 Lego Bricks
    CodeForces 732F Tourist Reform
    CodeForces 732E Sockets
    CodeForces 731E Funny Game
    CodeForces 731D 80-th Level Archeology
  • 原文地址:https://www.cnblogs.com/AlexKing007/p/12338187.html
Copyright © 2011-2022 走看看