zoukankan      html  css  js  c++  java
  • 《python深度学习》笔记---6.1、one-hot-encoding

    《python深度学习》笔记---6.1、one-hot-encoding

    一、总结

    一句话总结:

    用的texts_to_matrix方法:one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
    这里的one-hot是binary结构:[0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

    [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
    [0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

    1、one-hot 散列技巧(one-hot hashing trick)?

    【如果词表中唯一标记的数量太大而无法直接处理】:one-hot 编码的一种变体是所谓的one-hot 散列技巧(one-hot hashing trick),如果词表中唯 一标记的数量太大而无法直接处理,就可以使用这种技巧。
    【没有为每个单词显式分配一个索引并将这些索引保存在一个字典中】:这种方法没有为每个单词显式分配 一个索引并将这些索引保存在一个字典中,而是将单词散列编码为固定长度的向量,通常用一 个非常简单的散列函数来实现。
    【节省内存并允许数据的在线编码】:这种方法的主要优点在于,它避免了维护一个显式的单词索引, 从而节省内存并允许数据的在线编码(在读取完所有数据之前,你就可以立刻生成标记向量)。
    【散列冲突(hash collision)】:这种方法有一个缺点,就是可能会出现散列冲突(hash collision),即两个不同的单词可能具有 相同的散列值,随后任何机器学习模型观察这些散列值,都无法区分它们所对应的单词。如果 散列空间的维度远大于需要散列的唯一标记的个数,散列冲突的可能性会减小。

    二、6.1、one-hot-encoding

    博客对应课程的视频位置:

    1、Word level one-hot encoding (toy example):

    In [1]:
    import numpy as np
    
    # This is our initial data; one entry per "sample"
    # (in this toy example, a "sample" is just a sentence, but
    # it could be an entire document).
    samples = ['The cat sat on the mat.', 'The dog ate my homework.']
    
    # First, build an index of all tokens in the data.
    token_index = {}
    for sample in samples:
        # We simply tokenize the samples via the `split` method.
        # in real life, we would also strip punctuation and special characters
        # from the samples.
        for word in sample.split():
            if word not in token_index:
                # Assign a unique index to each unique word
                token_index[word] = len(token_index) + 1
                # Note that we don't attribute index 0 to anything.
    
    # Next, we vectorize our samples.
    # We will only consider the first `max_length` words in each sample.
    max_length = 10
    
    # This is where we store our results:
    results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
    for i, sample in enumerate(samples):
        for j, word in list(enumerate(sample.split()))[:max_length]:
            index = token_index.get(word)
            results[i, j, index] = 1.
    
    In [2]:
    print(results)
    
    [[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
    
     [[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
      [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]
    

    2、Character level one-hot encoding (toy example)

    In [4]:
    import string
    
    samples = ['The cat sat on the mat.', 'The dog ate my homework.']
    characters = string.printable  # All printable ASCII characters.
    token_index = dict(zip(characters, range(1, len(characters) + 1)))
    
    max_length = 50
    results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
    for i, sample in enumerate(samples):
        for j, character in enumerate(sample[:max_length]):
            index = token_index.get(character)
            results[i, j, index] = 1.
    
    In [6]:
    print(results.shape)
    print(results)
    # 101因为:All printable ASCII characters.
    # 50因为:指定的max_length = 50
    # 2因为:2个句子
    
    (2, 50, 101)
    [[[0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      ...
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]]
    
     [[0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      ...
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]]]
    

    3、Using Keras for word-level one-hot encoding:

    In [12]:
    from tensorflow.keras.preprocessing.text import Tokenizer
    
    samples = ['The cat sat on the mat.', 'The dog ate my homework.']
    
    # We create a tokenizer, configured to only take
    # into account the top-1000 most common words
    tokenizer = Tokenizer(num_words=1000)
    # This builds the word index
    tokenizer.fit_on_texts(samples)
    
    # This turns strings into lists of integer indices.
    sequences = tokenizer.texts_to_sequences(samples)
    print(sequences)
    
    # You could also directly get the one-hot binary representations.
    # Note that other vectorization modes than one-hot encoding are supported!
    one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')
    print(one_hot_results[0][0:20])
    print(one_hot_results[1][0:20])
    
    # This is how you can recover the word index that was computed
    word_index = tokenizer.word_index
    print('Found %s unique tokens.' % len(word_index))
    
    [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
    [0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    Found 9 unique tokens.
    
    In [8]:
    print(one_hot_results)
    
    [[0. 1. 1. ... 0. 0. 0.]
     [0. 1. 0. ... 0. 0. 0.]]
    

    4、Word-level one-hot encoding with hashing trick (toy example):

    In [13]:
    samples = ['The cat sat on the mat.', 'The dog ate my homework.']
    
    # We will store our words as vectors of size 1000.
    # Note that if you have close to 1000 words (or more)
    # you will start seeing many hash collisions, which
    # will decrease the accuracy of this encoding method.
    dimensionality = 1000
    max_length = 10
    
    results = np.zeros((len(samples), max_length, dimensionality))
    for i, sample in enumerate(samples):
        for j, word in list(enumerate(sample.split()))[:max_length]:
            # Hash the word into a "random" integer index
            # that is between 0 and 1000
            index = abs(hash(word)) % dimensionality
            results[i, j, index] = 1.
    
    In [16]:
    print(results.shape)
    print(results[0][0])
    print(results)
    
    (2, 10, 1000)
    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
     0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
    [[[0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      ...
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]]
    
     [[0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      ...
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]
      [0. 0. 0. ... 0. 0. 0.]]]
    
    In [ ]:
     
     
    我的旨在学过的东西不再忘记(主要使用艾宾浩斯遗忘曲线算法及其它智能学习复习算法)的偏公益性质的完全免费的编程视频学习网站: fanrenyi.com;有各种前端、后端、算法、大数据、人工智能等课程。
    博主25岁,前端后端算法大数据人工智能都有兴趣。
    大家有啥都可以加博主联系方式(qq404006308,微信fan404006308)互相交流。工作、生活、心境,可以互相启迪。
    聊技术,交朋友,修心境,qq404006308,微信fan404006308
    26岁,真心找女朋友,非诚勿扰,微信fan404006308,qq404006308
    人工智能群:939687837

    作者相关推荐

  • 相关阅读:
    第一章 002-JDK的安装
    第一章 001-初识Java
    计算2^4000内数字0到9的分布
    1027 大数乘法
    1005 大数加法
    哈夫曼编码
    new: Set up a window
    GLFW扩展库
    outdated: 3.Adding Color
    简单的图元
  • 原文地址:https://www.cnblogs.com/Renyi-Fan/p/13806418.html
Copyright © 2011-2022 走看看