zoukankan      html  css  js  c++  java
  • 独热编码处理文本属性

    学习来源:click here

    当数据中存在文本属性时,机器学习算法不便于处理文本属性,这时候需要把文本属性转换成数字。转换时,如果属性间存在顺序关系,例如:(冷,暖,热),可以直接使用整数编码;但当属性间没有顺序关系时,例如:(红, 绿, 蓝),则可使用独热编码。

    独热编码:编码属性的值为1,其余属性的值为0

    一、人工独热编码

    from numpy import argmax
    import numpy as np data
    = 'hello world' alphabet = 'abcdefghigklmnopqrstuvwxyz ' char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) #整数编码 integer_encoded = [char_to_int[char] for char in data] print(integer_encoded) #独热编码 OneHot_Encoder = list() for i in integer_encoded: letter = [0 for _ in range(len(alphabet))] letter[i] = 1 OneHot_Encoder.append(letter) print(np.array(OneHot_Encoder)) #从独热编码恢复数据(argmax-返回最大值的索引) inverted = int_to_char[argmax(OneHot_Encoder[0])] print(inverted)
    #output:

    二、Scikit-Learn独热编码

    from numpy import argmax
    from numpy import array
    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import OneHotEncoder
    
    #整数编码
    data = array(['cold', 'cold', 'warm', 'hot', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold'])
    label_encoder = LabelEncoder()
    label_encoded = label_encoder.fit_transform(data)
    print(label_encoded)
    #独热编码
    onehot_encoder = OneHotEncoder(categories='auto')
    onehot_encoded = onehot_encoder.fit_transform(label_encoded.reshape(-1, 1))
    onehot = onehot_encoded.toarray()
    print(onehot)
    #恢复编码
    state = label_encoder.inverse_transform([argmax(onehot[0, :])])
    print(state)

    #output:
    
    
  • 相关阅读:
    Spoj-DWARFLOG Manipulate Dwarfs
    Spoj-DRUIDEOI Fata7y Ya Warda!
    LightOJ1106 Gone Fishing
    LightOJ1125 Divisible Group Sums
    hdu5396 Expression
    cf715B Complete The Graph
    cf601A The Two Routes
    cf602B Approximating a Constant Range
    cf602A Two Bases
    认证方式
  • 原文地址:https://www.cnblogs.com/pineapple-chicken/p/12402273.html
Copyright © 2011-2022 走看看