zoukankan      html  css  js  c++  java
  • Neural Network SMS Text Classifier

    Neural Network SMS Text Classifier

    https://www.freecodecamp.org/learn/machine-learning-with-python/machine-learning-with-python-projects/neural-network-sms-text-classifier

    In this challenge, you need to create a machine learning model that will classify SMS messages as either "ham" or "spam". A "ham" message is a normal message sent by a friend. A "spam" message is an advertisement or a message sent by a company.

    You can access the full project instructions and starter code on Google Colaboratory.

    参考

    https://www.kaggle.com/akhatova/sms-spam-classification-by-keras#3.-Keras-Model

    此示例存在两种解法:

    (1)词向量 + 回归模型

    (2)输入层序列模式 + 词嵌套 + keras模型/CNN模型

    经过验证, 词向量 特征更加适合垃圾邮件检测, 最终使用模型 词向量 + KERAS DNN模型。

    数据

    https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

    The table below lists the provided dataset in different file formats, the amount of samples in each class and the total number of samples.

    ApplicationFile format# Spam# HamTotalLink
    General Plain text 747 4,827 5,574 Link 1
    Weka ARFF 747 4,827 5,574 Link 2

    The collection is composed by just one file, where each line has the correct class (ham or spam) followed by the raw message.


    ham   What you doing?how are you?
    ham   Ok lar... Joking wif u oni...
    ham   dun say so early hor... U c already then say...
    ham   MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
    ham   Siva is in hostel aha:-.
    ham   Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
    spam  FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
    spam  Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
    spam  URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

    词向量特征提取-TfidfVectorizer

    https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

    >>> from sklearn.feature_extraction.text import TfidfVectorizer
    >>> corpus = [
    ...     'This is the first document.',
    ...     'This document is the second document.',
    ...     'And this is the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> vectorizer = TfidfVectorizer()
    >>> X = vectorizer.fit_transform(corpus)
    >>> print(vectorizer.get_feature_names())
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    >>> print(X.shape)
    (4, 9)

    或者使用TensorFlow处理接口

    https://www.tensorflow.org/guide/keras/preprocessing_layers#encoding_text_as_a_dense_matrix_of_ngrams_with_tf-idf_weighting
    # Define some text data to adapt the layer
    data = tf.constant(
        [
            "The Brain is wider than the Sky",
            "For put them side by side",
            "The one the other will contain",
            "With ease and You beside",
        ]
    )
    # Instantiate TextVectorization with "tf-idf" output_mode
    # (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
    text_vectorizer = preprocessing.TextVectorization(output_mode="tf-idf", ngrams=2)
    # Index the bigrams and learn the TF-IDF weights via `adapt()`
    text_vectorizer.adapt(data)
    
    print(
        "Encoded text:
    ",
        text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
        "
    ",
    )
    
    # Create a Dense model
    inputs = keras.Input(shape=(1,), dtype="string")
    x = text_vectorizer(inputs)
    outputs = layers.Dense(1)(x)
    model = keras.Model(inputs, outputs)
    
    # Call the model on test data (which includes unknown tokens)
    test_data = tf.constant(["The Brain is deeper than the sea"])
    test_output = model(test_data)
    print("Model output:", test_output)

    类别样本量不均衡

    https://keras.io/examples/structured_data/imbalanced_classification/

    设置权重法

    计算类别权重,

    数量少的类别,给予高的权重

    counts = np.bincount(train_targets[:, 0])
    print(
        "Number of positive samples in training data: {} ({:.2f}% of total)".format(
            counts[1], 100 * float(counts[1]) / len(train_targets)
        )
    )
    
    weight_for_0 = 1.0 / counts[0]
    weight_for_1 = 1.0 / counts[1]

    在训练接口中,指定类别权重

    metrics = [
        keras.metrics.FalseNegatives(name="fn"),
        keras.metrics.FalsePositives(name="fp"),
        keras.metrics.TrueNegatives(name="tn"),
        keras.metrics.TruePositives(name="tp"),
        keras.metrics.Precision(name="precision"),
        keras.metrics.Recall(name="recall"),
    ]
    
    model.compile(
        optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
    )
    
    callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]
    class_weight = {0: weight_for_0, 1: weight_for_1}
    
    model.fit(
        train_features,
        train_targets,
        batch_size=2048,
        epochs=30,
        verbose=2,
        callbacks=callbacks,
        validation_data=(val_features, val_targets),
        class_weight=class_weight,
    )

    设置度量指标

    https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#train_the_model

    类别不均衡的情况下, 不能只使用 acc 准确度指标, 否则训练模型很有可能,只考虑大数量类别的情况, 忽略少数数量类别的数量。

    METRICS = [
          keras.metrics.TruePositives(name='tp'),
          keras.metrics.FalsePositives(name='fp'),
          keras.metrics.TrueNegatives(name='tn'),
          keras.metrics.FalseNegatives(name='fn'), 
          keras.metrics.BinaryAccuracy(name='accuracy'),
          keras.metrics.Precision(name='precision'),
          keras.metrics.Recall(name='recall'),
          keras.metrics.AUC(name='auc'),
          keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
    ]

    过采样 - Oversampling

    https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversample_the_minority_class

    对于少数的类别, 通过抽样方法,生成和多数类别相同数量。

    个人感觉这种方法, 仅仅解决数量上的问题, 但是数据质量并没有提升, 数据的多样性问题没有得到解决, 那么最终影响模型对少数类别的泛化能力。

    Using NumPy

    You can balance the dataset manually by choosing the right number of random indices from the positive examples:

    ids = np.arange(len(pos_features))
    choices = np.random.choice(ids, len(neg_features))

    res_pos_features = pos_features[choices]
    res_pos_labels = pos_labels[choices]

    res_pos_features.shape
    (181966, 29)
    
    resampled_features = np.concatenate([res_pos_features, neg_features], axis=0)
    resampled_labels = np.concatenate([res_pos_labels, neg_labels], axis=0)

    order = np.arange(len(resampled_labels))
    np.random.shuffle(order)
    resampled_features = resampled_features[order]
    resampled_labels = resampled_labels[order]

    resampled_features.shape
    (363932, 29)
    

    TSV

    https://en.wikipedia.org/wiki/Tab-separated_values

    A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data,[1] and a way of exchanging information between databases.[2] Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

    https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe

    Use pandas.read_table(filepath). The default separator is tab.

    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    费马小定理
    Big Number阶乘位数计算(斯特林公式)
    V
    矩阵快速幂求斐波那契
    奇迹
    缘分
    求导
    拓扑排序
    线段树
    单调栈
  • 原文地址:https://www.cnblogs.com/lightsong/p/14759251.html
Copyright © 2011-2022 走看看