zoukankan      html  css  js  c++  java
  • NLP(三十八):tfidf_CountVectorizer 与 TfidfTransformer 保存和测试

    做nlp的时候,如果用到tf-idf,sklearn中用CountVectorizer与TfidfTransformer两个类,下面对和两个类进行讲解

    一、训练以及测试

    CountVectorizer与TfidfTransformer在处理训练数据的时候都用fit_transform方法,在测试集用transform方法。fit包含训练的意思,表示训练好了去测试,如果在测试集中也用fit_transform,那显然导致结果错误。

    #变量:content_train 训练集,content_test测试集
    vectorizer = CountVectorizer()
    tfidftransformer = TfidfTransformer()

    #训练 用fit_transform
    count_train=vectorizer.fit_transform(content_train)
    tfidf = tfidftransformer.fit_transform(count_train)

    #测试
    count_test=vectorizer.transform(content_test)
    test_tfidf = tfidftransformer.transform(count_test)

    测试集的if-idf
    test_weight = test_tfidf.toarray()

    二、tf-idf词典的保存

    我们总是需要保存tf-idf的词典,然后计算测试集的tfidf,这里要注意sklearn中保存有两种方法:pickle与joblib。我们这里用pickle

    train_content = segmentWord(X_train)
    test_content = segmentWord(X_test)
    # replace 必须加,保存训练集的特征
    vectorizer = CountVectorizer(decode_error="replace")
    tfidftransformer = TfidfTransformer()
    # 注意在训练的时候必须用vectorizer.fit_transform、tfidftransformer.fit_transform
    # 在预测的时候必须用vectorizer.transform、tfidftransformer.transform
    vec_train = vectorizer.fit_transform(train_content)
    tfidf = tfidftransformer.fit_transform(vec_train)
    
    # 保存经过fit的vectorizer 与 经过fit的tfidftransformer,预测时使用
    feature_path = 'models/feature.pkl'
    with open(feature_path, 'wb') as fw:
        pickle.dump(vectorizer.vocabulary_, fw)
    
    tfidftransformer_path = 'models/tfidftransformer.pkl'
    with open(tfidftransformer_path, 'wb') as fw:
        pickle.dump(tfidftransformer, fw)

    注意:vectorizer 与tfidftransformer都要保存,而且只能 fit_transform 之后保存,表示vectorizer 与tfidftransformer已经用训练集训练好了。

    三、tf-idf加载,测试新数据

    # 加载特征
    feature_path = 'models/feature.pkl'
    loaded_vec = CountVectorizer(decode_error="replace", vocabulary=pickle.load(open(feature_path, "rb")))
    # 加载TfidfTransformer
    tfidftransformer_path = 'models/tfidftransformer.pkl'
    tfidftransformer = pickle.load(open(tfidftransformer_path, "rb"))
    #测试用transform,表示测试数据,为list
    test_tfidf = tfidftransformer.transform(loaded_vec.transform(test_content))

    转载于:https://my.oschina.net/u/2293326/blog/1838918

  • 相关阅读:
    导包路径
    django导入环境变量 Please specify Django project root directory
    替换django的user模型,mysql迁移表报错 django.db.migrations.exceptions.InconsistentMigrationHistory: Migration admin.0001_initial is applied before its dependen cy user.0001_initial on database 'default'.
    解决Chrome调试(debugger)
    check the manual that corresponds to your MySQL server version for the right syntax to use near 'order) values ('徐小波','XuXiaoB','男','1',' at line 1")
    MySQL命令(其三)
    MySQL操作命令(其二)
    MySQL命令(其一)
    [POJ2559]Largest Rectangle in a Histogram (栈)
    [HDU4864]Task (贪心)
  • 原文地址:https://www.cnblogs.com/zhangxianrong/p/15538868.html
Copyright © 2011-2022 走看看