zoukankan      html  css  js  c++  java
  • Column Transformer with Heterogeneous Data Sources -- of sklearn

    Column Transformer with Heterogeneous Data Sources

    https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py

          数据集合经常包含不同元素, 需要不同的特征提取 和 处理流水线。 例如

        1. 数据集合中包含异构数据, 例如 文本 和 图片。

        2. 数据集合存储在 pandas中, 不同列需要不同的处理流水线。

        使用 ColumnTransformer 可以处理不同类型特征的预处理。

    Datasets can often contain components that require different feature extraction and processing pipelines. This scenario might occur when:

    1. your dataset consists of heterogeneous data types (e.g. raster images and text captions),

    2. your dataset is stored in a pandas.DataFrame and different columns require different processing pipelines.

    This example demonstrates how to use ColumnTransformer on a dataset containing different types of features. The choice of features is not particularly helpful, but serves to illustrate the technique.

    20 newsgroups dataset

        20中新闻主题数据集合,获取训练和测试集合。

    We will use the 20 newsgroups dataset, which comprises posts from newsgroups on 20 topics. This dataset is split into train and test subsets based on messages posted before and after a specific date. We will only use posts from 2 categories to speed up running time.

    categories = ['sci.med', 'sci.space']
    X_train, y_train = fetch_20newsgroups(random_state=1,
                                          subset='train',
                                          categories=categories,
                                          remove=('footers', 'quotes'),
                                          return_X_y=True)
    X_test, y_test = fetch_20newsgroups(random_state=1,
                                        subset='test',
                                        categories=categories,
                                        remove=('footers', 'quotes'),
                                        return_X_y=True)
    

    Each feature comprises meta information about that post, such as the subject, and the body of the news post.

    print(X_train[0])
    

    Out:

    From: mccall@mksol.dseg.ti.com (fred j mccall 575-3539)
    Subject: Re: Metric vs English
    Article-I.D.: mksol.1993Apr6.131900.8407
    Organization: Texas Instruments Inc
    Lines: 31
    
    
    
    
    American, perhaps, but nothing military about it.  I learned (mostly)
    slugs when we talked English units in high school physics and while
    the teacher was an ex-Navy fighter jock the book certainly wasn't
    produced by the military.
    
    [Poundals were just too flinking small and made the math come out
    funny; sort of the same reason proponents of SI give for using that.]
    
    --
    "Insisting on perfect safety is for people who don't have the balls to live
     in the real world."   -- Mary Shafer, NASA Ames Dryden
    

     

    Creating transformers

         对于 非结构化数据 或者 半结构化数据, 需要自己定义数据的解析规则。 这时候就需要开发者自己写数据的变换过程。

        例如本例中, 需要从post中, 提取 主题 和 正文。

        使用 FunctionTransformer 来定义一个数据变换器。

    First, we would like a transformer that extracts the subject and body of each post. Since this is a stateless transformation (does not require state information from training data), we can define a function that performs the data transformation then use FunctionTransformer to create a scikit-learn transformer.

    提取内容变换器 SubjectBodyExtractor

    def subject_body_extractor(posts):
        # construct object dtype array with two columns
        # first column = 'subject' and second column = 'body'
        features = np.empty(shape=(len(posts), 2), dtype=object)
        for i, text in enumerate(posts):
            # temporary variable `_` stores '
    
    '
            headers, _, body = text.partition('
    
    ')
            # store body text in second column
            features[i, 1] = body
    
            prefix = 'Subject:'
            sub = ''
            # save text after 'Subject:' in first column
            for line in headers.split('
    '):
                if line.startswith(prefix):
                    sub = line[len(prefix):]
                    break
            features[i, 0] = sub
    
        return features
    
    
    subject_body_transformer = FunctionTransformer(subject_body_extractor)

    提取文本长度和句子个数 text_stats_transformer

    We will also create a transformer that extracts the length of the text and the number of sentences.

    def text_stats(posts):
        return [{'length': len(text),
                 'num_sentences': text.count('.')}
                for text in posts]
    
    
    text_stats_transformer = FunctionTransformer(text_stats)

    FunctionTransformer

    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer

          从任意个可调用的对象中创建变换器。

         一个函数变换器, 传递X参数到用户自定义的函数中, 返回函数的处理结果。

    Constructs a transformer from an arbitrary callable.

    A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.

    Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.

    也可以提供逆函数, 以支持

    inverse_transform(X)

    Transform X using the inverse function.

    funccallable, default=None

    The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.

    inverse_funccallable, default=None

    The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.

    >>> import numpy as np
    >>> from sklearn.preprocessing import FunctionTransformer
    >>> transformer = FunctionTransformer(np.log1p)
    >>> X = np.array([[0, 1], [2, 3]])
    >>> transformer.transform(X)
    array([[0.       , 0.6931...],
           [1.0986..., 1.3862...]])

    Classification pipeline

    流水线第一步 subject_body_transformer, 从半结构化数据中 提取数据。

    使用 ColumnTransformer 组装特征,

    特征来自子流水线, 包括

    1 主题 的词频向量

    2. 正文的 词频向量, 后经过奇异值分解, 获取降维数据

    3. 正文 的统计数据

    后 给三类特征添加权重。

    最后将特征送入  模型。

    The pipeline below extracts the subject and body from each post using SubjectBodyExtractor, producing a (n_samples, 2) array.

    This array is then used to compute standard bag-of-words features for the subject and body as well as text length and number of sentences on the body, using ColumnTransformer.

    We combine them, with weights, then train a classifier on the combined set of features.

    pipeline = Pipeline([
        # Extract subject & body
        ('subjectbody', subject_body_transformer),
        # Use ColumnTransformer to combine the subject and body features
        ('union', ColumnTransformer(
            [
                # bag-of-words for subject (col 0)
                ('subject', TfidfVectorizer(min_df=50), 0),
                # bag-of-words with decomposition for body (col 1)
                ('body_bow', Pipeline([
                    ('tfidf', TfidfVectorizer()),
                    ('best', TruncatedSVD(n_components=50)),
                ]), 1),
                # Pipeline for pulling text stats from post's body
                ('body_stats', Pipeline([
                    ('stats', text_stats_transformer),  # returns a list of dicts
                    ('vect', DictVectorizer()),  # list of dicts -> feature matrix
                ]), 1),
            ],
            # weight above ColumnTransformer features
            transformer_weights={
                'subject': 0.8,
                'body_bow': 0.5,
                'body_stats': 1.0,
            }
        )),
        # Use a SVC classifier on the combined features
        ('svc', LinearSVC(dual=False)),
    ], verbose=True)

    最后训练模型, 并在验证集合上查看性能。

    Finally, we fit our pipeline on the training data and use it to predict topics for X_test. Performance metrics of our pipeline are then printed.

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print('Classification report:
    
    {}'.format(
        classification_report(y_test, y_pred))
    )

    分类性能报告

    Out:

    [Pipeline] ....... (step 1 of 3) Processing subjectbody, total=   0.0s
    [Pipeline] ............. (step 2 of 3) Processing union, total=   0.4s
    [Pipeline] ............... (step 3 of 3) Processing svc, total=   0.0s
    Classification report:
    
                  precision    recall  f1-score   support
    
               0       0.84      0.87      0.86       396
               1       0.87      0.83      0.85       394
    
        accuracy                           0.85       790
       macro avg       0.85      0.85      0.85       790
    weighted avg       0.85      0.85      0.85       790
    

    Total running time of the script: ( 0 minutes 2.529 seconds)

    TruncatedSVD

    https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD

    使用截断的奇异值分解, 来线性降维。

    对比PCA, 这个算法不需要中心化数据。可以处理稀疏矩阵。经常用于文本特征提取的潜在语义分析。

    Dimensionality reduction using truncated SVD (aka LSA).

    This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.

    In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

    This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on X * X.T or X.T * X, whichever is more efficient.

    >>> from sklearn.decomposition import TruncatedSVD
    >>> from scipy.sparse import random as sparse_random
    >>> X = sparse_random(100, 100, density=0.01, format='csr',
    ...                   random_state=42)
    >>> svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
    >>> svd.fit(X)
    TruncatedSVD(n_components=5, n_iter=7, random_state=42)
    >>> print(svd.explained_variance_ratio_)
    [0.0646... 0.0633... 0.0639... 0.0535... 0.0406...]
    >>> print(svd.explained_variance_ratio_.sum())
    0.286...
    >>> print(svd.singular_values_)
    [1.553... 1.512...  1.510... 1.370... 1.199...]
    出处:http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。
  • 相关阅读:
    SSM框架整合步骤
    Spring-data-jpa
    allure定制报告
    pytest常用选项
    staticmethod&classmethod&property
    __slot__
    python的参数传递
    闭包和装饰器
    内置高阶函数
    str
  • 原文地址:https://www.cnblogs.com/lightsong/p/14297934.html
Copyright © 2011-2022 走看看