zoukankan      html  css  js  c++  java
  • t-SNE and PCA

    1.t-SNE

    知乎

    • t-分布领域嵌入算法
    • 虽然主打非线性高维数据降维,但是很少用,因为
    • 比较适合应用于可视化,测试模型的效果
    • 保证在低维上数据的分布与原始特征空间分布的相似性高

    因此用来查看分类器的效果更加

    1.1 复现demo

    # Import TSNE
    from sklearn.manifold import TSNE 
    
    # Create a TSNE instance: model
    model = TSNE(learning_rate=200)
    
    # Apply fit_transform to samples: tsne_features
    tsne_features = model.fit_transform(samples)
    
    # Select the 0th feature: xs
    xs = tsne_features[:,0]
    
    # Select the 1st feature: ys
    ys = tsne_features[:,1]
    
    # Scatter plot, coloring by variety_numbers
    plt.scatter(xs,ys,c=variety_numbers)
    plt.show()
    

    2.PCA

    主成分分析是进行特征提取,会在原有的特征的基础上产生新的特征,新特征是原有特征的线性组合,因此会达到降维的目的,但是降维不仅仅只有主成分分析一种

    • 当特征变量很多的时候,变量之间往往存在多重共线性。
    • 主成分分析,用于高维数据降维,提取数据的主要特征分量
    • PCA能“一箭双雕”的地方在于
      • 既可以选择具有代表性的特征,
      • 每个特征之间线性无关
      • 总结一下就是原始特征空间的最佳线性组合
        有一个非常易懂的栗子知乎

    2.1 数学推理

    可以参考【机器学习】降维——PCA(非常详细)
    Making sense of principal component analysis, eigenvectors & eigenvalues

    2.2栗子

    sklearn里面有直接写好的方法可以直接使用

    from sklearn.decomposition import PCA
    
    # Perform the necessary imports
    import matplotlib.pyplot as plt
    from scipy.stats import pearsonr
    
    # Assign the 0th column of grains: width
    width = grains[:,0]
    
    # Assign the 1st column of grains: length
    length = grains[:,1]
    
    # Scatter plot width vs length
    plt.scatter(width, length)
    plt.axis('equal')
    plt.show()
    
    # Calculate the Pearson correlation
    correlation, pvalue = pearsonr(width, length)
    
    # Display the correlation
    print(correlation)
    

    # Import PCA
    from sklearn.decomposition import PCA
    
    # Create PCA instance: model
    model = PCA()
    
    # Apply the fit_transform method of model to grains: pca_features
    pca_features = model.fit_transform(grains)
    
    # Assign 0th column of pca_features: xs
    xs = pca_features[:,0]
    
    # Assign 1st column of pca_features: ys
    ys = pca_features[:,1]
    
    # Scatter plot xs vs ys
    plt.scatter(xs, ys)
    plt.axis('equal')
    plt.show()
    
    # Calculate the Pearson correlation of xs and ys
    correlation, pvalue = pearsonr(xs, ys)
    
    # Display the correlation
    print(correlation)
    
    <script.py> output:
        2.5478751053409354e-17
    

    2.3intrinsic dimension

    主成分的固有维度,其实就是提取主成分,得到最佳线性组合

    2.3.1 提取主成分

    .n_components_
    提取主成分一般占总的80%以上,不过具体问题还得具体分析

    # Perform the necessary imports
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import make_pipeline
    import matplotlib.pyplot as plt
    
    # Create scaler: scaler
    scaler = StandardScaler()
    
    # Create a PCA instance: pca
    pca = PCA()
    
    # Create pipeline: pipeline
    pipeline = make_pipeline(scaler,pca)
    
    # Fit the pipeline to 'samples'
    pipeline.fit(samples)
    
    # Plot the explained variances
    features =range( pca.n_components_)
    plt.bar(features, pca.explained_variance_)
    plt.xlabel('PCA feature')
    plt.ylabel('variance')
    plt.xticks(features)
    plt.show()
    

    2.3.2Dimension reduction with PCA

    主成分降维

    给一个文本特征提取的小例子

    虽然我还不知道这个是啥,以后学完补充啊

    # Import TfidfVectorizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Create a TfidfVectorizer: tfidf
    tfidf = TfidfVectorizer() 
    
    # Apply fit_transform to document: csr_mat
    csr_mat = tfidf.fit_transform(documents)
    
    # Print result of toarray() method
    print(csr_mat.toarray())
    
    # Get the words: words
    words = tfidf.get_feature_names()
    
    # Print words
    print(words)
    
    ['cats say meow', 'dogs say woof', 'dogs chase cats']
    
    <script.py> output:
        [[0.51785612 0.         0.         0.68091856 0.51785612 0.        ]
         [0.         0.         0.51785612 0.         0.51785612 0.68091856]
         [0.51785612 0.68091856 0.51785612 0.         0.         0.        ]]
        ['cats', 'chase', 'dogs', 'meow', 'say', 'woof']
    
    <script.py> output:
        [[0.51785612 0.         0.         0.68091856 0.51785612 0.        ]
         [0.         0.         0.51785612 0.         0.51785612 0.68091856]
         [0.51785612 0.68091856 0.51785612 0.         0.         0.        ]]
        ['cats', 'chase', 'dogs', 'meow', 'say', 'woof']
    
    
  • 相关阅读:
    JavaScript中判断函数是new还是()调用
    IE6/7 and IE8/9(Q)中td的上下padding失效
    JQuery中html()方法使用不当带来的陷阱
    有name为action的表单元素时取form的属性action杯具了
    为非IE浏览器添加mouseenter,mouseleave事件
    各浏览器中querySelector和querySelectorAll的实现差异
    仅IE6/7/8中innerHTML返回值忽略英文空格
    各浏览器关键字/保留字作为对象属性的差异
    各浏览器中鼠标按键值的差异
    给body标签和document.body都添加点击事件后仅Firefox弹出了两次
  • 原文地址:https://www.cnblogs.com/gaowenxingxing/p/12313864.html
Copyright © 2011-2022 走看看