Chapter 5 - Dimensionality Reduction Methods
Segment 1 - Explanatory factor analysis
Factor Analysis
A method that explores a data set in order to find root causes which explain why data is acting a certain way
Factors(or latent variables): variables that are quite meaningful but that are inferred and not directly observable
Factor Analysis Assumptions
- Features are metric
- Feature are continuous or ordinal
- There is r > 0.3 correlation between the features in your dataset
- You have > 100 observations and > 5 observations per feature
- Sample is homogenous
The Iris Dataset
Iris flowers(labels):
- Setosa
- Versicolour
- Virginica
Attributes (predictive features):
- Sepal length
- Sepal length
- Petal length
- Petal width
Factor Loading
- ~ -1 or 1 = Factor has a strong influence on the variable
- ~0 = Factor weakly influences on the variable
- '>1 = That means these are highly correlated factors
import pandas as pd
import numpy as np
import sklearn
from sklearn.decomposition import FactorAnalysis
from sklearn import datasets
Factor analysis on iris dataset
iris = datasets.load_iris()
X = iris.data
variable_names = iris.feature_names
X[0:10,]
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
factor = FactorAnalysis().fit(X)
DF = pd.DataFrame(factor.components_, columns=variable_names)
print(DF)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 0.706989 -0.158005 1.654236 0.70085
1 0.115161 0.159635 -0.044321 -0.01403
2 -0.000000 0.000000 0.000000 0.00000
3 -0.000000 0.000000 0.000000 -0.00000