zoukankan html css js c++ java

分层拆分

分层拆分保证拆分后的数据集标签列比例还一样。比如在原来数据集中正负样本比例是2:1，那么在拆分后的测试集和训练集中，正负标签也是2:1。
可以用来修正随机拆分后的测试集和训练中比例不一样的问题。
如果正样本特别少，并且测试集也很少，那么测试集有可能抽不到正样本，可以使用分层采样。

使用sklearn 测试，不使用分层：

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.DataFrame(
    data={
        'c1': ['a', 'b', 'c', 'd', 'e', 'f'],
        'label': [1, 1, 1, 1, 0, 0]
    }
)
# X_train, X_test = train_test_split(df, test_size=0.333, random_state=100, stratify=df['label'])  # 使用分层抽样，指定分层抽样依据的列
X_train, X_test = train_test_split(df, test_size=0.333, random_state=100)

print('X_train: ')
print(X_train)
print('X_test: ')
print(X_test)

X_train: 
  c1  label
4  e      0
3  d      1
5  f      0
0  a      1
X_test: 
  c1  label
1  b      1
2  c      1

使用分层抽样的输出：

X_train: 
  c1  label
1  b      1
3  d      1
5  f      0
2  c      1
X_test: 
  c1  label
4  e      0
0  a      1

查看全文

相关阅读:
[LeetCode] 769. Max Chunks To Make Sorted
[LeetCode] 563. Binary Tree Tilt
[LeetCode] 1802. Maximum Value at a Given Index in a Bounded Array
[LeetCode] 1198. Find Smallest Common Element in All Rows
[LeetCode] 370. Range Addition
[LeetCode] 1749. Maximum Absolute Sum of Any Subarray
[LeetCode] 1801. Number of Orders in the Backlog
[LeetCode] 869. Reordered Power of 2
[LeetCode] 841. Keys and Rooms
[LeetCode] 1603. Design Parking System

原文地址：https://www.cnblogs.com/oaks/p/15224321.html