pandas的随机打乱数据集sample函数

zoukankan html css js c++ java

pandas的随机打乱数据集sample函数
pandas的随机打乱数据集sample函数

一、总结

一句话总结：

[甲]、设置frac=0.5表示随机抽取50%的数据

[乙]、df=df.sample(frac=1.0) #打乱所有数据

二、pandas：sample函数解释

转自或参考：pandas：sample函数解释
http://blog.csdn.net/Flag_ing/article/details/106979895
函数定义：

DataFrame.sample(self: ~ FrameOrSeries, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

作用：

从所选的数据的指定 axis 上返回随机抽样结果，类似于random.sample()函数。

举个栗子（关于每个参数的解释在最下面）：

1、首先定义一个数据，结构如下：

import pandas as pd # 定义一组数据 df = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]}, index=['falcon', 'dog', 'spider', 'fish']) print(df) """ -----------------结果-----------------""" num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8

2、从Series df['num_legs']中随机提取3个元素。注意我们使用random_state（类似于random库中随机种子的作用）确保示例的可复现性。可以看出，结果是在上述数据的“num_legs”项中随机抽取三个。

extract = df['num_legs'].sample(n=3, random_state=1) print(extract) """------------运行结果----------""" fish 0 spider 8 falcon 2 Name: num_legs, dtype: int64

3、replace=True时表示有放回抽样，设置frac=0.5表示随机抽取50%的数据，默认对行数据进行操作。栗子如下。

extract2 = df.sample(frac=0.5, replace=True, random_state=1) print(extract2) """-----------运行结果-----------""" num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8

4、一个上采样的栗子。设置 frac=2。注意，当frac>1时必须设置replace=True，默认对行数据进行操作。

extract3 = df.sample(frac=2, replace=True, random_state=1) print(extract3) """-----------运行结构-----------""" num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 falcon 2 2 10 falcon 2 2 10 fish 0 0 8 dog 4 0 2 fish 0 0 8 dog 4 0 2

5、使用数据中的某列的数据值作为权重的栗子。对num_availen_seen列数据进行操作，该列数据中值较大的行更容易被采样。可以看出，num_availen_seen列中的数据为[10, 2, 1, 8]，则[10, 8]两列更易被抽到。抽样结果即说明了这一点。

extract4 = df.sample(n=2, weights='num_specimen_seen', random_state=1) print(extract4) """------------运行结果------------""" num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8

参数解释：

n ：int, optional

随机抽样返回的items个数。当frac = None时不可用。

Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.

frac ：float, optional

要返回的 axis items 数量的小数(比例)表示。不能与n一起使用。

Fraction of axis items to return. Cannot be used with n.

Note：If frac > 1, replacement should be set to True.

replace ：bool, default False

是否是有放回取样。

Allow or disallow sampling of the same row more than once.

weights ：str or ndarray-like, optional

默认的“None”将导致相等的概率权重。如果传递了一个序列，将与目标对象上的索引对齐。权重中未被采样对象发现的索引值将被忽略，权重中未被采样对象的索引值将被赋值为零。如果在DataFrame上调用，将在axis = 0时接受列的名称。除非权重是一个序列，否则权重必须与被采样的轴长度相同。如果权重的和不是1，它们将被规范化为和为1。weights列中缺少的值将被视为零。不允许无限值。

Default ‘None’ results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.

random_state ：int or numpy.random.RandomState, optional

用于随机数生成器(如果是int类型的参数)或numpy RandomState对象的种子。

Seed for the random number generator (if int), or numpy RandomState object.

axis ：{0 or ‘index’, 1 or ‘columns’, None}, default None

采样的轴。可以是axis的编号或名称。

Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames).

Returns ：Series or DataFrame

与调用数据相同类型的新对象，包含从调用数据对象中随机取样的n项。

A new object of same type as caller containing n items randomly sampled from the caller object.

参考官网： https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
我的旨在学过的东西不再忘记（主要使用艾宾浩斯遗忘曲线算法及其它智能学习复习算法）的偏公益性质的完全免费的编程视频学习网站： fanrenyi.com；有各种前端、后端、算法、大数据、人工智能等课程。

版权申明：欢迎转载，但请注明出处
一些博文中有一些参考内容因时间久远找不到来源了没有注明，如果侵权请联系我删除。

博主25岁，前端后端算法大数据人工智能都有兴趣。

大家有啥都可以加博主联系方式（qq404006308，微信fan404006308）互相交流。工作、生活、心境，可以互相启迪。

聊技术，交朋友，修心境，qq404006308，微信fan404006308

26岁，真心找女朋友，非诚勿扰，微信fan404006308，qq404006308

人工智能群：939687837

作者相关推荐

感悟总结

其它重要感悟总结

感悟总结200813 最近心境200830 最近心境201019 201218-210205
查看全文

相关阅读:
javascript闭包的理解
 关于打印
 CozyShark开发日志 3章节
 CozyShark开发日志 2章节
 CozyShark开发日志 1.5章节
 CozyShark开发日志 1章节
 CozyShark开发日志 0章节
 WPF：设置DataGrid中DataGridColumn列的普通样式和编辑样式
 Windows Phone开发学习笔记(1)---------自定义弹框
 一个简单的ASP.NEW MVC4网站（二）

原文地址：https://www.cnblogs.com/Renyi-Fan/p/13669897.html

pandas的随机打乱数据集sample函数

pandas的随机打乱数据集sample函数

一、总结

一句话总结：

[甲]、设置frac=0.5表示随机抽取50%的数据

[乙]、df=df.sample(frac=1.0) #打乱所有数据

二、pandas：sample函数解释

作者相关推荐