Generate a random multilabel classification problem.
- For each sample, the generative process is:
-
- pick the number of labels: n ~ Poisson(n_labels):选取标签的数目
- n times, choose a class c: c ~ Multinomial(theta) :n次,选取类别C:多项式
- pick the document length: k ~ Poisson(length) :选取文档长度
- k times, choose a word: w ~ Multinomial(theta_c):k次,选取一个单词
In the above process, rejection sampling is used to make sure that n is never zero or more than n_classes, and that the document length is never zero. Likewise, we reject classes which have already been chosen.
在上面的过程中,为确保n不为0或不超过变量n_classes,且文本长度不为0,采用拒绝抽样的方法。同样的,我们拒绝已经选择的类。
Parameters: |
n_samples : int, optional (default=100)
n_features : int, optional (default=20)
n_classes : int, optional (default=5)
n_labels : int, optional (default=2)
length : int, optional (default=50)
allow_unlabeled : bool, optional (default=True)
sparse : bool, optional (default=False)
return_indicator : ‘dense’ (default) | ‘sparse’ | False
return_distributions : bool, optional (default=False)
random_state : int, RandomState instance or None, optional (default=None)
|
---|---|
Returns: |
X : array of shape [n_samples, n_features]
Y : array or sparse CSR matrix of shape [n_samples, n_classes]
p_c : array, shape [n_classes]
p_w_c : array, shape [n_features, n_classes]
|
官网教程:
"""
==============================================
Plot randomly generated multilabel dataset【绘制随机生成的多标签数据集】
==============================================
This
illustrates the `datasets.make_multilabel_classification` dataset
generator. Each sample consists of counts of two features (up to 50 in
total), which are differently distributed in each of two classes.Points
are labeled as follows, where Y means the class is present:
【数据集生成器“datasets.make_multilabel_classification”说明:】
===== ===== ===== ======
1 2 3 Color
===== ===== ===== ======
Y N N Red
N Y N Blue
N N Y Yellow
Y Y N Purple
Y N Y Orange
Y Y N Green
Y Y Y Brown
===== ===== ===== ======
A
star marks the expected sample for each class; its size reflects the
probability of selecting that class label.【一颗星星标志着每个类标签的预期样本,它的大小反映了
选择该类标签的概率。】
The
left and right examples highlight the ``n_labels`` parameter: more of
the samples in the right plot have 2 or 3 labels.Note that this
two-dimensional example is very degenerate:generally the number of
features would be much greater than the "document length", while here we
have much larger documents than vocabulary.
Similarly, with ``n_classes > n_features``, it is much less likely that a feature distinguishes a particular class.
【左右两幅图显示“n_labels”的参数;右边的大多数样本有2到3个标签。注意,这个二维的样本是非常退化的:通常,特征的总数比“文本”的总数要多,但是在这里,我们的文本长度大于词汇数。类似地,因为``n_classes(3)> n_features(2)``,特征不太可能区分特定的类】
"""