参考链接:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
DataFrame.
drop_duplicates
(subset=None, keep='first', inplace=False, ignore_index=False)
这个方法默认是去除每一行中的重复行,可以指定特定的去重的columns参数位subset。
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep. - first
: Drop duplicates except for the first occurrence. - last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates.
keep ,可以让你选择去重以后需要选择留下的内容,first为第一次出现的索引,last为最后一次出现的索引,Fasle为放弃所有的重复行
inplace就不介绍了。
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
这个是是否重复调整索引
上官方demo
In [8]: df Out[8]: brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 In [9]: df.drop_duplicates() Out[9]: brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 In [10]: df.drop_duplicates(ignore_index=True) Out[10]: brand style rating 0 Yum Yum cup 4.0 1 Indomie cup 3.5 2 Indomie pack 15.0 3 Indomie pack 5.0 In [11]: df.drop_duplicates(keep='last') Out[11]: brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 In [12]: df.drop_duplicates(keep=False) Out[12]: brand style rating 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0