linux:使用comm命令比较两个文件：交集、差

- linux:使用comm命令比较两个文件：交集、差
comm命令可以按行比较两个排序好的文件,输出有3列:第一列是file1独有的、第二列是file2独有的,第三列是两者都有的,简单语法如下:NAMEcomm-comparetwosortedfileslinebylineSYNOPSIScomm[OPTION]...FILE1FILE2DESCRIPTIONComparesortedfilesFILE1andFILE2linebyline.Withnooptions,producethree-columnoutput.Colu
comm命令可以按行比较两个排序好的文件,输出有3列:第一列是file1独有的、第二列是file2独有的,第三列是两者都有的,简单语法如下:

NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output.Column one contains lines unique to FILE1, column two contains lines
unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
--check-order
check that the input is correctly sorted, even if all input lines are pairable
--nocheck-order
do not check that the input is correctly sorted
--output-delimiter=STR
separate columns with STR
示例:先从词典里按顺序随机抽取一些行导出到文件中,这样就省得排序了:
aliyunzixun@xxx.com:/tmp$ sed -n '5p;1001p;3000p;4000p;5000p;7000p;8800p;9900p;10000p' /usr/share/dict/american-english > file1
aliyunzixun@xxx.com:/tmp$ sed -n '2p;4000p;5000p;8888p;10000p;30000p;40000p' /usr/share/dict/american-english > aliyunzixun@xxx.com:/tmp$ cat file1
ABM's
Ashikaga's
Charybdis's
Decker
Eurasia
Idaho's
Lipizzaner
Meghan's
Merck's
aliyunzixun@xxx.com:/tmp$ cat file2
A's
Decker
Eurasia
Lombard's
Merck's
collaborated
elms

比较两个文件
aliyunzixun@xxx.com:/tmp$ comm file1 file2
aliyunzixun@xxx.com:/tmp$ comm file1 file2
A's
ABM's
Ashikaga's
Charybdis's
Decker
Eurasia
Idaho's
Lipizzaner
Lombard's
Meghan's
Merck's
collaborated
elms

只显示file1独有的行:
需要把第2列和第3列去掉:
aliyunzixun@xxx.com:/tmp$ comm -2 -3 file1 file2
ABM's
Ashikaga's
Charybdis's
Idaho's
Lipizzaner
Meghan's只显示file2独有的行:
aliyunzixun@xxx.com:/tmp$ comm -1 -3 file1 file2
A's
Lombard's
collaborated
elms只显示两者重复的行:
aliyunzixun@xxx.com:/tmp$ comm -1 -2 file1 file2
Decker
Eurasia
Merck's只显示两者不重复的行:
后面的sed是将以/t开头的/t去掉:
aliyunzixun@xxx.com:/tmp$ comm -3 file1 file2 | sed 's/^/t//'
A's
ABM's
Ashikaga's
Charybdis's
Idaho's
Lipizzaner
Lombard's
Meghan's
collaborated
elms

comm 命令

comm命令可用于两个文件之间的比较。通过参数调整输出，可以执行交集、求差以及差集操作。

- 交集：打印出两个文件所有共同的行。

- 求差：打印出指定文件所包含的互不相同的那些行。

- 差集：打印出包含在文件A中，但不包含在其他指定文件中的那些行。

需要注意的是，comm必须使用经过排序的文件作为输入。在linux中可以使用sort命令实现排序。

comm实战

建立两个文本文件输入以下内容:


cat A.txt

apple

orange

gold

silver

steel

iron

cat B.txt

orange

gold

cookies

carrot

此时两个文件内的文本是乱序的，使用sort进行排序。

sort [option] [file] 参数 -o 要输出的文件。


sort A.txt -o A.txt;

sort B.txt -o B.txt;

(1) 首先执行不带任何选项的comm:


$ comm A.txt B.txt

apple

    carrot

    cookies

        gold

iron

        orange

silver

steal

输出的第一列包含中在A.txt中的行，第二列包含只在B.txt中出现的行，第三列包含同时包含两文件中相同的行。各列使用分隔。

(2) 为了打印两个文件的交集，我们需要删除第一列和第二列，只打印第三列：


$ comm A.txt B.txt -1 -2

gold

orange

(3) 打印出两个文件中不相同的行：


$ comm a.txt b.txt -3


apple

    carrot

    cookies

iron

silver

steal

(4) 为了提高输出结果的可用性，需要删除空白字段，将两列合并成一列：

借助sed命令格式化输出。

sed - stream editor for filtering and transforming text 字符流编辑


$ comm a.txt b.txt -3 | sed 's/^	//'

apple

carrot

cookies

iron

silver

steal

sed 命令解释: sed命令通过管道获取comm的输出。它删除行首的字符。sed中的s表示替换substitute。/^ /匹配行前的 (^是行首标记).//是用来替换行首的/t的字符串。如此一来，就可以删除所有行首的 .