Linux 文本处理工具记录

zoukankan html css js c++ java

Linux 文本处理工具记录

Shuffle lines of multi files

现在有 1000 个文本文件(0.txt ~ 999.txt)，每个文件大概 11M，总共 11G，我想把这 1000 个文本文件的内容随机组合成一个文件。

用cat *.txt | shuf > random试了下，大概第 8 秒内存占用就达到 96%，然后就不再上涨了，最后大概用了 55 秒完成，要求也完美达成。

输出第n到m个词

一个文本文件，可能有多行，每行有多个单词，单词通过空格分隔，现希望输出第 100 到第 500 ([100,500]，闭区间) 个词。

tr ' ' ' ' < inputfile | cut -d' ' -f 100-500 > outputfile

edgelist 转 csv

将 edgelist 文件转换为 csv 文件，即在文件头添加 "source,target"，且将空格替换为逗号

sed -e '1i source,target' -e 's/ /,/g' test.edgelist > test.csv
或
awk 'BEGIN{print "source,target"}{print $1","$2}' test.edgelist > test.csv

表格化输出

column -t -s ',' result.csv

集合操作

comm 输入两个排序后的文件，输出三列，第一列仅出现在第一个文件中，第二列仅出现在第二个文件中，第三列在两文件中都有
参数 123 控制不输出哪些列
交集 comm -12 <(sort test|uniq) <(sort test1|uniq) comm -12 <(ls) <(ls|head)
差集1 comm -13 <(sort test|uniq) <(sort test1|uniq) comm -13 <(ls) <(ls|head) 出现在第二项中而不出现在第一项中
差集2 comm -23 <(sort test|uniq) <(sort test1|uniq) comm -23 <(ls) <(ls|head) 出现在第一项中而不出现在第二项中
并集 cat test test1 |sort|uniq

查看全文

相关阅读:
java 多线程4: java线程的优先级
 Thread.currentThread().getName() ，对象实例.getName() 和 this.getName(）区别
 go http
go redis
go tcp
go 单元测试
 go 定时器
 go channel
go goroutine
go 错误处理

原文地址：https://www.cnblogs.com/maxuewei2/p/10234648.html

Linux 文本处理工具记录

Shuffle lines of multi files

输出第n到m个词

edgelist 转 csv

表格化输出

集合操作