背景:有一批数据源从kafka给过来,接收到后需要处理,然后入库,我们用一个线程消费下来,一次消费30000条,
按照对象的概念,可以用List<Person>来表示,因为某种原因,需要根据记录的主键personId先在内存做去重(覆盖)处理
在新特性之前,正常的思路会是:list转为map,key为personId,put的时候相同的personId后面的覆盖前面的
java8新特性中,对这种情形有优雅的处理方式,我们分两种:
(1)不关心覆盖逻辑,相同personId只留一条
public static List<Person> coverDuplicate(List<Person> sourceList) { if (CollectionUtils.isEmpty(sourceList)) { return new ArrayList<>(); } List<Person> distinctList = sourceList.stream().collect( Collectors.collectingAndThen( Collectors.toCollection( () -> new TreeSet<>(Comparator.comparing(o -> o.getPersonId()))), ArrayList::new) ); return distinctList; }
(2)相同的personId,后面的记录要求覆盖前面的
public static List<Person> coverDuplicate1(List<Person> sourceList) { if (CollectionUtils.isEmpty(sourceList)) { return new ArrayList<>(); } List<Person> distinctList = sourceList.stream().collect( Collectors.toMap(Person::getPersonId, Function.identity(), (e1, e2) -> e2) ).values().stream().collect(Collectors.toList()); return distinctList; }
测试用例:
public class Person{ private String personId; private String name; private Integer operateTag; } public static void main(String[] args) {
Person p1 = new Person("1","111",1);
Person p2 = new Person ("1","222",0);
Person p3 = new Person ("3","333",1);
Person p4 = new Person ("4","444",0);
Person p5 = new Person ("4","555",1);
List<Person > sourceList = new ArrayList<>();
sourceList.add(p1); sourceList.add(p2);
sourceList.add(p3);
sourceList.add(p4);
sourceList.add(p5); List<Person> unique = coverDuplicate(sourceList);
unique.forEach(e -> System.out.println(e.getPersonId()+","+e.getName()+","+e.getOperateTag())); }
两种方式,打印结果如预期