simhash项目地址:https://github.com/CreekLou/simhash.git
这个项目不能直接使用,得改造,一般需要改造两个地方:
1、luecene的核心jar包版本与其他包及分词器ikanalyzer中使用的版本不一致,注意检查
此项目将核心包版本从3.6.1改为4.7.2
------------------------------至此改完后可用测试类TEST调试,调试前需给程序抛异常IOException-------注意有多处,不一一列举
2、抛完异常后,运行会报java.lang.IllegalStateException: TokenStream contract violation: reset()/close()问题,因为分词analyzer.tokenStream方法生成的对象添加完属性后需要重置
按下图顺序寻找位置:
-----------------------找到G点
然后运行正常
在springboot项目中测试:
1、搭好springboot环境
POM.XML
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.0.2.RELEASE</version> </parent> <groupId>com.mwq</groupId> <artifactId>com-mwq-crawler-webmagic</artifactId> <version>1.0-SNAPSHOT</version> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> </dependency> <!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-core --> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> <exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion> </exclusions> </dependency> <!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-extension --> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>com.google.guava</groupId>//布隆过滤器的依赖 <artifactId>guava</artifactId> <version>16.0</version> </dependency> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.7</version> </dependency> <dependency> <groupId>com.lou</groupId> <artifactId>simhasher</artifactId> <version>0.0.1-SNAPSHOT</version> </dependency> </dependencies> </project>
simhash是单独的项目我们可以这么导入
然后下一步,下一步就可以了,导入模型后要设置SDK,按上述调试好包后,进行安装:
然后再POM.XML中配置依赖即可
然后测试:
package com.mwq.job.task; import com.lou.simhasher.SimHasher; import org.apache.commons.io.IOUtils; import org.springframework.stereotype.Component; import java.io.FileInputStream; import java.io.IOException; @Component public class Test { // @Scheduled(cron = "0/5 * * * * *") public void testDistance() throws IOException { String str1 = readAllFile("D:/test/testin2.txt"); SimHasher hash1 = new SimHasher(str1); System.out.println(hash1.getSignature()); System.out.println("============================"); String str2 = readAllFile("D:/test/testin.txt"); SimHasher hash2 = new SimHasher(str2); System.out.println(hash2.getSignature()); System.out.println("============================"); System.out.println(hash1.getHammingDistance(hash2.getSignature())); } /** * 测试用 * @param filename 名字 * @return */ public static String readAllFile(String filename) { String everything = ""; try { FileInputStream inputStream = new FileInputStream(filename); everything = IOUtils.toString(inputStream); inputStream.close(); } catch (IOException e) { } return everything; } }