zoukankan      html  css  js  c++  java
  • 爬虫使用的simhash网页去重算法-----项目改造使用-----java.lang.IllegalStateException: TokenStream contract violation: reset()/close()问题

    simhash项目地址:https://github.com/CreekLou/simhash.git

    这个项目不能直接使用,得改造,一般需要改造两个地方:

    1、luecene的核心jar包版本与其他包及分词器ikanalyzer中使用的版本不一致,注意检查

     

        此项目将核心包版本从3.6.1改为4.7.2

    ------------------------------至此改完后可用测试类TEST调试,调试前需给程序抛异常IOException-------注意有多处,不一一列举

    2、抛完异常后,运行会报java.lang.IllegalStateException: TokenStream contract violation: reset()/close()问题,因为分词analyzer.tokenStream方法生成的对象添加完属性后需要重置

    按下图顺序寻找位置:

     

     

    -----------------------找到G点

     然后运行正常

    在springboot项目中测试:

    1、搭好springboot环境

    POM.XML

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <parent>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-parent</artifactId>
            <version>2.0.2.RELEASE</version>
        </parent>
        <groupId>com.mwq</groupId>
        <artifactId>com-mwq-crawler-webmagic</artifactId>
        <version>1.0-SNAPSHOT</version>
        <properties>
            <java.version>1.8</java.version>
        </properties>
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
        <!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-core -->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.7.3</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- https://mvnrepository.com/artifact/us.codecraft/webmagic-extension -->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.7.3</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>//布隆过滤器的依赖
            <artifactId>guava</artifactId>
            <version>16.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>
        <dependency>
            <groupId>com.lou</groupId>
            <artifactId>simhasher</artifactId>
            <version>0.0.1-SNAPSHOT</version>
        </dependency>
    </dependencies>
    </project>

    simhash是单独的项目我们可以这么导入

     

     然后下一步,下一步就可以了,导入模型后要设置SDK,按上述调试好包后,进行安装:

     然后再POM.XML中配置依赖即可

    然后测试:

    package com.mwq.job.task;
    
    import com.lou.simhasher.SimHasher;
    import org.apache.commons.io.IOUtils;
    import org.springframework.stereotype.Component;
    
    import java.io.FileInputStream;
    import java.io.IOException;
    
    @Component
    public class Test {
     // @Scheduled(cron = "0/5 * * * * *")
        public void testDistance() throws IOException {
            String str1 = readAllFile("D:/test/testin2.txt");
            SimHasher hash1 = new SimHasher(str1);
            System.out.println(hash1.getSignature());
            System.out.println("============================");
    
            String str2 = readAllFile("D:/test/testin.txt");
            SimHasher hash2 = new SimHasher(str2);
            System.out.println(hash2.getSignature());
            System.out.println("============================");
    
            System.out.println(hash1.getHammingDistance(hash2.getSignature()));
    
        }
        /**
         * 测试用
         * @param filename 名字
         * @return
         */
        public static String readAllFile(String filename) {
            String everything = "";
            try {
                FileInputStream inputStream = new FileInputStream(filename);
                everything = IOUtils.toString(inputStream);
                inputStream.close();
            } catch (IOException e) {
            }
    
            return everything;
        }
    }
  • 相关阅读:
    FreeMark教程
    Intellij IDEA 创建Web项目并在Tomcat中部署运行
    catalina.home和catalina.base这两个属性的作用
    如何用javac 和java 编译运行整个Java工程
    Java中Properties类的操作
    注册邮箱验证激活技术
    commons-logging的使用
    Windows下安装GDB
    BM算法
    Intellij IDEA 部署 项目在tomcat 原理
  • 原文地址:https://www.cnblogs.com/mwq1992/p/14218900.html
Copyright © 2011-2022 走看看