zoukankan      html  css  js  c++  java
  • 爬虫抓取5大门户网站和电商数据day1:基础环境搭建

    最新想用爬虫实现抓取五大门户网站(搜狐、新浪、网易、腾讯、凤凰网)和电商数据(天猫,京东,聚美等), 今天第一天先搭建下环境和测试。

    采用maven+xpath+ HttpClient+正则表达式。

    maven pom.xml配置文件信息

    <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>4.12</version>
          <scope>test</scope>
        </dependency>
        <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-core_2.10</artifactId>
           <version>1.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>1.6.0</version>
        </dependency>
        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-hive_2.10</artifactId>
          <version>1.6.0</version>
        </dependency>
        <dependency>
              <groupId>org.apache.spark</groupId>
              <artifactId>spark-streaming_2.10</artifactId>
              <version>1.6.0</version>
        </dependency>
        <dependency>
              <groupId>org.apache.hadoop</groupId>
              <artifactId>hadoop-client</artifactId>
              <version>2.6.0</version>
        </dependency>
        <dependency>
              <groupId>org.apache.spark</groupId>
              <artifactId>spark-streaming-kafka_2.10</artifactId>
              <version>1.6.0</version>
        </dependency>
        <dependency>
              <groupId>org.apache.spark</groupId>
              <artifactId>spark-graphx_2.10</artifactId>
              <version>1.6.0</version>
        </dependency>
        <!-- httpclient4.4 -->
            <dependency>
                <groupId>org.apache.httpcomponents</groupId>
                <artifactId>httpclient</artifactId>
                <version>4.4</version>
            </dependency>
            <!-- htmlcleaner -->
            <dependency>
                <groupId>net.sourceforge.htmlcleaner</groupId>
                <artifactId>htmlcleaner</artifactId>
                <version>2.10</version>
            </dependency>
            <!-- json -->
            <dependency>
                <groupId>org.json</groupId>
                <artifactId>json</artifactId>
                <version>20140107</version>
            </dependency>
            
            
            <!-- hbase -->
            <dependency>
                <groupId>org.apache.hbase</groupId>
                <artifactId>hbase-client</artifactId>
                <version>0.96.1.1-hadoop2</version>
            </dependency>
            
                <dependency>
                <groupId>org.apache.hbase</groupId>
                <artifactId>hbase-server</artifactId>
                <version>0.96.1.1-hadoop2</version>
            </dependency>
    
            
            <!-- redis 2.7.0-->
            <dependency>
                <groupId>redis.clients</groupId>
                <artifactId>jedis</artifactId>
                <version>2.7.0</version>
            </dependency>
            
            <!-- slf4j -->
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
                <version>1.7.10</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
                <version>1.7.10</version>
            </dependency>
            <!-- quartz1.8.4 -->
            <dependency>
                <groupId>org.quartz-scheduler</groupId>
                <artifactId>quartz</artifactId>
                <version>1.8.4</version>
            </dependency>
            <!-- curator -->
            <dependency>
                <groupId>org.apache.curator</groupId>
                <artifactId>curator-framework</artifactId>
                <version>2.7.1</version>
            </dependency>

    新建一个测试类:SpiderTest

      /**
         * url 入口,下载页面
         * @param url
         */
        public  static String downLoadCrawlurl(String url){
            String context = null;
            Logger logger = LoggerFactory.getLogger(SpiderTest.class);
            HttpClientBuilder create = HttpClientBuilder.create();
            HttpGet httpGet = new HttpGet(url);
            CloseableHttpClient build = create.build();
            try {
                CloseableHttpResponse response = build.execute( httpGet);
                HttpEntity entity = response.getEntity();
                context = EntityUtils.toString( entity );
                System.out.println("context:" + context);
            }
            catch ( ClientProtocolException e ) {
                e.printStackTrace();
            }
            catch ( IOException e ) {
                logger.info("download...." );
            }
            return context;
        }
    
    
    public static void main( String[] args ) {
      
    String url = "http://money.163.com/";
    downLoadCrawlurl(url);
    }
  • 相关阅读:
    java并发编程(五)——线程池
    java并发编程(四)——无锁
    java并发编程(三)——java内存模型
    java并发编程(二)——加锁
    java并发编程(一)——进程与线程
    java中的异常和处理机制
    使用JDK自带的keytool工具生成证书
    JAVA 制作证书
    网络安全之证书相关概念
    maven之.lastUpdated文件
  • 原文地址:https://www.cnblogs.com/zhanggl/p/5216346.html
Copyright © 2011-2022 走看看