zoukankan      html  css  js  c++  java
  • Python爬虫实例(六)多进程下载金庸网小说

    目标任务:使用多进程下载金庸网各个版本(旧版、修订版、新修版)的小说

    代码如下:

    # -*- coding: utf-8 -*-
    import requests
    from lxml import etree
    from multiprocessing import Pool
    import os
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}
    
    def download(title,url, filename):
        response = requests.get(url, headers=headers).text
        html = etree.HTML(response)
        pages = html.xpath('//div//p/text()')[2:]
        with open(filename, 'a') as f:
                f.write(title+'
    ')
        for page in pages:
            with open(filename, 'a') as f:
                f.write(page+'
    ')
    
    
    def main(url):
        start_url = 'http://www.jinyongwang.com'+url
        sname = start_url.split('/')[-2]
        if sname.startswith('o'):
            folder = 'old/'
            if(not os.path.exists(folder)):
                os.makedirs(folder)
        elif sname.startswith('n'):
            folder = 'new/'
            if(not os.path.exists(folder)):
                os.makedirs(folder)
        else:
            folder = 'now/'
            if(not os.path.exists(folder)):
                os.makedirs(folder)
        filename = folder+sname+'.txt'
        base_url = 'http://www.jinyongwang.com'
        response = requests.get(start_url, headers=headers).text
        html = etree.HTML(response)
        urls = html.xpath('//ul[@class="mlist"]/li/a/@href')
        titles = html.xpath('//ul[@class="mlist"]/li//text()')
        for index,url in enumerate(urls):
            full_url = base_url+url
            title = titles[index]
            download(title, full_url, filename)
    
    if __name__ == '__main__':
        url01 = 'http://www.jinyongwang.com/'
        response = requests.get(url01, headers=headers).text
        html = etree.HTML(response)
        urls = html.xpath('//li[@class="book_li"]/p[3]//a/@href')
        pool = Pool()
        pool.map(main,urls)
        pool.close()
        pool.join()

    结果展示:

  • 相关阅读:
    【转】将项目打成war包并用tomcat部署的方法,步骤及注意点
    JETTY+NGINX
    【转】收集 jetty、tomcat、jboss、weblogic 的比较
    SQL左右连接中的on and和on where的区别
    定义一个servlet用于处理所有外部接口类 架构思路
    spring上下文快速获取方法
    jasper打印实例2 ----通过文件字节流获得PDF格式图片
    Jasper打印示例
    Jasperreport5.6.9-----1
    Linux装B命令
  • 原文地址:https://www.cnblogs.com/xinyangsdut/p/7766123.html
Copyright © 2011-2022 走看看