zoukankan      html  css  js  c++  java
  • 对于下发的文件进行爬取,减少人去下载的过程

    对于政府网站下发的文件进行爬取,减少人去下载的过程

    博问上有人不会,我写了一下

    绝对不要加多线程多线程进去

    仅供学习,不要用于商业目的

    import re
    
    import requests
    from lxml.html import etree
    
    url = 'http://www.liyang.gov.cn/default.php?mod=article&fid=163250&s99679207_start=0'
    rp = requests.get(url)
    re_html = etree.HTML(rp.text)
    url_xpath = '//*[@id="s99679207_content"]/table/tbody/tr/td/span[1]/span/a/@href'
    title_xpath = '//*[@id="s99679207_content"]/table/tbody/tr/td/span[1]/span/a/text()'
    url_list = re_html.xpath(url_xpath)
    title_list = re_html.xpath(title_xpath)
    title_list = title_list[::-1]
    data_url_list = []
    for url_end in url_list:
        new_url = f'http://www.liyang.gov.cn/{url_end}'
        print(new_url)
        rp_1 = requests.get(new_url)
        print(rp_1.text)
        try:
            re_1_html = etree.HTML(rp_1.text)
            data_url_xpth = '//tbody/tr[1]/td[2]/a'
            data_url = re_1_html.xpath(data_url_xpth)[0]
        except:
            data_list = re.findall('<a href="(.*?)" target="_blank">', rp_1.text)
            data_url = data_list[0]
        print(data_url)
        data_url = f'http://www.liyang.gov.cn/{data_url}'
        re = requests.get(data_url)
        data = re.content
        with open(f'{title_list.pop()}.pdf', 'wb') as fw:
            fw.write(data)
    
  • 相关阅读:
    单例模式
    抽象类的作用和应用场景
    java内部类的作用
    java多线程
    IO流--与properties集合配合使用
    IO流--序列化流与反序列化流
    8 个必备的PHP功能开发
    CSS3 box-shadow:
    移动平台的meta标签-----神奇的功效
    Android Screen Monitor抓取真机屏幕
  • 原文地址:https://www.cnblogs.com/pythonywy/p/11279269.html
Copyright © 2011-2022 走看看