zoukankan      html  css  js  c++  java
  • python 爬虫系列06--古诗文

    读书破万卷,下笔如有神

    import requests
    import re
    def parse_page(url):
        headers = {
            'USer-Agent':'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
        }
        respose = requests.get(url,headers)
        text = (respose.text)
        titles = re.findall(r'<divsclass="cont">.*?<b>(.*?)</b>',text,re.DOTALL)
        dynsties = re.findall(r'<p class="source">.*?<a.*?>(.*?)</a>',text,re.DOTALL)
        authors = re.findall(r'<p class="source".*?<a.*?>.*?<a.*?>(.*?)</a>',text,re.DOTALL)
        wenben = re.findall(r'<div class="contson" .*?>(.*?)</div>',text,re.DOTALL)
        peoms = []
        for conter in wenben:
            x = re.sub(r'<.*?>',"",conter)
            peoms.append(x.strip())
        poem2 = []
        for calue in zip(titles,dynsties,authors,wenben):
            titles,dynsties,authors,wenben = calue
            poem = {
                '标题':titles,
                '朝代':dynsties,
                '作者':authors,
                '文本':wenben
            }
            poem2.append(poem)
        for poem in poem2:
            print(poem)
            print('*'*40)
    
    def main():
        #url = 'https://www.gushiwen.org/default_1.aspx'
        for x in range(1,10):
            url = "https://www.gushiwen.org/default_%s.aspx" % x
            x 
            parse_page(url)
    
    if __name__ == "__main__":
        main()
  • 相关阅读:
    Python--__init__方法
    Python--面向对象编程
    用R语言对NIPS会议文档进行聚类分析
    docker oracle install
    java 删除字符串左边空格和右边空格 trimLeft trimRight
    mysql 表名和字段、备注
    docker学习
    shell爬虫
    shell 解析json
    SecureCRT 7.1.1和SecureFx key 亲测可用
  • 原文地址:https://www.cnblogs.com/kingle-study/p/9916192.html
Copyright © 2011-2022 走看看