zoukankan      html  css  js  c++  java
  • python 爬虫系列06--古诗文

    读书破万卷,下笔如有神

    import requests
    import re
    def parse_page(url):
        headers = {
            'USer-Agent':'user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
        }
        respose = requests.get(url,headers)
        text = (respose.text)
        titles = re.findall(r'<divsclass="cont">.*?<b>(.*?)</b>',text,re.DOTALL)
        dynsties = re.findall(r'<p class="source">.*?<a.*?>(.*?)</a>',text,re.DOTALL)
        authors = re.findall(r'<p class="source".*?<a.*?>.*?<a.*?>(.*?)</a>',text,re.DOTALL)
        wenben = re.findall(r'<div class="contson" .*?>(.*?)</div>',text,re.DOTALL)
        peoms = []
        for conter in wenben:
            x = re.sub(r'<.*?>',"",conter)
            peoms.append(x.strip())
        poem2 = []
        for calue in zip(titles,dynsties,authors,wenben):
            titles,dynsties,authors,wenben = calue
            poem = {
                '标题':titles,
                '朝代':dynsties,
                '作者':authors,
                '文本':wenben
            }
            poem2.append(poem)
        for poem in poem2:
            print(poem)
            print('*'*40)
    
    def main():
        #url = 'https://www.gushiwen.org/default_1.aspx'
        for x in range(1,10):
            url = "https://www.gushiwen.org/default_%s.aspx" % x
            x 
            parse_page(url)
    
    if __name__ == "__main__":
        main()
  • 相关阅读:
    PyQt4信号与槽
    Amazon Redshift数据库
    NoSQL数据库的认识
    如何划分子网
    VPC见解
    Linux之添加交换分区
    MySQL基础之 标准模式通配符
    MySQL基础之 LIKE操作符
    MySQL基础之 AND和OR运算符
    MySQL基础之 索引
  • 原文地址:https://www.cnblogs.com/kingle-study/p/9916192.html
Copyright © 2011-2022 走看看