zoukankan      html  css  js  c++  java
  • 学习进度-10 python爬虫

    学习爬虫的第一个案例是小说爬虫。

    小说爬虫首先是解析小说页面源代码,在页面源代码中可以看到小说每章节的内容链接

     爬虫的代码:

    import requests
    import re
    
    url = 'http://www.92kshu.cc/69509/'
    response = requests.get(url)
    response.encoding = 'gbk'
    html = response.text
    title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)[0]
    fb = open('%s.txt' % title, 'w', encoding='utf-8')
    # 获取每章的内容
    # print(html)
    dl = re.findall(r'<dl><dt><i class="icon"></i>正文</dt>(.*?)</dl>', html)[0]
    print(dl)
    chapter_info_list = re.findall(r'<dd><a href="(.*?)">(.*?)</a></dd>', dl)
    print(chapter_info_list)
    for chapter_info in chapter_info_list:
        chapter_url, chapter_title = chapter_info
        chapter_url = "http://www.92kshu.cc%s" % chapter_url
        # print(chapter_url)
        chapter_response = requests.get(chapter_url)
        chapter_response.encoding = 'gbk'
        chapter_html = chapter_response.text
        chapter_content = re.findall(r'<div class="chapter">(.*?)><br>', chapter_html)[0]
        # print(chapter_content)
        chapter_content = chapter_content.replace('<p>', '')
        chapter_content = chapter_content.replace('</p>', '')
        fb.write(chapter_title)
        fb.write(chapter_content)
        fb.write('
    ')
        print(chapter_url)

    爬虫结果:

     

     

     

  • 相关阅读:
    如何设定Fckeditor的工具栏
    WINDOWS 8
    博弈论
    互联网协议入门
    关于“性”
    两个看似互斥矛盾的法则,在更高的法则上会达到统一
    silverlight后台加载本地图片
    REST(Representational State Transfer表述性状态转移)
    各个搜索引擎网站管理员工具地址
    OAuth
  • 原文地址:https://www.cnblogs.com/zhaoxinhui/p/12291944.html
Copyright © 2011-2022 走看看