zoukankan      html  css  js  c++  java
  • python内涵段子爬取练习

    # -*- coding:utf-8 -*-
    from urllib import request as urllib2
    import re
    # 利用正则表达式爬取内涵段子
    url = r'http://www.neihanpa.com/article/list_5_{}.html'

    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0',
    }
    file_name = '第二天内涵段子爬取练习.txt'
    for page in range(2):
    # 2表示页数,可以自行调整
        fullurl = url.format(str(page+1))
        request = urllib2.Request(url=fullurl, headers=headers)
        response = urllib2.urlopen(request)
        html = response.read().decode('gbk')
        # re.S 如果没有re.S 则是只匹配一行有没有符合规则的字符串,如果没有则下一行重新匹配
        # 如果加上re.S 则是将所有的字符串作为一个整体进行匹配
        pattern = re.compile(r'<divsclass="f18 mb20">(.*?)</div>',re.S)
        duanzis = pattern.findall(html)
        for duanzi in duanzis:
            duanzi = duanzi.replace('<p>','').replace('</p>','').replace('<br />',' ').replace('&ldquo;','').replace('&rdquo','').replace('&hellip;','')
            try:
                # 将爬取的段子写入文件
                file = open(file_name,'a',encoding='utf-8')
                file.write(' '.join(duanzi.split()))
                file.close()
            except OSError as e:
                print(e)

    逆风的方向更适合飞翔,不怕千万人阻挡,只怕自己投降!
  • 相关阅读:
    洛谷 P1990 覆盖墙壁
    洛谷 P1033 自由落体
    洛谷 P2049 魔术棋子
    洛谷 P2183 巧克力
    poj_1743_Musical Theme(后缀数组)
    Codeforces Round #367 (Div. 2) D. Vasiliy's Multiset
    Codeforces Round #367 (Div. 2) C. Hard problem
    hdu_5831_Rikka with Parenthesis II(模拟)
    hdu_5826_physics(物理题)
    hdu_5821_Ball(贪心)
  • 原文地址:https://www.cnblogs.com/jackzz/p/9125802.html
Copyright © 2011-2022 走看看