zoukankan      html  css  js  c++  java
  • 2019.1.7

    import urllib.request
    import urllib.error
    import re
    data=urllib.request.urlopen("http://bbs.hupu.com/").read()
    data=data.decode("utf-8","ignore")
    pat='<a href="(.*?.html)" target="_blank" title='
    allurl=re.compile(pat).findall(data)
    for i in range(0,len(allurl)):
        allurl[i]='https://bbs.hupu.com/'+allurl[i]
    fh=open('./result.txt','a',encoding='utf8')
    for i in range(0,len(allurl)):
        try:
            nowurl=allurl[i]
            print('正在爬取第'+str(i+1)+'个帖子')
            print(nowurl)
            data=urllib.request.urlopen(nowurl).read()
            data=data.decode("utf-8","ignore")
            pat='<title>
    (.*?)
    </title>'
            result=re.compile(pat).findall(data)
            fh.write(result[0]+'
    ')
            print('----打印成功----')
        except urllib.error.URLError as e:
            print('爬取第' + str(i + 1) + '个帖子失败')
            if hasattr(e, "code"):
                print(e.code)
            if hasattr(e, "reason"):
                print(e.reason)
    fh.close()
  • 相关阅读:
    51nod1278 相离的圆
    CodeForces
    SPOJ
    51nod 1040(欧拉函数)
    51nod1009 51nod1042(数位dp)
    51nod1264 线段相交
    51nod1050 循环数组最大子段和
    Spark SQL UDF示例
    Spark SQL官网阅读笔记
    Spark RDD
  • 原文地址:https://www.cnblogs.com/hesse/p/10235434.html
Copyright © 2011-2022 走看看