zoukankan      html  css  js  c++  java
  • python3爬取页面内容并筛选

    from urllib import request
    import re
    def getResponse(url):
        url_request = request.Request(url)
        url_response = request.urlopen(url_request)
        return url_response
    def getData(data):
        html = re.findall(r'alt="[u4E00-u9FA5\s]+"',data)
        return html
    aid = 1
    for aid in range(1,123):
        html = "http://www.zhijiaow.com/ShopMallList_%s_0.html" %aid
        aid +=1    
        http_response = getResponse(html)
        data = http_response.read().decode('utf8')
        l = getData(data)
        global n
        n = 1
        for info in l:
            with open('c.txt','a') as f:
                f.write(info)
            n +=1
    with open('c.txt','r') as f:
        lines = f.readlines()
    with open('a.txt','a') as w:
        for l in lines:
            w.write(l.replace('"alt="','
    '))
  • 相关阅读:
    作业6
    作业8
    作业7
    作业5
    作业4
    作业3
    作业2
    作业1
    浏览器跨域的细节
    解析node-cors模块
  • 原文地址:https://www.cnblogs.com/isule/p/8926754.html
Copyright © 2011-2022 走看看