zoukankan      html  css  js  c++  java
  • python 爬虫初试

    python3.5  抓网易新闻的排行榜上的新闻,主要用自带的request模块和lxml

    import re
    from urllib import request
    from lxml import etree
    import threadpool
    import threading

    htmlcode='gbk'
    threadlock=threading.Lock() testurl
    ="http://news.163.com/rank/" with request.urlopen(testurl) as f: print('Status:', f.status, f.reason) #网页的编码格式只取一次,默认所有的编码方式都是这个 decode=(f.headers['Content-Type'].split(';')[1]).split('=')[1] data = f.read().decode(decode.lower()) infos = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', data, re.S) for i in range(len(infos)): print('%s-%s'%(i,infos[i][0])) print('选择新闻类型') k=input() if k.isdigit()and int(k)<len(infos): newpage=(request.urlopen(infos[int(k)][1]).read()).decode(decode.lower()) dom=etree.HTML(newpage) items=dom.xpath('//tr/td/a/text()') urls=dom.xpath('//tr/td/a/@href') assert (len(items)==len(urls)) print(len(items)) for i in range(len(urls)): print(items[i]) new=(request.urlopen(urls[i]).read()).decode(decode.lower()) ncs=re.findall(r'<div id="endText" class="end-text">.*?</div>',data,re.S) newdom=etree.HTML(new) newitems=newdom.xpath("//div[@id='endText'and @class='post_text']/p/text()") for n in newitems: print(n) print('=======================输入y继续') if 'y'==input():continue else:break;

     多线程版本,用threadpool  直接pip安装的,实测读取同样的50个页面多线程要比上边的要快点,不过这个跟实时网速有关系,不太好具体测试时间

    def test2():
        with request.urlopen(testurl) as f:
            htmlcode=(f.headers['Content-Type'].split(';')[1]).split('=')[1]
            data = f.read().decode(htmlcode.lower())
            infos = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', data, re.S)
            newpage=(request.urlopen(infos[0][1]).read()).decode(htmlcode.lower())
            dom=etree.HTML(newpage)
            items=dom.xpath('//tr/td/a/text()')
            urls=dom.xpath('//tr/td/a/@href')
            assert (len(items)==len(urls))
            urlss=urls[:50]
            print(len(items))
            news=[]
            args=[]
            [args.append(([i,news],None)) for i in urlss]
            pool=threadpool.ThreadPool(8)
            ress=threadpool.makeRequests(GetNewpage,args)
            [pool.putRequest(req) for req in ress]
            print("start=====%s"%len(urlss))
            pool.wait()
            print("end==========")
            print(len(news))
            print(news[0])
            while(True):
                k=input()
                if not k.isdigit()or int(k)>=len(news):break
                print(news[int(k)])
    
    
    
                
    def GetNewpage(url,news):
        try:
            new=(request.urlopen(url).read()).decode(htmlcode.lower())
            newdom=etree.HTML(new)
            newitems=newdom.xpath("//div[@id='endText'and @class='post_text']/p/text()")
            newcontent=""
            for n in newitems:
                newcontent=newcontent+n
            threadlock.acquire()//
            news.append(newcontent)
            threadlock.release()
        except Exception:
            print('%s------------------error'%url)

    threadpool多线程和profile的调用示例,线程的执行函数应该是可以多参数的··,

    threadpool 调用示例,线程执行函数有两个参数,穿的参数列表有点奇怪,只有一个参数直接传一个list就可以了
    
    
    def pooltest():
        a=[1,2,3,4,5,6,7,8,9,10,111]
        b=[]
        args=[]
        [args.append(([i,b],None)) for i in a]
        pool=threadpool.ThreadPool(5)
        ress=threadpool.makeRequests(lambda x,y:y.append(x),args,lambda x,y:print(x.requestID))
        [pool.putRequest(req) for req in ress]
        pool.wait()
    
    

    profile的调用示例,可以查看函数执行耗时情况

        import profile
        profile.run('test2()','prores')
        import pstats
        p=pstats.Stats('prores')
        p.strip_dirs().sort_stats("cumulative").print_stats(0)//显示前几行,这里设置0 只显示总时间
     

    使用cookie访问需要用户名的网站:

    首先你得从浏览器里边吧 对应网站的cookie给弄出来 存在本地文件或者什么地方 ,用的时候直接取出来··

    这里本地测试了 百度和豆瓣,百度OK可以获取对应的信息, 豆瓣不行··· 豆瓣最近抽筋,经常无法访问,估计是夏天来了豆瓣太穷买不起空调的原因吧···

    从网站返回的是gzip格式的数据,这个是压缩的数据,不能直接解码,要先解压再解码

    def cookietest():
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0',
                 'Accept':'*/*',
                 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
                 'Accept-Encoding':'gzip, deflate, br',
                 'Referer':'https://www.douban.com/',
                 'Cookie':cookiedouban,
                 'Connection':'keep-alive'}
        
        req=request.Request('https://www.douban.com',headers=headers)
        with request.urlopen(req) as f:
            print('Status:', f.status, f.reason)
            for k, v in f.getheaders():
                print('%s: %s' % (k, v))
            bs=f.read();
            bi = io.BytesIO(bs)
            gf = gzip.GzipFile(fileobj=bi, mode="rb")
            print(gf.read().decode('utf-8'))
  • 相关阅读:
    x64 平台开发 Mapxtreme 编译错误
    hdu 4305 Lightning
    Ural 1627 Join(生成树计数)
    poj 2104 Kth Number(可持久化线段树)
    ural 1651 Shortest Subchain
    hdu 4351 Digital root
    hdu 3221 Bruteforce Algorithm
    poj 2892 Tunnel Warfare (Splay Tree instead of Segment Tree)
    hdu 4031 Attack(BIT)
    LightOJ 1277 Looking for a Subsequence
  • 原文地址:https://www.cnblogs.com/onegarden/p/6830482.html
Copyright © 2011-2022 走看看