zoukankan html css js c++ java

爬虫

http://www.cnblogs.com/fnng/p/3576154.html

连接中文章能抓他给的网页的图片，我想抓

http://tieba.baidu.com/p/2923384495

的图片抓不到

我改了代码

reg=r'src="(.*?.jpg)" style'

还是抓不到郁闷啊

谷歌浏览器审查元素明明看到

<img src="http://imgsrc.baidu.com/forum/w%3D580/sign=a194f05af503918fd7d13dc2613c264b/d42a2834349b033b99065b6017ce36d3d439bdfa.jpg" style="position: absolute; display: block; top: 32px; left: 0px;  560px; height: 448px; cursor: url(http://tb2.bdstatic.com/tb/static-pb/img/cur_zin.cur), pointer;" height="1" width="1" origin-src="http://imgsrc.baidu.com/forum/w%3D580/sign=a194f05af503918fd7d13dc2613c264b/d42a2834349b033b99065b6017ce36d3d439bdfa.jpg">

后来我用

src="http://imgsrc.baidu.com/forum/w%3D580/sign=a194f05af503918fd7d13dc2613c264b/d42a2834349b033b99065b6017ce36d3d439bdfa.jpg" style

去网页源代码中搜索，没有搜索到，怎么回事？

http://tieba.baidu.com/p/3034536041
用RegEx Tester把网页源文件放进去，寻找正则表达式去匹配，发现用原来的（reg = r'src="(.+?.jpg)" pic_ext'）是可以抓到的

 1 import urllib
 2 import re
 3 
 4 def getHtml(url):
 5     page=urllib.urlopen(url)
 6     html=page.read()
 7     return html
 8 
 9 def getImg(html):
10     reg=r'src="(.*?.jpg)" pic'
11     imgre=re.compile(reg)
12     imglist=re.findall(imgre,html)
13     x=0
14     for imgurl in imglist:
15         urllib.urlretrieve(imgurl,'%s.jpg' % x)
16         x+=1
17 
18 print 'Please input the url:',
19 url=raw_input()
20 html=getHtml(url)
21 print 'Start working'
22 
23 print getImg(html)
24 print 'Done'



然后又试了一下http://tieba.baidu.com/p/2923384495这次能抓到了，但是抓到的不全，只抓到了4张图片 到底是什么情况啊

而且这样的正则表达式很有局限性，有木有一个通用的抓图片的库什么的，目前没有查到


__________________________________________________________________________________________________________________________________________________________

http://www.douban.com/group/topic/44698923/

http://www.zhihu.com/question/21358581

http://www.zhihu.com/question/20899988

http://pyiner.com/2013/06/10/Python%E7%88%AC%E8%99%AB%E6%95%99%E7%A8%8B-%E7%AE%80%E5%8D%95%E7%9A%84%E6%8A%93%E5%8F%96.html

http://www.lovelucy.info/python-crawl-pages.html

http://blog.csdn.net/column/details/why-bug.html 这个系列的博客的前几篇是翻译自https://docs.python.org/2/howto/urllib2.html的

import string, urllib2

def badutieba(url,begin_page,end_page):
    for i in range(begin_page,end_page+1):
        sname=string.zfill(i,5)+'.html'
        print 'dowloading'+str(i)+sname+'......'
        f=open(sname,'w+')
        m=urllib2.urlopen(url+str(i)).read()
        f.write(m)
        f.close()

url=str(raw_input('input the url:'))
begin_page=int(raw_input('begin_page:'))
end_page=int(raw_input('end_page:'))

badutieba(url,begin_page,end_page)

注： url:http://tieba.baidu.com/p/2923384495?pn=

原文评论的13楼那个每页50 所以pn需要换算，指的是：http://tieba.baidu.com/f/good?kw=%CF%C9%BD%A3%CE%E5%CD%E2%B4%AB&cid=0&pn=

不用换算，指的是：http://tieba.baidu.com/p/2923384495?pn=

分别指帖子列表和具体帖子

查看全文

相关阅读:
【leetcode】1534. 统计好三元组
 【leetcode】1351. 统计有序矩阵中的负数
 【leetcode】1523. 在区间范围内统计奇数数目
 【leetcode】204. 计数质数
 【leetcode】993. 二叉树的堂兄弟节点
 【leetcode】1598. 文件夹操作日志搜集器
 【leetcode】1389. 按既定顺序创建目标数组
 【leetcode】增减字符串匹配
 【leetcode】1185.一周中的第几天
 052-158

原文地址：https://www.cnblogs.com/crane-practice/p/3780858.html