zoukankan html css js c++ java

小试python爬虫程序

什么都不说了，先上代码，代码里面有很详细的解释，

参考地址为：http://www.poluoluo.com/jzxy/201210/183913.html

这个代码并未使用BeautifulSoup

很快我们将使用BeautifulSoup

 1 #coding:utf8
 2 import urllib2
 3 import re
 4 url = "http://www.baidu.com/s?wd=steve+jobs"            # 百度搜索关键字steve jobs
 5 content = urllib2.urlopen(url).read()
 6 # 将URL的源码存在content变量中，其类型为字符形
 7 urls_pat = re.compile(r'<span class="g">(.*?)</span>')
 8 # 上面括号里面是要去的的内容，(.*?)是指标签中所有的东西
 9 # re.compile是将字符串编译为用于python正则式的模式，
10 # 字符前的r表示是纯字符，这样就不需要对元字符进行两次转义
11 siteUrls = re.findall(urls_pat , content)
12 #############################################
13 for (i,v) in enumerate(siteUrls):               # i是数组siteUrls的索引，v是siteUrls的值
14     print v
15 #############################################
16 # 上下两种方法都能遍历输出结果，原因是siteUrls 是数组类型
17 #for i in range(0,len(siteUrls)):
18 #    print siteUrls[i]

运行结果为：

alex@universe ~/python/OOP $ python spider.py 
  www.apple.com/<b>stevejobs</b>/ 2013-3-13  
  cn.engadget.com/tag/<b>SteveJobs</b>/ 2013-3-13  
  book.douban.com/subject/65121... 2013-3-13  
wenku.baidu.com/view/dd72f7eeb8f67c1... 2012-7-10 
  www.forbes.com/profile/<b>steve</b>...<b>jobs</b>/ 2013-3-24  
  www.yeeyan.org/articles/view/Alexhon... 2013-3-13  
  www.ifanr.com/6619 2013-3-13  
  www.businessinsider.com/blackboard/<b>s</b>... 2013-3-13

这说明这样用了urllib2, re 模块后，我们抓取到一定的数据

下面我们用BeautifulSoup做一下

其下载地址：http://www.crummy.com/software/BeautifulSoup/#Download

The current release of Beautiful Soup 3 is 3.2.1 (February 16, 2012). You can install Beautiful Soup 3 with pip install BeautifulSoup or easy_install BeautifulSoup. It's also available as python-beautifulsoup in Debian and Ubuntu, and as python-BeautifulSoup on Red Hat.

我用的是linuxmint，是ubuntu分支的，所以可以这样安装。sudo easy_install BeautifulSoup

alex@universe ~/python/OOP $ sudo easy_install BeautifulSoup
[sudo] password for alex: 
Searching for BeautifulSoup
Best match: BeautifulSoup 3.2.0
Adding BeautifulSoup 3.2.0 to easy-install.pth file

Using /usr/lib/python2.7/dist-packages
Processing dependencies for BeautifulSoup
Finished processing dependencies for BeautifulSoup

 1 #coding:utf8
 2 import urllib2
 3 from BeautifulSoup import BeautifulSoup
 4 url = "http://www.baidu.com/s?wd=steve+jobs"            # 百度搜索关键字steve jobs
 5 content = urllib2.urlopen(url).read()
 6 # 将URL的源码存在content变量中，其类型为字符形
 7 soup = BeautifulSoup(content)
 8 siteUrls = soup.findAll('span',attrs = {'class':'g'})
 9 ############################################
10 for (i,v) in enumerate(siteUrls):               # i是数组siteUrls的索引，v是siteUrls的值
11     print v
12 #############################################
13 # 上下两种方法都能遍历输出结果，原因是siteUrls 是数组类型
14 #for i in range(0,len(siteUrls)):
15 #    print siteUrls[i]

输出结果为：

alex@universe ~/python/OOP $ python spider_BeautifulSoup.py 
<span class="g">  www.apple.com/<b>stevejobs</b>/ 2013-3-13  </span>
<span class="g">  cn.engadget.com/tag/<b>SteveJobs</b>/ 2013-3-13  </span>
<span class="g">  book.douban.com/subject/65121... 2013-3-13  </span>
<span class="g">wenku.baidu.com/view/dd72f7eeb8f67c1... 2012-7-10 </span>
<span class="g">  www.forbes.com/profile/<b>steve</b>...<b>jobs</b>/ 2013-3-24  </span>
<span class="g">  www.yeeyan.org/articles/view/Alexhon... 2013-3-13  </span>
<span class="g">  www.ifanr.com/6619 2013-3-13  </span>
<span class="g">  www.businessinsider.com/blackboard/<b>s</b>... 2013-3-13  </span>

这里并没有去掉html标签。

BeautifulSoup()可以把刚才抓到的字符串转化为Beautiful的对象。这样就可以应用BeautifulSoup提供的一些方法处理HTML。比如，findAll('a')就可以返回一个所有页面的a标签的List，我觉得这个和JS里面的getElementByTagName挺像的。另外也可以指定attrs参数，这个参数就是一个筛选条件，其数据结构是一个字典。findAll('span',attrs={'class':'g'})的意思就是返回所有class='g'的span标签的内容（包括span标签自身）。

用正则式和BeautifulSoup获得内容还需要进一步处理，因为其中包含html标签。类似，hi.baidu.com/cloga 2010-8-29或者 hi.baidu.com/cloga 2010-8-29 ，同样可以用正则式的sub方法替换掉这些标签。

strip_tag_pat=re.compile(r'<.*?>')
file=open('results000.csv','w')
for i in results:
 i0=re.sub(strip_tag_pat,'',i)
 i0=i0.strip()
 i1=i0.split(' ')
 date=i1[-1]
 siteUrl=''.join(i1[:-1])
 rank+=1
 file.write(date+','+siteUrl+','+str(rank)+'\n')
file.close()

再来就是把对应的结果输出到文件中，比如，排名、URL、收入日期这样的形式。OK，这样就用Python实现了一个简单的爬虫需求。秀一下上面代码的输出。

Result

文章来源：Cloga与网站|数字分析，转载请注明出处。

查看全文

相关阅读:
#检查磁盘使用率超过90%，并且后台进程没有rman在跑，就运行 /data/script/del_dg_arch.sh 脚本清理归档
 linux shell数据重定向
 创建用户
 Linux HA+ Oracle 安装维护手册
 解决UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range
Linux 文件不能被root修改与编辑原因
 python中的时间戳和格式化之间的转换
 Python-Redis-发布订阅
 Python-Redis-常用操作&管道
 Python-Redis-Set

原文地址：https://www.cnblogs.com/spaceship9/p/2988036.html