首先新建一个Scrapy项目,如果不知道项目怎么建的,请看前面爬取豆瓣TOP电影那篇文章。
目录结构如下:
因为我只是爬取问题,所以item里面只有一个title项,直接上zhihu_spider.py代码:
1 # -*- coding: utf-8 -*- 2 from scrapy.spiders import BaseSpider 3 from scrapy.selector import HtmlXPathSelector 4 from zhihu.items import ZhihuItem 5 import scrapy 6 import sys 7 reload(sys) 8 sys.setdefaultencoding("utf-8") # 保证编码为utf-8 9 10 class ZhihuSpider(BaseSpider): 11 """docstring for ZhihuSpider""" 12 name = "zhihu" 13 allowed_domains = ["zhihu.com"] 14 start_urls = ["http://www.zhihu.com/topic/19550517/questions?page=" +str(page) for page in range(1,21500)] 15 16 def parse(self, response): 17 hax = HtmlXPathSelector(response) 18 for seq in hax.xpath('//div/h2'): 19 item = ZhihuItem() 20 item['title'] = seq.xpath('a/text()').extract() 21 22 yield item
然后从apart.txt文件里面取出这些问题,将这些问题分词,对分出的词计数。
此处用到两个库,一个是redis,一个是jieba分词。其中redis是需要在本地或者服务器上安装redis数据库的,至于redis怎么用,还是去官方文档上查吧。
redis官网http://redis.io/
python中redis-py库:https://github.com/andymccurdy/redis-py
这两个库都不是python自带的,需要自行安装。
1 # -*- coding: utf-8 -*- 2 3 import re 4 import math 5 import jieba 6 import redis 7 8 def getwords(doc): 9 f = open(doc,'r') 10 for line in f.readlines(): 11 # print line 12 words = jieba.cut(line) 13 for word in words: 14 if(r.exists(word)): 15 r.incr(word) 16 else: 17 r.set(word,1) 18 19 r = redis.Redis(host='127.0.0.1',port=6379,db=3) 20 getwords('apart.txt') 21 最后从redis数据库里面取出分词,排序: 22 23 # -*- coding: utf-8 -*- 24 25 import redis 26 27 r = redis.Redis('127.0.0.1', 6379, db=3) 28 print "size:",r.dbsize() 29 30 keys = r.keys() 31 fc={} 32 for key in keys: 33 value = int(r.get(key)) 34 if value <= 3000 and value>=100: 35 fc[key]=value 36 37 dic = sorted(fc.iteritems(),key = lambda fc:fc[1],reverse=True) 38 for item in dic: 39 print item[0],item[1]
来看看最终结果(部分):