Scrapy安装:
1,首先进入虚拟环境
2,使用国内豆瓣源进行安装,快!
1 pip install -i https://pypi.douban.com/simple/ scrapy
3,特殊情况出错:缺少c++,解决办法:自己安装了个vs2015
基本命令:
1 scrapy --help 2 Available commands: 3 bench Run quick benchmark test 4 commands 5 fetch Fetch a URL using the Scrapy downloader 6 genspider Generate new spider using pre-defined templates 7 runspider Run a self-contained spider (without creating a project) 8 settings Get settings values 9 shell Interactive scraping console 10 startproject Create new project 11 version Print Scrapy version 12 view Open URL in browser, as seen by Scrapy 13 14 [ more ] More commands available when run from project directory 15 到时候用到再说
创建工程:
在这里只能通过命令行:pycharm 没有加载scrapy,与Django 不一样
命令:
#注意:cd 到所需创建工程的目录下
scrapy startproject projectname
默认是没有模板的,还需要自己命令创建
目录树:(main是后来自己建的)
创建爬虫模板:
好比在Django中创建一个APP,在次创建一个爬虫
命令:
#注意:必须在该工程目录下
#创建一个名字为blogbole,爬取root地址为blog.jobbole.com 的爬虫;爬伯乐在线
scrapy genspider jobbole blog.jobbole.com
1 创建的文件: 2 # -*- coding: utf-8 -*- 3 import scrapy 4 5 6 class JobboleSpider(scrapy.Spider): 7 #爬虫名字 8 name = "jobbole" 9 #运行爬取的域名 10 allowed_domains = ["blog.jobbole.com"] 11 #开始爬取的URL 12 start_urls = ['http://blog.jobbole.com'] 13 14 #爬取函数 15 def parse(self, response): 16 #xpath 解析response内容,提取数据 17 #//*[@id="post-110769"]/div[1]/h1 18 re_selector = response.xpath('//*[@id="post-110769"]/div[1]/h1/text()') 19 re2_selector = response.xpath('/html/body/div[3]/div[1]/h1/text()') 20 re3_selector = response.xpath('//div[@class="entry-header"]/h1/text()') 21 22 pass
至此,一个爬虫工程建立完毕;