windowns 10
这里我使用的是python3.7
创建虚拟环境
pip install virtualenv
pip install virtualenvwrapper-win
# 创建虚拟环境 first_pro
mkvirtualenv first_pro 如果报错“mkvirtualenv不是内部命令或者外部命令”,请在环境配置中添加有mkvirtual.bat下路径目录
# 删除虚拟环境
rmvirtualenv first_pro
# 退出虚拟环境
deactivate
#进入虚拟环境first_pro
workon first_pro
安装scrapy包
pip install scrapy
如果不成功,则按照以下步骤:
1.下载Twisted,下载地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
进入包的目录下:pip install Twisted-19.2.1-cp37-cp37m-win_amd64.whl
2.安装scrapy
pip install scrapy
如果报错:“Consider using the `--user` option or check the permissions”的情况
pip install --user scrapy
创建一个scrapy项目
步骤1:
scrapy startproject mySpider
步骤2:
cd mySpider
步骤3:
scrapy genspider [lvbo] [s.hc360.com]
scrapy目录
mySpider
-mSpider
-spiders
__init__.py
lvbo.py
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
案列,爬取慧聪网,铝箔相关内容
lvbo.py
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import scrapy from mySpider.items import MyspiderItem import time class LvboSpider(scrapy.Spider): name = 'lvbo' allowed_domains = [] start_urls = ['https://s.hc360.com/seller/search.html?kwd=%E9%93%9D%E7%AE%94'] def parse(self, response): li_list = response.xpath("//div[@class='s-layout']//div[@class='wrap-grid']//li") for li in li_list: item = MyspiderItem() url = li.xpath(".//div[@class='NewItem']//a/@href").extract_first() if url: url = 'https:' + url yield scrapy.Request( url, callback=self.parse_detail, meta={"item": item} ) time.sleep(0.2) # 翻页 next_url = response.xpath("//a[text()='下一页']/@href").extract_first() if next_url: next_url = 'https:' + next_url print(next_url) yield scrapy.Request( next_url, callback=self.parse ) def parse_detail(self, response): item = response.meta["item"] company_name = response.xpath("//div[@class='word-box']/div/div[@class='p sate']/em/text()").extract_first() name = response.xpath("//div[@class='word-box']/div/div[@class='p name']/em/text()").extract_first() phone = response.xpath("//div[@class='word-box']/div/div[@class='p tel2']/em/text()").extract_first() if company_name and name and phone: item["company_name"] = company_name.lstrip(":") item["name"] = name.replace(u'xa0', u' ').strip() item["phone"] = phone.lstrip(":") print(item) yield item
settings.py
# 日志打印等级 LOG_LEVEL = "WARNING" # 不遵守roboot规则 ROBOTSTXT_OBEY = False # 下载间隔3秒 DOWNLOAD_DELAY = 3 # 开启管道 ITEM_PIPELINES = { 'mySpider.pipelines.MyspiderPipeline': 300, }
pipelines.py
import json class MyspiderPipeline(object):
# 程序运行就打开记事本 def open_spider(self, spider): self.file = open('lvbo.txt', 'w', encoding='utf-8') # 运行结束关闭 def close_spider(self, spider): self.file.close() # 将数据写入记事本 def process_item(self, item, spider):
# 中文编码 ensure_ascii=False line = json.dumps(dict(item), ensure_ascii=False) + ' ' self.file.write(line) return item
items.py
import scrapy class MyspiderItem(scrapy.Item): # define the fields for your item here like: company_name = scrapy.Field() name = scrapy.Field() position = scrapy.Field() phone = scrapy.Field() site = scrapy.Field()
运行
DOS命令窗口,进入mySpider目录下:cd mySpider
...mySpider> scrapy crawl lvbo
如果使用pycharm进行DeBug,则在setting.py文件同级下创建一个start.py文件
start.py
# -*- coding:utf-8 -*- from scrapy import cmdline cmdline.execute("scrapy crawl lvbo".split())
在pycharm中的Run/Debug Configuration 的 Script path : C:codemySpidermySpiderstart.py