爬取 csv 格式数据与 xml 等方法基本一致
使用下列的表格:
name | sex | addr | |
Alex | Boy | Los Angeles | alex@hotstone.com |
Coy | Girl | Los Angeles, | coy@hotstone.com |
Couch | Boy | California | couch@hotstone.com |
Tom | Girl | New York | tom@hotstone.com |
创建一个项目:
$ scrapy startproject mycsv |
创建 CSV 模板:
$ cd mycsv $ scrapy genspider -t csvfeed mycsvspider localhost |
编写 items 代码:
import scrapy class MycsvItem(scrapy.Item): name = scrapy.Field() sex = scrapy.Field() |
编写 spider 文件:
# -*- coding: utf-8 -*- from scrapy.spiders import CSVFeedSpider from mycsv.items import MycsvItem class MycsvspiderSpider(CSVFeedSpider): name = 'mycsvspoder' allowed_domains = [ 'localhost' ] # headers = ['id', 'name', 'description', 'image_link'] # delimiter = ' ' # 定义 headers headers = [ 'name' , 'sex' , 'addr' , 'email' ] # 定义间隔符 delimiter = ',' # Do any adaptations you need here #def adapt_response(self, response): # return response def parse_row(self, response, row): i = MycsvItem() #i['url'] = row['url'] #i['name'] = row['name'] #i['description'] = row['description'] i[ 'name' ] = row[ 'name' ].encode() i[ 'sex' ] = row[ 'sex' ].encode() print( " 名字是: " ) print(i[ 'name' ]) print( "性别是: " ) print(i[ 'sex' ]) print( "---------------------------" ) return i |
项目下保存 csv 文件名 feed.csv 内容都是以逗号分隔
使用 Docker 启动本地 HTTP 服务,主要用途是访问 csv 文件:
$ cd mycsv $ docker run -d -w /data -p 80:8080 - v ${PWD}: /data slzcc /java-webserver :jenkins-java-webserver-14 java -jar /usr/src/app/app .jar 8080 |
启动完成后可以检测是否可以访问:
创建 main.py 文件:
from scrapy import cmdline cmdline.execute( "scrapy crawl mycsvspider" . split ()) |
结果如下: