zoukankan html css js c++ java

scrapy中输出中文保存中文

1.json文件中文解码：

#!/usr/bin/python
#coding=utf-8
#author=dahu
import json
with open('huxiu.json','r') as f:
    data=json.load(f)
print data[0]['title']
for key in data[0]:
    print '"%s":"%s",'%(key,data[0][key])

read_from_json

中文写入json：

#!/usr/bin/python
#coding=utf-8
#author=dahu
import json
data={
"desc":"女友不是你想租想租就能租",
"link":"/article/214877.html",
"title":"押金8000元，共享女友门槛不低啊"
}
with open('tmp.json','w') as f:
    json.dump(data,f,ensure_ascii=False)        #指定ensure_ascii

write_to_json

2.scrapy在保存json文件时，容易乱码，

例如：

scrapy crawl huxiu --nolog -o huxiu.json
$ head huxiu.json 
[
{"title": "u62bcu91d18000u5143uff0cu5171u4eabu5973u53cbu95e8u69dbu4e0du4f4eu554a", "link": "/article/214877.html", "desc": "u5973u53cbu4e0du662fu4f60u60f3u79dfu60f3u79dfu5c31u80fdu79df"},
{"title": "u5f20u5634uff0cu817eu8bafu8981u5582u4f60u5403u836fu4e86", "link": "/article/214879.html", "desc": "u201cu8033u65c1u56deu8361u7740Ponyu9a6cu7684u6559u8bf2uff1au597du597du7528u8111u5b50u60f3u60f3uff0cu4e0du5145u94b1uff0cu4f60u4eecu4f1au53d8u5f3au5417uff1fu201d"},

结合上面保存json文件为中文的技巧：

settings.py文件改动：

ITEM_PIPELINES = {
   'coolscrapy.pipelines.CoolscrapyPipeline': 300,
}

注释去掉

pipelines.py改成如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
# import codecs

class CoolscrapyPipeline(object):
    # def __init__(self):
        # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        # line = json.dumps(dict(item),ensure_ascii=False) + '
'
        # self.file.write(line)

        with open('data_cn1.json', 'a') as f:
            json.dump(dict(item), f, ensure_ascii=False)
            f.write(',
')
        return item

注释的部分是另一种写法，核心在于settings里启动pipeline，会自动运行process_item程序，所以就可以保存我们想要的任何格式

此时终端输入命令

scrapy crawl huxiu --nolog

如果仍然加 -o file.json ，file和pipeline里定义文件都会生成，但是file的json格式仍然是乱码。

3.进一步

由上分析可以得出另一个结论，setting里的ITEM_PIPELINES 是控制着pipeline的，如果我们多开启几个呢：

ITEM_PIPELINES = {
   'coolscrapy.pipelines.CoolscrapyPipeline': 300,
   'coolscrapy.pipelines.CoolscrapyPipeline1': 300,
}

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
# import codecs

class CoolscrapyPipeline(object):
    # def __init__(self):
        # self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        # line = json.dumps(dict(item),ensure_ascii=False) + '
'
        # self.file.write(line)

        with open('data_cn1.json', 'a') as f:
            json.dump(dict(item), f, ensure_ascii=False)
            f.write(',
')
        return item
class CoolscrapyPipeline1(object):

    def process_item(self, item, spider):
        with open('data_cn2.json', 'a') as f:
            json.dump(dict(item), f, ensure_ascii=False)
            f.write(',hehe
')
        return item

pipelines.py

运行：

$ scrapy crawl huxiu --nolog

$ head -n 2 data_cn*
==> data_cn1.json <==
{"title": "押金8000元，共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},
{"title": "张嘴，腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲：好好用脑子想想，不充钱，你们会变强吗？”"},

==> data_cn2.json <==
{"title": "押金8000元，共享女友门槛不低啊", "link": "/article/214877.html", "desc": "女友不是你想租想租就能租"},hehe
{"title": "张嘴，腾讯要喂你吃药了", "link": "/article/214879.html", "desc": "“耳旁回荡着Pony马的教诲：好好用脑子想想，不充钱，你们会变强吗？”"},hehe

可以看到两个文件都生成了！而且还是按照我们想要的格式！

查看全文

相关阅读:
打印空心字符菱形
 良好的布局与风格
 编写递归函数来使字符串逆序
 排序函数重载
 easyui刷新当前页
 Easyui 翻页不保存选中记录
 HTML 字体颜色色号对照表
 SpringMVC结合ajaxfileupload.js实现文件无刷新上传
 ajax如何传递参数给controller
eclipse启动tomcat无法访问的解决方法

原文地址：https://www.cnblogs.com/dahu-daqing/p/7528642.html