zoukankan      html  css  js  c++  java
  • scrapy相关 通过设置 FEED_EXPORT_ENCODING 解决 unicode 中文写入json文件出现`uXXXX`

    0.问题现象

    爬取 item:

    2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.huxiu.com/v2_action/article_list>
    {'author': u'u5546u4e1au8bc4u8bbau7cbeu9009xa9',
     'cmt': 5,
     'fav': 194,
     'time': u'4u5929u524d',
     'title': u'u96f7u519bu8c08u5c0fu7c73u201cu65b0u96f6u552eu201duff1au50cfZarau4e00u6837u5f00u5e97uff0cu8981u505au5f97u6bd4Costcou66f4u597d',
     'url': u'/article/217755.html'}

    写入jsonline jl 文件

    {"title": "u8fd9u4e00u5468uff1au8d2bu7a77u66b4u51fb", "url": "/article/217997.html", "author": "u864eu55c5", "fav": 8, "time": "2u5929u524d", "cmt": 5}
    {"title": "u502au840du8001u516cu7684u65b0u620fu6251u8857u4e86uff0cu9ec4u6e24u6301u80a1u7684u516cu53f8u8981u8d54u60e8u4e86", "url": "/article/217977.html", "author": "u5a31u4e50u8d44u672cu8bba", "fav": 5, "time": "2u5929u524d", "cmt": 3}

    item 被转 str,默认 ensure_ascii = True,则非 ASCII 字符被转化为 `uXXXX`,每一个 ‘{xxx}’ 单位被写入文件

    目标:注意最后用 chrome 或 notepad++ 打开确认,firefox 打开 jl 可能出现中文乱码,需要手动指定编码。 

    {"title": "这一周:贫穷暴击", "url": "/article/217997.html", "author": "虎嗅", "fav": 8, "time": "2天前", "cmt": 5}
    {"title": "倪萍老公的新戏扑街了,黄渤持股的公司要赔惨了", "url": "/article/217977.html", "author": "娱乐资本论", "fav": 5, "time": "2天前", "cmt": 3}

    1.参考资料

    scrapy抓取到中文,保存到json文件为unicode,如何解决.

    import json
    import codecs
    
    class JsonWithEncodingPipeline(object):
    
        def __init__(self):
            self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')
    
        def process_item(self, item, spider):^M
            line = json.dumps(dict(item), ensure_ascii=False) + "
    "
            self.file.write(line)
            return item
    
        def close_spider(self, spider):
            self.file.close()
    View Code

    scrapy中输出中文保存中文

    Scrapy爬虫框架抓取中文结果为Unicode编码,如何转换UTF-8编码

    lidashuang / imax-spider

    以上资料实际上就是官方文档举的 pipeline 例子,另外指定  ensure_ascii=False

    Write items to a JSON file

    The following pipeline stores all scraped items (from all spiders) into a single items.jl file, containing one item per line serialized in JSON format:

    import json
    
    class JsonWriterPipeline(object):
    
        def open_spider(self, spider):
            self.file = open('items.jl', 'w')
    
        def close_spider(self, spider):
            self.file.close()
    
        def process_item(self, item, spider):
            line = json.dumps(dict(item)) + "
    "  #另外指定  ensure_ascii=False
            self.file.write(line)
            return item

    Note

    The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

     

    2.更好的解决办法:

    scrapy 使用item export输出中文到json文件,内容为unicode码,如何输出为中文?

    http://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence 里面有提到,将 JSONEncoder 的 ensure_ascii 参数设为 False 即可。

    而 scrapy 的 item export 文档里有提到

    The additional constructor arguments are passed to the
    BaseItemExporter constructor, and the leftover arguments to the
    JSONEncoder constructor, so you can use any JSONEncoder constructor
    argument to customize this exporter.

    因此就在调用 scrapy.contrib.exporter.JsonItemExporter 的时候额外指定 ensure_ascii=False 就可以啦。

    3.根据上述解答,结合官网和源代码,直接解决办法:

    1.可以通过修改 project settings.py 补充 FEED_EXPORT_ENCODING = 'utf-8'

    2.或在cmd中传入 G:pydatapycodescrapyhuxiu_com>scrapy crawl -o new.jl -s FEED_EXPORT_ENCODING='utf-8' huxiu

    https://doc.scrapy.org/en/latest/topics/feed-exports.html#feed-export-encoding

    FEED_EXPORT_ENCODING

    Default: None

    The encoding to be used for the feed.

    If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (uXXXX sequences) for historic reasons.

    Use utf-8 if you want UTF-8 for JSON too.

    In [615]: json.dump?
    Signature: json.dump(obj, fp, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, encoding='utf-8', default=None, sort_keys=False, **kw)
    Docstring:
    Serialize ``obj`` as a JSON formatted stream to ``fp`` (a
    ``.write()``-supporting file-like object).
    
    
    
    If ``ensure_ascii`` is true (the default), all non-ASCII characters in the
    output are escaped with ``uXXXX`` sequences, and the result is a ``str``
    instance consisting of ASCII characters only. If ``ensure_ascii`` is
    ``False``, some chunks written to ``fp`` may be ``unicode`` instances.
    This usually happens because the input contains unicode strings or the
    ``encoding`` parameter is used. Unless ``fp.write()`` explicitly
    understands ``unicode`` (as in ``codecs.getwriter``) this is likely to
    cause an error.

     

    C:Program FilesAnaconda2Libsite-packagesscrapyexporters.py

    class JsonLinesItemExporter(BaseItemExporter):
    
        def __init__(self, file, **kwargs):
            kwargs.setdefault('ensure_ascii', not self.encoding)
    
    
    class JsonItemExporter(BaseItemExporter):
        def __init__(self, file, **kwargs):
            kwargs.setdefault('ensure_ascii', not self.encoding)
    
    
    class XmlItemExporter(BaseItemExporter):
    
        def __init__(self, file, **kwargs):
            if not self.encoding:
                self.encoding = 'utf-8'
  • 相关阅读:
    Torchkeras,一个源码不足300行的深度学习框架
    【知乎】语义分割该如何走下去?
    【SDOI2017】天才黑客(前后缀优化建图 & 最短路)
    【WC2014】紫荆花之恋(替罪羊重构点分树 & 平衡树)
    【SDOI2017】相关分析(线段树)
    【学习笔记】分治法最短路小结
    【CH 弱省互测 Round #1 】OVOO(可持久化可并堆)
    【学习笔记】K 短路问题详解
    【学习笔记】浅析平衡树套线段树 & 带插入区间K小值
    【APIO2020】交换城市(Kruskal重构树)
  • 原文地址:https://www.cnblogs.com/my8100/p/feed_export_encoding.html
Copyright © 2011-2022 走看看