zoukankan      html  css  js  c++  java
  • Node.js Cheerio parser breaks UTF-8 encoding

    From: https://stackoverflow.com/questions/31574127/node-js-cheerio-parser-breaks-utf-8-encoding

    [问题]

    I parse my request with Cheerio like this:

    var url = http://shop.nag.ru/catalog/16939.IP-videonablyudenie-OMNY/16944.IP-kamery-OMNY-c-vario-obektivom/16704.OMNY-1000-PRO;
    request.get(url, function (err, response, body) {
      console.log(body);
       $ = cheerio.load(body);
       console.log($(".description").html());
    });

    And as output I see content but in unreadable strange encoding:

    //Plain body console.log(body) (p.s. russian chars): 
    <h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1><p style
    
    //  cheerio's console.log $(".description").html()
    <h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY

    Target url link coding is in UTF-8 format. So why Cheerio breaks my encoding?

    Trying to use iconv to encode my body responce:

    var body1 = iconv.decode(body, "utf-8");

    but console.log($(".description").html()); still returns weird text.

    [回答]

    Cheerio hasn't broken anything. The HTML it outputs will be rendered by any browser exactly the same as the HTML input. Take a look at this snippet:

    <h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>
    
    <h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY - &#x43F;&#x43E;&#x43F;&#x440;&#x43E;&#x431;&#x443;&#x439;&#x442;&#x435; &#x43D;&#x430;&#x439;&#x442;&#x438; &#x43B;&#x443;&#x447;&#x448;&#x435;</span></h1>

    It's merely the case that &#x423; is the HTML "entity" for the UTF-8 character У, in the same way the entity &gt; represents >.

    However, if you want to get the unencoded text, you can set the decodeEntities option to false:

    const $ = cheerio.load(
      `<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>`,
      { decodeEntities: false }
    );
    
    
    console.log($('span').html())
    // => Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше
    .as-console-wrapper{min-height:100%}
    <script src="https://wzrd.in/standalone/cheerio@latest"></script>
  • 相关阅读:
    解决前端跨域请求的几种方式
    使用excel 展现数据库内容
    win7 安装windows 服务 报错 System.Security.SecurityException 解决方法 An exception occurred during the Install phase. System.Security.SecurityException: The so
    java 查看线程死锁
    linux 配置 java 环境变量
    数据库性能优化
    C#中静态与非静态方法比较
    apache日志切割工具cronolog安装配置
    虚拟机克隆后网卡有问题解决方法
    vs2015工具箱突然没有了
  • 原文地址:https://www.cnblogs.com/time-is-life/p/9209454.html
Copyright © 2011-2022 走看看