zoukankan      html  css  js  c++  java
  • What's the difference between UTF-8 and UTF-8 without BOM?

    https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom

    Answer1

    The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (EF BB BF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

    Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

    According to the Unicode standard, the BOM for UTF-8 files is not recommended:

    2.6 Encoding Schemes

    ... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

    Answer2

    The other excellent answers already answered that:

    • There is no official difference between UTF-8 and BOM-ed UTF-8
    • A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
    • Those bytes, if present, must be ignored when extracting the string from the file/stream.

    But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...

    For example, the data [EF BB BF 41 42 43] could either be:

    • The legitimate ISO-8859-1 string "ABC"
    • The legitimate UTF-8 string "ABC"

    So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above

    Encodings should be known, not divined.

    评论

    You understood correctly. The string [EF BB BF 41 42 43] is just a bunch of bytes. You need external information to choose how to interpret it. If you believe those bytes were encoded using ISO-8859-1, then the string is "ABC". If you believe those bytes were encoded using UTF-8, then it is "ABC". If you don't know, then you must try to find out. The BOM could be a clue. The absence of invalid character when decoded as UTF-8 could be another... In the end, unless you can memorize/find the encoding somehow, an array of bytes is just an array of bytes.

    Convert a file to utf-8, but still shown as ascii

    https://stackoverflow.com/questions/11303405/force-encode-from-us-ascii-to-utf-8-iconv

    ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

    It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.

  • 相关阅读:
    vs2017启动iis局域网无法访问解决
    centos7多节点部署redis4.0.11集群
    centos7用docker安装elasticsearch5.6.13的主从
    centos7用docker安装单节点redis4.0.11
    centos7用docker安装mysql5.7.24后配置主从
    centos7 docker 安装 mysql5.7.24 导入12G的sql
    局域网内搭建一个服务器,可以使用 https 吗
    nginx负载均衡fair模块安装和配置
    腿伤中,继续养...
    文件Move操作
  • 原文地址:https://www.cnblogs.com/chucklu/p/10298239.html
Copyright © 2011-2022 走看看