zoukankan      html  css  js  c++  java
  • 是否应该使用utf8 bom——因DirectVobSub不支持utf8 no bom带来的问题

    使用DirectVobSub作为播放器的字幕插件。

    把字幕转换成utf-8 no bom格式,播放时字幕显示乱码。

    把字幕转换成utf-8 bom格式,播放时字幕正常。

    看来DirectVobSub不支持utf-8 no bom。

    DirectVobSub(vsfilter)官方网站:http://sourceforge.net/projects/guliverkli2/files/DirectShow%20Filters/

    utf-8应不应该使用bom呢? unicode标准是如何规定的?

    查了一下,供参考:

    http://zh.wikipedia.org/zh-cn/UTF-8#UTF-8.E7.9A.84.E8.A1.8D.E7.94.9F.E7.89.A9

    维基百科说:

    虽然不是标准,但许多Windows 程序(包括Windows 笔记本)在UTF-8编码的文件的开首加入一段字节串EF BB BF。这是字节顺序记号 U+FEFF 的 UTF-8 编码结果。对于没有预期要处理UTF-8的文本编辑器和浏览器会显示成 ISO-8859-1 字符串 ""。 

     从维基百科的说法看,好像是不应该使用bom。

    本着“微软靠得住,母猪会上树” 的成见,由于Windows的记事本另存为utf-8格式会产生bom,而gedit会产生utf-8 no bom,我认为utf-8不应该使用bom。

    然后查到http://unicode.org/faq/utf_bom.html#bom1

    unicode.org说:

    Q: How I should deal with BOMs?

    A: Here are some guidelines to follow:

    1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

    2. Some protocols allow optional BOMs in the case of untagged text. In those cases,

      • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

      • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

    3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

    4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used. See also [ Q: What is the difference between UCS-2 and UTF-16?] [AF] & [MD]

     明白了,我的理解是:

    当纯文本文件没有声明编码时,使用bom。如果没有bom,编码不好判断。

    如果数据声明了编码,如保存在数据库中的数据(在数据库中声明了编码) 、xml(使用encoding="utf-8"声明编码)、html(使用charset=utf-8声明编码),不应使用bom(the BOM should not be used)。

    由此可得,在纯文本中使用utf-8 bom是可以的。

    突然想起,以前在linux下使用各个播放器(mplayer、smplayer) 都出现utf-8格式字幕乱码的问题,难道是因为linux下的文本编辑器(gedit等)生成的是utf-8 no bom ?

    参考资料:

  • 相关阅读:
    C语言面试题分类->宏定义
    c语言位运算
    C语言一个程序的存储空间
    收藏的链接-English
    侧滑关闭Activity的解决方案——SwipeBackLayout
    实现ViewPager的联动效果
    由Toolbar造成的ListView最后一项显示不全
    收藏的链接-Stub
    收藏的链接-Git
    收藏的链接
  • 原文地址:https://www.cnblogs.com/sink_cup/p/DirectVobSub_utf8_no_bom_vsfilter_should_use_utf8_bom_or_not.html
Copyright © 2011-2022 走看看