zoukankan      html  css  js  c++  java
  • 详解W3C标准:html 4.01中的lang属性——实际上它是一个刮胡刀

    在HTML和XHTML中的lang属性使用什么值呢?

    是使用zh-CN、zh-Hans还是zh-Hans-CN?

    是使用zh-CN还是zh-cn,是否区分大小写?

    是使用yue-Hans还是zh-yue-Hans呢?

    为什么浏览器中一直使用zh-cn?

    这是一段HTML 4.01代码:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html lang="zh-CN">
    <head>
        
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
        
    <title>实际上它是一个刮胡刀</title>
    </head>
    <body>
    <lang="zh-CN">它是一个刮胡刀
    </p>
    <lang="yue-Hans">佢系一个须刨嚟嘅
    </p>
    </body>
    </html>


    它是一个刮胡刀

    佢系一个须刨嚟嘅

    先看W3C HTML 4.01标准:

    HTML 4.01 Specification(W3C Recommendation 24 December 1999) :http://www.w3.org/TR/html401/#toc 

    第6章(Basic HTML data types)——第8节(Language codes):http://www.w3.org/TR/html401/types.html#h-6.8 

    原文如下:

    6.8 Language codes

    The value of attributes whose type is a language code ( %LanguageCode in the DTD) refers to a language code as specified by [RFC1766], section 2. For information on specifying language codes in HTML, please consult the section on language codes. Whitespace is not allowed within the language-code.

    Language codes are case-insensitive.

    W3C的规定是:HTML 4.01中的lang的属性值使用RFC1766中的定义值。HTML 4.01的语言代码不区分大小写。

    详细信息: http://www.w3.org/TR/html401/struct/dirlang.html#langcodes

     原文如下:

    8.1.1 Language codes

    The lang attribute's value is a language code that identifies a natural language spoken, written, or otherwise used for the communication of information among people. Computer languages are explicitly excluded from language codes.

    [RFC1766] defines and explains the language codes that must be used in HTML documents.

    Briefly, language codes consist of a primary code and a possibly empty series of subcodes:

     language-code = primary-code ( "-" subcode )*

    Here are some sample language codes:

    • "en": English
    • "en-US": the U.S. version of English.
    • "en-cockney": the Cockney version of English.
    • "i-navajo": the Navajo language spoken by some Native Americans.
    • "x-klingon": The primary tag "x" indicates an experimental language tag

    Two-letter primary codes are reserved for [ISO639] language abbreviations. Two-letter codes include fr (French), de (German), it (Italian), nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he (Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), and sa (Sanskrit).

    Any two-letter subcode is understood to be a [ISO3166] country code.

    即:

    lang的属性值表示语言代码,定义了人们之间通过说话、书写或其他方式进行交流的语言。

    RFC1766 定义了HTML中必须使用的语言代码 。

    语言代码包括一个主代码和一系列子代码,主代码必须使用,子代码可以不使用。

    格式为:主代码、主代码-子代码、主代码-子代码-子代码、……

    例如:en表示英语,en-US表示美国英语。

    2个字母的主代码按照ISO639的规定执行,包括fr (French), de (German), it (Italian), nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he (Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), and sa (Sanskrit)。

    2个字母的子代码与ISO3166国家代码一致。

    下面先来看 RFC1766:http://www.ietf.org/rfc/rfc1766.txt

    RFC1766由互联网工程工作小组(The Internet Engineering Task Force ,IETF)发行,网站http://www.ietf.org/

    RFC1766的意思是:

    主语言标签:

    2个字母的主语言标签与ISO标准639一致("Code for the representation of names of languages" [ISO 639]),不允许使用标准以外的其他值。

    第一个子语言标签:

    2个字母的子语言标签与ISO 3166 alpha-2(2个字母的代码表)一致。

    3到8个字母的子语言标签按照IANA的登记执行,按照第5章的说明进行使用。

     NOTE: The ISO 639/ISO 3166 convention is that language names are
    written in lower case, while country codes are written in upper case.
    This convention is recommended, but not enforced; the tags are case
    insensitive.

    注意:ISO 639和ISO 3166约定语言名称使用小写,国家代码使用大写。这是一个推荐使用的惯例,并不是强制要求;语言代码是不区分大小写的。

    再来看ISO639:

    [ISO639]
    "Codes for the representation of names of languages", ISO 639:1988. For more information, consult http://www.iso.ch/cate/d4766.html. Refer also to http://www.oasis-open.org/cover/iso639a.html.

    按照维基百科(http://zh.wikipedia.org/zh-cn/ISO_639)的资料: ISO 639 是数个由国际标准化组织(ISO)为各语言所订定的语言代码。

    此标准还在持续更新。

    1988年的ISO 639标准:http://ftp.ics.uci.edu/pub/ietf/http/related/iso639.txt

    Technical contents of ISO 639:1988 (E/F)
    "Code for the representation of names of languages".
    Typed by Keld.Simonsen@dkuug.dk 1990-11-30 <ftp://dkuug.dk/i18n/ISO_639>
    Minor corrections, 1992-09-08 by Keld Simonsen
    Sundanese corrected, 1992-11-11 by Keld Simonsen
    Telugu corrected, 1995-08-24 by Keld Simonsen
    Hebrew, Indonesian, Yiddish corrected 1995-10-10 by Michael Everson
    Inuktitut, Uighur, Zhuang added 1995-10-10 by Michael Everson
    Sinhalese corrected, 1995-10-10 by Michael Everson
    Faeroese corrected to Faroese, 1995-11-18 by Keld Simonsen
    Sangro corrected to Sangho, 1996-07-28 by Keld Simonsen
    Two-letter lower-case symbols are used.
    The Registration Authority for ISO 639 is Infoterm, Osterreichisches
    Normungsinstitut (ON), Postfach 130, A-1021 Vienna, Austria.

    aa Afar
    ab Abkhazian
    af Afrikaans
    am Amharic
    ar Arabic
    as Assamese
    ay Aymara
    az Azerbaijani

    ba Bashkir
    be Byelorussian
    bg Bulgarian
    bh Bihari
    bi Bislama
    bn Bengali; Bangla
    bo Tibetan
    br Breton

    ca Catalan
    co Corsican
    cs Czech
    cy Welsh
    da Danish
    de German
    dz Bhutani
    el Greek
    en English
    eo Esperanto
    es Spanish
    et Estonian
    eu Basque
    fa Persian
    fi Finnish
    fj Fiji
    fo Faroese
    fr French
    fy Frisian
    ga Irish
    gd Scots Gaelic
    gl Galician
    gn Guarani
    gu Gujarati
    ha Hausa
    he Hebrew (formerly iw)
    hi Hindi
    hr Croatian
    hu Hungarian
    hy Armenian
    ia Interlingua
    id Indonesian (formerly in)
    ie Interlingue
    ik Inupiak
    is Icelandic
    it Italian
    iu Inuktitut
    ja Japanese
    jw Javanese

    ka Georgian
    kk Kazakh
    kl Greenlandic
    km Cambodian
    kn Kannada
    ko Korean
    ks Kashmiri
    ku Kurdish
    ky Kirghiz
    la Latin
    ln Lingala
    lo Laothian
    lt Lithuanian
    lv Latvian, Lettish
    mg Malagasy
    mi Maori
    mk Macedonian
    ml Malayalam
    mn Mongolian
    mo Moldavian
    mr Marathi
    ms Malay
    mt Maltese
    my Burmese
    na Nauru
    ne Nepali
    nl Dutch
    no Norwegian
    oc Occitan
    om (Afan) Oromo
    or Oriya
    pa Punjabi
    pl Polish
    ps Pashto, Pushto
    pt Portuguese
    qu Quechua
    rm Rhaeto-Romance
    rn Kirundi
    ro Romanian
    ru Russian
    rw Kinyarwanda
    sa Sanskrit
    sd Sindhi
    sg Sangho
    sh Serbo-Croatian
    si Sinhalese
    sk Slovak
    sl Slovenian
    sm Samoan
    sn Shona
    so Somali
    sq Albanian
    sr Serbian
    ss Siswati
    st Sesotho
    su Sundanese
    sv Swedish
    sw Swahili
    ta Tamil
    te Telugu
    tg Tajik
    th Thai
    ti Tigrinya
    tk Turkmen
    tl Tagalog
    tn Setswana
    to Tonga
    tr Turkish
    ts Tsonga
    tt Tatar
    tw Twi
    ug Uighur
    uk Ukrainian
    ur Urdu
    uz Uzbek
    vi Vietnamese
    vo Volapuk
    wo Wolof
    xh Xhosa
    yi Yiddish (formerly ji)
    yo Yoruba
    za Zhuang
    zh Chinese
    zu Zulu

    到这里,主代码primary-code即语种的名称的标准找到了。

    下面开始找子代码subcode。

    [ISO3166]
    "Codes for the representation of names of countries", ISO 3166:1993.

    按照维基百科的资料(http://zh.wikipedia.org/zh-cn/ISO_3166) :

    国际标准化组织的ISO 3166国际标准针对国家和地区编代码,有三部份:

        * ISO 3166-1有国际标准化组织(ISO)的标准国家代码。有二位字母代码、三位字母代码、以及三位数字代码。1974年首次出版。
        * ISO 3166-2定义国家或地区的主要行政区代码。
        * ISO 3166-3定义被取代的ISO 3166-1代码的代码。1998年首次出版。

    开始看ISO 3166 alpha-2(2个字母的代码表):

    当时HTML 4.01采用的是ISO 3166:1993,内容如下:http://xml.coverpages.org/country3166.html

    Country Code List: ISO 3166-1993 (E)

    This international standard provides a two-letter alphabetic code for representing the names of countries, dependencies, and other areas of special geopolitical interest. The source of this code set is the "Codes for the Representation of Names of Countries (ISO 3166-1993 (E))." Note: 2005-04 correction, Nambia --> Namibia. It is available from:

    American National Standards Institute, Inc.
    11 West 42nd Street, 13th floor
    New York, New York 10036
    CodeDefinition and Explanation
    AD Andorra
    AE United Arab Emirates
    AF Afghanistan
    AG Antigua & Barbuda
    AI Anguilla
    AL Albania
    AM Armenia
    AN Netherlands Antilles
    AO Angola
    AQ Antarctica
    AR Argentina
    AS American Samoa
    AT Austria
    AU Australia
    AW Aruba
    AZ Azerbaijan
    BA Bosnia and Herzegovina
    BB Barbados
    BD Bangladesh
    BE Belgium
    BF Burkina Faso
    BG Bulgaria
    BH Bahrain
    BI Burundi
    BJ Benin
    BM Bermuda
    BN Brunei Darussalam
    BO Bolivia
    BR Brazil
    BS Bahama
    BT Bhutan
    BU Burma (no longer exists)
    BV Bouvet Island
    BW Botswana
    BY Belarus
    BZ Belize
    CA Canada
    CC Cocos (Keeling) Islands
    CF Central African Republic
    CG Congo
    CH Switzerland
    CI Côte D'ivoire (Ivory Coast)
    CK Cook Iislands
    CL Chile
    CM Cameroon
    CN China
    CO Colombia
    CR Costa Rica
    CS Czechoslovakia (no longer exists)
    CU Cuba
    CV Cape Verde
    CX Christmas Island
    CY Cyprus
    CZ Czech Republic
    DD German Democratic Republic (no longer exists)
    DE Germany
    DJ Djibouti
    DK Denmark
    DM Dominica
    DO Dominican Republic
    DZ Algeria
    EC Ecuador
    EE Estonia
    EG Egypt
    EH Western Sahara
    ER Eritrea
    ES Spain
    ET Ethiopia
    FI Finland
    FJ Fiji
    FK Falkland Islands (Malvinas)
    FM Micronesia
    FO Faroe Islands
    FR France
    FX France, Metropolitan
    GA Gabon
    GB United Kingdom (Great Britain)
    GD Grenada
    GE Georgia
    GF French Guiana
    GH Ghana
    GI Gibraltar
    GL Greenland
    GM Gambia
    GN Guinea
    GP Guadeloupe
    GQ Equatorial Guinea
    GR Greece
    GS South Georgia and the South Sandwich Islands
    GT Guatemala
    GU Guam
    GW Guinea-Bissau
    GY Guyana
    HK Hong Kong
    HM Heard & McDonald Islands
    HN Honduras
    HR Croatia
    HT Haiti
    HU Hungary
    ID Indonesia
    IE Ireland
    IL Israel
    IN India
    IO British Indian Ocean Territory
    IQ Iraq
    IR Islamic Republic of Iran
    IS Iceland
    IT Italy
    JM Jamaica
    JO Jordan
    JP Japan
    KE Kenya
    KG Kyrgyzstan
    KH Cambodia
    KI Kiribati
    KM Comoros
    KN St. Kitts and Nevis
    KP Korea, Democratic People's Republic of
    KR Korea, Republic of
    KW Kuwait
    KY Cayman Islands
    KZ Kazakhstan
    LA Lao People's Democratic Republic
    LB Lebanon
    LC Saint Lucia
    LI Liechtenstein
    LK Sri Lanka
    LR Liberia
    LS Lesotho
    LT Lithuania
    LU Luxembourg
    LV Latvia
    LY Libyan Arab Jamahiriya
    MA Morocco
    MC Monaco
    MD Moldova, Republic of
    MG Madagascar
    MH Marshall Islands
    ML Mali
    MN Mongolia
    MM Myanmar
    MO Macau
    MP Northern Mariana Islands
    MQ Martinique
    MR Mauritania
    MS Monserrat
    MT Malta
    MU Mauritius
    MV Maldives
    MW Malawi
    MX Mexico
    MY Malaysia
    MZ Mozambique
    NA Namibia
    NC New Caledonia
    NE Niger
    NF Norfolk Island
    NG Nigeria
    NI Nicaragua
    NL Netherlands
    NO Norway
    NP Nepal
    NR Nauru
    NT Neutral Zone (no longer exists)
    NU Niue
    NZ New Zealand
    OM Oman
    PA Panama
    PE Peru
    PF French Polynesia
    PG Papua New Guinea
    PH Philippines
    PK Pakistan
    PL Poland
    PM St. Pierre & Miquelon
    PN Pitcairn
    PR Puerto Rico
    PT Portugal
    PW Palau
    PY Paraguay
    QA Qatar
    RE Réunion
    RO Romania
    RU Russian Federation
    RW Rwanda
    SA Saudi Arabia
    SB Solomon Islands
    SC Seychelles
    SD Sudan
    SE Sweden
    SG Singapore
    SH St. Helena
    SI Slovenia
    SJ Svalbard & Jan Mayen Islands
    SK Slovakia
    SL Sierra Leone
    SM San Marino
    SN Senegal
    SO Somalia
    SR Suriname
    ST Sao Tome & Principe
    SU Union of Soviet Socialist Republics (no longer exists)
    SV El Salvador
    SY Syrian Arab Republic
    SZ Swaziland
    TC Turks & Caicos Islands
    TD Chad
    TF French Southern Territories
    TG Togo
    TH Thailand
    TJ Tajikistan
    TK Tokelau
    TM Turkmenistan
    TN Tunisia
    TO Tonga
    TP East Timor
    TR Turkey
    TT Trinidad & Tobago
    TV Tuvalu
    TW Taiwan, Province of China
    TZ Tanzania, United Republic of
    UA Ukraine
    UG Uganda
    UM United States Minor Outlying Islands
    US United States of America
    UY Uruguay
    UZ Uzbekistan
    VA Vatican City State (Holy See)
    VC St. Vincent & the Grenadines
    VE Venezuela
    VG British Virgin Islands
    VI United States Virgin Islands
    VN Viet Nam
    VU Vanuatu
    WF Wallis & Futuna Islands
    WS Samoa
    YD Democratic Yemen (no longer exists)
    YE Yemen
    YT Mayotte
    YU Yugoslavia
    ZA South Africa
    ZM Zambia
    ZR Zaire
    ZW Zimbabwe
    ZZ Unknown or unspecified country

    根据ISO提供的信息(http://www.iso.org/iso/catalogue_detail.htm?csnumber=22748),ISO 3166:1993已被ISO 3166-1取代。

    http://zh.wikipedia.org/zh-cn/ISO_3166-1可以看到ISO 3166-1的国家和地区代码表,其中CN表示China,HK表示香港。

    到这里,2个字母的子代码也清楚了。

    下面开始“3到8个字母的子语言标签按照IANA的登记执行,按照第5章的说明进行使用。”

    第5章的原文如下:

    5. IANA registration procedure for language tags
    Any language tag must start with an existing tag, and extend it.
    This registration form should be used by anyone who wants to use a
    language tag not defined by ISO or IANA.
    Alvestrand [Page 7]
    RFC 1766 Language Tag March 1995
    ----------------------------------------------------------------------
    LANGUAGE TAG REGISTRATION FORM
    Name of requester :
    E-mail address of requester:
    Tag to be registered :
    English name of language :
    Native name of language (transcribed into ASCII):
    Reference to published description of the language (book or article):
    ----------------------------------------------------------------------
    The language form must be sent to <ietf-types@uninett.no> for a 2-
    week review period before submitting it to IANA. (This is an open
    list. Requests to be added should be sent to <ietf-types-
    request@uninett.no>.)
    When the two week period has passed, the language tag reviewer, who
    is appointed by the IETF Applications Area Director, either forwards
    the request to IANA@ISI.EDU, or rejects it because of significant
    objections raised on the list.
    Decisions made by the reviewer may be appealed to the IESG.
    All registered forms are available online in the directory
    ftp://ftp.isi.edu/in-notes/iana/assignments/languages/

     根据第5章的内容,并没有说明3到8个字母的代码表在哪里?

    查得IANA(互联网地址指派机构,Internet Assigned Numbers Authority) ,网站是:http://www.iana.org/

    根据维基百科(http://zh.wikipedia.org/wiki/IANA)的资料:

    IANA是英文Internet Assigned Numbers Authority的缩写,即Internet号码分配局,是互联网地址指派机构,是在国际互联网中使用的IP地址、域名和许多其它参数的管理机构。IP地址、自治系统成员以及许多顶级和二级域名分配的日常职责由国际互联网注册中心(IR)和地区注册中心承担。

    查得iana language subtag,在这里:http://www.iana.org/assignments/language-subtag-registry

    引用部分内容如下:

     
    %%
    Type: redundant
    Tag: zh-Hans
    Description: simplified Chinese
    Added: 2003-05-30
    %%
    Type: redundant
    Tag: zh-Hans-CN
    Description: PRC Mainland Chinese in simplified script
    Added: 2005-04-13
    %%
    Type: redundant
    Tag: zh-Hans-HK
    Description: Hong Kong Chinese in simplified script
    Added: 2005-04-11

     标签zh-Hans是在2003-05-30添加的,表示“简体中文”(西方说法)或者称为“规范汉字”(东方说法)。

    标签zh-Hans-CN是在2005-04-13添加的,表示“以简体中文格式书写的PRC Mainland用语” 。

    标签zh-Hans-HK表示“以简体中文格式书写的香港地区用语”。

    到这里,3到8个字母的子代码也明白。

    结论是:

    按照ISO国际标准,使用zh-CN、zh-HK。按惯例是语言种类小写(如zh) ,国家和地区代码大写(如CN),不作强制要求。

    ISO是国际标准,跟着ISO走,你不会错的。各个浏览器也都是这么干的。

    IANA的语言代码更新快,但是影响力不如ISO,各个浏览器都不采用。按照W3C的HTML 4.01规范,可以使用IANA的语言代码。所以如果你认可IANA的新代码,用吧,也是符合W3C的,zh-Hans也是对的。


    下面请看各个浏览器支持的语言代码截图:

     

     

     

    到这里,HTML 4.01中的lang属性值问题得以解决。

    XHTML 1.0中的语言应该使用什么属性值,下次再说。虽然XHTML2已经停止了,不过HTML5的lang可能是在XHTML的基础上再扩充。

     

    下面是相关的问题,有空再说。

    %%
    Type: redundant
    Tag: zh-yue
    Description: Cantonese
    Added: 1999-12-18
    Deprecated: 2009-07-29
    Preferred-Value: yue

    IANA极不赞成使用zh-yue,优先使用yue。

    但是ISO 639-2和ISO 639-3都没有把yue作为单独的语言,看来浏览器也不会支持了。

    http://zh.wikipedia.org/zh-cn/ISO_639-2%E4%BB%A3%E7%A0%81%E8%A1%A8

    http://zh.wikipedia.org/zh-cn/ISO_639-3 

    关于“粤语是汉语族下属的一门语言”和“粤语是汉语的一种方言”的介绍,请看:http://zh.wikipedia.org/zh-cn/%E6%B1%89%E8%AF%AD

    %%
    Type: language
    Subtag: cmn
    Description: Mandarin Chinese
    Added: 2009-07-29
    Macrolanguage: zh
    %%
    Type: grandfathered
    Tag: zh-guoyu
    Description: Mandarin or Standard Chinese
    Added: 1999-12-18
    Deprecated: 2005-07-15
    Preferred-Value: cmn

    ISO 639规定的是语言代码,那么其中的zh应表示汉语,汉语又分为“发音”与“文字”。关于“发音”与“文字”的问题,有空再谈。

    IANA的记录表明:zh-guoyu表示“现代标准汉语”,但已不赞成使用,优先使用cmn表示“现代标准汉语”。

    联合国的官方语言有6种:阿拉伯语 · 汉语 · 英语 · 法语 · 俄语 · 西班牙语。

    现代标准汉语包括:普通话、国语、华语。详细内容请看:http://zh.wikipedia.org/zh-cn/%E7%8F%BE%E4%BB%A3%E6%A8%99%E6%BA%96%E6%BC%A2%E8%AA%9E

  • 相关阅读:
    【软件安装】CentOS7安装Tengine_2_3_2(Nginx 1_17_0)
    【NET开发】图片处理类-仿照七牛云图片处理功能
    Chrome下flash无法显示多个的问题。
    windows搭建ftp
    windows安装RabbitMQ
    安装Mysql,开发权限,以及复制数据库
    idea打jar包
    mysql数据库——选择优化的数据类型
    mysql数据库——事务隔离级别
    Java环境变量配置
  • 原文地址:https://www.cnblogs.com/sink_cup/p/html401_lang_iso639_iso3166_iana_language_subtag.html
Copyright © 2011-2022 走看看