zoukankan      html  css  js  c++  java
  • UTF-8和Unicode

    alt text

    Is it true that unicode=utf16 ?

    UPDATE

    Many are saying unicode is a standard not an encoding,but most editors support save as Unicode encoding actually.

    As Rasmus states in his article "The difference between UTF-8 and Unicode?" (link fixed):

    If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.

    Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:

    UTF-8 is an encoding - Unicode is a character set

    A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.

    An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

    00000001 00000010 00000011 00000100 

    Our data is now translated into binary and can now be saved to disk.

    All together now

    Say an application reads the following from the disk:

    1101000 1100101 1101100 1101100 1101111 

    The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

    104 101 108 108 111 

    Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".

    Conclusion

    So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:

    UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.

    shareimprove this answer
     
    19  
    @vikas...I wish I could upvote you 100 times...but thanks for explaining it very very clearly! – user547453 Dec 28 '12 at 19:04
        
    LOVELY! Thankyou... – OceanBlue Mar 31 '13 at 1:36
        
    Smashing indeed! – MalsR May 1 '13 at 22:56
    2  
    This is totally correct, and answers the question posed in the title. It does not however answer the actual question, which is based on a misrepresentation of Microsoft using Unicode to refer to UTF-16. – Mark Ransom Feb 13 '14 at 14:07
    2  
    Feel relaxed after finding this. Thanks vikas – Ramyavjr Mar 2 '14 at 14:56
              

    most editors support save as ‘Unicode’ encoding actually.

    This is an unfortunate misnaming perpetrated by Windows.

    Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).

    This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.

    This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.

    (Other editors that do encodings themselves, like Notepad++, don't have this problem.)

    If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.
  • 相关阅读:
    Codeforces Round #619 (Div. 2) ABC 题解
    Codeforces Round #669 ABC 题解
    中大ACM个人赛 ABC题题解
    Codeforces Round #601 (Div. 2) ABC 题解
    SCAU 2019年校赛 部分题解
    SCAU 2018年新生赛 初出茅庐 全题解
    Educational Codeforces Round 74 (Rated for Div. 2) ABC 题解
    Codeforces Round #603 (Div. 2) ABC 题解
    【题解】 CF767E Change-free 带悔贪心
    【题解】 NOIp2013 华容道 最短路+状态压缩 Luogu1979
  • 原文地址:https://www.cnblogs.com/sddai/p/5934650.html
Copyright © 2011-2022 走看看