zoukankan      html  css  js  c++  java
  • Western Subregional of NEERC, Minsk, Wednesday, November 4, 2015 Problem K. UTF-8 Decoder 模拟题

    Problem K. UTF-8 Decoder

    题目连接:

    http://opentrains.snarknews.info/~ejudge/team.cgi?SID=c75360ed7f2c7022&all_runs=1&action=140

    Description

    UTF-8 is a character encoding capable of encoding all possible characters, or code points, in Unicode.
    Nowadays UTF-8 is the dominant character encoding for the World Wide Web, accounting for 85.1% of
    all Web pages in September 2015.
    Peter works in a large company as a software engineer and develops a new Internet search engine.
    Its crawler needs a UTF-8 decoder to parse Web pages and put them into index. Peter has already
    checked if there are any ready-made solutions available. He used his own search engine to look for opensource
    implementations on the Web and found nothing that satisfied him. Several huge libraries following
    ‘batteries included’ philosophy were rejected because they are too heavy and contain tons of code. Several
    small but relevant libraries didn’t get to the top of search results page because Peter’s search engine is
    not perfect at present. . . So Peter decided to invent the wheel and write his custom lightweight UTF-8
    decoder.
    Let’s define a code point as an integer from range [0, 2
    31). One code point is encoded into variable-length
    sequence of 8-bit units (bytes).
    The design of UTF-8 can be seen in this table (the x characters are replaced by the bits of the code point):
    One-byte codes are used only for the ASCII code point values 0 through 127. In this case the UTF-8 code
    has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that
    ASCII text is valid UTF-8.
    Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and
    one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while
    continuation bytes all have 10 in the high-order position. UTF-8 offers clear distinction between multibyte
    and single-byte characters. The high order bits of every byte determine the type of byte; single bytes
    (0xxxxxxx), leading bytes (11xxxxxx), and continuation bytes (10xxxxxx) do not share values.
    The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes
    in the sequence. The remaining bits of the encoding (the x bits in the above patterns) are used for the
    bits of the code point being encoded, padded with high-order 0s if necessary. The high-order bits go in
    the lead byte, lower-order bits in succeeding continuation bytes.
    The standard specifies that the correct encoding of a code point use only the minimum number of bytes
    required to hold the significant bits of the code point. Longer encodings are called overlong and are not
    valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between
    code points and their valid encodings, so that there is a unique valid encoding for each code point. This
    ensures that string comparisons and searches are well-defined.
    Modern real-life UTF-8 encoding contains more restrictions. For instance, RFC 3629 removed all 5-, 6-
    byte sequences and some 4-byte sequences in order to match the constraints of the UTF-16 character
    encoding. Peter wants his decoder to be flexible and to be able to decode as much texts as possible, that’s
    why Peter does not implement these additional restrictions.

    Input

    The first line of input contains an integer N (1 ≤ N ≤ 100 000). The second line contains the values of
    N bytes (in range between 0 and 255 each, inclusive) given in hexadecimal. A value consists of two hex
    digits. The symbols 0–9 represent digits zero to nine, and A, B, C, D, E, F represent digits ten to fifteen.
    Values of bytes are separated by single spaces.

    Output

    If the input sequence of N bytes can be decoded successfully into sequence of L code points, then in the
    first line print the number L and in the second line print L code point values (31-bit integers in usual
    decimal notation with no leading zeros) separated by spaces.
    If the input cannot be decoded, output a single line Epic Fail.

    Sample Input

    1
    24

    Sample Output

    1
    36

    Hint

    题意

    其实就是让你写一个UTF8译码器。

    大体上来说,0xxxxx表示这是一个单bit的,10xxxx表示这个是填补的,11xxx0xxxx这个表示后面有多少个bit

    然后把所有的数转成二进制就好了。

    你还得判断是否非法。

    然后这个数必须得使用最简单的表示方法才行。

    题解:

    模拟题,把题目讲的东西全部模拟一遍就好了……

    代码

     #include<bits/stdc++.h>
    using namespace std;
    
    string s[100005];
    string tmp;
    vector<long long>ans;
    string get(char c){
        if(c=='0')return "0000";
        if(c=='1')return "0001";
        if(c=='2')return "0010";
        if(c=='3')return "0011";
        if(c=='4')return "0100";
        if(c=='5')return "0101";
        if(c=='6')return "0110";
        if(c=='7')return "0111";
        if(c=='8')return "1000";
        if(c=='9')return "1001";
        if(c=='A')return "1010";
        if(c=='B')return "1011";
        if(c=='C')return "1100";
        if(c=='D')return "1101";
        if(c=='E')return "1110";
        if(c=='F')return "1111";
    }
    int main(){
        int n;
        scanf("%d",&n);
        for(int i=0;i<n;i++){
            cin>>tmp;
            s[i]+=get(tmp[0]);
            s[i]+=get(tmp[1]);
        }
        string now;
    
        for(int i=0;i<n;i++){
            if(s[i][0]=='1'&&s[i][1]=='0'){
                printf("Epic Fail");
                return 0;
            }
            int j;
            for(j=0;j<s[i].size();j++)
                if(s[i][j]=='0')break;
            if(j==8||j==7){
                printf("Epic Fail");
                return 0;
            }
            for(int t=j+1;t<s[i].size();t++)
                now+=s[i][t];
            if(n<i+j){
                printf("Epic Fail");
                return 0;
            }
            for(int t=i+1;t<i+j;t++){
                if(s[t][0]!='1'||s[t][1]!='0'){
                    printf("Epic Fail");
                    return 0;
                }
                for(int k=2;k<s[t].size();k++)
                    now+=s[t][k];
            }
            if(j!=0)i=i+j-1;
            reverse(now.begin(),now.end());
            int k = 0;
            for(int t=0;t<now.size();t++)
                if(now[t]=='1')k=t;
            if(now.size()==11&&k<7){
                printf("Epic Fail");
                return 0;
            }
            if(now.size()==16&&k<11){
                printf("Epic Fail");
                return 0;
            }
            if(now.size()==21&&k<16){
                printf("Epic Fail");
                return 0;
            }
            if(now.size()==26&&k<21){
                printf("Epic Fail");
                return 0;
            }
            if(now.size()==31&&k<26){
                printf("Epic Fail");
                return 0;
            }
            long long tmp = 1;
            long long Ans = 0;
            for(int t=0;t<now.size();t++){
                if(now[t]=='1')Ans+=tmp;
                tmp*=2;
            }
            ans.push_back(Ans);
            now="";
        }
        cout<<ans.size()<<endl;
        for(int i=0;i<ans.size();i++)
            cout<<ans[i]<<" ";
        cout<<endl;
    }
  • 相关阅读:
    STM32的DMA
    STM32串口接收不定长数据原理与源程序(转)
    推挽与开漏
    开关量输入检测与输出的电路设计(转)
    理解一下单片机的I2C和SPI通信
    电阻桥的作用(转)
    为什么工业上用4到20毫安电流传输数据(转)
    Keil的标题“礦ision3" 的改变(转)
    epplus动态合并列数据
    npm脚本编译代码
  • 原文地址:https://www.cnblogs.com/qscqesze/p/5767085.html
Copyright © 2011-2022 走看看