zoukankan      html  css  js  c++  java
  • UTF-8 Invalid Byte Sequences

    Chances are, some of you have run into the issue with the invalid byte sequence in UTF-8 error when dealing with user-submitted data. A Google search shows that my hunch isn’t off.

    Among the search results are plenty of answers—some using the deprecated iconv library—that might lead you to a sufficient fix. However, among the slew of queries are few answers on how to reliably replicate and test the issue.

    In developing the Griddler gem we ran into some cases where the data being posted back to our controller had invalid UTF-8 bytes. For Griddler, our failing case needs to simulate the body of an email having an invalid byte, and encoded as UTF-8.

    What are valid and invalid bytes? This table on Wikipedia tells us bytes 192, 193, and 245-255 are off limits. In ruby’s string literal we can represent this by escaping one of those numbers:

    > "hi 255"
     => "hi xAD"
    

    There’s our string with the invalid byte! How do we know for sure? In that IRB session we can simulate a comparable issue by sending a message to the string it won’t like - like split or gsub.

    > "hi 255".split(' ')
    ArgumentError: invalid byte sequence in UTF-8
      from (irb):9:in `split'
      from (irb):9
      from /Users/joel/.rvm/rubies/ruby-1.9.3-p125/bin/irb:16:in `<main>'
    

    Yup. It certainly does not like that.

    Let’s create a very real-world, enterprise-level, business-critical test case:

    invalid_byte_spec.rb

    require 'rspec'
    
    def replace_name(body, name)
      body.gsub(/joel/, name)
    end
    
    describe 'replace_name' do
      it 'removes my name' do
        body = "hello joel"
    
        replace_name(body, 'hank').should eq "hello hank"
      end
    
      it 'clears out invalid UTF-8 bytes' do
        body = "hello joel255"
    
        replace_name(body, 'hank').should eq "hello hank"
      end
    end
    

    The first test passes as expected, and the second will fail as expected but not with the error we want. By adding that extra byte we should see an exception raised similar to what we simulated in IRB. Instead it’s failing in the comparison with the expected value.

    1) replace_name clears out invalid UTF-8 bytes
       Failure/Error: replace_name(body, 'hank').should eq "hello hank"
    
         expected: "hello hank"
              got: "hello hankxAD"
    
         (compared using ==)
       # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'
    

    Why isn’t it failing properly? If we pry into our running test we find out that inside our file the strings being passed around are encoded as ASCII-8BIT instead of UTF-8.

    [2] pry(#<RSpec::Core::ExampleGroup::Nested_1>)> body.encoding
    => #<Encoding:ASCII-8BIT>
    

    As a result we’ll have to force that string’s encoding to UTF-8:

    it 'clears out invalid UTF-8 bytes' do
      body = "hello joel255".force_encoding('UTF-8')
    
      replace_name(body, 'hank').should_not raise_error(ArgumentError)
      replace_name(body, 'hank').should eq "hello hank"
    end
    

    By running the test now we will see our desired exception

    1) replace_name clears out invalid UTF-8 bytes
       Failure/Error: body.gsub(/joel/, name)
       ArgumentError:
         invalid byte sequence in UTF-8
       # ./invalid_byte_spec.rb:4:in `gsub'
       # ./invalid_byte_spec.rb:4:in `replace_name'
       # ./invalid_byte_spec.rb:17:in `block (2 levels) in <top (required)>'
    
    Finished in 0.00426 seconds
    2 examples, 1 failure
    

    Now that we’re comfortably in the red part of red/green/refactor we can move on to getting this passing by updating our replace_name method.

    def replace_name(body, name)
      body
        .encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
        .gsub(/joel/, name)
    end
    

    And the test?

    Finished in 0.04252 seconds
    2 examples, 0 failures
    

    For such a small piece of code we admittedly had to jump through some hoops. Through that process, however, we learned a bit about character encoding and how to put ourselves in the right position—through the red/green/refactor cycle—to fix bugs we will undoubtedly run into while writing software.

    #encoding: utf-8
    require 'json'
    f="dsp-cpi"
    File.open(f).each  do |line|
    line = line.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
    end
  • 相关阅读:
    Java相对路径读取文件
    【转载】 OpenCV ——双线性插值(Bilinear interpolation)
    【转载】 从ACM会议看中国大陆计算机科学与国外的差距
    【转载】 一个老博士的经验顺口溜! 研究生生活的精华总结!
    【转载】 研究生生活总结(2):从技术到研究再到技术的过程
    【转载】 研究生生活总结(1):当助教的那些人和事
    【转载】 如何看待 2019 年 CS PhD 现扎堆申请且大部分为 AI 方向?未来几年 AI 泡沫会破裂吗?
    【转载】 深度强化学习处理cartpole为什么reward很难超过200?
    【转载】 强化学习中 采用 【首次访问的蒙特卡洛预测法】 的算法描述
    【转载】 混合智能
  • 原文地址:https://www.cnblogs.com/lavin/p/8150106.html
Copyright © 2011-2022 走看看