zoukankan      html  css  js  c++  java
  • 【参考】IBM sun.io.MalformedInputException and text encoding conversions transforms numerals to their word equivalents


    Problem(Abstract)

    When converting contents from a file or string using WebSphere Application Server, numbers may be converted to their word equivalents, especially if using PDFBOX to extract text, along with sun.io.MalformedInputExceptions. 

    Symptom

    Text extracted from UTF-8 sources, such as PDFs, are displayed incorrectly.

    For example, when "123 Hello Motto" is extracted from a PDF, the text "onetwothreespaceHellospaceMottospace" is output.



    Diagnosing the problem

    When extracting text from a source that using UTF-8, you may find that numbers and non-alpha characters are transforming into word equivalents. This is a problem seen on Linux; however, MalformedInputExceptions are likely to be seen on other operating systems.

     

    We have a stand-alone test case that can confirm if the text transformations are occurring. Unzip the contents of PDF_Test_Case.zip into a temporary location and execute the following against the Java™ executable that is bundled with your WebSphere Application Server.

    [JAVA_EXECUTABLE] -jar pdfproblem.jar "123_Hello_Motto.pdf"


    PDF_Test_Case.zip

    If the test fails, you will see output similar to the following:
    onetwothreespaceHellospaceMottospace

    Resolving the problem

    Because of the IBM SDK's use of Java IO for text and font conversion, these transformation issues occur. The solution is to force the Java Virtual Machine (JVM) to use the Java NIO libraries for extracting text. Add this JVM argument to resolve the problem:


    -Dibm.stream.nio=true


    I am getting a MalformedInputException. How can I resolve this?

    This exception does not alter the resulting string, which is output after the exception. Java IO is designed to throw exceptions when errors are reported. By switching to NIO, these exceptions would be caught and not reported to the log. 

    You can resolve these errors by forcing NIO, but there is an alternative. Check the environment variable LANG to see if it set to UTF-8. It may read something like this: 

    # echo $LANG
    en_US.UTF-8

    Alter the variable and remove the .UTF-8 appended to the end of the string. From the command prompt on UNIX and Linux, you can type the following: 

    # export LANG=en_US

    Alternatively, you can add this environment variable from the administration console. 

    MalformedInputException may also occur when running your application on WebSphere Application Server and would be output to the standard error. 


     


    Why is Java IO used for converting text?
    Java IO is retained in the IBM SDK for performance reasons instead of using NIO, or New IO. By design, Java IO will throw exceptions when errors are encountered, such as the MalformedInputExcpetion error, while NIO will not. 

    The JVM can be forced to use NIO if the JVM argument is used as stated above. 



    Does the Oracle JDK suffer similar problems?
    Since the Oracle JDK uses NIO by default, this issue does not occur when running WebSphere Application Server on Solaris and HP-UX.
  • 相关阅读:
    异步编程与scrapy
    统计学 李航读书笔记
    算法模型手写
    《剑指offer》面试题的Python实现
    numpy 中文手册
    django部署
    Django ORM中使用update_or_create功能再解
    RabbitMQ(七)心跳控制 -- heartbeat
    重写__eq__函数——对象list中使用in index()——获得list中不同属性对象个数
    Python机器学习及分析工具:Scikit-learn篇
  • 原文地址:https://www.cnblogs.com/zhangxsh/p/3494510.html
Copyright © 2011-2022 走看看