注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

流星永恒的博客

JSF,Facelets,Rich(Prime)Faces,和java的笔记

 
 
 

日志

 
 

java process utf-8 with bom(text file)  

2010-03-11 11:37:51|  分类: java |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
    
java bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6378911
bug回复:
*** (#1 of 1): [ UNSAVED ]   xxxxx@xxxxx    Posted Date : 2006-01-31 04:55:02.0    UTF8 charset has been updated to recognize sequence EF BB BF as a "BOM"  as specified at http://www.unicode.org/faq/utf_bom.html#BOM, so this  utf8 signature is being skipped out during decoding if it appears at  the beginning of the input stream. See#4508058. So the assumption of  "parsed back into 0xFEFF" no longer stands, suggest the test case get.  updated accordingly  Posted Date : 2006-01-31 04:55:02.0    I am reopening this bug because it is breaking backwards-compatibility. The JSP container in the Java EE 5 RI and SJSAS 9.0 has been relying on detecting a BOM, setting the appropriate encoding, and discarding the BOM bytes before reading the input. The purpose of the test program I provided with the bug report was to demonstrate the issue. It is not just a matter of changing the test program to make things work.    This has worked up until JDK 1.6, and I expect it to continue to work with that JDK release.    If you want to support the new functionality of automatically detecting and discarding a BOM, this should be enabled with a flag, but not by default. We cannot have our container implement one behaviour when running on JDK 1.5, and a different behaviour when running on JDK 1.6.    Just curious, have the following encodings been updated as well:      UTF-32 BE    UTF-32 LE    UTF-16 BE    UTF-16 LE  Posted Date : 2006-01-31 17:25:41.0    It sounds like the customer code is doing "encoding detection".  But specifying UTF-8 is already choosing an encoding...    If you really want to examine the input to auto-detect encodings  (in general, an impossible task), then read the first few bytes  from the input stream directly *as bytes*, and compare to the  various encodings of BOM.    If a BOM is detected, deduce the encoding, discard the BOM,  and read the rest of the input using the detected encoding.  If you don't find something that looks like a BOM, guess and pray.  Auto-detecting encodings are never reliable; only heuristics.    The change to UTF-8 is incompatible, but a strong case can be made  that UTF-8 is specified by a standard, and so the change was simply a bug fix.  What do other implementations of UTF-8 do?  Posted Date : 2006-01-31 17:50:12.0    The "encoding autodetection" approach mentioned above is exactly what we have been doing.    However, after having deduced an encoding from the BOM bytes (if present), we need to reset the input stream so we can later pass it to the javax.xml.parsers.SAXParser (for JSP pages in XML syntax) or our "hand-written" parser (for JSP pages in classic syntax).    This means that in the classic syntax case, we must discard the BOM manually (in the XML syntax case, the javax.xml.parsers.SAXParser is taking care of this) when running against a JRE with a version < 1.6, but must rely on the decoder to do this for us as of JRE 1.6. The problem is that we cannot implement different behaviour depending on the JRE version we're running against.     Also, do these encodings also discard a BOM as of JDK 1.6:      UTF-32 BE    UTF-32 LE    UTF-16 BE    UTF-16 LE    or was the change made to UTF-8 only?  Posted Date : 2006-01-31 18:17:15.0    I don't understand why the technique that works with 1.6,  namely "discarding the BOM manually", doesn't also work with 1.5.  If you consistently pass a BOM-free stream to the decoder, the  behavior will be unchanged.  Posted Date : 2006-01-31 18:35:44.0    Sorry if I wasn't clear: We currently detect the encoding of a JSP file from its BOM, reset the input stream, and pass the input stream to the appropriate parser, based on the JSP file's syntax. Notice that the SAXParser (invoked for XML syntax) would choke on a BOM-free stream. If the JSP page is in classic syntax, we remember if a BOM was present by setting a flag, set the stream's encoding to that derived from the BOM, and have our parser read and parse the JSP page from the stream. If the BOM flag is set, the parser knows to discard the first char. With JDK 1.6, this approach no longer works, because the decoder will already have discarded the BOM, so our parser will look at the wrong char.    Also, I never got an answer if the automatic BOM-stripping is now also done in the case of UTF-32 and UTF-16, or just UTF-8.  Posted Date : 2006-01-31 20:51:47.0    The update we made to recognized the BOM in UTF-8 (4508058) is  correct according to the Unicode Standard.  Our assumption was that  our change should rarely break any real-world applications because  that would imply that they were not following the Unicode Standard.    Unfortunately, we have found a common application where our  assumption was incorrect.  We will back out the changes associated  with 4508058, thus reverting to our previous behaviour of ignoring  the BOM for UTF-8.    No changes were ever made to BOM handling for UTF-16 or UTF-32 as  these double-byte encodings require its processing.  Posted Date : 2006-02-06 21:01:05.0  
解决方法:
只能通过自己的应用程序来处理,判断文件是不是带有bom,如果有,去掉后读取。
/**   version: 1.1 / 2007-01-25   - changed BOM recognition ordering (longer boms first)     Original pseudocode   : Thomas Weidenfeller   Implementation tweaked: Aki Nieminen     http://www.unicode.org/unicode/faq/utf_bom.html   BOMs in byte length ordering:     00 00 FE FF    = UTF-32, big-endian     FF FE 00 00    = UTF-32, little-endian     EF BB BF       = UTF-8,     FE FF          = UTF-16, big-endian     FF FE          = UTF-16, little-endian     Win2k Notepad:     Unicode format = UTF-16LE  ***/    import java.io.*;    /**   * This inputstream will recognize unicode BOM marks   * and will skip bytes if getEncoding() method is called   * before any of the read(...) methods.   *   * Usage pattern:       String enc = "ISO-8859-1"; // or NULL to use systemdefault       FileInputStream fis = new FileInputStream(file);       UnicodeInputStream uin = new UnicodeInputStream(fis, enc);       enc = uin.getEncoding(); // check and skip possible BOM bytes       InputStreamReader in;       if (enc == null) in = new InputStreamReader(uin);       else in = new InputStreamReader(uin, enc);   */  public class UnicodeInputStream extends InputStream {     PushbackInputStream internalIn;     boolean             isInited = false;   String              defaultEnc;   String              encoding;     private static final int BOM_SIZE = 4;     UnicodeInputStream(InputStream in, String defaultEnc) {    internalIn = new PushbackInputStream(in, BOM_SIZE);    this.defaultEnc = defaultEnc;   }     public String getDefaultEncoding() {        return defaultEnc;     }       public String getEncoding() {        if (!isInited) {           try {              init();           } catch (IOException ex) {              IllegalStateException ise = new IllegalStateException("Init method failed.");              ise.initCause(ise);              throw ise;           }        }        return encoding;     }       /**      * Read-ahead four bytes and check for BOM marks. Extra bytes are      * unread back to the stream, only BOM bytes are skipped.      */     protected void init() throws IOException {        if (isInited) return;          byte bom[] = new byte[BOM_SIZE];        int n, unread;        n = internalIn.read(bom, 0, bom.length);          if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&                    (bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF) ) {           encoding = "UTF-32BE";           unread = n - 4;        } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&                    (bom[2] == (byte)0x00) && (bom[3] == (byte)0x00) ) {           encoding = "UTF-32LE";           unread = n - 4;        } else if (  (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&              (bom[2] == (byte)0xBF) ) {           encoding = "UTF-8";           unread = n - 3;        } else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {           encoding = "UTF-16BE";           unread = n - 2;        } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {           encoding = "UTF-16LE";           unread = n - 2;        } else {           // Unicode BOM mark not found, unread all bytes           encoding = defaultEnc;           unread = n;        }              //System.out.println("read=" + n + ", unread=" + unread);          if (unread > 0) internalIn.unread(bom, (n - unread), unread);          isInited = true;     }       public void close() throws IOException {        //init();        isInited = true;        internalIn.close();     }       public int read() throws IOException {        //init();        isInited = true;        return internalIn.read();     }  }  
  评论这张
 
阅读(879)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2018