我们在解析XML文件时,会碰到程序发生以下一些异常信息:
1 2 |
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x{2}) was found in the value of attribute "{1}" and element is "1f". |
或者:
1 |
An invalid XML character (Unicode: 0x1d) was found in the CDATA section. |
这些错误的发生是由于一些不可见的特殊字符的存在,而这些字符对于XMl文件来说又是非法的,所以XML解析器在解析时会发生异常,官方定义了XML的无效字符分为三段:
- 0x00 – 0x08
- 0x0b – 0x0c
- 0x0e – 0x1f
解决方法是:在解析之前先把字符串中的这些非法字符过滤掉:
1 |
string.replaceAll("[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", "") |
测试代码:TestXmlInvalidChar.java
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
package michael.xml; import java.io.ByteArrayInputStream; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; import org.w3c.dom.Element; /** * @author michael * */ public class TestXmlInvalidChar { /** * @param args */ public static void main(String[] args) { // 测试的字符串应该为:<r><c d="s" n="j"></c></r> // 正常的对应的byte数组为 byte[] ba1 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115, 34, 32, 110, 61, 34, 106, 34, 62, 60, 47, 99, 62, 60, 47, 114, 62 }; System.out.println("ba1 length=" + ba1.length); String ba1str = new String(ba1); System.out.println(ba1str); System.out.println("ba1str length=" + ba1str.length()); System.out.println("-----------------------------------------"); // 和正常的byte 数组相比 多了一个不可见的 31 byte[] ba2 = new byte[] { 60, 114, 62, 60, 99, 32, 100, 61, 34, 115, 34, 32, 110, 61, 34, 106, 31, 34, 62, 60, 47, 99, 62, 60, 47, 114, 62 }; System.out.println("ba2 length=" + ba2.length); String ba2str = new String(ba2); System.out.println(ba2str); System.out.println("ba2str length=" + ba2str.length()); System.out.println("-----------------------------------------"); try { DocumentBuilderFactory dbfactory = DocumentBuilderFactory .newInstance(); dbfactory.setIgnoringComments(true); DocumentBuilder docBuilder = dbfactory.newDocumentBuilder(); // 过滤掉非法不可见字符 如果不过滤 XML解析就报异常 String filter = ba2str.replaceAll( "[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]", ""); System.out.println("过滤后的length=" + filter.length()); ByteArrayInputStream bais = new ByteArrayInputStream(filter .getBytes()); Document doc = docBuilder.parse(bais); Element rootEl = doc.getDocumentElement(); System.out.println("过滤后解析正常 root child length=" + rootEl.getChildNodes().getLength()); } catch (Exception e) { e.printStackTrace(); } } } |
测试代码运行结果如下:
1 2 3 4 5 6 7 8 9 10 |
ba1 length=26 <r><c d="s" n="j"></c></r> ba1str length=26 ----------------------------------------- ba2 length=27 <r><c d="s" n="j"></c></r> ba2str length=27 ----------------------------------------- 过滤后的length=26 过滤后解析正常 root child length=1 |
对比可见,byte数组及字符串的长度前后是不一样的,但打印到控制台显示的结果却是一样的。同样过滤之后的字符串长度是有变化的。
原创文章,转载请注明: 转载自micmiu – 软件开发+生活点滴[ http://www.micmiu.com/ ]
本文链接地址: http://www.micmiu.com/exception/invalid-xml-character/
0 条评论。