zoukankan      html  css  js  c++  java
  • 关于大XML文件与大节点处理(System.Xml.XmlTextReader)

    近期有个任务要求处理大XML文件,其中有个存了Base64的大节点(>90M,路径已知)。

    这种任务只能上XmlReader,即使如此大节点的处理还是头疼了一阵……

    最初查MSDN的时候,找到了ReadChars(),可以拿来对付大节点。

    方法说明:https://msdn.microsoft.com/zh-cn/library/system.xml.xmltextreader.readchars(v=vs.110).aspx

    示例中提到使用方法是:

    while(0 != reader.ReadChars(buffer, 0, 1))
    {
        // Do something.
        // Attribute values are not available at this point.
    }

    这个处理规范格式的XML没有问题,比如这样的:

    <Root>
      <LeafNode>Value</LeafNode>
      <ParentNode>
        <LeafNode>Value</LeafNode>
      </ParentNode>
    </Root>

    但是(没人喜欢这个词,然并卵……),遇到些格式诡异的XML就……

    <Root><LeafNode>Value</LeafNode><ParentNode>
    <LeafNode>Value</LeafNode></ParentNode>
    </Root>

    比如这个画风的,用示例代码去读第一个LeafNode的内容,估计会读出“ValueValue”来……

    偏偏输入的XML就是这风格的……(*sigh*)

    单步执行了一阵,发现这种情况下,XmlTextReader.Name会变化成下个节点的名称(XmlTextReader.LocalName亦如此),可以根据这个判断是否已经达到节点结尾。

    改进版为:

    string currentName = reader.LocalName;
    while(currentName == reader.LocalName && 0 != reader.ReadChars(buffer, 0, 1))
    {
        // Do something.
        // Attribute values are not available at this point.
    }

    顺便贴上一个转写并对特定节点进行处理的代码:

    List<string> processNodePathList = new List<string> {"/Root/Path/to/Target"};
    List<string> bigNodePathList = new List<string> { "/Root/Path/to/Big/Node" }; 
    
    private static void ProcessBigXmlFile(string sourcePath, string targetPath, IList<string> processNodePathList, IList<string> bigNodePathList)
    {
        var processNodeNameList =
            processNodePathList.Select(
                processNodePath => processNodePath.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries))
                .Select(nodePathParts => nodePathParts[nodePathParts.Length - 1])
                .ToList();
        var bigNodeNameList = bigNodePathList.Select(
                bigNodePath => bigNodePath.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries))
                .Select(nodePathParts => nodePathParts[nodePathParts.Length - 1])
                .ToList();
    
        var sourceStream = new FileStream(sourcePath, FileMode.Open, FileAccess.Read);
        var reader = new XmlTextReader(sourceStream);
    
        var targetStream = new FileStream(targetPath, FileMode.Create, FileAccess.Write);
        var writer = new XmlTextWriter(targetStream, Encoding.UTF8);
    
        try
        {
            var pathStack = new Stack<string>();
            var readResult = reader.Read();
            while (readResult)
            {
                int skipMode = 0;
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                    {
                        pathStack.Push(reader.Name);
                        writer.WriteStartElement(reader.LocalName);
                        if (reader.HasAttributes)
                        {
                            while (reader.MoveToNextAttribute())
                            {
                                writer.WriteAttributeString(reader.LocalName,
                                    reader.Value);
                            }
                            reader.MoveToElement();
                        }
    
                        if (processNodeNameList.Contains(reader.LocalName))
                        {
                            var index = processNodeNameList.IndexOf(reader.LocalName);
                            if (CompareNodePath(pathStack, processNodePathList[index]))
                            {
                                        
                                // Replace node content
    
                                writer.WriteFullEndElement();
                                skipMode = 1;
                            }
                        }
                        else if (bigNodeNameList.Contains(reader.LocalName))
                        {
                            var index = bigNodeNameList.IndexOf(reader.LocalName);
                            if (CompareNodePath(pathStack, bigNodePathList[index]))
                            {
                                reader.MoveToContent();
                                var buffer = new char[1024];
                                int len;
                                while (reader.LocalName == bigNodePathList[index] &&
                                        (len = reader.ReadChars(buffer, 0, buffer.Length)) > 0)
                                {
                                    writer.WriteRaw(buffer, 0, len);
                                }
                                writer.WriteFullEndElement();
                                skipMode = 2;
                            }
                        }
                        if (reader.IsEmptyElement)
                        {
                            pathStack.Pop();
                            writer.WriteEndElement();
                        }
                        break;
                    }
                    //case XmlNodeType.Attribute:
                    //{
                    //    newPackageWriter.WriteAttributeString(oldPackageReader.LocalName, oldPackageReader.Value);
                    //    break;
                    //}
                    case XmlNodeType.Text:
                    {
                        writer.WriteValue(reader.Value);
                        break;
                    }
                    case XmlNodeType.CDATA:
                    {
                        writer.WriteCData(reader.Value);
                        break;
                    }
                    //case XmlNodeType.EntityReference:
                    //{
                    //    newPackageWriter.WriteEntityRef(oldPackageReader.Name);
                    //    break;
                    //}
                    //case XmlNodeType.Entity:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.ProcessingInstruction:
                    {
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    }
                    case XmlNodeType.Comment:
                    {
                        writer.WriteComment(reader.Value);
                        break;
                    }
                    //case XmlNodeType.Document:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.DocumentType:
                    {
                        writer.WriteRaw(string.Format("<!DOCTYPE{0} [{1}]>", reader.Name,
                            reader.Value));
                        break;
                    }
                    //case XmlNodeType.DocumentFragment:
                    //{
                    //    break;
                    //}
                    //case XmlNodeType.Notation:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.Whitespace:
                    {
                        writer.WriteWhitespace(reader.Value);
                        break;
                    }
                    //case XmlNodeType.SignificantWhitespace:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.EndElement:
                    {
                        pathStack.Pop();
                        writer.WriteFullEndElement();
                        break;
                    }
                    case XmlNodeType.XmlDeclaration:
                    {
                        writer.WriteStartDocument();
                        break;
                    }
                }
    
                switch (skipMode)
                {
                    case 1:
                    {
                        reader.Skip();
                        pathStack.Pop();
                        readResult = !reader.EOF;
                        break;
                    }
                    case 2:
                    {
                        pathStack.Pop();
                        readResult = !reader.EOF;
                        break;
                    }
                    default:
                    {
                        readResult = reader.Read();
                        break;
                    }
                }
            }
        }
        finally
        {
            writer.Close();
            targetStream.Close();
            targetStream.Dispose();
            reader.Close();
            sourceStream.Close();
            sourceStream.Dispose();
        }
    }
    
    private static bool CompareNodePath(Stack<string> currentNodePathStack, string compareNodePathString)
    {
        var currentArray = currentNodePathStack.Reverse().ToArray();
        var compareArray = compareNodePathString.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
        if (compareArray.Length != currentArray.Length)
        {
            return false;
        }
        bool isDifferent = false;
        for (int i = 0; i < currentArray.Length; i++)
        {
            if (compareArray[i] != currentArray[i])
            {
                isDifferent = true;
                break;
            }
        }
        return !isDifferent;
    }
  • 相关阅读:
    Spring Batch与ETL工具比较
    Spring Batch基本概念
    SpringBatch介绍
    2019第51周日
    用arthas的watch方法观察执行方法的输入输出
    三人行必有我师
    用arthas查看JVM已加载的类及方法信息
    线上问题排查利器Arthas
    换个视觉
    Java Servlet:服务器小程序
  • 原文地址:https://www.cnblogs.com/Rabbitism/p/7161926.html
Copyright © 2011-2022 走看看