zoukankan      html  css  js  c++  java
  • 关于大XML文件与大节点处理(System.Xml.XmlTextReader)

    近期有个任务要求处理大XML文件,其中有个存了Base64的大节点(>90M,路径已知)。

    这种任务只能上XmlReader,即使如此大节点的处理还是头疼了一阵……

    最初查MSDN的时候,找到了ReadChars(),可以拿来对付大节点。

    方法说明:https://msdn.microsoft.com/zh-cn/library/system.xml.xmltextreader.readchars(v=vs.110).aspx

    示例中提到使用方法是:

    while(0 != reader.ReadChars(buffer, 0, 1))
    {
        // Do something.
        // Attribute values are not available at this point.
    }

    这个处理规范格式的XML没有问题,比如这样的:

    <Root>
      <LeafNode>Value</LeafNode>
      <ParentNode>
        <LeafNode>Value</LeafNode>
      </ParentNode>
    </Root>

    但是(没人喜欢这个词,然并卵……),遇到些格式诡异的XML就……

    <Root><LeafNode>Value</LeafNode><ParentNode>
    <LeafNode>Value</LeafNode></ParentNode>
    </Root>

    比如这个画风的,用示例代码去读第一个LeafNode的内容,估计会读出“ValueValue”来……

    偏偏输入的XML就是这风格的……(*sigh*)

    单步执行了一阵,发现这种情况下,XmlTextReader.Name会变化成下个节点的名称(XmlTextReader.LocalName亦如此),可以根据这个判断是否已经达到节点结尾。

    改进版为:

    string currentName = reader.LocalName;
    while(currentName == reader.LocalName && 0 != reader.ReadChars(buffer, 0, 1))
    {
        // Do something.
        // Attribute values are not available at this point.
    }

    顺便贴上一个转写并对特定节点进行处理的代码:

    List<string> processNodePathList = new List<string> {"/Root/Path/to/Target"};
    List<string> bigNodePathList = new List<string> { "/Root/Path/to/Big/Node" }; 
    
    private static void ProcessBigXmlFile(string sourcePath, string targetPath, IList<string> processNodePathList, IList<string> bigNodePathList)
    {
        var processNodeNameList =
            processNodePathList.Select(
                processNodePath => processNodePath.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries))
                .Select(nodePathParts => nodePathParts[nodePathParts.Length - 1])
                .ToList();
        var bigNodeNameList = bigNodePathList.Select(
                bigNodePath => bigNodePath.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries))
                .Select(nodePathParts => nodePathParts[nodePathParts.Length - 1])
                .ToList();
    
        var sourceStream = new FileStream(sourcePath, FileMode.Open, FileAccess.Read);
        var reader = new XmlTextReader(sourceStream);
    
        var targetStream = new FileStream(targetPath, FileMode.Create, FileAccess.Write);
        var writer = new XmlTextWriter(targetStream, Encoding.UTF8);
    
        try
        {
            var pathStack = new Stack<string>();
            var readResult = reader.Read();
            while (readResult)
            {
                int skipMode = 0;
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                    {
                        pathStack.Push(reader.Name);
                        writer.WriteStartElement(reader.LocalName);
                        if (reader.HasAttributes)
                        {
                            while (reader.MoveToNextAttribute())
                            {
                                writer.WriteAttributeString(reader.LocalName,
                                    reader.Value);
                            }
                            reader.MoveToElement();
                        }
    
                        if (processNodeNameList.Contains(reader.LocalName))
                        {
                            var index = processNodeNameList.IndexOf(reader.LocalName);
                            if (CompareNodePath(pathStack, processNodePathList[index]))
                            {
                                        
                                // Replace node content
    
                                writer.WriteFullEndElement();
                                skipMode = 1;
                            }
                        }
                        else if (bigNodeNameList.Contains(reader.LocalName))
                        {
                            var index = bigNodeNameList.IndexOf(reader.LocalName);
                            if (CompareNodePath(pathStack, bigNodePathList[index]))
                            {
                                reader.MoveToContent();
                                var buffer = new char[1024];
                                int len;
                                while (reader.LocalName == bigNodePathList[index] &&
                                        (len = reader.ReadChars(buffer, 0, buffer.Length)) > 0)
                                {
                                    writer.WriteRaw(buffer, 0, len);
                                }
                                writer.WriteFullEndElement();
                                skipMode = 2;
                            }
                        }
                        if (reader.IsEmptyElement)
                        {
                            pathStack.Pop();
                            writer.WriteEndElement();
                        }
                        break;
                    }
                    //case XmlNodeType.Attribute:
                    //{
                    //    newPackageWriter.WriteAttributeString(oldPackageReader.LocalName, oldPackageReader.Value);
                    //    break;
                    //}
                    case XmlNodeType.Text:
                    {
                        writer.WriteValue(reader.Value);
                        break;
                    }
                    case XmlNodeType.CDATA:
                    {
                        writer.WriteCData(reader.Value);
                        break;
                    }
                    //case XmlNodeType.EntityReference:
                    //{
                    //    newPackageWriter.WriteEntityRef(oldPackageReader.Name);
                    //    break;
                    //}
                    //case XmlNodeType.Entity:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.ProcessingInstruction:
                    {
                        writer.WriteProcessingInstruction(reader.Name, reader.Value);
                        break;
                    }
                    case XmlNodeType.Comment:
                    {
                        writer.WriteComment(reader.Value);
                        break;
                    }
                    //case XmlNodeType.Document:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.DocumentType:
                    {
                        writer.WriteRaw(string.Format("<!DOCTYPE{0} [{1}]>", reader.Name,
                            reader.Value));
                        break;
                    }
                    //case XmlNodeType.DocumentFragment:
                    //{
                    //    break;
                    //}
                    //case XmlNodeType.Notation:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.Whitespace:
                    {
                        writer.WriteWhitespace(reader.Value);
                        break;
                    }
                    //case XmlNodeType.SignificantWhitespace:
                    //{
                    //    break;
                    //}
                    case XmlNodeType.EndElement:
                    {
                        pathStack.Pop();
                        writer.WriteFullEndElement();
                        break;
                    }
                    case XmlNodeType.XmlDeclaration:
                    {
                        writer.WriteStartDocument();
                        break;
                    }
                }
    
                switch (skipMode)
                {
                    case 1:
                    {
                        reader.Skip();
                        pathStack.Pop();
                        readResult = !reader.EOF;
                        break;
                    }
                    case 2:
                    {
                        pathStack.Pop();
                        readResult = !reader.EOF;
                        break;
                    }
                    default:
                    {
                        readResult = reader.Read();
                        break;
                    }
                }
            }
        }
        finally
        {
            writer.Close();
            targetStream.Close();
            targetStream.Dispose();
            reader.Close();
            sourceStream.Close();
            sourceStream.Dispose();
        }
    }
    
    private static bool CompareNodePath(Stack<string> currentNodePathStack, string compareNodePathString)
    {
        var currentArray = currentNodePathStack.Reverse().ToArray();
        var compareArray = compareNodePathString.Split(new[] { '/' }, StringSplitOptions.RemoveEmptyEntries);
        if (compareArray.Length != currentArray.Length)
        {
            return false;
        }
        bool isDifferent = false;
        for (int i = 0; i < currentArray.Length; i++)
        {
            if (compareArray[i] != currentArray[i])
            {
                isDifferent = true;
                break;
            }
        }
        return !isDifferent;
    }
  • 相关阅读:
    gThumb 3.1.2 发布,支持 WebP 图像
    航空例行天气预报解析 metaf2xml
    Baruwa 1.1.2 发布,邮件监控系统
    Bisect 1.3 发布,Caml 代码覆盖测试
    MoonScript 0.2.2 发布,基于 Lua 的脚本语言
    Varnish 入门
    快速增量备份程序 DeltaCopy
    恢复模糊的图像 SmartDeblur
    Cairo 1.12.8 发布,向量图形会图库
    iText 5.3.4 发布,Java 的 PDF 开发包
  • 原文地址:https://www.cnblogs.com/Rabbitism/p/7161926.html
Copyright © 2011-2022 走看看