zoukankan      html  css  js  c++  java
  • 一个excel(20M)就能干趴你的poi,你信吗?

      自从上一篇:一个普通类就能干趴你的springboot,你信吗?后,很巧的是这次又发现一个问题,所以有了这篇文章,还是想沿用上篇的”流水帐“的方式查找问题和解决问题。这篇文章主要是因为使用POI导入一个20M的excel发生了OOM(OutOfMemoryError)异常。说来也惭愧,工作了这么多年,还真没导入过这种大小的文件,并且还发生了内存溢出。如果你百度下基本上清一色的告诉你:POI导入excel文件有两种方式,第一种是用户模式,这种模式用起来很简单直观,可以类比xml的dom方式解析(这里只针对excel2007,因为2003本身就有最大条数限制并且目前基本用的很少,这里直接忽略),第二种是event模式,这种通常是网上说的解决POI导入大excel的”万金油“方法,可以类比为xml的sax解析方式。呵呵,我这篇文章首先就是要干趴这种方法(JVM使用-Xms512m -Xmx512m)。不信你随便写一个导入接口,导入如下20M大小的execl看看:链接: https://pan.baidu.com/s/1DUrS8ctLPp7Z6imOc1aIUQ 提取码: hd79 。

      首先,既然要导入大点的excel2007,那么我们应该稍微了解一下这种文件如何存储数据,我们百度上可以发现,2007其实就是一个压缩包,可以直接修改后缀成zip然后解压打开文件看看,如下

     

      上图可以看到最大的两个文件就两个:sharedStrings.xml和sheet1.xml。其中sheet2.xml这个可以不关注,直接从excel删掉都没事,这里没删除主要是没多大关系,这个excel文件也是测试直接提供给我的。由于sheet2比较小,与这个文章说到的内存溢出并无关系,请不要胡思乱想,被分散了注意。

      直接用大文本编辑工具打开上图两个大文件,可以发现sharedString.xml里内容其实就是excel中每个单元格里的字符串内容(数字类型除外),sheet.xml就是每个sheet里的结构xml,了解到这里基本上就了解了本文章说到问题的基本知识,然后下面进入正题。

      先使用百度中查到的提供的event方式导入excel,代码如下:

    package com.example.utils;
    
    import org.apache.poi.openxml4j.opc.OPCPackage;
    import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
    import org.apache.poi.xssf.eventusermodel.XSSFReader;
    import org.apache.poi.xssf.usermodel.XSSFRichTextString;
    import org.xml.sax.Attributes;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.DefaultHandler;
    import org.xml.sax.helpers.XMLReaderFactory;
    
    import java.io.File;
    import java.io.InputStream;
    import java.sql.SQLException;
    import java.util.ArrayList;
    import java.util.Iterator;
    import java.util.List;
    
    /**
     * 百度上直接copy过来的
     * XSSF and SAX (Event API)
     */
    public abstract class BigDataParseExcelUtil extends DefaultHandler {
        private ReadOnlySharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;
        private int sheetIndex = -1;
        private List<String> rowlist = new ArrayList<String>();
        private int curRow = 0; // 当前行
        private int curCol = 0; // 当前列索引
        private int preCol = 0; // 上一列列索引
        private int titleRow = 0; // 标题行,一般情况下为0
        private int rowsize = 0; // 列数
        private List excelList = new ArrayList();  //excel全部转换为list
    
        // excel记录行操作方法,以sheet索引,行索引和行元素列表为参数,对sheet的一行元素进行操作,元素为String类型
    
        public abstract void optRows(int sheetIndex, int curRow,
                                     List<String> rowlist, List excelList) throws SQLException, Exception;
    
        // 只遍历一个sheet,其中sheetId为要遍历的sheet索引,从1开始,1-3
    
        /**
         * @param filename
         * @param sheetId  sheetId为要遍历的sheet索引,从1开始,1-3
         * @throws Exception
         */
        public void processOneSheet(String filename, int sheetId) throws Exception {
            OPCPackage pkg = OPCPackage.open(filename);
            XSSFReader r = new XSSFReader(pkg);
            ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
            XMLReader parser = fetchSheetParser(strings);
            // rId2 found by processing the Workbook
            // 根据 rId# 或 rSheet# 查找sheet
            InputStream sheet2 = r.getSheet("rId" + sheetId);
            sheetIndex++;
            InputSource sheetSource = new InputSource(sheet2);
            parser.parse(sheetSource);
            sheet2.close();
        }
    
        @Override
        public void characters(char[] ch, int start, int length)
            throws SAXException {
            // 得到单元格内容的值
            lastContents += new String(ch, start, length);
        }
    
        public void process(InputStream inputStream) throws Exception {
            OPCPackage pkg = OPCPackage.open(inputStream);
            XSSFReader r = new XSSFReader(pkg);
            ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
            XMLReader parser = fetchSheetParser(strings);
            Iterator<InputStream> sheets = r.getSheetsData();
            while (sheets.hasNext()) {
                curRow = 0;
                sheetIndex++;
                InputStream sheet = sheets.next();
                InputSource sheetSource = new InputSource(sheet);
                parser.parse(sheetSource);
                sheet.close();
            }
        }
    
        /**
         * 遍历 excel 文件
         */
        public void process(File file) throws Exception {
            OPCPackage pkg = OPCPackage.open(file);
            XSSFReader r = new XSSFReader(pkg);
            ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
            XMLReader parser = fetchSheetParser(strings);
            Iterator<InputStream> sheets = r.getSheetsData();
            while (sheets.hasNext()) {
                curRow = 0;
                sheetIndex++;
                InputStream sheet = sheets.next();
                InputSource sheetSource = new InputSource(sheet);
                parser.parse(sheetSource);
                sheet.close();
            }
        }
    
        public XMLReader fetchSheetParser(ReadOnlySharedStringsTable sst)
            throws SAXException {
            XMLReader parser = XMLReaderFactory.createXMLReader();
            // .createXMLReader("org.apache.xerces.parsers.SAXParser");
            this.sst = sst;
            parser.setContentHandler(this);
            return parser;
        }
    
        @Override
        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            // c => 单元格
            if (name.equals("c")) {
                // 如果下一个元素是 SST 的索引,则将nextIsString标记为true
                String cellType = attributes.getValue("t");
                String rowStr = attributes.getValue("r");
                curCol = this.getRowIndex(rowStr);
                if (cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            }
            // 置空
            lastContents = "";
        }
    
        @Override
        public void endElement(String uri, String localName, String name)
            throws SAXException {
            // 根据SST的索引值的到单元格的真正要存储的字符串
            // 这时characters()方法可能会被调用多次
            if (nextIsString) {
                try {
                    int idx = Integer.parseInt(lastContents);
                    lastContents = new XSSFRichTextString(sst.getEntryAt(idx))
                        .toString();
                } catch (Exception e) {
                }
            }
            // v => 单元格的值,如果单元格是字符串则v标签的值为该字符串在SST中的索引
            // 将单元格内容加入rowlist中,在这之前先去掉字符串前后的空白符
            if (name.equals("v")) {
                String value = lastContents.trim();
                value = value.equals("") ? " " : value;
                int cols = curCol - preCol;
                if (cols > 1) {
                    for (int i = 0; i < cols - 1; i++) {
                        rowlist.add(preCol, "");
                    }
                }
                preCol = curCol;
                rowlist.add(curCol - 1, value);
            } else {
                // 如果标签名称为 row ,这说明已到行尾,调用 optRows() 方法
                if (name.equals("row")) {
                    int tmpCols = rowlist.size();
                    if (curRow > this.titleRow && tmpCols < this.rowsize) {
                        for (int i = 0; i < this.rowsize - tmpCols; i++) {
                            rowlist.add(rowlist.size(), "");
                        }
                    }
                    try {
                        optRows(sheetIndex, curRow, rowlist, excelList);
                    } catch (SQLException e) {
                        e.printStackTrace();
                    } catch (Exception e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                    }
                    if (curRow == this.titleRow) {
                        this.rowsize = rowlist.size();
                    }
                    rowlist.clear();
                    curRow++;
                    curCol = 0;
                    preCol = 0;
                }
            }
        }
    
        // 得到列索引,每一列c元素的r属性构成为字母加数字的形式,字母组合为列索引,数字组合为行索引,
        // 如AB45,表示为第(A-A+1)*26+(B-A+1)*26列,45行
        public int getRowIndex(String rowStr) {
            rowStr = rowStr.replaceAll("[^A-Z]", "");
            byte[] rowAbc = rowStr.getBytes();
            int len = rowAbc.length;
            float num = 0;
            for (int i = 0; i < len; i++) {
                num += (rowAbc[i] - 'A' + 1) * Math.pow(26, len - i - 1);
            }
            return (int) num;
        }
    
    
    }
    package com.example.service;
    
    import com.example.utils.BigDataParseExcelUtil;
    import org.springframework.stereotype.Service;
    
    import java.io.InputStream;
    import java.sql.SQLException;
    import java.util.List;
    
    /**
     * @author: rongdi
     * @date:
     */
    @Service
    public class ExcelService {
    
        public void import1(InputStream inputStream) throws Exception {
    
            BigDataParseExcelUtil xlx = new BigDataParseExcelUtil() {
                @Override
                public void optRows(int sheetIndex, int curRow, List<String> rowlist, List excelList)
                    throws SQLException {
                    System.out.println(rowlist);
                }
            };
            xlx.process(inputStream);
        }
    
    
    }
    package com.example.controller;
    
    import com.example.service.ExcelService;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.stereotype.Controller;
    import org.springframework.web.bind.annotation.RequestMapping;
    import org.springframework.web.bind.annotation.RequestParam;
    import org.springframework.web.bind.annotation.ResponseBody;
    import org.springframework.web.multipart.MultipartFile;
    
    /**
     * @author: rongdi
     * @date:
     */
    @Controller
    public class ExcelController {
    
        @Autowired
        private ExcelService excelService;
    
        @RequestMapping("/excel/import1")
        @ResponseBody
        public String import1(@RequestParam("file") MultipartFile multipartFile) throws Exception {
            excelService.import1(multipartFile.getInputStream());
            return "ok";
        }
    
    }

      使用postman等工具,导入上面说的20M的文件22.xlsx,报错如下:

       那我们优化一下不使用inputStream,直接使用一个File传入看看

        public void import2(File file) throws Exception {
            BigDataParseExcelUtil xlx = new BigDataParseExcelUtil() {
                @Override
                public void optRows(int sheetIndex, int curRow, List<String> rowlist, List excelList)
                    throws SQLException {
                    System.out.println(rowlist);
                }
            };
            xlx.process(file);
        }
    
        @RequestMapping("/excel/import2")
        @ResponseBody
        public String import2(@RequestParam("file") MultipartFile multipartFile) throws Exception {
            // 延迟解析比率
            ZipSecureFile.setMinInflateRatio(-1.0d);
            File tmp = Files.createTempFile("tmp-", ".xlsx").toFile();
            Files.copy(multipartFile.getInputStream(), Paths.get(tmp.getPath()), StandardCopyOption.REPLACE_EXISTING);
            excelService.import2(tmp);
            return "ok";
        }

      使用postman上传文件运行效果如下:

     

       这时候就发现很尴尬了,难道是POI自己代码里就有bug,我们可以使用断点调试确认一下这个strings里到底是不是全部放了sharedStrings.xml里的内容。

       由上证明就是这个strings装多了导致内存溢出了,从这里可以看出网上说使用event事件解析excel的方案基本都是行不通的,哎,我也不懂为啥百度上都是这种答案,难道他们压根都没遇到过大数据导入吗?当然也有可能我冤枉了他们,因为sharedStrings.xml中存放的是每个单元格的字符串内容,这个存放是排重过的,如果虽然excel里单元格很多,但是大多都是整型或者大多都是重复的,那确实可以跳过这一步一路之后会一路畅通了,因为毕竟sax解析xml确实可以节省很多内存。

      从上分析可以看到POI就两种方式导入:一种是用户方式写代码简单,基本按顺序数格子就好,但是类比dom方式解析xml,很耗内存。第二种事件方式,类比sax方式解析xml确实很省内存,但是POI提供的类里把解析出的大量字符串放入了集合中,还是会导致内存溢出。那么我们怎么解决这个问题,这里很常规的想法是到底这个strings是用来干啥的,怎么用的,如果可以保持strings相同逻辑功能的前提下,修改了ReadOnlySharedStringsTable这个类的逻辑,就可以解决这里的内存溢出了。那么我们可以直接搜索ReadOnlySharedStringsTable类里所有用到strings的方法上打上断点,特别是从strings里取值的方法上,然后调大jvm内存避免内存溢出的情况下断点调试如下

       我们是不是可以直接往strings里添加字符串和获取字符串的方法那里替换掉,不要使用strings这个集合存储所有字符串。但是既然excel里设计成使用一个sharedStrings.xml存放公共的字符串,而不是像csv格式那样,每次读一行取一行数据就好了。那么这个sharedStrings.xml中的数据总要解析出来,总要有个地方存储里面的数据,不然怎么结合sheet.xml的格式获取到每一行的数据呢?所以这里就很尴尬了,不能每次解析sharedStrings.xml时不保存每次需要获取strings的时候,再去解析一下这个xm吧,如果从本文章的xml上来看,要重复解析25W次,效率极其低。现在问题可以简化成我们需要把sharedStrings.xml解析出的所有字符串放在一个地方,还能方便解析,由于怕内存溢出,肯定不能放在内存中了。那么这里就有一些选择,比如解析出的字符串按加入strings集合的顺序放入数据库,文件,外部存储或者缓存(限制内存大小,多余写入文件)存储中。然后使用的时候按照索引位置idx去一一取出。本文章先使用临时文件来放这些数据,因为不想搞那么复杂,导入任务不管再多复杂的系统中,最终执行的都会是一个单节点,在单节点中先使用本机资源这种就近资源是最方便的。如下直接先复制源码,然后修改上述说的两个地方。

    package com.example.utils;
    
    import org.apache.poi.ooxml.util.SAXHelper;
    import org.apache.poi.openxml4j.opc.OPCPackage;
    import org.apache.poi.openxml4j.opc.PackagePart;
    import org.apache.poi.ss.usermodel.RichTextString;
    import org.apache.poi.util.Removal;
    import org.apache.poi.xssf.model.SharedStrings;
    import org.apache.poi.xssf.usermodel.XSSFRelation;
    import org.apache.poi.xssf.usermodel.XSSFRichTextString;
    import org.xml.sax.Attributes;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.DefaultHandler;
    
    import javax.xml.parsers.ParserConfigurationException;
    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.FileOutputStream;
    import java.io.FileReader;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.LineNumberReader;
    import java.io.PushbackInputStream;
    import java.nio.file.Files;
    import java.util.ArrayList;
    import java.util.LinkedHashMap;
    import java.util.List;
    import java.util.Map;
    
    import static org.apache.poi.xssf.usermodel.XSSFRelation.NS_SPREADSHEETML;
    
    public class ReadOnlySharedStringsTable extends DefaultHandler implements SharedStrings {
    
        protected final boolean includePhoneticRuns;
    
        /**
         * An integer representing the total count of strings in the workbook. This count does not
         * include any numbers, it counts only the total of text strings in the workbook.
         */
        protected int count;
    
        /**
         * An integer representing the total count of unique strings in the Shared String Table.
         * A string is unique even if it is a copy of another string, but has different formatting applied
         * at the character level.
         */
        protected int uniqueCount;
    
        /**
         * The shared strings table.
         */
        private List<String> strings;
    
        private File tmp = null;
    
        FileOutputStream fos = null;
    
        private int counts;
    
        private Map<Integer,String> map = new LinkedHashMap<Integer,String>();
    
        public ReadOnlySharedStringsTable(OPCPackage pkg)
                throws IOException, SAXException {
            this(pkg, true);
        }
    
        public ReadOnlySharedStringsTable(OPCPackage pkg, boolean includePhoneticRuns)
                throws IOException, SAXException {
            this.includePhoneticRuns = includePhoneticRuns;
            ArrayList<PackagePart> parts =
                    pkg.getPartsByContentType(XSSFRelation.SHARED_STRINGS.getContentType());
    
            // Some workbooks have no shared strings table.
            if (parts.size() > 0) {
                PackagePart sstPart = parts.get(0);
                readFrom(sstPart.getInputStream());
            }
        }
    
        /**
         * Like POIXMLDocumentPart constructor
         *
         * Calls {@link #ReadOnlySharedStringsTable(PackagePart, boolean)}, with a
         * value of <code>true</code> to include phonetic runs.
         *
         * @since POI 3.14-Beta1
         */
        public ReadOnlySharedStringsTable(PackagePart part) throws IOException, SAXException {
            this(part, true);
        }
    
        /**
         * Like POIXMLDocumentPart constructor
         *
         * @since POI 3.14-Beta3
         */
        public ReadOnlySharedStringsTable(PackagePart part, boolean includePhoneticRuns)
            throws IOException, SAXException {
            this.includePhoneticRuns = includePhoneticRuns;
            readFrom(part.getInputStream());
        }
        
        /**
         * Read this shared strings table from an XML file.
         *
         * @param is The input stream containing the XML document.
         * @throws IOException if an error occurs while reading.
         * @throws SAXException if parsing the XML data fails.
         */
        public void readFrom(InputStream is) throws IOException, SAXException {
            // test if the file is empty, otherwise parse it
            PushbackInputStream pis = new PushbackInputStream(is, 1);
            int emptyTest = pis.read();
            if (emptyTest > -1) {
                pis.unread(emptyTest);
                InputSource sheetSource = new InputSource(pis);
                try {
                    XMLReader sheetParser = SAXHelper.newXMLReader();
                    sheetParser.setContentHandler(this);
                    sheetParser.parse(sheetSource);
                } catch(ParserConfigurationException e) {
                    throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
                }
            }
        }
    
        /**
         * Return an integer representing the total count of strings in the workbook. This count does not
         * include any numbers, it counts only the total of text strings in the workbook.
         *
         * @return the total count of strings in the workbook
         */
        @Override
        public int getCount() {
            return this.count;
        }
    
        /**
         * Returns an integer representing the total count of unique strings in the Shared String Table.
         * A string is unique even if it is a copy of another string, but has different formatting applied
         * at the character level.
         *
         * @return the total count of unique strings in the workbook
         */
        @Override
        public int getUniqueCount() {
            return this.uniqueCount;
        }
    
        /**
         * Return the string at a given index.
         * Formatting is ignored.
         *
         * @param idx index of item to return.
         * @return the item at the specified position in this Shared String table.
         * @deprecated use <code>getItemAt</code> instead
         */
        @Removal(version = "4.2")
        @Deprecated
        public String getEntryAt(int idx) {
            /**
             * 这里就是修改部分了,直接从按行存储的临时文件读取需要的字符串
             */
            String value = map.get(idx + 1);
            if(value == null) {
    
                return readString(idx,1000,this.uniqueCount);
            } else {
                return value;
            }
    
        }
    
        /**
         * 从指定位置读取size个字符串,这里是使用局部性原理,每次读取size个字符串,
         * 以免每次需要读取文件,性能极低
         * @return
         */
        private String readString(int idx,int size,int numbers) {
            map.clear();
            int currNumber = idx + 1;
            if (currNumber < 0 || currNumber > numbers) {
                return null;
            }
            try {
                FileReader in = new FileReader(tmp);
                LineNumberReader reader = new LineNumberReader(in);
                try {
                    String line = "";
                    for(int i = 1;i <= numbers;i ++) {
                        line = reader.readLine();
                        if(i >= currNumber && i < currNumber + size) {
                            map.put(i, line);
                        }
                    }
                } finally {
                    reader.close();
                    in.close();
                }
            } catch (Exception e) {
                System.out.println(e.getMessage());
            }
            return map.get(idx + 1);
        }
    
    
        /**
         * Returns all the strings.
         * Formatting is ignored.
         *
         * @return a list with all the strings
         * @deprecated use <code>getItemAt</code> instead
         */
        @Removal(version = "4.2")
        @Deprecated
        public List<String> getItems() {
            return strings;
        }
    
        @Override
        public RichTextString getItemAt(int idx) {
            return new XSSFRichTextString(getEntryAt(idx));
        }
    
        //// ContentHandler methods ////
    
        private StringBuilder characters;
        private boolean tIsOpen;
        private boolean inRPh;
    
        @Override
        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            if (uri != null && ! uri.equals(NS_SPREADSHEETML)) {
                return;
            }
    
            if ("sst".equals(localName)) {
                String count = attributes.getValue("count");
                if(count != null) this.count = Integer.parseInt(count);
                String uniqueCount = attributes.getValue("uniqueCount");
                if(uniqueCount != null) this.uniqueCount = Integer.parseInt(uniqueCount);
                try {
                    tmp = Files.createTempFile("tmp-", ".xlsx").toFile();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                //    this.strings = new ArrayList<>(this.uniqueCount);
                characters = new StringBuilder(64);
                try {
                    fos = new FileOutputStream(tmp,true);
                } catch (FileNotFoundException e) {
                    e.printStackTrace();
                }
            } else if ("si".equals(localName)) {
                characters.setLength(0);
            } else if ("t".equals(localName)) {
                tIsOpen = true;
            } else if ("rPh".equals(localName)) {
                inRPh = true;
                //append space...this assumes that rPh always comes after regular <t>
                if (includePhoneticRuns && characters.length() > 0) {
                    characters.append(" ");
                }
            }
        }
    
        @Override
        public void endElement(String uri, String localName, String name) throws SAXException {
            if (uri != null && ! uri.equals(NS_SPREADSHEETML)) {
                return;
            }
    
            if ("si".equals(localName)) {
             //   strings.add(characters.toString().intern());
                try {
                    /**
                     * 这里就是修改的一部分,这里直接把字符串按行存入临时文件
                     */
                    counts ++;
                    fos.write((characters.toString() + "
    ").getBytes());
                    if(counts == this.uniqueCount) {
                        fos.close();
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            } else if ("t".equals(localName)) {
                tIsOpen = false;
            } else if ("rPh".equals(localName)) {
                inRPh = false;
            }
        }
    
        /**
         * Captures characters only if a t(ext) element is open.
         */
        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            if (tIsOpen) {
                if (inRPh && includePhoneticRuns) {
                    characters.append(ch, start, length);
                } else if (! inRPh){
                    characters.append(ch, start, length);
                }
            }
        }
    
    }

      然后在自己代码里把包换成自己的包,替换POI里该类的包,运行JVM堆情况如下毫无压力

      自此内存溢出问题大功告成!针对使用POI导入大Excel遇到的问题总结如下:

      1)网上给出的方案不管是用户模式还是事件模式,往往都不能支持大excel的导入

      2)excel本质上是一堆excel的压缩包(这里只考虑2007忽略2003)改了个后缀名成xlsx

      3)使用事件导入时应先将上传文件存入文件,再使用文件OPCPackage.open(file),如果直接传入输入流,由于里面逻辑会将输入流的所有内容先存入ByteArrayOutputStream 中,这个输出流实际上是一个内存中的字节流,所以也会导致内存溢出。

      4)用户模式不用考虑,事件模式会先将sharedString.xml这个大xml解析出来放入一个List中,不管通过什么方式都绕不开需要解析这个类,因为每个单元格的字符串都放在这个xml中,而要解析这个xml最常规的方法就是保存在内存使用list和map之内的容器存放我相信不会有人会想刚解析出一个xml还要存回文件中把,这里基本就绕不开ReadOnlySharedStringsTable或者SharedStringsTable,就算你侥幸绕开了,想自己解析,或许还是重复这两个类的悲剧,这就是另外一种内存溢出的根源。

      回顾一下上述实现直接把sharedStrings.xml中的内容粗略的保存到文件中,然后再从文件中获取是属于很低劣的实现,只能说能满足不内存溢出,性能方面堪忧!下面直接借鉴easyexcel源码中用到的ReadCache来实现保存sharedStrings.xml中的内容

    package com.example.advanceevent;
    
    import com.example.utils.FileUtils;
    import org.ehcache.Cache;
    import org.ehcache.CacheManager;
    import org.ehcache.config.CacheConfiguration;
    import org.ehcache.config.builders.CacheConfigurationBuilder;
    import org.ehcache.config.builders.CacheManagerBuilder;
    import org.ehcache.config.builders.ResourcePoolsBuilder;
    import org.ehcache.config.units.MemoryUnit;
    import org.ehcache.core.Ehcache;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    
    import java.io.File;
    import java.util.HashMap;
    import java.util.UUID;
    
    /**
     * @author: rongdi
     * @date:
     */
    public class ReadCache {
    
        private static final Logger LOGGER = LoggerFactory.getLogger(Ehcache.class);
        private int index = 0;
        private HashMap<Integer, String> dataMap = new HashMap(1334);
        private static CacheManager fileCacheManager;
        private static CacheConfiguration<Integer, HashMap> fileCacheConfiguration;
        private static CacheManager activeCacheManager;
        private CacheConfiguration<Integer, HashMap> activeCacheConfiguration;
        private Cache<Integer, HashMap> fileCache;
        private Cache<Integer, HashMap> activeCache;
        private String cacheAlias;
        private int cacheMiss = 0;
    
        public ReadCache(int maxCacheActivateSize) {
            this.activeCacheConfiguration = CacheConfigurationBuilder.newCacheConfigurationBuilder(Integer.class, HashMap.class, ResourcePoolsBuilder.newResourcePoolsBuilder().heap((long)maxCacheActivateSize, MemoryUnit.MB)).withSizeOfMaxObjectGraph(1000000L).withSizeOfMaxObjectSize((long)maxCacheActivateSize, MemoryUnit.MB).build();
            init();
        }
    
        private void init() {
            this.cacheAlias = UUID.randomUUID().toString();
            this.fileCache = fileCacheManager.createCache(this.cacheAlias, fileCacheConfiguration);
            this.activeCache = activeCacheManager.createCache(this.cacheAlias, this.activeCacheConfiguration);
        }
    
        public void put(String value) {
            this.dataMap.put(this.index, value);
            if ((this.index + 1) % 1000 == 0) {
                this.fileCache.put(this.index / 1000, this.dataMap);
                this.dataMap = new HashMap(1334);
            }
    
            ++this.index;
            if (LOGGER.isDebugEnabled() && this.index % 1000000 == 0) {
                LOGGER.debug("Already put :{}", this.index);
            }
    
        }
    
        public String get(Integer key) {
            if (key != null && key >= 0) {
                int route = key / 1000;
                HashMap<Integer, String> dataMap = (HashMap)this.activeCache.get(route);
                if (dataMap == null) {
                    dataMap = (HashMap)this.fileCache.get(route);
                    this.activeCache.put(route, dataMap);
                    if (LOGGER.isDebugEnabled() && this.cacheMiss++ % 1000 == 0) {
                        LOGGER.debug("Cache misses count:{}", this.cacheMiss);
                    }
                }
    
                return (String)dataMap.get(key);
            } else {
                return null;
            }
        }
    
        public void putFinished() {
            if (this.dataMap != null) {
                this.fileCache.put(this.index / 1000, this.dataMap);
            }
        }
    
        public void destroy() {
            fileCacheManager.removeCache(this.cacheAlias);
            activeCacheManager.removeCache(this.cacheAlias);
        }
    
        static {
            File cacheFile = FileUtils.createCacheTmpFile();
            fileCacheManager = CacheManagerBuilder.newCacheManagerBuilder().with(CacheManagerBuilder.persistence(cacheFile)).build(true);
            activeCacheManager = CacheManagerBuilder.newCacheManagerBuilder().build(true);
            fileCacheConfiguration = CacheConfigurationBuilder.newCacheConfigurationBuilder(Integer.class, HashMap.class, ResourcePoolsBuilder.newResourcePoolsBuilder().disk(10L, MemoryUnit.GB)).withSizeOfMaxObjectGraph(1000000L).withSizeOfMaxObjectSize(10L, MemoryUnit.GB).build();
        }
    
    }
    package com.example.advanceevent;
    
    import org.apache.poi.ooxml.util.SAXHelper;
    import org.apache.poi.openxml4j.opc.OPCPackage;
    import org.apache.poi.openxml4j.opc.PackagePart;
    import org.apache.poi.ss.usermodel.RichTextString;
    import org.apache.poi.util.Removal;
    import org.apache.poi.xssf.model.SharedStrings;
    import org.apache.poi.xssf.usermodel.XSSFRelation;
    import org.apache.poi.xssf.usermodel.XSSFRichTextString;
    import org.xml.sax.Attributes;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.DefaultHandler;
    
    import javax.xml.parsers.ParserConfigurationException;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.PushbackInputStream;
    import java.util.ArrayList;
    import java.util.List;
    
    import static org.apache.poi.xssf.usermodel.XSSFRelation.NS_SPREADSHEETML;
    
    public class ReadOnlySharedStringsTable extends DefaultHandler implements SharedStrings {
    
        protected final boolean includePhoneticRuns;
    
        /**
         * An integer representing the total count of strings in the workbook. This count does not
         * include any numbers, it counts only the total of text strings in the workbook.
         */
        protected int count;
    
        /**
         * An integer representing the total count of unique strings in the Shared String Table.
         * A string is unique even if it is a copy of another string, but has different formatting applied
         * at the character level.
         */
        protected int uniqueCount;
    
        /**
         * 缓存
         */
        ReadCache readCache = new ReadCache(100);
    
        private int counts;
    
    
        public ReadOnlySharedStringsTable(OPCPackage pkg)
                throws IOException, SAXException {
            this(pkg, true);
        }
    
        public ReadOnlySharedStringsTable(OPCPackage pkg, boolean includePhoneticRuns)
                throws IOException, SAXException {
            this.includePhoneticRuns = includePhoneticRuns;
            ArrayList<PackagePart> parts =
                    pkg.getPartsByContentType(XSSFRelation.SHARED_STRINGS.getContentType());
    
            // Some workbooks have no shared strings table.
            if (parts.size() > 0) {
                PackagePart sstPart = parts.get(0);
                readFrom(sstPart.getInputStream());
            }
        }
    
        /**
         * Like POIXMLDocumentPart constructor
         *
         * Calls {@link #ReadOnlySharedStringsTable(PackagePart, boolean)}, with a
         * value of <code>true</code> to include phonetic runs.
         *
         * @since POI 3.14-Beta1
         */
        public ReadOnlySharedStringsTable(PackagePart part) throws IOException, SAXException {
            this(part, true);
        }
    
        /**
         * Like POIXMLDocumentPart constructor
         *
         * @since POI 3.14-Beta3
         */
        public ReadOnlySharedStringsTable(PackagePart part, boolean includePhoneticRuns)
            throws IOException, SAXException {
            this.includePhoneticRuns = includePhoneticRuns;
            readFrom(part.getInputStream());
        }
        
        /**
         * Read this shared strings table from an XML file.
         *
         * @param is The input stream containing the XML document.
         * @throws IOException if an error occurs while reading.
         * @throws SAXException if parsing the XML data fails.
         */
        public void readFrom(InputStream is) throws IOException, SAXException {
            // test if the file is empty, otherwise parse it
            PushbackInputStream pis = new PushbackInputStream(is, 1);
            int emptyTest = pis.read();
            if (emptyTest > -1) {
                pis.unread(emptyTest);
                InputSource sheetSource = new InputSource(pis);
                try {
                    XMLReader sheetParser = SAXHelper.newXMLReader();
                    sheetParser.setContentHandler(this);
                    sheetParser.parse(sheetSource);
                } catch(ParserConfigurationException e) {
                    throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
                }
            }
        }
    
        /**
         * Return an integer representing the total count of strings in the workbook. This count does not
         * include any numbers, it counts only the total of text strings in the workbook.
         *
         * @return the total count of strings in the workbook
         */
        @Override
        public int getCount() {
            return this.count;
        }
    
        /**
         * Returns an integer representing the total count of unique strings in the Shared String Table.
         * A string is unique even if it is a copy of another string, but has different formatting applied
         * at the character level.
         *
         * @return the total count of unique strings in the workbook
         */
        @Override
        public int getUniqueCount() {
            return this.uniqueCount;
        }
    
        /**
         * Return the string at a given index.
         * Formatting is ignored.
         *
         * @param idx index of item to return.
         * @return the item at the specified position in this Shared String table.
         * @deprecated use <code>getItemAt</code> instead
         */
        @Removal(version = "4.2")
        @Deprecated
        public String getEntryAt(int idx) {
            /**
             * 这里就是修改部分了,直接从按行存储的临时文件读取需要的字符串
             */
            return readCache.get(idx);
    
        }
    
        /**
         * Returns all the strings.
         * Formatting is ignored.
         *
         * @return a list with all the strings
         * @deprecated use <code>getItemAt</code> instead
         */
        @Removal(version = "4.2")
        @Deprecated
        public List<String> getItems() {
            return null;
        }
    
        @Override
        public RichTextString getItemAt(int idx) {
            return new XSSFRichTextString(getEntryAt(idx));
        }
    
        //// ContentHandler methods ////
    
        private StringBuilder characters;
        private boolean tIsOpen;
        private boolean inRPh;
    
        @Override
        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            if (uri != null && ! uri.equals(NS_SPREADSHEETML)) {
                return;
            }
    
            if ("sst".equals(localName)) {
                String count = attributes.getValue("count");
                if(count != null) this.count = Integer.parseInt(count);
                String uniqueCount = attributes.getValue("uniqueCount");
                if(uniqueCount != null) this.uniqueCount = Integer.parseInt(uniqueCount);
                //    this.strings = new ArrayList<>(this.uniqueCount);
                characters = new StringBuilder(64);
            } else if ("si".equals(localName)) {
                characters.setLength(0);
            } else if ("t".equals(localName)) {
                tIsOpen = true;
            } else if ("rPh".equals(localName)) {
                inRPh = true;
                //append space...this assumes that rPh always comes after regular <t>
                if (includePhoneticRuns && characters.length() > 0) {
                    characters.append(" ");
                }
            }
        }
    
        @Override
        public void endElement(String uri, String localName, String name) throws SAXException {
            if (uri != null && ! uri.equals(NS_SPREADSHEETML)) {
                return;
            }
    
            if ("si".equals(localName)) {
             //   strings.add(characters.toString().intern());
                readCache.put(characters.toString());
                /**
                 * 这里就是修改的一部分,这里直接把字符串按行存入临时文件
                 */
                counts ++;
                if(counts == this.uniqueCount) {
                    readCache.putFinished();
                }
            } else if ("t".equals(localName)) {
                tIsOpen = false;
            } else if ("rPh".equals(localName)) {
                inRPh = false;
            }
        }
    
        /**
         * Captures characters only if a t(ext) element is open.
         */
        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            if (tIsOpen) {
                if (inRPh && includePhoneticRuns) {
                    characters.append(ch, start, length);
                } else if (! inRPh){
                    characters.append(ch, start, length);
                }
            }
        }
    
    }

      至此代码效率有了相当大的提高,而且内存溢出问题也得到解决。详细测试代码:https://github.com/rongdi/poi-example.git

      

      

  • 相关阅读:
    setup-nginx
    Sql Server
    第一次找工作
    JGroups 初探
    游戏与彩票
    MINA 网络黏包处理代码
    powershell遍历文件夹设置权限,解决文件无法删除的问题。
    c# 异步任务队列(可选是否使用单线程执行任务,以及自动取消任务)
    c#注册表对象映射
    最小安装centos 7 无GUI静默安装 oracle 12c,打造轻量linux化服务器
  • 原文地址:https://www.cnblogs.com/rongdi/p/11872810.html
Copyright © 2011-2022 走看看