zoukankan      html  css  js  c++  java
  • java读取pdf文本转换html

    补充:一下代码基于maven,现将依赖的jar包单独导出

    地址:pdf jar

    完整代码地址 也就两个文件

     

     java读取pdf中的纯文字,这里使用的是pdfbox工具包

    maven引入如下配置

         <dependency>
                <groupId>net.sf.cssbox</groupId>
                <artifactId>pdf2dom</artifactId>
                <version>1.7</version>
            </dependency>
            <dependency>
                <groupId>org.apache.pdfbox</groupId>
                <artifactId>pdfbox</artifactId>
                <version>2.0.12</version>
            </dependency>
            <dependency>
                <groupId>org.apache.pdfbox</groupId>
                <artifactId>pdfbox-tools</artifactId>
                <version>2.0.12</version>
            </dependency>

    工具类直接读取

    代码示例

      /*
        读取pdf文字
         */
        @Test
        public void readPdfTextTest() throws IOException {
            byte[] bytes = getBytes("D:\code\pdf\HashMap.pdf");
            //加载PDF文档
            PDDocument document = PDDocument.load(bytes);
            readText(document);
        }
    
        public void readText(PDDocument document) throws IOException {
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            System.out.println(text);
        }

    将pdf转换为html

    效果图

     代码示例

    /*
        pdf转换html
         */
        @Test
        public void pdfToHtmlTest()  {
            String outputPath = "D:\code\pdf\HashMap.html";
            byte[] bytes = getBytes("D:\code\pdf\HashMap.pdf");
    //        try() 写在()里面会自动关闭流
            try (BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputPath)),"UTF-8"));){
                //加载PDF文档
                PDDocument document = PDDocument.load(bytes);
                PDFDomTree pdfDomTree = new PDFDomTree();
                pdfDomTree.writeText(document,out);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        /*
        将文件转换为byte数组
         */
        private byte[] getBytes(String filePath){
            byte[] buffer = null;
            try {
                File file = new File(filePath);
                FileInputStream fis = new FileInputStream(file);
                ByteArrayOutputStream bos = new ByteArrayOutputStream(1000);
                byte[] b = new byte[1000];
                int n;
                while ((n = fis.read(b)) != -1) {
                    bos.write(b, 0, n);
                }
                fis.close();
                bos.close();
                buffer = bos.toByteArray();
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
            return buffer;
        }

    完整的一个上传pdf转换为HTML功能(今后转换pdf也不需要找什么第三方了,哈哈)

    @RequestMapping("ud")
    @Controller
    public class UpAndDownController {
        @RequestMapping("upload.do")
        @ResponseBody
        public Map<String,Object> upload(@RequestParam("file") MultipartFile file, HttpServletRequest request){
            Map<String, Object> map = new HashMap<>();
            map.put("code","200");
            try {
                PdfConvertUtil pdfConvertUtil = new PdfConvertUtil();
                String pdfName = file.getOriginalFilename();
                int lastIndex = pdfName.lastIndexOf(".pdf");
                String fileName = pdfName.substring(0, lastIndex);
                String htmlName = fileName + ".html";
                String realPath = ResourceUtils.getURL("classpath:").getPath() + "/templates/file";
                File f = new File(realPath);
                if(!f.exists()){
                    f.mkdirs();
                }
                String htmlPath = realPath + "\" + htmlName;
                pdfConvertUtil.pdftohtml(file.getBytes(), htmlPath);
            } catch (Exception e) {
                map.put("code","500");
                e.printStackTrace();
            }
            return map;
        }
    
    }

    可以使用postman调试

    需要设置请求头 Content-Type 指定为 application/x-www-form-urlencoded

    之后选择body选择form-data,OK

     

    如果涉及到HTML页面直接加载PDF,无需插件

    可以参考下 

    https://www.cnblogs.com/jacksoft/p/5302587.html

    https://github.com/mozilla/pdf.js

     

  • 相关阅读:
    [Docker] Windows 宿主环境下,共享或上传文件到容器的方法
    [Docker]
    [Docker]
    [Docker]
    [Windows]
    [Linux] 树莓派 4B 安装 Ubuntu 19.10 (Eoan Ermine) IOT 版
    [Linux]
    [.Net] 什么是线程安全的并发集合
    [IOT]
    c++库大全
  • 原文地址:https://www.cnblogs.com/chywx/p/10849749.html
Copyright © 2011-2022 走看看