zoukankan      html  css  js  c++  java
  • 从pdf 文件中抽取特定的页面

    前段时间买了一个kindle 电子书阅读器、我想用它来读的pdf文档、当然最主要是用来读python标准库&mysql的官方文档。

    问题就来了、这两个都是大头书、之前用mac看还好、用kindle就真的不方便了;主要是kindle对pdf的支持不太好、不能

    目录导航;于是我就想把大的pdf文件按章节分解成小的pdf文件

    一、安装PyPDF2这个python包

    pip3 install PyPDF2

    二、从源pdf文件中抽取页面

    #/usr/local/python/bin/python3
    
    from PyPDF2 import PdfFileReader,PdfFileWriter
    """
    抽取pdf页面
    """
    
    if __name__=="__main__":
        reader=PdfFileReader('/Users/jianglexing/Documents/linux/python/python-3.6/library.pdf')
        writer=PdfFileWriter()
        #开始的页面号
        start=108 
        #结束的页面号
        stop=126
        with open('/Users/jianglexing/Documents/python-std-re.pdf','wb') as wstream:
            for page in range(start,stop):
                temp=reader.getPage(page)
                writer.addPage(temp)
            writer.write(wstream)
        print("对抽取完成了")

    三、功能我们已经实现了、但是还太友好、下面对代码进行改进

    #/usr/local/python/bin/python3
    
    from PyPDF2 import PdfFileReader,PdfFileWriter
    import argparse
    
    """
    抽取pdf页面
    """
    
    if __name__=="__main__":
        parser=argparse.ArgumentParser()
        parser.add_argument('--source-file',default=r'/Users/jianglexing/Documents/linux/python/python-3.6/library.pdf',help='源文件全路径')
        parser.add_argument('--target-file',default=r'/tmp/target.pdf',help='目标路径的全路径')
        parser.add_argument('--start-page',default=0,type=int,help='开始的页号')
        parser.add_argument('--stop-page',default=0,type=int,help='结束的页号')
        args=parser.parse_args()
        reader=PdfFileReader(args.source_file)
        writer=PdfFileWriter()
        with open(args.target_file,'wb') as wstream:
            for page in range(args.start_page,args.stop_page):
                temp=reader.getPage(page)
                writer.addPage(temp)
            writer.write(wstream)
        print("对抽取完成了")

    四、还有一些没有解决的问题、如果源文件太大的话会报错、由于还没有看PyPDF2的源码、所以目前还不知道怎么解决

    JianglexingdeMacBook-Pro:Desktop jianglexing$ python3 splitpdf.py --source-file='/Users/jianglexing/Desktop/refman-5.7.18-en.a4.pdf' --target-file=/Users/jianglexing/Desktop/temp.pdf --start-page=1 --stop-page=6
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/generic.py", line 229, in __new__
        return decimal.Decimal.__new__(cls, utils.str_(value), context)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/utils.py", line 252, in str_
        if sys.version_info[0] < 3:
    RecursionError: maximum recursion depth exceeded in comparison
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "splitpdf.py", line 23, in <module>
        writer.write(wstream)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 482, in write
        self._sweepIndirectReferences(externalReferenceMap, self._root)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
        self._sweepIndirectReferences(externMap, realdata)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, value)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
        self._sweepIndirectReferences(externMap, realdata)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, value)
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 556, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, data[i])
      File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
        self._sweepIndirectReferences(externMap, realdata)

    ----

    学习交流

  • 相关阅读:
    hive函数之~字符串函数
    hive函数之~条件函数
    JSONP使用及注意事项小结
    css命名管理混乱?不妨试试BEM
    【移动端debug-6】如何做一个App里的web调试小工具
    ES6学习笔记(五):Class和Module
    ES6学习笔记(三):与迭代相关的新东东
    ES6学习笔记(四):异步操作
    ES6学习笔记(二):引用数据类型
    CORS跨域资源共享简述
  • 原文地址:https://www.cnblogs.com/JiangLe/p/6925791.html
Copyright © 2011-2022 走看看