zoukankan      html  css  js  c++  java
  • How to extract images from PDF in Python? 通过python从pdf文件中提取图片

    In this tutorial, we will write a Python code to extract images from PDF files and save them in the local disk using PyMuPDF and Pillow libraries.

    With PyMuPDF, you are able to access PDF, XPS, OpenXPS, epub and many other extensions. It should run on all platforms including Windows, Mac OSX and Linux.

    Let's install it along with Pillow:

    pip3 install PyMuPDF Pillow

    Open up a new Python file and let's get started. First, let's import the libraries:

    import fitz # PyMuPDF
    import io
    from PIL import Image

    I'm gonna test this with this PDF file, but you're free to bring and PDF file and put it in your current working directory, let's load it to the library:

    # file path you want to extract images from
    file = "1710.05006.pdf"
    # open the file
    pdf_file = fitz.open(file)

    Since we want to extract images from all pages, we need to iterate over all the pages available, and get all image objects on each page, the following code does that:

    # iterate over PDF pages
    for page_index in range(len(pdf_file)):
        # get the page itself
        page = pdf_file[page_index]
        image_list = page.getImageList()
        # printing number of images found in this page
        if image_list:
            print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
        else:
            print("[!] No images found on page", page_index)
        for image_index, img in enumerate(page.getImageList(), start=1):
            # get the XREF of the image
            xref = img[0]
            # extract the image bytes
            base_image = pdf_file.extractImage(xref)
            image_bytes = base_image["image"]
            # get the image extension
            image_ext = base_image["ext"]
            # load it to PIL
            image = Image.open(io.BytesIO(image_bytes))
            # save it to local disk
            image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

    We're using getImageList() method to list all available image objects as a list of tuples in that particular page. To get the image object index, we simply get the first element of the tuple returned.

    After that, we use the extractImage() method that returns the image in bytes along with additional information such as the image extension.

    Finally, we convert the image bytes to a PIL image instance and save it to the local disk using the save() method, which accepts a file pointer as an argument, we're simply naming the images with their corresponding page and image indices.

    After I ran the script, I get the following output:

    [!] No images found on page 0
    [+] Found a total of 3 images in page 1
    [+] Found a total of 3 images in page 2
    [!] No images found on page 3
    [!] No images found on page 4

    The images are saved as well, in the current directory:

    Extracted images using PythonConclusion

    Alright, we have successfully extracted images from that PDF file without loosing image quality. For more information on how the library works, I suggest you take a look at the documentation.

    Please check the full code here.

    实例 :

     cat pdf04.py 
    import fitz # PyMuPDF
    import io
    from PIL import Image
    
    # file path you want to extract images from
    file = "/root/1.pdf"
    # open the file
    pdf_file = fitz.open(file)
    
    # iterate over PDF pages
    for page_index in range(len(pdf_file)):
        # get the page itself
        page = pdf_file[page_index]
        image_list = page.getImageList()
        # printing number of images found in this page
        if image_list:
            print(f"[+] Found a total of {len(image_list)} images in page {page_index}")
        else:
            print("[!] No images found on page", page_index)
        for image_index, img in enumerate(page.getImageList(), start=1):
            # get the XREF of the image
            xref = img[0]
            # extract the image bytes
            base_image = pdf_file.extractImage(xref)
            image_bytes = base_image["image"]
            # get the image extension
            image_ext = base_image["ext"]
            # load it to PIL
            image = Image.open(io.BytesIO(image_bytes))
            # save it to local disk
            image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
    

     执行过程和结果: 

     python3 pdf04.py 
    [+] Found a total of 3 images in page 0
    [+] Found a total of 3 images in page 1
    [+] Found a total of 5 images in page 2
    [+] Found a total of 6 images in page 3
    

      

    原文: https://www.thepythoncode.com/article/extract-pdf-images-in-python

  • 相关阅读:
    连接AI与用户,京东云推出视音频通信技术方案
    我身边的高T,问了Java面试者这样的问题......
    解密协议层的攻击——HTTP请求走私
    产业实践推动科技创新,京东科技集团3篇论文入选ICASSP 2021
    2021年人工智能数据采集标注行业四大趋势预测;清华提出深度对齐聚类用于新意图发现
    京东科技集团21篇论文高票入选国际顶会AAAI 2021
    别困惑,不是你的错!90%的开发者把Clubhouse看成了Clickhouse
    京东App Swift 混编及组件化落地
    对话京东科技算法科学家吴友政:回望2020,NLP技术发展速度强劲
    关于京东技术,你想了解的都在这里丨征文活动获奖及优秀专栏推荐
  • 原文地址:https://www.cnblogs.com/weifeng1463/p/14964013.html
Copyright © 2011-2022 走看看