zoukankan      html  css  js  c++  java


    Leveraging the Hardware JPEG Decoder and NVIDIA nvJPEG Library on NVIDIA A100 GPUs


    JPEG指的是联合摄影专家组(Joint Photography Experts Group),该组于2017年庆祝了25岁生日。JPEG标准指定了编解码器,它定义如何将图像压缩成字节的比特流并将其解压缩回图像。             









    Figure 1a. Original butterfly image (no compression, Size 512×512, 24-bit), 769 KB.


    Figure 1b. Compressed butterfly image (quality compression coefficient 50%, subsampling 4:2:0, 24-bit), 33 KB.

    How JPEG works



    Figure 2. Diagram of the JPEG encoding process employing a parallel utilization of GPU CUDA software and CPU.


    第二步涉及到颜色转换到表示亮度(亮度)的Y Cb Cr颜色空间Y和表示色度(红色和蓝色投影)的Cb和Cr通道。然后,Cb和Cr信道被预定因子(通常是2或3)降采样。这个下采样给你第一阶段的压缩。             





    图3显示了NVIDIA GPU上的JPEG解码过程。


    Figure 3. The JPEG decoding process employs a parallel utilization of GPU CUDA and software. A hybrid (CPU/GPU) approach for Huffman decoding overcomes the serial process stall.






    New JPEG hardware decoder最近,我们介绍了NVIDIA A100 GPU,它有一个专用的硬件JPEG解码器。以前,在数据中心GPU上没有这样的硬件单元,JPEG解码是一个纯软件CUDA解决方案,它同时使用CPU和GPU。             

    现在,硬件解码器与GPU的其余部分同时运行,GPU可以执行各种计算任务,如图像分类、目标检测和图像分割。与NVIDIA Tesla V100相比,它在4-8x JPEG解码速度方面以多种方式大幅提高了吞吐量。             


    nvJPEG library overview

    nvJPEG是用于JPEG编解码器的GPU加速库。与NVIDIA DALI(一个数据增强和图像加载库)一起,通过加速数据的解码和增强,可以加速对图像分类模型的深度学习训练。A100包括一个5核硬件JPEG解码引擎。nvJPEG利用硬件后端对JPEG图像进行批量处理。


    Figure 4. The JPEG hardware decoding process employs a parallel utilization of hardware decoder and GPU CUDA software. The HW decoder is independent of the CUDA SMs so that software GPU decoders can be used simultaneously.

    通过使用nvjpegCreateEx init函数选择硬件解码器,nvJPEG提供了基线JPEG解码的加速和各种颜色转换格式(例如,YUV 420、422、444)。如图4所示,这使得图像解码速度比仅使用CPU的处理速度快20倍。DALI的用户可以直接受益于这种硬件加速,因为nvJPEG是抽象的。              nvJPEG库支持以下操作:

    • nvJPEG Encoding
    • nvJPEG Transcoding转码
    • nvJPEG Decoding (includes HW (A100) support)








    • 4:4:4
    • 4:2:2
    • 4:2:0
    • 4:4:0
    • 4:1:1
    • 4:1:0






    Performance numbers


    • NVIDIA V100 GPU: CPU – E5-2698 v4@2GHz 3.6GHz Turbo (Broadwell) HT On GPU – Tesla V100-SXM2-16GB(GV100) 116160 MiB 180 SM GPU Video Clock 1312 Batch 128 and Single Thread
    • NVIDIA A100 GPU CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On GPU – A100-SXM4-40GB(GA100) 140557 MiB 1108 SM GPU Video Clock 1095 Batch 128 and Single Thread
    • CPU: CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On TurboJPEG decode for CPU testing
    • Image dataset: 2K FHD = 1920 x 1080 4K UHD = 3840 x 2160 CUDA Toolkit 11.0 CUDA driver r450.24



    Figure 5. Graph showing the speed up achieved by hardware decode on A100 over the CUDA hybrid decode on V100. 


    Figure 6. The number of CPU threads required by the hybrid decoder on V100 to keep up with hardware decoder throughput on A100.




    Figure 7a. JPEG baseline encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).


    Figure 7b. JPEG progressive encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).

    Image decoding example

    下面是一个使用nvJPEG库的图像解码示例。此示例显示了在A100 GPU上使用硬件解码器以及对其他NVIDIA GPU使用后端回退。


    // The following code example shows how to use the nvJPEG library for JPEG image decoding.


    // Libraries used

    // nvJPEG decoding

    int main()



        // create nvJPEG decoder and decoder state

        nvjpegDevAllocator_t dev_allocator = {&dev_malloc, &dev_free};

        nvjpegPinnedAllocator_t pinned_allocator ={&host_malloc, &host_free};


        // Selecting A100 Hardware decoder

        nvjpegStatus_t status = nvjpegCreateEx(NVJPEG_BACKEND_HARDWARE,      






        params.hw_decode_available = true;

        if( status == NVJPEG_STATUS_ARCH_MISMATCH) {

            std::cout<<"Hardware Decoder not supported. Falling back to default backend"<<std::endl;

        // GPU SW decoder selected

        nvjpegCreateEx(NVJPEG_BACKEND_DEFAULT, &dev_allocator,

                       &pinned_allocator, NVJPEG_FLAGS_DEFAULT,


        params.hw_decode_available = false;




        // create JPEG decoder state

        nvjpegJpegStateCreate(params.nvjpeg_handle, &params.nvjpeg_state)


        // extract bitstream metadata to figure out whether a bitstream can be decoded

        nvjpegJpegStreamParseHeader(params.nvjpeg_handle, (const unsigned char *)img_data[i].data(), img_len[i], params.jpeg_streams[0]);


        // decode Batch images

        nvjpegDecodeBatched(params.nvjpeg_handle, params.nvjpeg_state,    



                            batched_output.data(), params.stream)




    $ git clone https://github.com/NVIDIA/CUDALibrarySamples.git

    $ cd nvJPEG/nvJPEG-Decoder/

    $ mkdir build

    $ cd build

    $ cmake ..

    $ make


    // Running nvJPEG decoder

    $ ./nvjpegDecoder -i ../input_images/ -o ~/tmp


    Decoding images in directory: ../input_images/, total 12, batchsize 1

    Processing: ../input_images/cat_baseline.jpg

    Image is 3 channels.

    Channel #0 size: 64 x 64

    Channel #1 size: 64 x 64

    Channel #2 size: 64 x 64

    YUV 4:4:4 chroma subsampling

    Done writing decoded image to file:/tmp/cat_baseline.bmp

    Processing: ../input_images/img8.jpg

    Image is 3 channels.

    Channel #0 size: 480 x 640

    Channel #1 size: 240 x 320

    Channel #2 size: 240 x 320

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img8.bmp

    Processing: ../input_images/img5.jpg

    Image is 3 channels.

    Channel #0 size: 640 x 480

    Channel #1 size: 320 x 240

    Channel #2 size: 320 x 240

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img5.bmp

    Processing: ../input_images/img7.jpg

    Image is 3 channels.

    Channel #0 size: 480 x 640

    Channel #1 size: 240 x 320

    Channel #2 size: 240 x 320

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img7.bmp

    Processing: ../input_images/img2.jpg

    Image is 3 channels.

    Channel #0 size: 480 x 640

    Channel #1 size: 240 x 320

    Channel #2 size: 240 x 320

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file: /tmp/img2.bmp

    Processing: ../input_images/img4.jpg

    Image is 3 channels.

    Channel #0 size: 640 x 426

    Channel #1 size: 320 x 213

    Channel #2 size: 320 x 213

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img4.bmp

    Processing: ../input_images/cat.jpg

    Image is 3 channels.

    Channel #0 size: 64 x 64

    Channel #1 size: 64 x 64

    Channel #2 size: 64 x 64

    YUV 4:4:4 chroma subsampling

    Done writing decoded image to file:/tmp/cat.bmp

    Processing: ../input_images/cat_grayscale.jpg

    Image is 1 channels.

    Channel #0 size: 64 x 64

    Grayscale JPEG

    Done writing decoded image to file:/tmp/cat_grayscale.bmp

    Processing: ../input_images/img1.jpg

    Image is 3 channels.

    Channel #0 size: 480 x 640

    Channel #1 size: 240 x 320

    Channel #2 size: 240 x 320

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file: /tmp/img1.bmp

    Processing: ../input_images/img3.jpg

    Image is 3 channels.

    Channel #0 size: 640 x 426

    Channel #1 size: 320 x 213

    Channel #2 size: 320 x 213

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img3.bmp

    Processing: ../input_images/img9.jpg

    Image is 3 channels.

    Channel #0 size: 640 x 480

    Channel #1 size: 320 x 240

    Channel #2 size: 320 x 240

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img9.bmp

    Processing: ../input_images/img6.jpg

    Image is 3 channels.

    Channel #0 size: 640 x 480

    Channel #1 size: 320 x 240

    Channel #2 size: 320 x 240

    YUV 4:2:0 chroma subsampling

    Done writing decoded image to file:/tmp/img6.bmp

    Total decoding time: 14.8286

    Avg decoding time per image: 1.23571

    Avg images per sec: 0.809248

    Avg decoding time per batch: 1.23571

    Image resizing example



    Figure 8. Image resizing and watermarking pipeline employing a parallel utilization of GPU software and CUDA.


    // The following code example shows how to resize images and watermark them with a logo image.
    // Libraries used 
    // nvJPEG decoding, NPP Resize, NPP watermarking, nvJPEG encoding
    int main()
        // nvJPEG decoder 
        nReturnCode = nvjpegDecode(nvjpeg_handle, nvjpeg_decoder_state, dpImage, nSize, oformat, &imgDesc, NULL);
        // NPP image resize
        st = nppiResize_8u_C3R_Ctx(imgDesc.channel[0], imgDesc.pitch[0], srcSize,   
             srcRoi, imgResize.channel[0], imgResize.pitch[0], dstSize, dstRoi,  
             NPPI_INTER_LANCZOS, nppStreamCtx);
        st = nppiResize_8u_C3R_Ctx(imgDescW.channel[0], imgDescW.pitch[0], srcSizeW, 
             srcRoiW,imgResizeW.channel[0], imgResizeW.pitch[0], dstSize, dstRoi,   
             NPPI_INTER_LANCZOS, nppStreamCtx);
        // Alpha Blending watermarking
        st = nppiAlphaCompC_8u_C3R_Ctx(imgResize.channel[0], imgResize.pitch[0], 
             255, imgResizeW.channel[0], imgResizeW.pitch[0], ALPHA_BLEND, 
             imgResize.channel[0], imgResize.pitch[0], dstSize, NPPI_OP_ALPHA_PLUS,  
        // nvJPEG encoding
        nvjpegEncodeImage(nvjpeg_handle, nvjpeg_encoder_state, nvjpeg_encode_params,
             &imgResize, iformat, dstSize.width, dstSize.height,NULL));
    $ git clone https://github.com/NVIDIA/CUDALibrarySamples.git 
    $ cd nvJPEG/Image-Resize-WaterMark/
    $ mkdir build
    $ cd build
    $ cmake ..
    $ make 
    // Running Image resizer and watermarking
    $ ./imageResizeWatermark -i ../input_images/ -o resize_images -q 85 -rw 512 -rh 512


    Download the latest version of prebuilt DALI binaries with NVIDIA Ampere architecture support. For a detailed list of new features and enhancements, see the  nvJPEG Library documentation and the latest release notes.

    To learn more about how DALI uses nvJPEG for accelerating a deep learning data pipeline, see Loading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs.

  • 相关阅读:
    HTML5 Canvas编写五彩连珠(1):预览
    SQL SERVER BI 入门:(2) Analysis Service 应用
    HTML5 Canvas编写五彩连珠(6):试玩
    HTML5 Canvas编写五彩连珠(4):动画
    HTML5 Canvas编写五彩连珠(3):设计
    SQL SERVER BI 入门:(1)安装与基础概念
    HTML5 Canvas编写五彩连珠(2):画图
    pip报错“ModuleNotFoundError: No module named 'pip'”
  • 原文地址:https://www.cnblogs.com/wujianming-110117/p/13161517.html
Copyright © 2011-2022 走看看