zoukankan      html  css  js  c++  java
  • cuda FORTRAN 统一内存 managed

    这里是官方的一篇很好的帖子,里面有实例和讲解 managed memory 的使用方法。

    http://www.pgroup.com/lit/articles/insider/v6n1a2.htm

    可以直接看官方的文档,排版比较好,为了防止以后这篇文档丢失,我复制在下面,有机会再整理一下。

    CUDA Fortran Managed Memory

    Note: This update to an article originally published in August 2014 reflects changes effective with PGI version 15.9. If you're using any earlier PGI version, please refer to the original article.

    In PGI 14.7 we are introducing the biggest productivity improvement to CUDA Fortran since we rolled out CUF kernels roughly four years ago. Using managed memory simplifies many coding tasks, makes source code cleaner, and enables a unified view of complicated data structures across host and device memories.

    A good explanation of Unified Memory can be found on the NVIDIA Parallel Forall blog, and Appendix J of the CUDA C Programming Guide gives a detail explanation. Unified Memory provides a way to obtain a single pointer to user allocated data that can be used in both host and device code. Unlike zero-copy memory, managed memory is not pinned and static, but migrates between the host and device on access. The system (driver and OS) controls the physical page movement. Specifically, in CUDA C/C++, managed memory usage consists of replacing calls to cudaMalloc() with cudaMallocManaged(), and removing explicit uses of cudaMemcpy() to transfer data between host and device.

    In CUDA Fortran, we've added the managed keyword to the language, which can be used in host code similarly to the device keyword. Here's an example of how we've used it to simplify the host code in our sgemm test, a CUDA Fortran example we've used frequently here at PGI:

    Here's the code before CUDA managed data:

      ! matrix data
      real, allocatable, dimension(:,:) :: A, B, C, gold
      real, allocatable, device, dimension(:,:) :: dA, dB, dC
      . . .
    
      allocate(A(N,N))
      allocate(B(N,N))
      allocate(C(N,N))
      allocate(gold(N,N))
    
      call random_number(A)
      call random_number(B)
    
      allocate(dA(N,N))
      allocate(dB(N,N))
      allocate(dC(N,N))
    
      dA = A
      dB = B
      dC = 0.0
    
      alpha = 1
      beta = 0
      m = N
      k = N
      blocks = dim3(N/256, N/16, 1)
      threads = dim3(16, 16, 1)
    
      call sgemm_cpu(A, B, gold, m, N, k, alpha, beta)
    
      ! timing experiment
      time = 0.0
      istat = cudaEventRecord(start, 0)
      do j = 1, NREPS
        call sgemmNN_16x16<<<blocks, threads="">>>(dA, dB, dC, m, N, k, alpha, beta)
      end do
      istat = cudaEventRecord(stop, 0)
      istat = cudaDeviceSynchronize()
      istat = cudaEventElapsedTime(time, start, stop)
      time = time / (NREPS*1.0e3)
    
      C = dC
    

    Now, here's the same code using managed data:

      ! matrix data
      real, managed, allocatable, dimension(:,:) :: A, B, C
      real, allocatable, dimension(:,:) :: gold
      . . .
    
      allocate(A(N,N))
      allocate(B(N,N))
      allocate(C(N,N))
      allocate(gold(N,N))
    
      call random_number(A)
      call random_number(B)
      C = 0.0
    
      alpha = 1
      beta = 0
      m = N
      k = N
      blocks = dim3(N/256, N/16, 1)
      threads = dim3(16, 16, 1)
    
      call sgemm_cpu(A, B, gold, m, N, k, alpha, beta)
    
      ! timing experiment
      time = 0.0
      istat = cudaEventRecord(start, 0)
      do j = 1, NREPS
        call sgemmNN_16x16<<<blocks, threads="">>>(A, B, C, m, N, k, alpha, beta)
      end do
      istat = cudaEventRecord(stop, 0)
      istat = cudaDeviceSynchronize()
      istat = cudaEventElapsedTime(time, start, stop)
      time = time / (NREPS*1.0e3)
    

    As you can see, we no longer need two copies of the arrays A, B, and C. The data movement still happens, but rather than being explicit, now it is controlled by the unified memory management system behind the scenes, much like the way an OS manages virtual memory.

    If your CUDA Fortran program is bogged down with two copies of variables, like variable_name for the host, and variable_name_d for the device, then managed memory might be the answer for you.

    In this sgemm case, the performance impact of using managed memory is negligible. In other cases, the use of managed memory may result in unwanted transfers between host and device without any of the look-ahead, pre-fetching, or overlap operations that you might find in a tuned CUDA code. Be sure to check wall-clock time; any extra overhead will be attributed to system time by the Linux time command. Remember, you can still use all the tools CUDA provides to tune performance including the use of streams and asynchronous copies where needed to optimize the overlap of computation and data movement.

    In our limited testing, we've seen 0–20% performance degradation when doing a quick replacement of the device attribute with the managed attribute followed by removing the data transfer assignment statements. In one purposely poorly done effort, we saw code run about twice as slow but we don't think that's a representative example. I would expect most users will see the former behavior on current hardware and CUDA 6.0.

    Managed Data Argument Matching

    You can pass managed arrays to kernels which declare their dummy arguments as device arrays so your global subroutines do not require a rewrite. In fact, we've implemented argument matching so that managed actual arguments can match managed, device, and host dummy arguments, in that order. If you declare your dummy argument to be managed, it is only valid to pass a managed actual argument to it; anything else will result in a compilation error as long as the interface is explicit.

    module minit
      interface init
        module procedure inith, initm, initd
      end interface
      contains
        subroutine initd(a)
          integer, device :: a(:)
          print *,"initd"
          a(:) = z'deadbeef'
        end subroutine
        subroutine inith(a)
          integer :: a(:)
          print *,"inith"
          a(:) = z'deadbeef'
        end subroutine
        subroutine initm(a)
          integer, managed :: a(:)
          print *,"initm"
          if (size(a).le.1000) then
            call inith(a)
          else
            call initd(a)
          end if
        end subroutine
    end module
    
    program t
    use minit
    integer, parameter :: n = 10
    integer, managed :: m(n)
    integer, device :: d(n)
    integer :: h(n)
    call init(m)
    call init(d)
    call init(h)
    end program
    

    Here, we've overloaded the init function to handle managed, device, or host data. The managed initfunction, in turn, can call either the device or host version depending on some condition. The same rules apply for calling global subroutines.

    Managed Data Runtime Visibility and Access

    There are no limitations on the availability of managed memory from the device side. For example you can initiate managed data operations on the host and then launch a kernel. When that kernel executes, you are guaranteed to access the updated data on the device.

    When a kernel is executing, however, no host accesses of globally visible managed data are permitted. There is just no way currently for the system (driver and host OS) to query, interrupt, and/or synchronize the device sufficiently without some help from the programmer. Using CUDA 6.0, you should put explicit calls to cudaDeviceSynchronize() after your kernel launches and before you try to access the managed data areas from the host. Not doing this may lead to segmentation faults in your application that can be very difficult to track down.

    If you suspect that seg faults are due to illegal managed data accesses, one way to make sure is to run under cuda-memcheck. You will see an error like this:

    brentl@sb-leback:~/tmp> ./a.out
    Segmentation fault
    
    brentl@sb-leback:~/tmp> cuda-memcheck ./a.out
    ========= CUDA-MEMCHECK
    ========= Error: process didn't terminate successfully
    ========= The application may have hit an error when dereferencing Unified
    Memory from the host. Please rerun the application under cuda-gdb or Nsight
    Eclipse Edition to catch host side errors.
    ========= Internal error (20)
    ========= No CUDA-MEMCHECK results found
    

    Advanced users will want to understand whether high-level constructs like assignment are accessing the managed data from the host or from the device. We will discuss that in detail in the Assignment section below.

    Managed CUDA Fortran API Routines

    Just like we have CUDA Fortran wrappers for the CUDA C API for things like cudaMalloc, cudaMemset, and cudaMemcpy, we have added interfaces and wrappers to the new managed memory API routines.

    cudaMallocManaged() can be called with an allocatable managed array just as you can call cudaMalloc with an allocatable device array. There are two additional arguments besides the array to allocate. An integer count, to specify the number of elements to allocate, and flags which specify the visibility of the managed data. By default, the managed data is visible to all kernels running on all streams. This behavior can be changed with the next runtime routine we'll introduce.

    cudaStreamAttachMemAsync() allows stream-level control of managed data, and can allow host code to access managed data not currently involved in a running kernel.

    In addition to these two CUDA calls, in CUDA Fortran we've added some additional functions to make life easier if you mix OpenMP and CUDA Fortran:

    cudaforSetDefaultStream() allows you to set the default stream for high-level language constructs on a per-thread basis. Inputs are the stream number, or alternatively, a device or managed object and a stream number. The default stream, when set, applies to allocation/stream attach operations on managed data, data movement operations on both managed and device data, and the new intrinsic reduction functions. This stream must still be set explicitly on CUDA kernel launches by the programmer.

    Managed Data with Derived Types

    One area where CUDA Fortran managed data really shines is with derived types. Before, using device data, derived types were awkward because a derived type containing allocatable device arrays could not itself easily have the device attribute. So, in effect, some fields of the derived type were resident on the device while other fields were resident on the host. Managing that programmatically was cumbersome as the types became larger and more deeply nested.

    Now, if your program makes use of derived types in its computational core, managed data seems like a great way to go. Only the data that is required will be moved on demand. If large areas of the types are not accessed, they will not incur the overhead of being moved back and forth. Furthermore, putting these types in modules, both the definitions and instantiated entities, means they can be used in both host and device code and don't need to be passed as arguments.

    Here is an example, submitted to our User Forum several years ago, that we have not been able to adequately address until now:

    module mCuda
      integer :: num_cm
      type tCM
        integer, allocatable, device :: fine(:)
        real, allocatable, device :: mat_matrix(:,:,:)
        real, allocatable, device :: src_matrix(:,:,:)
      end type
      type(tCM), allocatable, device :: cm_list(:)
    end module
    
    program test
    use mCuda
    integer i
    num_cm=10
    allocate (cm_list(num_cm))
    do i=1, num_cm
      allocate(cm_list(i)%fine(3))
      cm_list(i)%fine=10
    enddo
    end 
    

    In this example the problem has been that the array of derived types, cm_list, has to be allocated and set up on the host, but we want it to be accessed primarily on the device. The Fortran device code that accesses the derived type, like mat_matrix, is most flexible and most clearly expressed when you use the entire structure, i.e. something like cm_list(i)%mat_matrix(j,k,12). It is up to the compiler (our job) to optimize away common sub-expressions in the generated code.

    So, first off, let's just change the outer declaration to use managed data:

    type(tCM), allocatable, managed :: cm_list(:)
    

    When we do that, what we see when we compile and run this code is:

    # Make sure to specify CUDA 6.0 on the compile line:
    % pgfortran -Mcuda=cuda6.0 userforum.cuf
    % ./a.out
    Segmentation fault
    

    This falls under the illegal managed data access rules described above. When we execute this loop:

    do i=1, num_cm
      allocate(cm_list(i)%fine(3))
      cm_list(i)%fine=10
    enddo
    

    the first iteration executes successfully. We access the managed data in cm_list to set up the first fine array, cm_list(1)%fine. Next we run a kernel (a memset operation contained in the CUDA Fortran runtime library) to initialize the device array cm_list(i)%fine to the value 10. Kernel launches are asynchronous with respect to the host. So, the host circles around to iteration 2, where we again access the managed data in cm_list to set up for cm_list(2)%fine. This causes the seg fault as we are accessing global managed data while a kernel is executing.

    There are at least 3 ways to work around this problem:

    1. Add synchronization to the loop to ensure the memset kernel is done before accessing managed data.
      do i=1, num_cm
        allocate(cm_list(i)%fine(3))
        cm_list(i)%fine=10
        istat = cudaDeviceSynchronize()
      enddo
      
    2. Make all the arrays managed. Use cudaforSetDefaultStream to put all the managed arrays in the same stream:
      type tCM
        integer, allocatable, managed :: fine(:)
        real, allocatable, managed :: mat_matrix(:,:,:)
        real, allocatable, managed :: src_matrix(:,:,:)
      end type
      
      program test
      use mCUDA
      use cudafor
      integer (kind=cuda_stream_kind) :: str_mng
      integer i
      num_cm=10
      istat = cudaStreamCreate(str_mng)
      istat = cudaforSetDefaultStream(str_mng)
      allocate (cm_list(num_cm))
      do i=1, num_cm
        allocate(cm_list(i)%fine(3))
        cm_list(i)%fine=10
      enddo
      end
      
    3. Explicitly perform the device array operations on one stream and the managed array operations on another by mapping each allocated entity to the specific stream. This is probably the best performing option:
      program test
      use mCuda
      use cudafor
      integer(kind=cuda_stream_kind) :: str_mng, str_dev
      integer i
      num_cm=10
      istat = cudaStreamCreate(str_mng)
      istat = cudaStreamCreate(str_dev)
      allocate(cm_list(num_cm))
      istat = cudaforSetDefaultStream(cm_list, str_mng)
      do i=1, num_cm
        allocate(cm_list(i)%fine(3))
        istat = cudaforSetDefaultStream(cm_list(i)%fine, str_dev)
        cm_list(i)%fine=10
      enddo
      

    Which methods to use are up to you, the programmer, based on ease of use, readability, and performance requirements.

    Assignment

    CUDA Fortran uses high-level array syntax for moving data between host and device. This feature is carried over to managed data as well. For instance:

    real, managed :: a(10000)
    real :: b(10000)
    a = 0.0   ! This results in a cudaMemset type operation which launches
    		  ! a kernel on the device.  Because a is managed, by default 
    		  ! our runtime inserts a cudaStreamSynchronize call after the
    		  ! operation.
    b = 1.0   ! This, of course, uses host data and runs on the host only
    		  !
    a = b     ! This results in a cudaMemcpy API call, host to device.
    		  ! cudaMemcpy is a synchronous operation.
    b = a     ! This results in a cudaMemcpy API call, device to host
    		  ! This also is synchronous
    a(1:10000:2) = b(1:5000)  ! Using slice notation of any kind with managed
    		  ! data results in the copy happening on the host.
    istat = cudaforSetDefaultStream(a, mystream)
    a = b     ! This results in an asynchronous cudaMemcpy API call, host to
    		  ! device, on stream mystream.
    

    Managed Data in CUF Kernels

    Here's an example of managed data used in an OpenMP program with CUF kernels. It also takes advantage of the new default stream API:

    program ompcuf
    use cudafor
    use omp_lib
    integer(kind=cuda_stream_kind) :: mystream
    
    !$omp parallel private(istat,mystream)
    istat = cudaStreamCreate(mystream)
    istat = cudaforSetDefaultStream(mystream)
    call foo()
    !$omp end parallel
    end
    
    subroutine foo()
    use cudafor
    use omp_lib
    real, managed :: a(10000)
    j = omp_get_thread_num()
    a = real(j)
    
    !$cuf kernel do <<< *, *, stream=cudaforGetDefaultStream() >>>
    do i = 1, 10000
      a(i) = a(i) + 1.0
    end do
    istat = cudaStreamSynchronize(cudaforGetDefaultStream())
    
    if (any(a.ne.real(j+1))) then
      print *,"Found error on ",j
    else
      print *,"Looks good on ",j
    endif
    end
    

    In the main program, we create a stream for each OMP thread. We add a call to cudaforSetDefaultStream to set the default stream for all subsequent high-level language constructs. So, for instance, in subroutine foo, the array a gets allocated as managed data and bound to the default stream behind the scenes using cudaStreamAttachMemAsync(). In the assignment a = real(j), we also use the thread's default stream in what is basically a cudaMemset operation. We have to explicitly add the default stream in the launch configuration of the CUF kernel, and also use the thread's default stream to synchronize. Once the cudaStreamSynchronize has occurred, this thread can safely access the managed data on the host, in this case in the any() function, even while other threads may be in the middle of their kernel launch.

    Conclusion

    I've shown a number of ways in which the CUDA 6.0 managed memory feature can make CUDA Fortran programming easier. At this point, managed memory should be viewed as a productivity feature, and not targeted at improved performance. As NVIDIA GPU architectures progress over the next several years, we expect to see better and better performance from the unified memory system. You can start today, and I think you'll find managed memory is a good tool for learning, coding up prototypes, and simplifying complicated data structures which need to be used on both the host and device. Stay tuned; we have a few more ideas on how unified memory can make additional high-level Fortran language features available to CUDA Fortran programmers in the near future.

  • 相关阅读:
    Web Browser使用技巧
    Excel 函数
    删除文件夹, 解决源文件名长度大于文件系统支持的长度问题
    Send Mail using C# code
    动态规划——最长回文子串
    字符串处理总结
    打印日期
    A+B
    对称矩阵
    最小年龄的3个职工
  • 原文地址:https://www.cnblogs.com/cofludy/p/8474287.html
Copyright © 2011-2022 走看看