zoukankan      html  css  js  c++  java
  • 【转】Profiling application LLC cache misses under Linux using Perf Events

    转自:http://ariasprado.name/2011/11/30/profiling-application-llc-cache-misses-under-linux-using-perf-events.html

    In this post we will see how to do some profiling under Ubuntu Linux using Perf Events, present in the kernel since version 2.6.31 [1, 2]. In particular, we will estimate the rate of Last Level Cache (LLC) misses that a Java application has.

    There are GIS applications that are computing power hungry; among them applications processing LiDAR data are an example, because the volume of the input data is usually huge. The efficient usage of the processor caches can boost execution time. Given the high penalty processor cache misses have, identifying application areas causing too much cache misses is very important.

    1. Installation of Perf Events

    Fortunately, Ubuntu Linux offers Perf Events (PE) in the form of binary packages. By using the command apt-get, installation is straighforward:

    $ sudo apt-get install linux-tools-common linux-tools-2.6.38-13

    Two notes about installation. First, before attempting installation check that the kernel you are using is recent enough: Perf Events [note 1] is available since Linux version 2.6.31. Second, install a version of the package linux-tools matching your kernel version.

    2. The Java test application

    Below is shown a simple Java application able to cause many LLC cache misses.

    The constructor method creates and populates a square matrix of random, double numbers.

    The method calculateSum() calculates the summatory of all numbers the matrix stores; this method is called fifty times. Since sum is conmutative, traversing the matrix by rows or by columns will yield the same result; the boolean parameter traverseByRows sets the traversing mode.

     1 public class LLCMissesTest {
     2     public static final int DEFAULT_MATRIX_SIZE = 7500;
     3   
     4     protected double[][] matrix;
     5   
     6     public LLCMissesTest(int n) {
     7         matrix = new double[n][n];
     8         for (int i = 0; i < n; i = i + 1) {
     9             for (int j = 0; j < n; j = j + 1) {
    10                 matrix[i][j] = Math.random();
    11             }
    12         }
    13     }
    14   
    15     public double calculateSum(boolean traverseByRows) {
    16         double sum = (double) 0;
    17   
    18         int n = matrix.length;
    19         for (int i = 0; i < n; i = i + 1) {
    20             for (int j = 0; j < n; j = j + 1) {
    21                 if (traverseByRows == true) {
    22                     sum = sum + matrix[i][j];
    23                 } else {
    24                     sum = sum + matrix[j][i];
    25                 }
    26             }
    27         }
    28   
    29         return sum;
    30     }
    31  
    32     public static void main(String[] args) {
    33         final int NUM_ITERATIONS = 50;
    34  
    35         LLCMissesTest lmt = new LLCMissesTest(DEFAULT_MATRIX_SIZE);
    36         boolean traverseByRows = true;
    37         for (int i = 0; i < NUM_ITERATIONS; i = i + 1) {
    38                 System.out.printf("i = %d, traverseByRows = %b: total = %f
    ", i, traverseByRows, lmt.calculateSum(traverseByRows));
    39         }
    40     }
    41 }

    What can we expect from this class? When executing the method calculateSum() the number of LLC memory load events will be orders of magnitude higher when traversing the matrix by columns (that is, the parameter traverseByRows is set to false), and also a higher number of LLC load misses. This is because in Java matrices are stored by rows (Row-major order) and without any guarantee that two consecutive rows are actually contiguous in memory [note 2].

    3. Counting LLC loads and LLC load misses miss events

    Once Perf Events is installed we can measure, among others, the number of LLC-loadand LLC-load-misses cache misses events. The list of list of the available pre-defined events can be get by executing

    $ perf list

    According to the man page the returned list items are actually "the symbolic event types which can be selected in the various perf commands with the -e option" [note 3].

    We have counted the number of LLC-loads and LLC-load-misses events by using perf's command stat:

    $ perf stat -e LLC-loads,LLC-load-misses java LLCMissesTest

    This measurement has been done twice: the first time, the variable traverseByRows (line #36) was set to true, the second one it was set to false. The results are shown in the table below:

    matrix
    size
    traverse
    by rows
    LLC-loads
    events
    LLC-load-misses
    events
    time
    (seconds)
    load-misses / loads
    ratio
    7,500 true 367,944,413 15,522,099 8.63 4.22%
    7,500 false 10,467,824,326 1,288,872,561 84.13 12.31%

    It can be seen that when traversing the matrix by columns, the number of LLC-loads andLLC-load-misses events increases by orders of magnitude, and hence the execution time.

    Hardware main features were:

    • processor: 2 x Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz
    • cache size: 3,072 kilobytes
    • bogomips: 5,852.47
    • RAM size: 3,597,972 kilobytes

    Operating system was Ubuntu Linux 11.04, kernel 2.6.38-13-generic-pae. The Java virtual machine was the OpenJDK Runtime Environment (IcedTea6 1.10.4), Java version 1.6.0_22.

    4. Caveats

    The tests made are very simple: what we have actually measured in the previous section is the number of LLC-loads and LLC-load-misses events of the whole program, not just the method calculateSum(); to minimize the contribution of other parts of the program, the method calculateSum() is called 50 times.

    Another issue comes from the fact that the LLC is a shared resource. Hence, if the test is run in parallel with other applications that intensively use consume computer memory, the gotten results could be inaccurate.

    5. Useful links

    references

    [1] "2.6.31 is out": http://goo.gl/UCfWn

    [2] "Perfcounters added to the mainline": http://lwn.net/Articles/339361/

    notes

    [note 1] The first version was named Performance Counters. In version 2.6.32 it was renamed to Perf Events.

    [note 2] "We can expect elements of an array of primitive elements to be stored contiguously, but we cannot expect the objects of an array of objects to be stored contiguously. For a rectangular array of primitive elements, the elements of a row will be stored contiguously, but the rows may be scattered. A basic observation is that accessing the consecutive elements in a row will be faster than accessing consecutive elements in a column." (http://goo.gl/O8HPf)

    Geir Gundersen, Trond Steihaug; 2004; "Data structures in Java for matrix computations"; Concurrency and Computation: Practice and Experience; vol. 16, issue 8; pp. 799-815

    [note 3] "These events have been specifically implemented by architecture. Preliminary investigations suggest that the events appear correct but we also suggest that the events are compared against the corresponding raw counters and also against oprofile results until this tool is thoroughly investigated (this section will be updated as confirmation is made)."

    Bill Buros, "Using perf on POWER7 systems" (http://goo.gl/f4vS3)

  • 相关阅读:
    nes 红白机模拟器 第4篇 linux 手柄驱动支持
    nes 红白机模拟器 第3篇 游戏手柄测试 51 STM32
    nes 红白机模拟器 第2篇 InfoNES
    python语言输入
    python控制窗口口字形运动
    python控制窗口对角线运动
    python控制窗口移动(画圆)
    python控制窗口缩放
    python控制窗口显示隐藏
    python修改内存,(修改植物大战僵尸)
  • 原文地址:https://www.cnblogs.com/dorothychai/p/3436139.html
Copyright © 2011-2022 走看看