zoukankan      html  css  js  c++  java
  • 重看ebpf 通信&&数据结构分析

      Message passing to invoke behavior in a program is a widely used technique in soft‐ware engineering. A program can modify another program’s behavior by sending messages; this also allows the exchange of information between those programs. One of the most fascinating aspects about BPF, is that the code running on the kernel and the program that loaded said code can communicate with each other at runtime using message passing

      BPF maps are key/value stores that reside in the kernel. They can be accessed by any BPF program that knows about them. Programs that run in user-space can also access these maps by using file descriptors. You can store any kind of data in a map, as long as you specify the data size correctly beforehand. The kernel treats keys and values as binary blobs, and it doesn’t care about what you keep in a map.

    Creating BPF Maps

      The most direct way to create a BPF map is by using the bpf syscall. When the first argument in the call is BPF_MAP_CREATE, you’re telling the kernel that you want to create a new map. This call will return the file descriptor identifier associated with the map you just created. The second argument in the syscall is the configuration for this
    map:

    union bpf_attr {
        struct {
            __u32 map_type; /* one of the values from bpf_map_type */
            __u32 key_size; /* size of the keys, in bytes */
            __u32 value_size; /* size of the values, in bytes */
            __u32 max_entries; /* maximum number of entries in the map */
            __u32 map_flags; /* flags to modify how we create the map */
        };
    }            

      The third argument in the syscall is the size of this configuration attribute.
    For example, you can create a hash-table map to store unsigned integers as keys and values as follows:

    union bpf_attr my_map {
        .map_type = BPF_MAP_TYPE_HASH,
        .key_size = sizeof(int),
        .value_size = sizeof(int),
        .max_entries = 100,
        .map_flags = BPF_F_NO_PREALLOC,
    };

    int fd = bpf(BPF_MAP_CREATE, &my_map, sizeof(my_map));

      If the call fails, the kernel returns a value of -1. There might be three reasons why it fails. If one of the attributes is invalid, the kernel sets the errno variable to EINVAL. If the user executing the operation doesn’t have enough privileges, the kernel sets the
    errno variable to EPERM. Finally, if there is not enough memory to store the map, the kernel sets the errno variable to ENOMEM.
    The helper function bpf_map_create wraps the code you just saw to make it easier to  initialize maps on demand. We can use it to create the previous map with only one line of code:

    int fd;
    fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(int), sizeof(int), 100, BPF_F_NO_PREALOC);

      If you know which kind of map you’re going to need in your program, you can also predefine it. This is helpful to get more visibility in the maps your program is using beforehand:

    struct bpf_map_def SEC("maps") my_map = {
        .type = BPF_MAP_TYPE_HASH,
        .key_size = sizeof(int),
        .value_size = sizeof(int),
        .max_entries = 100,
        .map_flags = BPF_F_NO_PREALLOC,
    };

      When you define a map in this way, you’re using what’s called a section attribute, in this case SEC("maps"). This macro tells the kernel that this structure is a BPF map and it should be created accordingly !!
    You might have noticed that we don’t have the file descriptor identifier associated with the map in this new example. In this case, the kernel uses a global variable called map_data to store information about the maps in your program. This variable is an array of structures, and it’s ordered by how you specified each map in your code. For example, if the previous map was the first one specified in your code, you’d get the file descriptor identifier from the first element in the array:
    fd = map_data[0].fd;
      You can also access the map’s name and its definition from this structure; this information is sometimes useful for debugging and tracing purposes.

    其实主要就是:内核程序编译生成的 .o 文件要被解析成 ELF 文件 load 到内核。为此,map 是放在独有的 ELF 段中

    #define SEC(NAME) __attribute__((section(NAME), used)) 

    用户程序通过bpf 系统调用 (cmd为BPF_MAP_CREATE)创建 map,输入参数为 map 的各个参数,返回值为 map 对应的 fd。在官方例程中,用户空间程序是这样进行 map 创建的。

    通过linux kernel 源码的sample/bpf里面的Makefile 可以知道

    sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
    always += sockex1_kern.o

    一个编译生成 sockex1_kern.o , 一个编译生成可执行程序 sockex1

    所有 sockex1 会涉及到 bpf_load.c libbpf.c 等文件

    #include <stdio.h>
    #include <assert.h>
    #include <linux/bpf.h>
    #include "libbpf.h"
    #include "bpf_load.h"
    #include <unistd.h>
    #include <arpa/inet.h>
    
    int main(int ac, char **argv)
    {
        char filename[256];
        FILE *f;
        int i, sock;
    
        snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
    
        if (load_bpf_file(filename)) {
            printf("%s", bpf_log_buf);
            return 1;
        }
    
        sock = open_raw_sock("lo");
    
        assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
                  sizeof(prog_fd[0])) == 0);
    
        f = popen("ping -c5 localhost", "r");
        (void) f;
    
        for (i = 0; i < 5; i++) {
            long long tcp_cnt, udp_cnt, icmp_cnt;
            int key;
    
            key = IPPROTO_TCP;
            assert(bpf_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);
    
            key = IPPROTO_UDP;
            assert(bpf_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);
    
            key = IPPROTO_ICMP;
            assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
    
            printf("TCP %lld UDP %lld ICMP %lld bytes
    ",
                   tcp_cnt, udp_cnt, icmp_cnt);
            sleep(1);
        }
    
        return 0;
    }

      之前讲解了  load_bpf_file 会通过系统调用bfp(BPF_PROG_LOAD..........) 将内核代码bpf 指令加载到内核 返回一个关联的fd,但是通信用的map 怎样让user 以及kernel 都知道呢,也就是通过什么 关联在一起呢?

    user 通过关联体 就能访问 kernel.o中创建的map呢?

    答案是一切皆文件!!!

    int load_bpf_file(char *path)
    {
        -------------------------
    
        fd = open(path, O_RDONLY, 0);
        if (fd < 0)
            return 1;
    
        elf = elf_begin(fd, ELF_C_READ, NULL);
    
        if (!elf)
            return 1;
    // 解析 ELF 文件
        if (gelf_getehdr(elf, &ehdr) != &ehdr)
            return 1;
    
    ---------------------------------------
    
        /* scan over all elf sections to get license and map info */
        for (i = 1; i < ehdr.e_shnum; i++) {
    
            --------------------------------------------
            } else if (strcmp(shname, "maps") == 0) {//解析到map的同时  调用load——map创建 对应的map 并关联到一个fd上
                processed_sec[i] = true;
                //扫描到SEC("maps")后,对BPF Map相关的操作是由load_maps函数完成,其中的bpf_create_map_node()和bpf_create_map_in_map_node()就是创建BPF Map的关键函数
                if (load_maps(data->d_buf, data->d_size))
                    return 1;
            } else if (shdr.sh_type == SHT_SYMTAB) {
                symbols = data;
            }
        }
    static int load_maps(struct bpf_map_def *maps, int len)
    {
        int i;
    
        for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {
    
            map_fd[i] = bpf_create_map(maps[i].type,
                           maps[i].key_size,
                           maps[i].value_size,
                           maps[i].max_entries,
                           maps[i].map_flags);
            if (map_fd[i] < 0) {
                printf("failed to create a map: %d %s
    ",
                       errno, strerror(errno));
                return 1;
            }
    
            if (maps[i].type == BPF_MAP_TYPE_PROG_ARRAY)
                prog_array_fd = map_fd[i];
        }
        return 0;
    }
    int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
               int max_entries, int map_flags)
    {
        union bpf_attr attr = {
            .map_type = map_type,
            .key_size = key_size,
            .value_size = value_size,
            .max_entries = max_entries,
            .map_flags = map_flags,
        };
    
        return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
    }

     

    load_bpf_file  
      |
      |-- load_maps 
          |
          |-- bpf_create_map

    内核空间创建 map

    内核空间响应 BPF_MAP_CREATE 系统调用,申请内存作为 map。

    /kernel/bpf/syscall.c
    
    static int map_create(union bpf_attr *attr)
    {
        struct bpf_map *map;
        int err;
    
        /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
        map = find_and_alloc_map(attr);
    
        // code omitted
        err = bpf_map_new_fd(map);
    
        return err;
    }
     

    内核程序写 map

    内核程序通常做的是,将数据写入 map,内核程序通过 bpf_map_lookup_elem() 找到 index 为 KEY 对应的内存,然后对其进行修改

    int bpf_prog1(struct __sk_buff *skb)
    {
        int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
        long *value;
    
        if (skb->pkt_type != PACKET_OUTGOING)
            return 0;
    
        value = bpf_map_lookup_elem(&my_map, &index);
        if (value)
            __sync_fetch_and_add(value, skb->len);
    
        return 0;
    }
     

    用户程序读 map

    用户程序可以通过 BPF_MAP_LOOKUP_ELEM 系统调用可以读取 map 中特定 KEY 对应的值, 第一个参数即为创建 map 时返回的 fd.

    int bpf_lookup_elem(int fd, void *key, void *value)
    {
        union bpf_attr attr = {
            .map_fd = fd,
            .key = ptr_to_u64(key),
            .value = ptr_to_u64(value),
        };
    
        return syscall(__NR_bpf, BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
    }
     

    BPF社区网站

    学习技术还是得从源代码开始,下面是与bpf相关的代码仓库:

    学习技术也需要沟通交流,下面是推荐的沟通渠道:

    • Brendan Gregg,来自Netflix最强BPF布道师,他的博客都是关于Linux系统优化的,观点独到,每一篇都值得一读;
    • Alexei Starovoitov,eBPF创造者,目前在Facebook就职,经常能在内核代码commit中看到他的踪迹;
    • Daniel Borkmann,eBPF kernel co-maintainer,目前在Cilium所在的公司Isovalent就职,是给eBPF增加feature的能力者;
    • Thomas Graf,Cilium之父,Isovalent的CTO,他也是eBPF和Cilium的强力布道师;
    • Quentin Monnet,BPFTool co-maintainer,Quentin是在stackoverflow上bpf问题的killer,twitter有关于eBPF的系列实战短文;
    http代理服务器(3-4-7层代理)-网络事件库公共组件、内核kernel驱动 摄像头驱动 tcpip网络协议栈、netfilter、bridge 好像看过!!!! 但行好事 莫问前程 --身高体重180的胖子
  • 相关阅读:
    走进AngularJs(二) ng模板中常用指令的使用方式
    mysql知识汇总
    存储动态数据时,数据库的设计方法
    js判断密码强度
    svg―Raphael.js Library(一)
    常用js代码
    IE6实现图片或背景的圆角效果
    巧用css border
    IE6下的效果
    理解盒模型
  • 原文地址:https://www.cnblogs.com/codestack/p/14745167.html
Copyright © 2011-2022 走看看