zoukankan      html  css  js  c++  java
  • vfio

    [root@localhost dpdk-19.11]# lsmod | grep vfio
    [root@localhost dpdk-19.11]# modprobe vfio
    [root@localhost dpdk-19.11]# lsmod | grep vfio
    vfio_iommu_type1      262144  0 
    vfio                  262144  1 vfio_iommu_type1
    [root@localhost dpdk-19.11]# ./usertools/dpdk-devbind.py  --bind=vfio 0000:05:00.0
    Error: bind failed for 0000:05:00.0 - Cannot open /sys/bus/pci/drivers/vfio/bind
    

      

    [root@localhost dpdk-19.11]# modprobe vfio-pci
    [root@localhost dpdk-19.11]# ls /sys/bus/pci/drivers/
    ahci  ata_piix  ehci-pci  exar_serial  hibmc-drm  hinic  hisi_sas_v3_hw  hns3  igb_uio  ipmi_si  megaraid_sas  nvme  ohci-pci  pcieport  pci-stub  serial  uhci_hcd  vfio-pci  xhci_hcd
    [root@localhost dpdk-19.11]# lsmod | grep vfio
    vfio_pci              262144  0 
    vfio_virqfd           262144  1 vfio_pci
    vfio_iommu_type1      262144  0 
    vfio                  262144  2 vfio_iommu_type1,vfio_pci
    [root@localhost dpdk-19.11]# 
    [root@localhost dpdk-19.11]# ls /sys/bus/pci/drivers/
    ahci  ata_piix  ehci-pci  exar_serial  hibmc-drm  hinic  hisi_sas_v3_hw  hns3  igb_uio  ipmi_si  megaraid_sas  nvme  ohci-pci  pcieport  pci-stub  serial  uhci_hcd  xhci_hcd
    [root@localhost dpdk-19.11]#
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu
    device  devices  power  subsystem  uevent
    [root@localhost lib]# ls /sys/kernel/iommu_groups/
    0  1  10  11  12  13  14  15  16  17  18  19  2  20  21  22  23  24  25  26  27  28  29  3  30  31  32  33  34  35  36  4  5  6  7  8  9
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu_group 
    devices  reserved_regions
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu_group -al 
    lrwxrwxrwx. 1 root root 0 Sep 18 02:21 /sys/bus/pci/devices/0000:05:00.0/iommu_group -> ../../../../../../kernel/iommu_groups/27
    [root@localhost lib]# 
    [root@localhost lib]# ls /sys/kernel/iommu_groups/
    0  1  10  11  12  13  14  15  16  17  18  19  2  20  21  22  23  24  25  26  27  28  29  3  30  31  32  33  34  35  36  4  5  6  7  8  9
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu_group 
    devices  reserved_regions
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu_group -al 
    lrwxrwxrwx. 1 root root 0 Sep 18 02:21 /sys/bus/pci/devices/0000:05:00.0/iommu_group -> ../../../../../../kernel/iommu_groups/27
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu_group/
    devices  reserved_regions
    [root@localhost lib]# ls /sys/bus/pci/devices/0000:05:00.0/iommu_group/devices
    0000:05:00.0
    [root@localhost lib]# ls /dev/vfio/
    vfio
    [root@localhost lib]# ls /dev/vfio/vfio 
    /dev/vfio/vfio
    [root@localhost lib]# cd ..
    [root@localhost dpdk-19.11]# 
    ./usertools/dpdk-devbind.py  --bind=vfio-pci  0000:05:00.0
    [root@localhost dpdk-19.11]# ls /sys/kernel/i
    iommu_groups/ irq/          
    [root@localhost dpdk-19.11]# ls /sys/kernel/iommu_groups/                   -----27本来就存在
    0  1  10  11  12  13  14  15  16  17  18  19  2  20  21  22  23  24  25  26  27  28  29  3  30  31  32  33  34  35  36  4  5  6  7  8  9
    [root@localhost dpdk-19.11]# ls /sys/kernel/iommu_groups/27
    devices  reserved_regions
    [root@localhost dpdk-19.11]# ls /sys/kernel/iommu_groups/27/devices/
    0000:05:00.0
    [root@localhost dpdk-19.11]# 

    Not KVM bound. The VFIO API deconstructs a device into regions, irqs, etc. The userspace application (QEMU, cloud-hypervisor, etc..) is responsible for reconstructing it into a device for e.g. a guest VM to consume.

    Boot with intel_iommu=on.

    IOMMU groups

    Devices are bound together for isolation, IOMMU capabilities and platform topology reasons. It is not configurable.

    These are IOMMU groups.

    $ ls /sys/kernel/iommu_groups/
    0  1  10  11  12  13  19  2  3  4  5  6  7  8  9

    VFIO Objects

    Groups

    VFIO groups mapped to IOMMU groups. VFIO group <-> IOMMU group.

    When binding a device to the vfio-pci kernel driver, the kernel creates the corresponding group under /dev/vfio. Let's follow a simple example:

    $ lspci -v
    [...]
    01:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
    	Subsystem: Dell Device 07e6
    	Flags: bus master, fast devsel, latency 0, IRQ 133
    	Memory at dc300000 (32-bit, non-prefetchable) [size=4K]
    	Capabilities: [80] Power Management version 3
    	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
    	Capabilities: [b0] Express Endpoint, MSI 00
    	Capabilities: [100] Advanced Error Reporting
    	Capabilities: [148] Device Serial Number 00-00-00-01-00-4c-e0-00
    	Capabilities: [158] Latency Tolerance Reporting
    	Capabilities: [160] L1 PM Substates
    	Kernel driver in use: rtsx_pci
    	Kernel modules: rtsx_pci
    [...]
    

    We have a PCI card reader.

    $ readlink /sys/bus/pci/devices/0000:01:00.0/iommu_group
    ../../../../kernel/iommu_groups/12

    It belongs to the IOMMU group number 12.

    $ ls /sys/bus/pci/devices/0000:01:00.0/iommu_group/devices/
    0000:01:00.0

    It is alone in that group, there's one single PCI device in IOMMU group number 12.

    Next we need to unbind this device from its host kernel driver (rtsx_pci) and have the vfio_pci drive it so that we can control it from userspace and from our VMM.

    # Add the vfio-pci driver
    $ modprobe vfio_pci
    
    # Get the device VID/PID
    $ lspci -n -s 0000:01:00.0
    01:00.0 ff00: 10ec:525a (rev 01)
    
    # Unbind it from its default driver
    $ echo 0000:01:00.0 > /sys/bus/pci/devices/0000:01:00.0/driver/unbind
    
    # Have vfio-pci drive it
    $ echo 10ec 525a > /sys/bus/pci/drivers/vfio-pci/new_id

    The whole IOMMU group this device belongs to is now driven by the vfio-pci device. As a consequence, the vfio-pci created a VFIO group:

    $ ls /dev/vfio/12 
    /dev/vfio/12

    Userspace has now full access to the whole IOMMU group and all devices belonging to it.

    Next we need to:

    • Create a VFIO container
    • Add our VFIO group to this container
    • Map and control our device

    VFIO Group API

    Container

    A VFIO container is a collection of VFIO groups logically bound together. Linking VFIO groups together through a VFIO container makes sense when a userspace application is going to access several VFIO groups. It is more efficient to share page tables between groups and avoid TLB trashing.

    A VFIO container by itself is not very useful, but the VFIO API for a given group is not accessible until it's added to a container.

    Once added to a container, all devices from a given group can be mapped and controlled by userspace.

    VFIO Container API

    • Create a container: container_fd = open(/dev/vfio/vfio);

    Device

    A VFIO device is represented by a file descriptor. This file descriptor ir returned from the VFIO_GROUP_GET_DEVICE_FD ioctl on the device's group. The ioctl takes the device path as an argument: segment:bus:device.function. In our example, this is 0000:01:00.0.

    Each VFIO device resource is represented by a VFIO region, and the device file descriptor gives access to the VFIO regions.

    VFIO Device API

    • VFIO_DEVICE_GET_INFO: Gets the device flags and number of associated regions and irqs.
    struct vfio_device_info {
    	__u32	argsz;
    	__u32	flags;
    #define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
    #define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
    #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
    #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
    #define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
    #define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
    	__u32	num_regions;	/* Max region index + 1 */
    	__u32	num_irqs;	/* Max IRQ index + 1 */
    };
    • VFIO_DEVICE_GET_REGION_INFO:

    Regions

    Each VFIO region associated to a VFIO device represents a device resource (BARs, configuration space, etc).

    VFIO and KVM

    VFIO is not bound to KVM and can be used outside of the hardware virtualization context.

    However, when using VFIO for assigning host devices into a KVM based guest, we need to let KVM know about the VFIO groups we're adding or removing. The KVM API to do so is the device one. There is a VFIO KVM device type (KVM_DEV_TYPE_VFIO) and there must be one single KVM VFIO device per guest. It is not a representation of the VFIO devices we want to assign to the guest, but rather a VFIO KVM device entry point.

    Any VFIO group is added/removed to this pseudo device by setting the VFIO KVM device attribute (KVM_DEV_VFIO_GROUP_ADD and KVM_DEV_VFIO_GROUP_DEL).

    Interrupts

    write_config_register() -> device.msi_enable() -> add_msi_routing()
    

    When the guest programs the device with an MSI vector, and we have an interrupt event for the device, we do the following:

    • enable_msi() calls the VFIO_DEVICE_SET_IRQS() ioctl to have the kernel write to an eventfd whenever the programmed MSI is triggered. As the eventfs has previously been associated with a guest IRQ (through register_irqfd()), the MSI triggered from the physical device will generate a guest interrupt. VFIO_DEVICE_SET_IRQS() sets an interrupt handler for the device physical interrupt, in both the MSI and legacy cases. The interrupt handler only writes to the eventfd file descriptor passed through the API. This ioctl also indirectly enables posted interrupts by calling into the irqbypass kernel API.

    • add_msi_routing() sets a GSI routing entry to map the guest IRQ to the programmed MSI vector, and have the guest handle the MSI vector and not the VMM chosen IRQ.

    As a summary, when a vfio/physical device triggers an interrupt, there are 2 cases:

    1. The guest is running

      • The device writes to the programmed MSI vector.
      • As the device is running, this triggers a posted, remapped interrupt directly into the guest
    2. The guest is not running

      • The device writes to the programmed MSI vector.
      • This triggers a host interrupt.
      • VFIO catches that interrupt.
      • VFIO writes to the eventfd the VMM gave it to when calling into the VFIO_DEVICE_SET_IRQS() ioctl.
      • KVM receives the eventfd write.
      • KVM remaps the IRQ linked to the eventfd to a guest MSI vector. This has been set by the add_msi_routing() call (KVM GSI routing table).
      • KVM injects the MSI interrupt into the guest.
      • The guest handles the interrupt next time it's scheduled
      • (Should we handled the eventfd write from the VMM to force a guest run?)

    Example

    Good set of C examples at https://github.com/awilliam/tests

    Deconstructing and Reconstructing

    When binding a physical device to the vfio-pci, we're essentially deconstructing it. In other words we're splitting it apart in separate resources (VFIO regions). At that point the device is no longer usable and not managed by any driver.

    The idea behind VFIO is to completely or partially reconstruct the device in userspace. In a VM/virtualization context, the VMM reassemble those separate resources to reconstruct a guest device by:

    • Building and emulating the guest device PCI configuration space. It is up to the VMM to decide what it wants to expose from the physical device configuration space.

    • Emulating all BARs MMIO reads and writes. The guest will for example set DMA transfers by writing at specific offsets in special BARs, and the VMM is responsible for trapping those writes and translating them into VFIO API calls. This will program the physical device accordingly.

    • Setting the IOMMU interrupt remapping, based on the device interrupt information given by the VFIO API (To be Documented)

    • Setting the DMA remapping, mostly by adding the whole guest physical memory to the IOMMU table. This is again done through the VFIO API. When the driver in the guest programs a DMA transfer, the VMM translates that into physical device programming via VFIO calls. The DMA transfer then starts (The PCI device will become a memory bus master) and will use a guest physical address as either a source or destination IOVA (I/O virtual address). The IOMMU translates that into a host virtual address, as programmed by the VMM through the VFIO DMA mapping APIs.

    References

    Debugging

    VFIO kernel traces:

    $ trace-cmd record -p function_graph -l vfio_*
    $ trace-cmd report

    KVM events and functions:

    $ trace-cmd record -p function_graph -l kvm_*
    $ trace-cmd report
    $ trace-cmd record -e kvm_*
    $ trace-cmd report
  • 相关阅读:
    有关同时包含<winsock2.h>与<windows.h>头文件的问题
    如何使用微软提供的TCHAR.H头文件?
    下面的程序在VC6通过,在VS2008不能,错误信息都是“不能将参数……从const char[]转换为LPCWSTR”
    Android开发学习日志(四)
    爬虫开发(一)
    java集合源码详解
    Paxos算法
    linux 常用命令
    Bitmap的原理和应用
    Flink Checkpoint 问题排查实用指南
  • 原文地址:https://www.cnblogs.com/dream397/p/13834374.html
Copyright © 2011-2022 走看看