zoukankan      html  css  js  c++  java
  • gvisor syscall 原理

    https://terassyi.net/posts/2020/04/14/gvisor.html

     https://wenboshen.org/posts/2018-12-25-gvisor-inside.html

    System calls

    For Linux kernel, Anatomy of a system call, part 1 gives a good overview of how syscall is handled in kernel. MSR_LSTAR is a Model-Specific Registers, used to hold “Target RIP for the called procedure when SYSCALL is executed in 64-bit mode”, details in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers Table 2-2. On the latest kernel v4.20, syscall_init sets MSR_LSTAR to be entry_SYSCALL_64, which will jump to syscall according to the syscall number at do_syscall_64.

    For gvisor, from How gvisor trap to syscall handler in kvm platform, “On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler sysenter, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.”

    SyscallTable is a struct. All the implemented syscalls are listed in var AMD64.

    On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.
     
    System calls from the sentry to the host are a bit more involved, as they require the sentry to switch from guest mode back to host mode before calling into the host kernel.
    The sentry is developed as a normal user-space application​ (see "How is gVisor different from other container isolation mechanisms?" and the following Architecture section of our README). As such, it may make host system calls for many different reasons. e.g., external file system access performs read()s and write()s to a 9p server over a Unix Domain Socket. The Go runtime itself uses clone(), futex(), and mmap() (among others) for host system thread creation, synchronization primitives, and memory allocation, respectively.
     
    The vast majority of sentry code (anything outside of pkg/sentry/platform/kvm or pkg/sentry/platform/ring0) assumes that it is a normal Linux process. Those packages are responsible for ensuring that interactions with the host (syscalls) still work properly.
    o the overall architecture looks like below? 
     
    Ring 3    User App         |     Sentry
    ------------------------------------------------    guest
    Ring 0                Sentry.ring0
     
    ///////////////////////////////////////////////////////////////////////
     
    Ring 3                Sentry.kvm_platform   host
     
    Is it correct that when the user app makes a syscall, it will first be intercepted by the sentry at ring 0 in the guest. Then it will be actually handled by the Sentry emulator running at ring3 in the guest. If the Sentry emulator hits a syscall or needs some resources, it will switch to the host and be handled by the host linux?
     
    Almost, except in guest mode, the sentry always executes in ring 0. You can see the core flow here: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/ring0/kernel_amd64.go#L215-L231
     
    The sentry is normally mapped at a normal userspace address which cannot be mapped into application address spaces (since it would conflict with application mappings). So there is a sentry page table with the normal mappings, plus a mirror of relevant sentry mappings in the kernel range (bit 63 set) in all application page tables. This mirrored copy is what executes between jumpToKernel() and jumpToUser().
     
    iret()/sysret() save RSP/RBP so that the syscall handler (sysenter()) can restore them and then "return" to the call site in SwitchToUser.
     
    The full execution path looks like:
    kernel.runApp.execute -> kernel.Task.p.Switch (kvm.context.Switch) -> kvm.vCPU.SwitchToUser -> ring0.CPU.SwitchToUser
     
    kernel.runApp is part of the core task lifecycle state machine which handles application syscalls (eventually calling one of the handlers). The kernel package is independent of the execution platform.
     
     
     

    How does KVM system call redirection work?

    1. During setup, the sentry sets LSTAR to the syscall handler, sysenter (Just like any OS).
    2. In SwitchToUser, the sentry calls sysret (or iret), which saves the sentry stack state and executes SYSRET to switch to ring 3 and execute user code.
    3. User code eventually executes SYSCALL, and the core switches to (guest) ring 0 and jumps to sysenter.
    4. sysenter restores the sentry stack state and returns. This effectively makes the sysret() call in SwitchToUser "return", and the sentry runs the remainder of SwitchToUser.
    5. This ultimately makes Platform.Switch return in the core sentry, where we ultimately handle the syscall by selecting an implementation from the syscall table.
       
     
    With the KVM platform, the sentry runs in both host ring 3 (HR3) and guest ring 0 (GR0), depending on the current context. It runs in GR0 before running user code because it must be to execute SYSRET/IRET to switch to GR3 user code. It runs in HR3 when making a syscall to the host kernel because it must be for the host kernel to intercept the syscall. It never runs in host ring 0, though of course KVM syscalls and VM exits and other host syscalls are handled by the standard host Linux kernel in host ring 0.
     
    WARNING: DATA RACE
    Write at 0x00c00014f0c8 by goroutine 332:
      gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*queue).WriteFromBlocks()
          pkg/sentry/fs/tty/queue.go:246 +0x2a6
      gvisor.googlesource.com/gvisor/pkg/sentry/safemem.Writer.WriteFromBlocks-fm()
          pkg/sentry/safemem/io.go:46 +0x75
      gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).withInternalMappings()
          pkg/sentry/mm/io.go:503 +0x8ac
      gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).withVecInternalMappings()
          pkg/sentry/mm/io.go:572 +0x964
      gvisor.googlesource.com/gvisor/pkg/sentry/mm.(*MemoryManager).CopyInTo()
          pkg/sentry/mm/io.go:309 +0x1f1
      gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*queue).write()
          pkg/sentry/usermem/usermem.go:543 +0x164
      gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*lineDiscipline).inputQueueWrite()
          pkg/sentry/fs/tty/line_discipline.go:205 +0x147
      gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*masterFileOperations).Write()
          pkg/sentry/fs/tty/master.go:141 +0x11c
      gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*File).Writev()
          pkg/sentry/fs/file.go:314 +0x1fc
      gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.writev()
          pkg/sentry/syscalls/linux/sys_write.go:261 +0xe0
      gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Write()
          pkg/sentry/syscalls/linux/sys_write.go:71 +0x293
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
          pkg/sentry/kernel/task_syscall.go:165 +0x407
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
          pkg/sentry/kernel/task_syscall.go:283 +0xb4
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
          pkg/sentry/kernel/task_syscall.go:244 +0x109
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
          pkg/sentry/kernel/task_syscall.go:219 +0x1b6
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
          pkg/sentry/kernel/task_run.go:215 +0x1852
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
          pkg/sentry/kernel/task_run.go:91 +0x2e5
    
    Previous read at 0x00c00014f0c8 by goroutine 113:
      gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*lineDiscipline).masterReadiness()
          pkg/sentry/fs/tty/queue.go:121 +0x43
      gvisor.googlesource.com/gvisor/pkg/sentry/fs/tty.(*masterFileOperations).Readiness()
          pkg/sentry/fs/tty/master.go:131 +0x71
      gvisor.googlesource.com/gvisor/pkg/sentry/syscalls.(*PollFD).initReadiness()
          pkg/sentry/fs/file.go:199 +0x2d0
      gvisor.googlesource.com/gvisor/pkg/sentry/syscalls.Poll()
          pkg/sentry/syscalls/polling.go:96 +0x139
      gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.doPoll()
          pkg/sentry/syscalls/linux/sys_poll.go:70 +0x2ac
      gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Ppoll()
          pkg/sentry/syscalls/linux/sys_poll.go:343 +0x113
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
          pkg/sentry/kernel/task_syscall.go:165 +0x407
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
          pkg/sentry/kernel/task_syscall.go:283 +0xb4
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
          pkg/sentry/kernel/task_syscall.go:244 +0x109
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
          pkg/sentry/kernel/task_syscall.go:219 +0x1b6
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
          pkg/sentry/kernel/task_run.go:215 +0x1852
      gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
          pkg/sentry/kernel/task_run.go:91 +0x2e5
    // SyscallTable is a lookup table of system calls.
    //
    // Note that a SyscallTable is not savable directly. Instead, they are saved as
    // an OS/Arch pair and lookup happens again on restore.
    type SyscallTable struct {
        // OS is the operating system that this syscall table implements.
        OS abi.OS
    
        // Arch is the architecture that this syscall table targets.
        Arch arch.Arch
    
        // The OS version that this syscall table implements.
        Version Version
    
        // AuditNumber is a numeric constant that represents the syscall table. If
        // non-zero, auditNumber must be one of the AUDIT_ARCH_* values defined by
        // linux/audit.h.
        AuditNumber uint32
    
        // Table is the collection of functions.
        Table map[uintptr]Syscall
    
        // lookup is a fixed-size array that holds the syscalls (indexed by
        // their numbers). It is used for fast look ups.
        lookup []SyscallFn
    
        // Emulate is a collection of instruction addresses to emulate. The
        // keys are addresses, and the values are system call numbers.
        Emulate map[usermem.Addr]uintptr
    
        // The function to call in case of a missing system call.
        Missing MissingFn
    
        // Stracer traces this syscall table.
        Stracer Stracer
    
        // External is used to handle an external callback.
        External func(*Kernel)
    
        // ExternalFilterBefore is called before External is called before the syscall is executed.
        // External is not called if it returns false.
        ExternalFilterBefore func(*Task, uintptr, arch.SyscallArguments) bool
    
        // ExternalFilterAfter is called before External is called after the syscall is executed.
        // External is not called if it returns false.
        ExternalFilterAfter func(*Task, uintptr, arch.SyscallArguments) bool
    
        // FeatureEnable stores the strace and one-shot enable bits.
        FeatureEnable SyscallFlagsTable
    }
    // Init initializes the system call table.
    //
    // This should normally be called only during registration.
    func (s *SyscallTable) Init() {
        if s.Table == nil {
            // Ensure non-nil lookup table.
            s.Table = make(map[uintptr]Syscall)
        }
        if s.Emulate == nil {
            // Ensure non-nil emulate table.
            s.Emulate = make(map[usermem.Addr]uintptr)
        }
    
        max := s.MaxSysno() // Checked during RegisterSyscallTable.
    
        // Initialize the fast-lookup table.
        s.lookup = make([]SyscallFn, max+1)
        for num, sc := range s.Table {
            s.lookup[num] = sc.Fn    //syscll生成lookup
        }
    
        // Initialize all features.
        s.FeatureEnable.init(s.Table, max)
    }

    gvisor/pkg/sentry/syscalls/linux/linux64.go

    // AMD64 is a table of Linux amd64 syscall API with the corresponding syscall
    // numbers from Linux 4.4.
    var AMD64 = &kernel.SyscallTable{
            OS:   abi.Linux,
            Arch: arch.AMD64,
            Version: kernel.Version{
                    // Version 4.4 is chosen as a stable, longterm version of Linux, which
                    // guides the interface provided by this syscall table. The build
                    // version is that for a clean build with default kernel config, at 5
                    // minutes after v4.4 was tagged.
                    Sysname: LinuxSysname,
                    Release: LinuxRelease,
                    Version: LinuxVersion,
            },
            AuditNumber: linux.AUDIT_ARCH_X86_64,
            Table: map[uintptr]kernel.Syscall{
                    0:   syscalls.Supported("read", Read),
                    1:   syscalls.Supported("write", Write),
                    2:   syscalls.PartiallySupported("open", Open, "Options O_DIRECT, O_NOATIME, O_PATH, O_TMPFILE, O_SYNC are not supported.", nil),
                    3:   syscalls.Supported("close", Close),
                    4:   syscalls.Supported("stat", Stat),
                    5:   syscalls.Supported("fstat", Fstat),
                    6:   syscalls.Supported("lstat", Lstat),
                    7:   syscalls.Supported("poll", Poll),
                    8:   syscalls.Supported("lseek", Lseek),
                    9:   syscalls.PartiallySupported("mmap", Mmap, "Generally supported with exceptions. Options MAP_FIXED_NOREPLACE, MAP_SHARED_VALIDATE, MAP_SYNC MAP_GROWSDOWN, MAP_HUGETLB are not supported.", nil),
                    10:  syscalls.Supported("mprotect", Mprotect),
                    11:  syscalls.Supported("munmap", Munmap),
                    12:  syscalls.Supported("brk", Brk),
                    13:  syscalls.Supported("rt_sigaction", RtSigaction),
                    14:  syscalls.Supported("rt_sigprocmask", RtSigprocmask),
                    15:  syscalls.Supported("rt_sigreturn", RtSigreturn),
                    16:  syscalls.PartiallySupported("ioctl", Ioctl, "Only a few ioctls are implemented for backing devices and file systems.", nil),
                    17:  syscalls.Supported("pread64", Pread64),
                    18:  syscalls.Supported("pwrite64", Pwrite64),
                    19:  syscalls.Supported("readv", Readv),
                    20:  syscalls.Supported("writev", Writev),
                    21:  syscalls.Supported("access", Access),
                    22:  syscalls.Supported("pipe", Pipe),
                    23:  syscalls.Supported("select", Select),
                    24:  syscalls.Supported("sched_yield", SchedYield),
                    25:  syscalls.Supported("mremap", Mremap),
    goroutine 974 [running]:
    panic(0x10a1140, 0xc00043c070)
        GOROOT/src/runtime/panic.go:1064 +0x470 fp=0xc000985508 sp=0xc000985450 pc=0x437030
    gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*inodeRefs).IncRef(0xc000d6e008)
        bazel-out/k8-fastbuild-ST-3bfd66f45e612c1a5c797474a25664e227d81bf914f3b08a40e00b2e2692afa4/bin/pkg/sentry/fsimpl/tmpfs/inode_refs.go:88 +0x18c fp=0xc000985580 sp=0xc000985508 pc=0x92828c
    gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*inode).incRef(...)
        pkg/sentry/fsimpl/tmpfs/tmpfs.go:512
    gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*dentry).IncRef(0xc0000c4aa0)
        pkg/sentry/fsimpl/tmpfs/tmpfs.go:357 +0x49 fp=0xc000985598 sp=0xc000985580 pc=0x92ef89
    gvisor.dev/gvisor/pkg/sentry/vfs.(*Dentry).IncRef(...)
        pkg/sentry/vfs/dentry.go:150
    gvisor.dev/gvisor/pkg/sentry/vfs.(*FileDescription).Init(0xc000d66500, 0x140d420, 0xc000d66500, 0xc000008241, 0xc000532660, 0xc0000c4aa0, 0xc000985624, 0x47a03f, 0xc000557358)
        pkg/sentry/vfs/file_description.go:151 +0x167 fp=0xc0009855c0 sp=0xc000985598 pc=0x7d3c87
    gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*dentry).open(0xc0000c4aa0, 0x1402d60, 0xc000bdaa80, 0xc000d6a000, 0xc000985878, 0x1, 0x0, 0x0, 0x0)
        pkg/sentry/fsimpl/tmpfs/filesystem.go:584 +0x1dd fp=0xc000985660 sp=0xc0009855c0 pc=0x923abd
    gvisor.dev/gvisor/pkg/sentry/fsimpl/tmpfs.(*filesystem).OpenAt(0xc000557300, 0x1402d60, 0xc000bdaa80, 0xc000d6a000, 0x8241, 0x0, 0x0, 0x0)
        pkg/sentry/fsimpl/tmpfs/filesystem.go:519 +0xa1e fp=0xc000985858 sp=0xc000985660 pc=0x92309e
    gvisor.dev/gvisor/pkg/sentry/vfs.(*VirtualFilesystem).OpenAt(0xc000228908, 0x1402d60, 0xc000bdaa80, 0xc000cec300, 0xc000985aa0, 0xc000985a88, 0x100, 0xc000532420, 0xc0002ac000)
        pkg/sentry/vfs/vfs.go:515 +0x1ee fp=0xc0009859e8 sp=0xc000985858 pc=0x7ebe6e
    gvisor.dev/gvisor/pkg/sentry/syscalls/linux/vfs2.openat(0xc000bdaa80, 0x2b4bffffff9c, 0x20000180, 0x241, 0x0, 0x0, 0x0, 0x0, 0x0)
        pkg/sentry/syscalls/linux/vfs2/filesystem.go:219 +0x2bc fp=0xc000985b38 sp=0xc0009859e8 pc=0xe4d2bc
    gvisor.dev/gvisor/pkg/sentry/syscalls/linux/vfs2.Creat(0xc000bdaa80, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        pkg/sentry/syscalls/linux/vfs2/filesystem.go:200 +0x71 fp=0xc000985b90 sp=0xc000985b38 pc=0xe4cfb1
    gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).executeSyscall(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0xea72d7, 0x1272f60, ...)
        pkg/sentry/kernel/task_syscall.go:116 +0x1b9 fp=0xc000985c50 sp=0xc000985b90 pc=0xa470f9
    gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        pkg/sentry/kernel/task_syscall.go:291 +0x70 fp=0xc000985cd8 sp=0xc000985c50 pc=0xa48410
    gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter(0xc000bdaa80, 0x55, 0x20000180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        pkg/sentry/kernel/task_syscall.go:238 +0xb4 fp=0xc000985d38 sp=0xc000985cd8 pc=0xa47eb4
    gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscall(0xc000bdaa80, 0x2, 0xc000bdaa80)
        pkg/sentry/kernel/task_syscall.go:205 +0x198 fp=0xc000985e08 sp=0xc000985d38 pc=0xa47798
    gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute(0x0, 0xc000bdaa80, 0x13d5ba0, 0x0)
        pkg/sentry/kernel/task_run.go:327 +0xd8c fp=0xc000985f60 sp=0xc000985e08 pc=0xa3a10c
    gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run(0xc000bdaa80, 0x2d)
        pkg/sentry/kernel/task_run.go:100 +0x1e2 fp=0xc000985fd0 sp=0xc000985f60 pc=0xa38c02
    runtime.goexit()
        src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000985fd8 sp=0xc000985fd0 pc=0x4705a1
    created by gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).Start
        pkg/sentry/kernel/task_start.go:374 +0x116

    pkg/sentry/syscalls/linux/sys_thread.go

    // Fork implements Linux syscall fork(2).
    func Fork(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) {
        // "A call to fork() is equivalent to a call to clone(2) specifying flags
        // as just SIGCHLD." - fork(2)
        return clone(t, int(syscall.SIGCHLD), 0, 0, 0, 0)
    }

    pkg/sentry/kernel/task_clone.go

    func clone(t *kernel.Task, flags int, stack usermem.Addr, parentTID usermem.Addr, childTID usermem.Addr, tls usermem.Addr) (uintptr, *kernel.SyscallControl, error) {
        opts := kernel.CloneOptions{
            SharingOptions: kernel.SharingOptions{
                NewAddressSpace:     flags&syscall.CLONE_VM == 0,
                NewSignalHandlers:   flags&syscall.CLONE_SIGHAND == 0,
                NewThreadGroup:      flags&syscall.CLONE_THREAD == 0,
                TerminationSignal:   linux.Signal(flags & exitSignalMask),
                NewPIDNamespace:     flags&syscall.CLONE_NEWPID == syscall.CLONE_NEWPID,
                NewUserNamespace:    flags&syscall.CLONE_NEWUSER == syscall.CLONE_NEWUSER,
                NewNetworkNamespace: flags&syscall.CLONE_NEWNET == syscall.CLONE_NEWNET,
                NewFiles:            flags&syscall.CLONE_FILES == 0,
                NewFSContext:        flags&syscall.CLONE_FS == 0,
                NewUTSNamespace:     flags&syscall.CLONE_NEWUTS == syscall.CLONE_NEWUTS,
                NewIPCNamespace:     flags&syscall.CLONE_NEWIPC == syscall.CLONE_NEWIPC,
            },
            Stack:         stack,
            SetTLS:        flags&syscall.CLONE_SETTLS == syscall.CLONE_SETTLS,
            TLS:           tls,
            ChildClearTID: flags&syscall.CLONE_CHILD_CLEARTID == syscall.CLONE_CHILD_CLEARTID,
            ChildSetTID:   flags&syscall.CLONE_CHILD_SETTID == syscall.CLONE_CHILD_SETTID,
            ChildTID:      childTID,
            ParentSetTID:  flags&syscall.CLONE_PARENT_SETTID == syscall.CLONE_PARENT_SETTID,
            ParentTID:     parentTID,
            Vfork:         flags&syscall.CLONE_VFORK == syscall.CLONE_VFORK,
            Untraced:      flags&syscall.CLONE_UNTRACED == syscall.CLONE_UNTRACED,
            InheritTracer: flags&syscall.CLONE_PTRACE == syscall.CLONE_PTRACE,
        }
        ntid, ctrl, err := t.Clone(&opts)
        return uintptr(ntid), ctrl, err
    }
      log.Infof("Process should have started...")
        l.watchdog.Start()
        return l.k.Start()
    ype Loader struct {
        // k is the kernel.
        k *kernel.Kernel

    //
    // threadID a dummy value set to the task's TID in the root PID namespace to
    // make it visible in stack dumps. A goroutine for a given task can be identified
    // searching for Task.run()'s argument value.
    func (t *Task) run(threadID uintptr) {
            atomic.StoreInt64(&t.goid, goid.Get())
    
            // Construct t.blockingTimer here. We do this here because we can't
            // reconstruct t.blockingTimer during restore in Task.afterLoad(), because
            // kernel.timekeeper.SetClocks() hasn't been called yet.
            blockingTimerNotifier, blockingTimerChan := ktime.NewChannelNotifier()
            t.blockingTimer = ktime.NewTimer(t.k.MonotonicClock(), blockingTimerNotifier)
            defer t.blockingTimer.Destroy()
            t.blockingTimerChan = blockingTimerChan
    
            // Activate our address space.
            t.Activate()
            // The corresponding t.Deactivate occurs in the exit path
            // (runExitMain.execute) so that when
            // Platform.CooperativelySharesAddressSpace() == true, we give up the
            // AddressSpace before the task goroutine finishes executing.
    
            // If this is a newly-started task, it should check for participation in
            // group stops. If this is a task resuming after restore, it was
            // interrupted by saving. In either case, the task is initially
            // interrupted.
            t.interruptSelf()
    
            for {
                    // Explanation for this ordering:
                    //
                    // - A freshly-started task that is stopped should not do anything
                    // before it enters the stop.
                    //
                    // - If taskRunState.execute returns nil, the task goroutine should
                    // exit without checking for a stop.
                    //
                    // - Task.Start won't start Task.run if t.runState is nil, so this
                    // ordering is safe.
                    t.doStop()
                    t.runState = t.runState.execute(t)
                    if t.runState == nil {
                            t.accountTaskGoroutineEnter(TaskGoroutineNonexistent)
                            t.goroutineStopped.Done()
                            t.tg.liveGoroutines.Done()
                            t.tg.pidns.owner.liveGoroutines.Done()
                            t.tg.pidns.owner.runningGoroutines.Done()
                            t.p.Release()
    
                            // Deferring this store triggers a false positive in the race
                            // detector (https://github.com/golang/go/issues/42599).
                            atomic.StoreInt64(&t.goid, 0)
                            // Keep argument alive because stack trace for dead variables may not be correct.
                            runtime.KeepAlive(threadID)
                            return
                    }
            }
    }
    func (ts *TaskSet) newTask(cfg *TaskConfig) (*Task, error) {
            tg := cfg.ThreadGroup
            image := cfg.TaskImage
            t := &Task{
                    taskNode: taskNode{
                            tg:       tg,
                            parent:   cfg.Parent,
                            children: make(map[*Task]struct{}),
                    },
                    runState:           (*runApp)(nil),
                    interruptChan:      make(chan struct{}, 1),
                    signalMask:         cfg.SignalMask,
                    signalStack:        arch.SignalStack{Flags: arch.SignalStackFlagDisable},
                    image:              *image,
                    fsContext:          cfg.FSContext,
                    fdTable:            cfg.FDTable,
                    p:                  cfg.Kernel.Platform.NewContext(),
                    k:                  cfg.Kernel,
                    ptraceTracees:      make(map[*Task]struct{}),
                    allowedCPUMask:     cfg.AllowedCPUMask.Copy(),
                    ioUsage:            &usage.IO{},
                    niceness:           cfg.Niceness,
                    netns:              cfg.NetworkNamespace,
                    utsns:              cfg.UTSNamespace,
                    ipcns:              cfg.IPCNamespace,
                    abstractSockets:    cfg.AbstractSocketNamespace,
                    mountNamespaceVFS2: cfg.MountNamespaceVFS2,
                    rseqCPU:            -1,
                    rseqAddr:           cfg.RSeqAddr,
                    rseqSignature:      cfg.RSeqSignature,
                    futexWaiter:        futex.NewWaiter(),
                    containerID:        cfg.ContainerID,
            }

     pkg/sentry/kernel/task_run.go

    func (*runApp) execute(t *Task) taskRunState {
        ...
        switch err {
        case nil:
            // Handle application system call.
            return t.doSyscall()
        ...
    }

    doSyscall pkg/sentry/kernel/task_syscall.go

    // doSyscall is the entry point for an invocation of a system call specified by
    // the current state of t's registers.
    //
    // The syscall path is very hot; avoid defer.
    func (t *Task) doSyscall() taskRunState {
        sysno := t.Arch().SyscallNo()
        args := t.Arch().SyscallArgs()
    
        // Tracers expect to see this between when the task traps into the kernel
        // to perform a syscall and when the syscall is actually invoked.
        // This useless-looking temporary is needed because Go.
        tmp := uintptr(syscall.ENOSYS)
        t.Arch().SetReturn(-tmp)
    
        // Check seccomp filters. The nil check is for performance (as seccomp use
        // is rare), not needed for correctness.
        if t.syscallFilters.Load() != nil {
            switch r := t.checkSeccompSyscall(int32(sysno), args, usermem.Addr(t.Arch().IP())); r {
            case linux.SECCOMP_RET_ERRNO, linux.SECCOMP_RET_TRAP:
                t.Debugf("Syscall %d: denied by seccomp", sysno)
                return (*runSyscallExit)(nil)
            case linux.SECCOMP_RET_ALLOW:
                // ok
            case linux.SECCOMP_RET_KILL_THREAD:
                t.Debugf("Syscall %d: killed by seccomp", sysno)
                t.PrepareExit(ExitStatus{Signo: int(linux.SIGSYS)})
                return (*runExit)(nil)
            case linux.SECCOMP_RET_TRACE:
                t.Debugf("Syscall %d: stopping for PTRACE_EVENT_SECCOMP", sysno)
                return (*runSyscallAfterPtraceEventSeccomp)(nil)
            default:
                panic(fmt.Sprintf("Unknown seccomp result %d", r))
            }
        }
    
        return t.doSyscallEnter(sysno, args)
    }
    func (t *Task) executeSyscall(sysno uintptr, args arch.SyscallArguments) (rval uintptr, ctrl *SyscallControl, err error) {
        s := t.SyscallTable()
    
        fe := s.FeatureEnable.Word(sysno)
    
        var straceContext interface{}
        if bits.IsAnyOn32(fe, StraceEnableBits) {
            straceContext = s.Stracer.SyscallEnter(t, sysno, args, fe)
        }
    
        if bits.IsOn32(fe, ExternalBeforeEnable) && (s.ExternalFilterBefore == nil || s.ExternalFilterBefore(t, sysno, args)) {
            t.invokeExternal()
            // Ensure we check for stops, then invoke the syscall again.
            ctrl = ctrlStopAndReinvokeSyscall
        } else {
            fn := s.Lookup(sysno)
            if fn != nil {
                // Call our syscall implementation.
                rval, ctrl, err = fn(t, args)
            } else {
                // Use the missing function if not found.
                rval, err = t.SyscallTable().Missing(t, sysno, args)
            }
        }
    
        if bits.IsOn32(fe, ExternalAfterEnable) && (s.ExternalFilterAfter == nil || s.ExternalFilterAfter(t, sysno, args)) {
            t.invokeExternal()
            // Don't reinvoke the syscall.
        }
    
        if bits.IsAnyOn32(fe, StraceEnableBits) {
            s.Stracer.SyscallExit(straceContext, t, sysno, rval, err)
        }
    
        return
    }

    pkg/sentry/syscalls/linux/vfs2/fscontext.go

    // Chdir implements Linux syscall chdir(2).
    func Chdir(t *kernel.Task, args arch.SyscallArguments) (uintptr, *kernel.SyscallControl, error) {
            addr := args[0].Pointer()
    
            path, err := copyInPath(t, addr)
            if err != nil {
                    return 0, nil, err
            }
            tpop, err := getTaskPathOperation(t, linux.AT_FDCWD, path, disallowEmptyPath, followFinalSymlink)
            if err != nil {
                    return 0, nil, err
            }
            defer tpop.Release(t)
    
            vd, err := t.Kernel().VFS().GetDentryAt(t, t.Credentials(), &tpop.pop, &vfs.GetDentryOptions{
                    CheckSearchable: true,
            })
            if err != nil {
                    return 0, nil, err
            }
            t.FSContext().SetWorkingDirectoryVFS2(t, vd)
            vd.DecRef(t)
            return 0, nil, nil
    }
    root@cloud:~/onlyGvisor# ps -elf | grep 947729
    4 S nobody   947729 947703  0  80   0 - 68452015012 sys_po 11:17 ?  00:00:00 runsc-sandbox --root=/var/run/docker/runtime-runsc-kvm/moby --log=/run/containerd/io.containerd.runtime.v1.linux/moby/b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88/log.json --log-format=json --platform=kvm --log-fd=3 boot --bundle=/run/containerd/io.containerd.runtime.v1.linux/moby/b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88 --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --io-fds=9 --io-fds=10 --io-fds=11 --device-fd=12 --stdio-fds=13 --stdio-fds=14 --stdio-fds=15 --pidns=true --cpu-num 64 b0b3e6a9f9c469275fe320d9b2b433902337cd66993b793ac79121e911d5bf88
    0 S root     948093 947631  0  80   0 -  1418 pipe_r 11:33 pts/3    00:00:00 grep --color=auto 947729
    root@cloud:~/onlyGvisor# docker inspect test2 | grep Pid | head -n 1
                "Pid": 947729,
    root@cloud:~/onlyGvisor# 

    gdb Socket

    root@cloud:/gvisor# docker run --runtime=runsc-kvm --rm --name=test -d alpine sleep 1000
    1076ade686c4ccea6e8c40e6d6881e4f5e9c403ff21aab9febf4557218a10e17
    root@cloud:/gvisor# docker inspect test | grep Pid | head -n 1
                "Pid": 927424,
    root@cloud:/gvisor# docker exec -it test ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    64 bytes from 8.8.8.8: seq=1 ttl=42 time=57.058 ms
    64 bytes from 8.8.8.8: seq=2 ttl=42 time=56.148 ms
    64 bytes from 8.8.8.8: seq=3 ttl=42 time=56.321 ms
    64 bytes from 8.8.8.8: seq=4 ttl=42 time=69.416 ms
    64 bytes from 8.8.8.8: seq=5 ttl=42 time=55.813 ms
    64 bytes from 8.8.8.8: seq=6 ttl=42 time=68.444 ms
    64 bytes from 8.8.8.8: seq=7 ttl=42 time=56.031 ms
    ^C
    --- 8.8.8.8 ping statistics ---
    8 packets transmitted, 7 packets received, 12% packet loss
    round-trip min/avg/max = 55.813/59.890/69.416 ms
    root@cloud:/gvisor# docker exec -it test ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    
    64 bytes from 8.8.8.8: seq=0 ttl=42 time=111.545 ms
    64 bytes from 8.8.8.8: seq=1 ttl=42 time=55.150 ms
    64 bytes from 8.8.8.8: seq=2 ttl=42 time=55.362 ms
    64 bytes from 8.8.8.8: seq=3 ttl=42 time=58.652 ms
    64 bytes from 8.8.8.8: seq=4 ttl=42 time=56.521 ms
    64 bytes from 8.8.8.8: seq=5 ttl=42 time=55.958 ms
    64 bytes from 8.8.8.8: seq=6 ttl=42 time=55.386 ms
    64 bytes from 8.8.8.8: seq=7 ttl=42 time=54.869 ms
    64 bytes from 8.8.8.8: seq=8 ttl=42 time=54.373 ms
    64 bytes from 8.8.8.8: seq=9 ttl=42 time=74.912 ms
    64 bytes from 8.8.8.8: seq=10 ttl=42 time=55.755 ms
    root@cloud:/mycontainer# dlv attach 927424
    Type 'help' for list of commands.
    (dlv) b Socket
    Command failed: Location "Socket" ambiguous: golang.org/x/sys/unix.Socket, syscall.Socket, type..eq.gvisor.dev/gvisor/pkg/unet.Socket, gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket, gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket…
    (dlv) c
    received SIGINT, stopping process (will not forward signal)
    > syscall.Syscall6() src/syscall/asm_linux_arm64.s:43 (PC: 0x8dccc)
    Warning: debugging optimized function
    (dlv) b  linux.Socket
    Breakpoint 1 set at 0x587f30 for gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket() pkg/sentry/syscalls/linux/sys_socket.go:172
    (dlv) b netstack.Socket
    Command failed: Location "netstack.Socket" ambiguous: gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket, gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*providerVFS2).Socket…
    (dlv) b netstack.(*provider).Socket
    Breakpoint 2 set at 0x647270 for gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*provider).Socket() pkg/sentry/socket/netstack/provider.go:94
    (dlv) b netstack.(*providerVFS2).Socket
    Breakpoint 3 set at 0x647960 for gvisor.dev/gvisor/pkg/sentry/socket/netstack.(*providerVFS2).Socket() pkg/sentry/socket/netstack/provider_vfs2.go:38
    (dlv) c
    > gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket() pkg/sentry/syscalls/linux/sys_socket.go:172 (hits goroutine(306):1 total:1) (PC: 0x587f30)
    Warning: debugging optimized function
    (dlv) bt
    0  0x0000000000587f30 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.Socket
       at pkg/sentry/syscalls/linux/sys_socket.go:172
    1  0x0000000000522ea4 in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).executeSyscall
       at pkg/sentry/kernel/task_syscall.go:104
    2  0x0000000000523c5c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke
       at pkg/sentry/kernel/task_syscall.go:239
    3  0x00000000005238dc in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter
       at pkg/sentry/kernel/task_syscall.go:199
    4  0x00000000005233e0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).doSyscall
       at pkg/sentry/kernel/task_syscall.go:174
    5  0x0000000000518e00 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute
       at pkg/sentry/kernel/task_run.go:282
    6  0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run
       at pkg/sentry/kernel/task_run.go:97
    7  0x0000000000077c84 in runtime.goexit
       at src/runtime/asm_arm64.s:1136

    entersyscall +  exitsyscall 

            entersyscall()
            bluepill(c)
            vector = c.CPU.SwitchToUser(switchOpts)
            exitsyscall()


    //go:linkname entersyscall runtime.entersyscall
    func entersyscall()   ------

    //go:linkname exitsyscall runtime.exitsyscall
    func exitsyscall()

    gvisor中 entersyscall 和exitsyscall使用的是runtime的
    oot@cloud:~/onlyGvisor# dlv attach 947729
    Type 'help' for list of commands.
    (dlv) b entersyscall
    Breakpoint 1 set at 0x72780 for runtime.entersyscall() GOROOT/src/runtime/proc.go:3126
    (dlv) c
    > runtime.entersyscall() GOROOT/src/runtime/proc.go:3126 (hits goroutine(203):1 total:1) (PC: 0x72780)
    Warning: debugging optimized function
    (dlv) bt
    0  0x0000000000072780 in runtime.entersyscall
       at GOROOT/src/runtime/proc.go:3126
    1  0x000000000008dcb0 in syscall.Syscall6
       at src/syscall/asm_linux_arm64.s:35
    2  0x00000000005458e8 in gvisor.dev/gvisor/pkg/fdnotifier.epollWait
       at pkg/fdnotifier/poll_unsafe.go:76
    3  0x0000000000545564 in gvisor.dev/gvisor/pkg/fdnotifier.(*notifier).waitAndNotify
       at pkg/fdnotifier/fdnotifier.go:149
    4  0x0000000000077c84 in runtime.goexit
       at src/runtime/asm_arm64.s:1136
    (dlv) clearall
    Breakpoint 1 cleared at 0x72780 for runtime.entersyscall() GOROOT/src/runtime/proc.go:3126
    (dlv) b pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248
    Breakpoint 2 set at 0x87f504 for gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248
    (dlv) c
    > gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248 (hits goroutine(94):1 total:1) (PC: 0x87f504)
    Warning: debugging optimized function
    (dlv) bt
    0  0x000000000087f504 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser
       at pkg/sentry/platform/kvm/machine_arm64_unsafe.go:248
    1  0x000000000087bb1c in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*context).Switch
       at pkg/sentry/platform/kvm/context.go:75
    2  0x00000000005186d0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute
       at pkg/sentry/kernel/task_run.go:271
    3  0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run
       at pkg/sentry/kernel/task_run.go:97
    4  0x0000000000077c84 in runtime.goexit
       at src/runtime/asm_arm64.s:1136
    (dlv) s
    > gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser() pkg/sentry/platform/kvm/machine_arm64_unsafe.go:249 (PC: 0x87f508)
    Warning: debugging optimized function
    (dlv) bt
    0  0x000000000087f508 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).SwitchToUser
       at pkg/sentry/platform/kvm/machine_arm64_unsafe.go:249
    1  0x000000000087bb1c in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*context).Switch
       at pkg/sentry/platform/kvm/context.go:75
    2  0x00000000005186d0 in gvisor.dev/gvisor/pkg/sentry/kernel.(*runApp).execute
       at pkg/sentry/kernel/task_run.go:271
    3  0x0000000000517d9c in gvisor.dev/gvisor/pkg/sentry/kernel.(*Task).run
       at pkg/sentry/kernel/task_run.go:97
    4  0x0000000000077c84 in runtime.goexit
       at src/runtime/asm_arm64.s:1136
    (dlv) s
    Stopped at: 0x881b80
    =>no source available
  • 相关阅读:
    数据应用
    Python邮件脚本
    函数
    tab模块
    python登陆,注册小程序
    三元运算+lambda表达式
    计算机基础
    软件测试概要
    asyn_fifo
    perl 对ENV环境变量的使用
  • 原文地址:https://www.cnblogs.com/dream397/p/14251415.html
Copyright © 2011-2022 走看看