Skip to content

Latest commit

 

History

History
860 lines (795 loc) · 35.3 KB

irq.md

File metadata and controls

860 lines (795 loc) · 35.3 KB

中断

Interrupt Descriptor Table (IDT)

Relationship of the IDTR and IDT

  • 中断描述符表 Interrupt Descriptor Table (IDT) 与每个中断和异常向量关联
  • 每个表项是一个 8 字节的描述符,x86 有 256 个中断向量,因此需要 256 x 8 = 2048 字节来存储 IDT
  • idtr寄存器允许 IDT 被加载到内存中的任何位置
    • 指定 IDT 的线性地址的基址和它的最大长度
    • 必须在开启中断前用lidt指令初始化
  • Intel 提供的中断描述符分三种类型,用 40-43 位进行区分
    • 任务门(Task Gate)
    • 中断门(Interrupt Gate)
    • 陷阱门(Trap Gate) Gate descriptors’ format
  • Linux 采用的中断描述符稍有不同:
    • 中断门(Interrupt Gate)
    • 系统门(System Gate):
      • Intel 陷阱门
      • DPL=3
      • 关联到 4,5,128 号向量,因此intoboundint $0x80指令可以在用户态使用。
    • 系统中断门(System Interrupt Gate)
      • Intel 陷阱门
      • DPL=3
      • 关联到 3 号向量,因此int3指令可以在用户态使用
    • 陷阱门(Trap Gate)
    • 任务门(Task Gate)

x86的中断和异常的硬件处理

Interrupt Procedure Call

  • 假设内核已经初始化并且 CPU 运行在保护模式

  • 在执行完一条指令之后,cseip寄存器会包含将要执行的下一条指令的逻辑地址

  • 在处理这条指令之前,CPU 控制单元会检查在执行上一条指令期间是否有中断或异常发生,如果有,CPU 控制单元会完成下列操作

    1. 决定中断向量 i (0 < i < 255)
    2. 根据idtr寄存器找到中断描述符表 IDT 的第 i 项
    3. gdtr寄存器获取到全局描述符表 GDT 的基地址,在 GDT 中,根据 IDT 条目中的选择符(selector)读取指定的段描述符(Segment Descriptor),该描述符指明了含有中断或异常 handler 的 segment 的基地址
      • 在我们熟知的段寻址中,逻辑地址的高 16 bit 作为段选择符(Segment Selector)在 GDT 中选择段描述符,而这里用的是 IDT 条目中的选择符来在 GDT 中选择段描述符
      • 含有中断或异常 handler 的 segment 应该是某一内核代码 segment
    4. 确保发出中断/异常的来源是合法的
      • 比较存储在cs寄存器最后两位的 Current Privilege Level (CPL) 和 GDT 表中的段描述符的 Descriptor Privilege Level (DPL)。
      • 如果 CPL 比 DPL 低,产生一个 “General protection” 的异常,因为中断处理函数的特权级不能低于比引起中断的程序的特权级。
      • 例如,只有在硬件中断发生时,CPU 特权级提升(到 0 级)后才能跳转到去执行中断处理代码,因此不会做此检查。而系统调用发生在用户态,因为系统调用所使用的陷阱门的 DPL=3,因此 CPL 为任何特权级的代码都可以发出系统调用。
      • 对于编程异常,作进一步的安全检查:比较 CPL 和 IDT 表中的门描述符的 DPL,如果 DPL 比 CPL 低,产生一个 “General protection” 异常。
      • 因为要跳转到的目标代码(中断代码)优先级需要高于或等于被打断的代码(比如用户态的代码)
      • 第二个检查阻止了用户应用程序通过高优先级的陷阱门或中断门,只能通过低优先级的陷阱门(系统门和系统中断门)
    5. 检查特权级是否发生了改变,即 CPL 与所选择的段描述符的 DPL 不一样,如果是的话,CPU 必须开始启用与特权级关联的新的栈,通过以下步骤:
      • 读取tr寄存器去访问当前正在运行的进程的 TSS segment
      • 将新特权级关联的栈段和栈指针加载到ssesp寄存器,这些值是在 TSS segment 中找到的
      • 在新的栈中,保存之前ssesp寄存器的值,这些值定义了旧特权级关联的栈的逻辑地址
    6. 如果缺陷已经发生了,将引起异常的指令的逻辑地址加载cseip,以便异常能够再次被执行
    7. eflagscseip的值保存到栈上
    8. 如果异常带有硬件错误码,也将它保存到栈上
    9. 根据 IDT 第 i 项的门描述符中的 段选择符偏移,将中断或异常 handler 第一条指令的逻辑地址加载地址到cseip
  • x86-32 位模式 CPU 的异常处理对栈的使用 Stack Usage on Transfers to Interrupt and Exception-Handling Routines

中断初始化

Call trace

start_kernel()
-> trap_init()
  -> idt_setup_traps()
     -> idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);
  -> cea_set_pte(CPU_ENTRY_AREA_RO_IDT_VADDR, __pa_symbol(idt_table), PAGE_KERNEL_RO);
  -> idt_descr.address = CPU_ENTRY_AREA_RO_IDT;
  -> cpu_init()
     -> load_current_idt()
        -> load_idt((const struct desc_ptr *)&idt_descr);
           -> native_load_idt()
  -> idt_setup_ist_traps()
     -> idt_setup_from_table(idt_table, ist_idts, ARRAY_SIZE(ist_idts), true)
  -> x86_init.irqs.trap_init()
  -> idt_setup_debugidt_traps()
-> init_IRQ()
  -> x86_init.irqs.intr_init()
  => native_init_IRQ()
     -> idt_setup_apic_and_irq_gates()
        -> idt_setup_from_table(idt_table, apic_idts, ARRAY_SIZE(apic_idts), true);
        -> int i = FIRST_EXTERNAL_VECTOR; // 第一个外部中断向量,0x20
           for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
              entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
              set_intr_gate(i, entry);
           }
  • idt_table为设置好内容的 IDT 表
  • idt_descr为存储 IDT 表地址的指针,用lidt指令将它加载到idtr寄存器
  • def_idts[]为通用的异常数组,apic_idts[]为 x86 APIC 的中断数组
  • x86 通过native_init_IRQ()调用idt_setup_apic_and_irq_gates()设置好 APIC 和 中断 的门
    • 对于中断来说,irq_entries_start就是软件在中断处理的第一段例程
    • irq_entries_start见 arch/x86/include/asm/idtentry.h,里面包含一个很大的汇编宏展开.rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)

def_idts[]数组

  • struct gate_struct对应到 Intel IDT 的表项,存储格式见上面图示,注意,中断处理函数的地址不是连续存储的,而是被offset_lowoffset_middleoffset_high分成三个部分
    • arch/x86/include/asm/desc_defs.h
    enum {  //对应到表示门描述符类型的 40-43 位
            GATE_INTERRUPT = 0xE,
            GATE_TRAP = 0xF,
            GATE_CALL = 0xC,
            GATE_TASK = 0x5,
    };
    ...
    struct idt_bits {
            u16             ist     : 3, //32-34 位,interrupt stack table 索引
                            zero    : 5, //35-39 位为 0
                            type    : 5, //40-43 位为门描述符类型,44 为 0
                            dpl     : 2, //45-46 位为门描述符 DPL
                            p       : 1; //47 位为 P
    } __attribute__((packed));
    
    struct gate_struct {
            u16             offset_low; //0~15 位为 Offset(0-15)
            u16             segment;    //16~31 位为段选择符
            struct idt_bits bits;       //32-47 位为门描述符 bit 位,见 struct idt_bits
            u16             offset_middle; //48-63 位为 Offset(16-31)
    #ifdef CONFIG_X86_64
            u32             offset_high; //64 位的高 32 位, Offset(32-63)
            u32             reserved;
    #endif
    } __attribute__((packed));
    
    typedef struct gate_struct gate_desc;
  • 0~32 号异常的枚举定义
    • arch/x86/include/asm/traps.h
    /* Interrupts/Exceptions */
    enum {
            X86_TRAP_DE = 0,        /*  0, Divide-by-zero */
            X86_TRAP_DB,            /*  1, Debug */
            X86_TRAP_NMI,           /*  2, Non-maskable Interrupt */
            X86_TRAP_BP,            /*  3, Breakpoint */
            X86_TRAP_OF,            /*  4, Overflow */
            X86_TRAP_BR,            /*  5, Bound Range Exceeded */
            X86_TRAP_UD,            /*  6, Invalid Opcode */
            X86_TRAP_NM,            /*  7, Device Not Available */
            X86_TRAP_DF,            /*  8, Double Fault */
            X86_TRAP_OLD_MF,        /*  9, Coprocessor Segment Overrun */
            X86_TRAP_TS,            /* 10, Invalid TSS */
            X86_TRAP_NP,            /* 11, Segment Not Present */
            X86_TRAP_SS,            /* 12, Stack Segment Fault */
            X86_TRAP_GP,            /* 13, General Protection Fault */
            X86_TRAP_PF,            /* 14, Page Fault */
            X86_TRAP_SPURIOUS,      /* 15, Spurious Interrupt */
            X86_TRAP_MF,            /* 16, x87 Floating-Point Exception */
            X86_TRAP_AC,            /* 17, Alignment Check */
            X86_TRAP_MC,            /* 18, Machine Check */
            X86_TRAP_XF,            /* 19, SIMD Floating-Point Exception */
            X86_TRAP_IRET = 32,     /* 32, IRET Exception */
    };
  • def_idts[]数组存储的 Linux 中断向量与中断处理函数的对应关系,这与 Intel IDT 表idt_table[]是不同的,尤其是 Intel IDT 表项中存的是 段选择符偏移,因此需要在idt_init_desc()函数中进行格式转换
  • arch/x86/kernel/idt.c
    struct idt_data {
            unsigned int    vector;
            unsigned int    segment;
            struct idt_bits bits;
            const void      *addr;
    };
    
    #define DPL0            0x0
    #define DPL3            0x3
    
    #define DEFAULT_STACK   0
    
    #define G(_vector, _addr, _ist, _type, _dpl, _segment)  \
            {                                               \
                    .vector         = _vector,              \
                    .bits.ist       = _ist,                 \
                    .bits.type      = _type,                \
                    .bits.dpl       = _dpl,                 \
                    .bits.p         = 1,                    \
                    .addr           = _addr,                \
                    .segment        = _segment,             \
            }
    
    /* Interrupt gate */
    #define INTG(_vector, _addr)                            \
            G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL0, __KERNEL_CS)
    
    /* System interrupt gate */
    #define SYSG(_vector, _addr)                            \
            G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL3, __KERNEL_CS)
    
    /* Interrupt gate with interrupt stack */
    #define ISTG(_vector, _addr, _ist)                      \
            G(_vector, _addr, _ist, GATE_INTERRUPT, DPL0, __KERNEL_CS)
    
    /* System interrupt gate with interrupt stack */
    #define SISTG(_vector, _addr, _ist)                     \
            G(_vector, _addr, _ist, GATE_INTERRUPT, DPL3, __KERNEL_CS)
    
    /* Task gate */
    #define TSKG(_vector, _gdt)                             \
            G(_vector, NULL, DEFAULT_STACK, GATE_TASK, DPL0, _gdt << 3)
    
    ...
    /*
     * The default IDT entries which are set up in trap_init() before
     * cpu_init() is invoked. Interrupt stacks cannot be used at that point and
     * the traps which use them are reinitialized with IST after cpu_init() has
     * set up TSS.
     */
    static const __initconst struct idt_data def_idts[] = {
            INTG(X86_TRAP_DE,               divide_error),
            INTG(X86_TRAP_NMI,              nmi),
            INTG(X86_TRAP_BR,               bounds),
            INTG(X86_TRAP_UD,               invalid_op),
            INTG(X86_TRAP_NM,               device_not_available),
            INTG(X86_TRAP_OLD_MF,           coprocessor_segment_overrun),
            INTG(X86_TRAP_TS,               invalid_TSS),
            INTG(X86_TRAP_NP,               segment_not_present),
            INTG(X86_TRAP_SS,               stack_segment),
            INTG(X86_TRAP_GP,               general_protection),
            INTG(X86_TRAP_SPURIOUS,         spurious_interrupt_bug),
            INTG(X86_TRAP_MF,               coprocessor_error),
            INTG(X86_TRAP_AC,               alignment_check),
            INTG(X86_TRAP_XF,               simd_coprocessor_error),
    
    #ifdef CONFIG_X86_32
            TSKG(X86_TRAP_DF,               GDT_ENTRY_DOUBLEFAULT_TSS),
    #else
            INTG(X86_TRAP_DF,               double_fault),
    #endif
            INTG(X86_TRAP_DB,               debug),
    
    #ifdef CONFIG_X86_MCE
            INTG(X86_TRAP_MC,               &machine_check),
    #endif
    
            SYSG(X86_TRAP_OF,               overflow),
    #if defined(CONFIG_IA32_EMULATION)
            SYSG(IA32_SYSCALL_VECTOR,       entry_INT80_compat),
    #elif defined(CONFIG_X86_32)
            SYSG(IA32_SYSCALL_VECTOR,       entry_INT80_32),
    #endif
    };
    ...*```
  • 例如,对于X86_TRAP_NMI,在IDT 数组def_idts[]的第二个元素
    struct idt_data def_idts[1] =
    {
            .vector         = X86_TRAP_NMI,   //第二号中断向量
            .bits.ist       = DEFAULT_STACK,  //用于中断栈表的索引
            .bits.type      = GATE_INTERRUPT, //门描述符的 40-43 位
            .bits.dpl       = DPL0,           //门描述符的 45~46 位
            .bits.p         = 1,              //门描述符的 47 位
            .addr           = nmi,
            .segment        = __KERNEL_CS,
    }
  • 其中,X86_TRAP_NMI向量的中断处理函数的入口地址为nmi,例如在 x86-64 中的定义见arch/x86/entry/entry_64.S中的ENTRY(nmi)
    • 该中断处理函数的入口地址会在idt_init_desc()函数中被分为offset_low(16 bit)、offset_middle(16 bit)、offset_high(32 bit)三段存储
  • 其他的一些中断向量的入口函数可能会用类似idtentry invalid_op do_invalid_op has_error_code=0的汇编宏.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1来实现
  • x86 填充 Intel IDT 表,加载 IDT 表的实现
    • arch/x86/include/asm/desc.h
    #ifdef CONFIG_PARAVIRT
    #include <asm/paravirt.h>
    #else
    #define load_TR_desc()                          native_load_tr_desc()
    ...
    #define load_idt(dtr)                           native_load_idt(dtr)
    #define load_tr(tr)                             asm volatile("ltr %0"::"m" (tr))
    ...
    #define store_tr(tr)                            (tr = native_store_tr())
    
    #define load_TLS(t, cpu)                        native_load_tls(t, cpu)
    ...
    #define write_idt_entry(dt, entry, g)           native_write_idt_entry(dt, entry, g)
    ...
    #endif  /* CONFIG_PARAVIRT */
    ...
    static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
    {
            memcpy(&idt[entry], gate, sizeof(*gate));
    }
  • idt_descr 初始值为idt_table[]的起始地址,在trap_init()中会被改写
  • arch/x86/kernel/idt.c
    /* Must be page-aligned because the real IDT is used in a fixmap. */
    gate_desc idt_table[IDT_ENTRIES] __page_aligned_bss;
    
    struct desc_ptr idt_descr __ro_after_init = {
            .size           = (IDT_ENTRIES * 2 * sizeof(unsigned long)) - 1,
            .address        = (unsigned long) idt_table,
    };
    
    static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
    {
            unsigned long addr = (unsigned long) d->addr;
    
            gate->offset_low        = (u16) addr;
            gate->segment           = (u16) d->segment;
            gate->bits              = d->bits;
            gate->offset_middle     = (u16) (addr >> 16);
    #ifdef CONFIG_X86_64
            gate->offset_high       = (u32) (addr >> 32);
            gate->reserved          = 0;
    #endif
    }
    
    static void
    idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
    {
            gate_desc desc;
    
            for (; size > 0; t++, size--) {
                    idt_init_desc(&desc, t); // 格式转换,转为 Intel IDT 表项要求的格式
                    write_idt_entry(idt, t->vector, &desc);
                    if (sys)
                            set_bit(t->vector, system_vectors);
            }
    }
    
    /**
     * idt_setup_traps - Initialize the idt table with default traps
     */
    void __init idt_setup_traps(void)
    {   //idt_table 是要填入的 Intel IDT 表,def_idts 是我们要提供的输入
        idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);
    }
  • 数组def_idts[]中的向量都会在system_vectors位图中设置相应的位
  • arch/x86/kernel/traps.c
    DECLARE_BITMAP(system_vectors, NR_VECTORS);
  • __ro_after_init是编译器属性,定义如下
  • include/linux/cache.h
    /*
     * __ro_after_init is used to mark things that are read-only after init (i.e.
     * after mark_rodata_ro() has been called). These are effectively read-only,
     * but may get written to during init, so can't live in .rodata (via "const").
     */
    #ifndef __ro_after_init
    #define __ro_after_init __attribute__((__section__(".data..ro_after_init")))
    #endif

加载IDT

  • arch/x86/include/asm/pgtable_64_types.h
    /*
     * 4th level page in 5-level paging case
     */
    #define P4D_SHIFT       39
    ...
    #define CPU_ENTRY_AREA_PGD      _AC(-4, UL) //0xfffffffffffffffc
    #define CPU_ENTRY_AREA_BASE     (CPU_ENTRY_AREA_PGD << P4D_SHIFT) //0xfffffe0000000000
  • arch/x86/include/asm/cpu_entry_area.h
    #define CPU_ENTRY_AREA_RO_IDT           CPU_ENTRY_AREA_BASE //0xfffffe0000000000
    ...
    #define CPU_ENTRY_AREA_RO_IDT_VADDR     ((void *)CPU_ENTRY_AREA_RO_IDT)  //0xfffffe0000000000
    ...*
  • cea_set_pte()用于给某个虚拟地址填充页表项
    • pa >> PAGE_SHIFT 得到物理地址所在的 Page Frame Number
    • pfn_pte(unsigned long page_nr, pgprot_t pgprot) 根据输入的 Page Frame Number 和 Page Flags 转为 pte_t 类型的值(.val当然是物理地址)
    • set_pte_vaddr(unsigned long vaddr, pte_t pteval) 会根据输入的虚拟地址层层往下直至找到 PTE,把值设为传入的pte_t pteval
  • arch/x86/mm/cpu_entry_area.c
    void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
    {
            unsigned long va = (unsigned long) cea_vaddr;
    
            set_pte_vaddr(va, pfn_pte(pa >> PAGE_SHIFT, flags));
    }
  • load_idt()在非半虚拟化分支就是native_load_idt(),定义见之前的代码
  • native_load_idt()lidt指令将&idt_descr加载到idtr寄存器
  • arch/x86/include/asm/desc.h
    static inline void native_load_idt(const struct desc_ptr *dtr)
    {
            asm volatile("lidt %0"::"m" (*dtr));
    }
    ...
    /*
     * The load_current_idt() must be called with interrupts disabled
     * to avoid races. That way the IDT will always be set back to the expected
     * descriptor. It's also called when a CPU is being initialized, and
     * that doesn't need to disable interrupts, as nothing should be
     * bothering the CPU then.
     */
    static inline void load_current_idt(void)
    {
            if (is_debug_idt_enabled())
                    load_debug_idt();
            else
                    load_idt((const struct desc_ptr *)&idt_descr);
    }
    • arch/x86/kernel/traps.c
    void __init trap_init(void)
    {
            /* Init cpu_entry_area before IST entries are set up */
            setup_cpu_entry_areas();
    
            idt_setup_traps();
    
            /*
             * Set the IDT descriptor to a fixed read-only location, so that the
             * "sidt" instruction will not leak the location of the kernel, and
             * to defend the IDT against arbitrary memory write vulnerabilities.
             * It will be reloaded in cpu_init() */
            cea_set_pte(CPU_ENTRY_AREA_RO_IDT_VADDR, __pa_symbol(idt_table),
                        PAGE_KERNEL_RO);
            idt_descr.address = CPU_ENTRY_AREA_RO_IDT;
    
            /*
             * Should be a barrier for any external CPU state:
             */
            cpu_init();
    
            idt_setup_ist_traps();
    
            x86_init.irqs.trap_init();
    
            idt_setup_debugidt_traps();
    }

x86的do_IRQ()

几个相关的数组

irq_desc数组/radix tree

  • 中断请求描述符struct irq_desc用于记录各个中断事件的处理方法和未处理事件
  • 注意:不要与硬件分发中用到的“中断描述符表”(IDT)相混淆 —— 那是与不同中断向量入口地址相关的
  • struct irq_desc定义见 include/linux/irqdesc.h
  • struct irq_desc irq_desc[NR_IRQS]数组的初值
    • 索引是中断向量,值是struct irq_desc实例
    • kernel/irq/irqdesc.c
    struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = {
            [0 ... NR_IRQS-1] = {
                    .handle_irq     = handle_bad_irq,
                    .depth          = 1,
                    .lock           = __RAW_SPIN_LOCK_UNLOCKED(irq_desc->lock),
            }
    };

vector_irq

  • vector_irq则是 per-CPU 的存储指向struct irq_desc实例的指针
  • 声明
    • arch/x86/include/asm/hw_irq.h
    typedef struct irq_desc* vector_irq_t[NR_VECTORS];
    DECLARE_PER_CPU(vector_irq_t, vector_irq);
  • 初值
    • arch/x86/kernel/irqinit.c
    DEFINE_PER_CPU(vector_irq_t, vector_irq) = {
            [0 ... NR_VECTORS - 1] = VECTOR_UNUSED,
    };

到中断通用入口 common_interrupt

Call Trace

irq_entries_start
-> jmp common_interrupt
   -> interrupt do_IRQ
      -> call \func
      => call do_IRQ()
  • struct pt_regs结构体的定义见arch/x86/include/uapi/asm/ptrace.h
  • Per-CPU 的struct pt_regs类型的irq_regs变量用于保存被中断时的寄存器的值。
    • 这些值是在调用do_IRQ()前在汇编入口例程中保存的。
  • arch/x86/include/asm/irq_regs.h
    DECLARE_PER_CPU(struct pt_regs *, irq_regs);
    
    static inline struct pt_regs *get_irq_regs(void)
    {
            return this_cpu_read(irq_regs);
    }
    
    static inline struct pt_regs *set_irq_regs(struct pt_regs *new_regs)
    {
            struct pt_regs *old_regs;
            /*被中断进程的寄存器的值的地址存入old_regs,irq_regs写入新的值*/
            old_regs = get_irq_regs();
            this_cpu_write(irq_regs, new_regs);
            /*该函数返回存旧寄存器值的地址*/
            return old_regs;
    }
  • include/linux/irqdesc.h
    /*
     * Architectures call this to let the generic IRQ layer
     * handle an interrupt.
     */
    static inline void generic_handle_irq_desc(struct irq_desc *desc)
    {       // 调用 handle_level_irq() 或者 handle_edge_irq() 类似的函数
            desc->handle_irq(desc);
    }
  • arch/x86/kernel/irq.c
    /*
     * do_IRQ handles all normal device IRQ's (the special
     * SMP cross-CPU interrupts have their own specific
     * handlers).
     */
    __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
    {
            struct pt_regs *old_regs = set_irq_regs(regs);
            struct irq_desc * desc;
            /* high bit used in ret_from_ code  */
            unsigned vector = ~regs->orig_ax;
    
            entering_irq();
    
            /* entering_irq() tells RCU that we're not quiescent.  Check it. */
            RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
            /*根据中断向量读取中断描述符*/
            desc = __this_cpu_read(vector_irq[vector]);
            if (likely(!IS_ERR_OR_NULL(desc))) {
                    if (IS_ENABLED(CONFIG_X86_32))
                            handle_irq(desc, regs);
                    else
                            generic_handle_irq_desc(desc); //调用中断描述符对应的中断处理函数
            } else {
                    ack_APIC_irq();
    
                    if (desc == VECTOR_UNUSED) {
                            pr_emerg_ratelimited("%s: %d.%d No irq handler for vector\n",
                                                 __func__, smp_processor_id(),
                                                 vector);
                    } else {
                            __this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
                    }
            }
    
            exiting_irq();
    
            set_irq_regs(old_regs);
            return 1;
    }
  • arch/x86/entry/entry_64.S
    /*
     * Build the entry stubs with some assembler magic.
     * We pack 1 stub into every 8-byte block.
     */
            .align 8
    ENTRY(irq_entries_start)
        vector=FIRST_EXTERNAL_VECTOR
        .rept (FIRST_SYSTEM_VECTOR - FIRST_EXTERNAL_VECTOR)
            pushq   $(~vector+0x80)                 /* Note: always in signed byte range */
        vector=vector+1
            jmp     common_interrupt  /*跳转至x86通用的汇编中断处理*/
            .align  8
        .endr
    END(irq_entries_start)
    
    /*
     * Interrupt entry/exit.
     *
     * Interrupt entry points save only callee clobbered registers in fast path.
     *
     * Entry runs with interrupts off.
     */
    
    /* 0(%rsp): ~(interrupt number) */
            .macro interrupt func
            cld
            ALLOC_PT_GPREGS_ON_STACK
            SAVE_C_REGS
            SAVE_EXTRA_REGS
            ENCODE_FRAME_POINTER
    
            testb   $3, CS(%rsp)
            jz      1f
    
            /*
             * IRQ from user mode.  Switch to kernel gsbase and inform context
             * tracking that we're in kernel mode.
             */
            SWAPGS
    
            /*
             * We need to tell lockdep that IRQs are off.  We can't do this until
             * we fix gsbase, and we should do it before enter_from_user_mode
             * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
             * the simplest way to handle it is to just call it twice if
             * we enter from user mode.  There's no reason to optimize this since
             * TRACE_IRQS_OFF is a no-op if lockdep is off.
             */
            TRACE_IRQS_OFF
    
            CALL_enter_from_user_mode
    
    1:
            /*
             * Save previous stack pointer, optionally switch to interrupt stack.
             * irq_count is used to check if a CPU is already on an interrupt stack
             * or not. While this is essentially redundant with preempt_count it is
             * a little cheaper to use a separate counter in the PDA (short of
             * moving irq_enter into assembly, which would be too much work)
             */
            movq    %rsp, %rdi
            incl    PER_CPU_VAR(irq_count)
            cmovzq  PER_CPU_VAR(irq_stack_ptr), %rsp /* 中断栈的切换是由内核完成的 */
            pushq   %rdi
            /* We entered an interrupt context - irqs are off: */
            TRACE_IRQS_OFF
            /*跳转至x86通用的 C 中断处理,在上面列出了*/
            call    \func   /* rdi points to pt_regs */
            .endm
    
            /*
             * The interrupt stubs push (~vector+0x80) onto the stack and
             * then jump to common_interrupt.
             */
            .p2align CONFIG_X86_L1_CACHE_SHIFT
    common_interrupt:
            ASM_CLAC
            addq    $-0x80, (%rsp)                  /* Adjust vector to [-256, -1] range */
            interrupt do_IRQ       /*上面列出的 .macro interrupt 汇编宏在此处展开*/
            /* 0(%rsp): old RSP */
    ret_from_intr:                 /*注意,这里是连着的,do_IRQ 返回后会接着执行后面的指令*/
            DISABLE_INTERRUPTS(CLBR_ANY)
            TRACE_IRQS_OFF
            decl    PER_CPU_VAR(irq_count)
    
            /* Restore saved previous stack */
            popq    %rsp          /*将之前存在栈上的前一个栈的栈指针弹出,放到栈指针寄存器*/
    
            testb   $3, CS(%rsp)  /*CS 为宏 17*8,即根据栈指针寄存器的值做偏移运算,在栈中找到被打断的上下文原 %cs 寄存器的值,判断中断是该返回到 user space 还是 kernel space*/
            jz      retint_kernel
    
            /* Interrupt came from user space */
    GLOBAL(retint_user)
            mov     %rsp,%rdi
            call    prepare_exit_to_usermode
            TRACE_IRQS_IRETQ
            SWAPGS
            jmp     restore_regs_and_iret
    
    /* Returning to kernel space */
    retint_kernel:
    #ifdef CONFIG_PREEMPT
            /* Interrupts are off */
            /* Check if we need preemption */
            bt      $9, EFLAGS(%rsp)                /* were interrupts off? */ /*测试 EFLAGS 的 bit 9,即 IF 的值存入 CF*/
            jnc     1f  /*检查 CF 即检查 IF,中断是否关闭。如果 CF=0,表示中断关闭,则前跳至 1,不抢占;否则往下执行*/
    0:      cmpl    $0, PER_CPU_VAR(__preempt_count) /*读取抢占计数,看能否进行内核抢占*/
            jnz     1f                    /*如果抢占计数不为 0,通过跳转到 lable 1 返回原执行点*/
            call    preempt_schedule_irq  /*如果抢占计数为 0,触发内核抢占,这里是内核抢占的一个点*/
            jmp     0b   /* preempt_schedule_irq 返回后再次跳回 label 0 检查抢占计数 */
    1:
    #endif
            /*
             * The iretq could re-enable interrupts:
             */
            TRACE_IRQS_IRETQ

较新内核的实现

  • irq_entries_start 的实现有所不同,跳转到通用中断入口 asm_common_interrupt
  • arch/x86/include/asm/idtentry.h
/*
 * ASM code to emit the common vector entry stubs where each stub is
 * packed into IDT_ALIGN bytes.
 *
 * Note, that the 'pushq imm8' is emitted via '.byte 0x6a, vector' because
 * GCC treats the local vector variable as unsigned int and would expand
 * all vectors above 0x7F to a 5 byte push. The original code did an
 * adjustment of the vector number to be in the signed byte range to avoid
 * this. While clever it's mindboggling counterintuitive and requires the
 * odd conversion back to a real vector number in the C entry points. Using
 * .byte achieves the same thing and the only fixup needed in the C entry
 * point is to mask off the bits above bit 7 because the push is sign
 * extending.
 */
    .align IDT_ALIGN
SYM_CODE_START(irq_entries_start)
    vector=FIRST_EXTERNAL_VECTOR
    .rept NR_EXTERNAL_VECTORS
    UNWIND_HINT_IRET_REGS
0 :
    ENDBR
    .byte   0x6a, vector
    jmp asm_common_interrupt
    /* Ensure that the above is IDT_ALIGN bytes max */
    .fill 0b + IDT_ALIGN - ., 1, 0xcc
    vector = vector+1
    .endr
SYM_CODE_END(irq_entries_start)
...
#define DECLARE_IDTENTRY_ERRORCODE(vector, func)            \
    idtentry vector asm_##func func has_error_code=1
...
/**
 * DECLARE_IDTENTRY_IRQ - Declare functions for device interrupt IDT entry
 *            points (common/spurious)
 * @vector: Vector number (ignored for C)
 * @func:   Function name of the entry point
 *
 * Maps to DECLARE_IDTENTRY_ERRORCODE()
 */
#define DECLARE_IDTENTRY_IRQ(vector, func)              \
    DECLARE_IDTENTRY_ERRORCODE(vector, func)
...
/* Entries for common/spurious (device) interrupts */
#define DECLARE_IDTENTRY_IRQ(vector, func)              \
    idtentry_irq vector func
...
/* Device interrupts common/spurious */
DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,    common_interrupt);
  • asm_common_interruptidtentry 汇编宏生成,最终会调用 common_interrupt()
  • arch/x86/kernel/irq.c
/*
 * common_interrupt() handles all normal device IRQ's (the special SMP
 * cross-CPU interrupts have their own entry points).
 */
DEFINE_IDTENTRY_IRQ(common_interrupt)
{
    struct pt_regs *old_regs = set_irq_regs(regs);
    struct irq_desc *desc;

    /* entry code tells RCU that we're not quiescent.  Check it. */
    RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");

    desc = __this_cpu_read(vector_irq[vector]);
    if (likely(!IS_ERR_OR_NULL(desc))) {
        handle_irq(desc, regs);
    } else {
        apic_eoi();

        if (desc == VECTOR_UNUSED) {
            pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n",
                         __func__, smp_processor_id(),
                         vector);
        } else {
            __this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
        }
    }

    set_irq_regs(old_regs);
}
  • 上面看到的 common_interrupt() 其实是 __common_interrupt(),真正的 common_interrupt() 其实是由 DEFINE_IDTENTRY_IRQ 宏生成的
/**
 * DEFINE_IDTENTRY_IRQ - Emit code for device interrupt IDT entry points
 * @func:   Function name of the entry point
 *
 * The vector number is pushed by the low level entry stub and handed
 * to the function as error_code argument which needs to be truncated
 * to an u8 because the push is sign extending.
 *
 * irq_enter/exit_rcu() are invoked before the function body and the
 * KVM L1D flush request is set. Stack switching to the interrupt stack
 * has to be done in the function body if necessary.
 */
#define DEFINE_IDTENTRY_IRQ(func)                   \
static void __##func(struct pt_regs *regs, u32 vector);         \
                                    \
__visible noinstr void func(struct pt_regs *regs,           \
                unsigned long error_code)           \
{                                   \
    irqentry_state_t state = irqentry_enter(regs);          \
    u32 vector = (u32)(u8)error_code;               \
                                    \
    instrumentation_begin();                    \
    kvm_set_cpu_l1tf_flush_l1d();                   \
    run_irq_on_irqstack_cond(__##func, regs, vector);       \
    instrumentation_end();                      \
    irqentry_exit(regs, state);                 \
}                                   \
                                    \
static noinline void __##func(struct pt_regs *regs, u32 vector)
  • common_interrupt() 的最后会调用 C 函数 irqentry_exit() 完成旧的汇编代码实现的大部分功能,包括中断处理完成后的内核抢占点
    • kernel/entry/common.c
void raw_irqentry_exit_cond_resched(void)
{
    if (!preempt_count()) {
        /* Sanity check RCU and thread stack */
        rcu_irq_exit_check_preempt();
        if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
            WARN_ON_ONCE(!on_thread_stack());
        if (need_resched())
            preempt_schedule_irq();
    }
}
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
DEFINE_STATIC_CALL(irqentry_exit_cond_resched, raw_irqentry_exit_cond_resched);
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
DEFINE_STATIC_KEY_TRUE(sk_dynamic_irqentry_exit_cond_resched);
void dynamic_irqentry_exit_cond_resched(void)
{
    if (!static_branch_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
        return;
    raw_irqentry_exit_cond_resched();
}
#endif
#endif

noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
    /* Check whether this returns to user mode */
    if (user_mode(regs)) {
        irqentry_exit_to_user_mode(regs);
    } else if (!regs_irqs_disabled(regs)) {
...
        instrumentation_begin();
        if (IS_ENABLED(CONFIG_PREEMPTION))
            irqentry_exit_cond_resched();

        /* Covers both tracing and lockdep */
        trace_hardirqs_on();
        instrumentation_end();
...
    }
}
  • 不用担心中断返回时被内核抢占。比如说
  1. 进程 A 在内核态被中断,它的上下文 a1 被 error_entry 压入进程内核栈,然后进行中断处理
  2. 切换到 per CPU 的中断栈的过程略过,中断处理 __common_interrupt 完成后要切换回 A 的内核栈的(见 call_on_stack 宏的实现)
  3. 中断返回前 irqentry_exit() 处发生内核抢占,被进程 B 抢占,此时 A 的上下文 a2 (被__switch_to_asm)保存在 A 的内核栈以及(被 __switch_to())PCB 中
  4. 进程 B 又被中断,中断返回前再次发生内核抢占,假设被进程 A 抢了回来
  5. A 的上下文 a2 从 A 的内核栈(__switch_to_asm)以及 PCB(__switch_to())中恢复,如果不再发生内核抢占,会继续执行,从 irqentry_exit() 返回
  6. 最后会通过 error_return 返回进程 A 最初被中断的上下文 a1
idtentry
    idtentry_body
        call    \cfunc
        jmp error_return
  • arch/x86/entry/entry_64.S
SYM_CODE_START_LOCAL(error_return)
    UNWIND_HINT_REGS
    DEBUG_ENTRY_ASSERT_IRQS_OFF
    testb   $3, CS(%rsp)
    jz  restore_regs_and_return_to_kernel
    jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(error_return)