How Can IOMMU Event Tracing Help You?

This is an article in a serious on IOMMU Event Tracing:

Input/Output Memory Management Unit (IOMMU) event tracing can be extremely beneficial when debugging IOMMU hardware, BIOS, and firmware problems. In addition, IOMMU event tracing can be used to debug  Linux kernel IOMMU driver, device assignment, and performance problems related to device assignment in virtualized and non-virtualized environments.

If you aren’t familiar with IOMMU Event Tracing, the first article in this series covered the fundamental concepts behind it, in addition to how tracing can be used to track information about devices as they are moved between guest and host environments. This article will focus on how to use IOMMU event tracing effectively, and will provide a few examples of IOMMU event tracing in action.

How to Enable IOMMU Event Tracing at Boot-Time

IOMMU trace events can be enabled using the Kernel boot option trace_event. The following enables all IOMMU trace events at boot-time:

trace_event=iommu

The following enables map and unmap events at boot-time:

trace_event=map,unmap

How to Enable IOMMU Event Tracing at Run-Time

A single IOMMU event or multiple events can be enabled at run-time. The following enables a single event:

cd /sys/kernel/debug/trace/events echo 1 > iommu/event_name_file

The following will enable all events:

for i in $(find /sys/kernel/debug/tracing/events/iommu/ -name enable); do echo 1 > $i; done 

Where are The Traces?

Traces can be found in /sys/kernel/debug/tracing/trace file. The following shows the trace format. For more details on tracing, please refer to Documentation/trace. This directory contains the tracing documentation.


# tracer: nop
#
# entries-in-buffer/entries-written: 18/18   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |

Traces provide insight into the state of the CPU on which the task is running on. The individual fields such as the irq-off, need-resched, and preempt-depth delay help debug problems. For example, long runs of a task with need-resched set might indicate problems in the code paths that could result in bad response times for other tasks running on the system. These problems could be solved by fixing the relevant code paths.

What do IOMMU Group Event Traces Look Like?

The following group event traces are from a test system with Intel VT-d support. These events show the device and group mapping.

# tracer: nop
#
# entries-in-buffer/entries-written: 18/18   #P:8
#
#                     _-----=> irqs-off
#                      / _----=> need-resched
#                     | / _---=> hardirq/softirq
#                     || / _--=> preempt-depth
#                     ||| /     delay
#    TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#       | |       |   ||||       |         |
swapper/0-1     [000] ....     1.899609: add_device_to_group: IOMMU: groupID=0 device=0000:00:00.0
swapper/0-1     [000] ....     1.899619: add_device_to_group: IOMMU: groupID=1 device=0000:00:01.0
swapper/0-1     [000] ....     1.899624: add_device_to_group: IOMMU: groupID=2 device=0000:00:02.0
swapper/0-1     [000] ....     1.899629: add_device_to_group: IOMMU: groupID=3 device=0000:00:03.0
swapper/0-1     [000] ....     1.899634: add_device_to_group: IOMMU: groupID=4 device=0000:00:14.0
swapper/0-1     [000] ....     1.899642: add_device_to_group: IOMMU: groupID=5 device=0000:00:16.0
swapper/0-1     [000] ....     1.899647: add_device_to_group: IOMMU: groupID=6 device=0000:00:1a.0
swapper/0-1     [000] ....     1.899651: add_device_to_group: IOMMU: groupID=7 device=0000:00:1b.0
swapper/0-1     [000] ....     1.899656: add_device_to_group: IOMMU: groupID=8 device=0000:00:1c.0
swapper/0-1     [000] ....     1.899661: add_device_to_group: IOMMU: groupID=9 device=0000:00:1c.2
swapper/0-1     [000] ....     1.899668: add_device_to_group: IOMMU: groupID=10 device=0000:00:1c.3
swapper/0-1     [000] ....     1.899674: add_device_to_group: IOMMU: groupID=11 device=0000:00:1d.0
swapper/0-1     [000] ....     1.899682: add_device_to_group: IOMMU: groupID=12 device=0000:00:1f.0
swapper/0-1     [000] ....     1.899687: add_device_to_group: IOMMU: groupID=12 device=0000:00:1f.2
swapper/0-1     [000] ....     1.899692: add_device_to_group: IOMMU: groupID=12 device=0000:00:1f.3
swapper/0-1     [000] ....     1.899696: add_device_to_group: IOMMU: groupID=13 device=0000:02:00.0
swapper/0-1     [000] ....     1.899701: add_device_to_group: IOMMU: groupID=14 device=0000:03:00.0
swapper/0-1     [000] ....     1.899704: add_device_to_group: IOMMU: groupID=10 device=0000:04:00.0

What Does lspci Show?

The following lspci output from the same system gives you information on each of the devices IOMMU found and classified into groups.


00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)

00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06

00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)

00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)

00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)

00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)

00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)

00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)

00:1c.3 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d5)

00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)

00:1f.0 ISA bridge: Intel Corporation H87 Express LPC Controller (rev 05)

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)

00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)

02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)

04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)

The following images depict this lspci output:

How Can IOMMU Event Tracing Help You - device_topo1

How Can IOMMU Event Tracing Help You - device_topo2

How Can IOMMU Event Tracing Help You - device_topo3

IOMMU Groups and Device Topology

While comparing IOMMU traces with lspci device topology and IOMMU topology, you will notice that some groups contain more than one device. For example, ISA bridge, SATA controller, and SMBus are placed in group 12, and a PCI bridge and a PCIe root port are placed in group 10. You will also notice that the Network Controller (02:00.0) is in a separate Group 13 even though it is also under the PCI bridge (00:1c.0) and PCI Root Port #1 device hierarchy which are in Group 8.

This is a good example of a device isolation under a PCI bridge, and the PCI Root Port hierarchy. The PCI bridge and root port are placed in the same group, whereas the network controller is in its own group and can be assigned to a VM. In the case of the PCI bridge (00:1c.3), PCI root port #3, and PCI bridge (04:00.0) hierarchy, all of them are placed in group 10. Devices that have dependencies on each other are usually placed in a group together so they can be isolated as a group. The following shows the IOMMU topology derived from the trace events generated as devices get added to individual groups during boot-time.

How Can IOMMU Event Tracing Help You - IOMMU-Topology
IOMMU Topology

What do IOMMU Device Event Traces Look Like?

# tracer: nop
#
# entries-in-buffer/entries-written: 5689868/5689868   #P:8
#
#                        _-----=> irqs-off
#                         / _----=> need-resched
#                        | / _---=> hardirq/softirq
#                        || / _--=> preempt-depth
#                        ||| /     delay
#       TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#          | |       |   ||||       |         |
    qemu-kvm-28546 [003] ....  1804.692631: attach_device_to_domain: IOMMU: device=0000:00:1c.0
    qemu-kvm-28546 [003] ....  1804.692635: attach_device_to_domain: IOMMU: device=0000:00:1c.4
    qemu-kvm-28546 [003] ....  1804.692643: attach_device_to_domain: IOMMU: device=0000:05:00.0
    qemu-kvm-28546 [003] ....  1804.692666: detach_device_from_domain: IOMMU: device=0000:00:1c.0
    qemu-kvm-28546 [003] ....  1804.692671: detach_device_from_domain: IOMMU: device=0000:00:1c.4
    qemu-kvm-28546 [003] ....  1804.692676: detach_device_from_domain: IOMMU: device=0000:05:00.0

What do IOMMU map/unmap Event Traces Look Like?

# tracer: nop
#
# entries-in-buffer/entries-written: 54/54   #P:8
#
#                    _-----=> irqs-off
#                     / _----=> need-resched
#                    | / _---=> hardirq/softirq
#                    || / _--=> preempt-depth
#                    ||| /     delay
#   TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#      | |       |   ||||       |         |
qemu-kvm-28546 [002] ....  1804.480679: map: IOMMU: iova=0x00000000000a0000 paddr=0x00000000446a0000 size=4096
qemu-kvm-28547 [006] ....  1809.032767: unmap: IOMMU: iova=0x00000000000c1000  size=4096 unmapped_size=4096

Great, We Have Traces! Using Traces to Solve Problems

Using traces we can get insight into IOMMU device topology to see which devices belong to which groups, and run-time device assignment changes as devices move from host to guests and back to host. In turn, this makes it much easier to debug IOMMU problems, Device assignment problems, Detect and solve performance problems, and BIOS and firmware problems related to IOMMU hardware and firmware implementation.

VFIO Based Device Assignment Use-Case

Alex Williamson, a VFIO maintainer, enabled IOMMU traces for vfio-based device assignment and found the following VFIO problems:

  • A large number of unmap calls were being made on VT-d system without IOMMU super-page support. This is because VFIO unmap path is not optimized on a VT-d system without IOMMU super-page support. As a result, each single page is unmapped individually since the current unmap path optimization relies on IOMMU super-page support.
  • Unnecessary single page mappings for invalid and reserved memory regions, like mappings of MMIO BARs.
  • Too many instances of very long tasks being run with needs-resched set.

Result 1: VFIO Patch Series to Fix Problems!

The problems above were fixed, resulting in a reduction in the number of unmap calls to 2% of the original on Intel VT-d without IOMMU super-page support. Before the fix, traces showed 472,574 maps, and 5,217,244 unmaps. Unmaps were more than 10 times greater than the number of maps! After the fix, traces showed 9509 maps, and 9509 unmaps, an extremely significant reduction. Additionally, more sporadic long task runs were added with needs-resched set

Result 2: Improvements to the IOMMU Tracing Feature

Alex also found a few bugs and improvements that could be made to the IOMMU tracing API. I would like to acknowledge Alex for using IOMMU tracing for VFIO based device assignments and for his feedback on improving the IOMMU Event Tracing API. The following fixes and improvements to IOMMU tracing API went into Linux 4.0

  • trace_iommu_map() should report original iova and size the map routine is called with as opposed to the iova and size at the end of mapping.
  • trace_iommu_unmap() should report original iova, size, and unmapped size.
  • Size field is handled as int and could overflow

Does Run-Time Tracing Add Overhead?

If you are wondering, what kind of overhead IOMMU tracing code introduces. Tracepoint code can be triggered to be included at run-time only when the tracepoint is enabled. In other words, tracepoint code is inactive unless it is enabled. When it is enabled, code is modified to include the tracepoint code. It doesn’t add any conditional logic overhead to determine whether or not to generate a trace message. The tracepoints use jump-labels which is a code modification of a branch.

When it is disabled, the code path looks like:

[ code ]
nop
back:
[ code ]
return;
tracepoint:
[ tracepoint code ]
jmp back;
When it is enabled, the code path looks like: (notice how the tracepoint code appears in the code path below)

[ code ]
jmp tracepoint
back:
[ code ]
return;
tracepoint:
[ tracepoint code ]
jmp back;

Future Enhancements & Closing Thoughts

At the moment, IO Page Faults are the only type of autonomous errors that are traced. In the future, additional errors and faults could be traced, but this would largely depend on the support for such reports funneling into the IOMMU drivers from the IOMMU hardware and firmware layers. One area that has been considered is the addition of error reporting and tracing to ARM IOMMU drivers if the underlying IOMMU hardware and firmware supports autonomous error reporting to the kernel.

I hope this article will help Linux kernel developers and users learn about a feature that can aid during development, maintenance, and support activities on systems with IOMMU hardware support. Reading this article, Linux kernel developers and users how to enable IOMMY event tracing to get reports on IOMMU boot-time and run-time events, and errors. Please see the following references to learn more about the use of IOMMUs in virtualized Linux environments and Linux VFIO PCI device assignment feature.

References

Author: Shuah Khan

Shuah contributes to multiple aspects of the Linux Kernel, and she maintains the Kernel Selftest framework.