Title Per-Process Memory Bandwidth Management for Heterogeneous Memory Systems Authors Lukas Werling, Daniel Habicht, Frank Bellosa E-Mail lukas.werling@kit.edu Affiliation Karlsruhe Institute of Technology Abstract Contention at memory controllers, whether at the CPU or on memory devices, is a well-known issue that can lead to high tail latencies or unfairness between processes. To mitigate these issues, hardware mechanismns such as Intel Memory Bandwidth Monitoring (MBM) have been introduced to monitor and limit the total memory bandwidth of individual processes. Heterogeneous memory systems introduce another dimension to this problem, as different memory technologies show different behavior. For example, write bandwidth to Intel Optane persistent memory drops drastically when under parallel load. First measurements of CXL memory devices show similar behavior. Consequently, it is not sufficient anymore to control the total memory bandwidth. Instead, we need a system that can monitor memory accesses per process as well as per technology. We propose sampling as a way of obtaining the required information. We use Intel PEBS to sample all retired store instructions. PEBS periodically captures additional information for an event and writes it into a memory buffer. For each sampled store, we check the target address to account the write to the memory technology mapped at that address. By disassembling the sampled store instruction to obtain the operand size, we avoid overcounting fast store instructions with small operands. We currently monitor load instructions with normal performance counters, since events that distinguish between memory technologies are available for loads. In order to implement policies for managing memory bandwidths, we need to make the accounting data collected in the kernel available to user space at low latency. Our kernel module provides a shared memory buffer for each core that contains accounting information for the process currently running there. In combination with counters for memory request queueing delay, a policy program can use this data to detect overload and take appropriate action. We currently implement a policy that can mitigate Optane write overload by confining write-heavy processes to a small number of cores. We plan to extend our policy to take read bandwidth as well as DRAM accesses into account. We evaluate the accuracy and overhead of our accounting mechanism with a microbenchmark. We show that our sampling mechanism consistently reports around 2% fewer written bytes. Additionally, we measure the resulting bandwidth and CPU utilization with and without accounting and observe a very low overhead of 0.2%, on average. We further show that we can provide measurements to the policy program at low latency in the order of microseconds. This is critical to allow a fast reaction to changing memory bandwidth demands. Language of the Presentation German