[PATCH] common/mlx5: fix high SMMU TLB miss with mempool alignment

yangxingui yangxingui at huawei.com
Tue Jun 23 09:11:00 CEST 2026


Friendly ping...

On 2026/6/12 15:14, Xingui Yang wrote:
> From: Shuaisong Yang <yangshuaisong at h-partners.com>
> 
> On Kunpeng SoC with mlx CX7, dpdk-l3fwd with intra-NUMA core pinning
> under SMMU nonstrict/strict mode shows about 30% performance degradation
> compared to cross-NUMA pinning. With SMMU disabled or passthrough mode,
> intra-NUMA performs as expected (slightly better than cross-NUMA).
> 
> CX7 in NUMA1
> NUMA node0 CPU(s):    0-39
> NUMA node1 CPU(s):    40-79
> 
> intra-NUMA:
> dpdk-l3fwd -l 40-55 -n 4 -a 0000:17:00.1,mprq_en=1 -- -p 0x1 -P \
>    --config='(0,0,40),(0,1,41),(0,2,42),(0,3,43),(0,4,44),\
>              (0,5,45),(0,6,46),(0,7,47),(0,8,48),(0,9,49),\
>              (0,10,50),(0,11,51),(0,12,52),(0,13,53),\
>              (0,14,54),(0,15,55)' \
>    --rx-queue-size=4096 --tx-queue-size=4096 --rx-burst=64
> 
> cross-NUMA:
> dpdk-l3fwd -l 11-26 -n 4 -a 0000:17:00.1,mprq_en=1 -- -p 0x1 -P \
>    --config='(0,0,11),(0,1,12),(0,2,13),(0,3,14),(0,4,15),\
>              (0,5,16),(0,6,17),(0,7,18),(0,8,19),(0,9,20),\
>              (0,10,21),(0,11,22),(0,12,23),(0,13,24),\
>              (0,14,25),(0,15,26)' \
>    --rx-queue-size=4096 --tx-queue-size=4096 --rx-burst=64
> 
> The root cause is that under SMMU enabled mode, the mempool allocated
> for intra-NUMA pinning is aligned to system page size instead of
> hugepage size, while cross-NUMA pinning correctly uses hugepage size
> alignment. This causes high TLB miss rates under SMMU.
> 
> Align all memory ranges to hugepage boundaries during mempool
> registration to ensure hugepage_sz alignment, thereby reducing TLB
> misses and fixing the intra-NUMA performance degradation.
> 
> Fixes: 690b2a88c2f7 ("common/mlx5: add mempool registration facilities")
> Cc: stable at dpdk.org
> 
> Signed-off-by: Shuaisong Yang <yangshuaisong at h-partners.com>
> Signed-off-by: Xingui Yang <yangxingui at huawei.com>
> ---
>   .mailmap                             |  1 +
>   drivers/common/mlx5/mlx5_common_mr.c | 53 +++++++++++++++++++---------
>   2 files changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4001e5fb0e..e13e88db1b 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1979,3 +1979,4 @@ Zongyu Wu <wuzongyu1 at huawei.com>
>   Zorik Machulsky <zorik at amazon.com>
>   Zyta Szpak <zyta at marvell.com> <zr at semihalf.com>
>   Zyta Szpak <zyta at marvell.com> <zyta.szpak at semihalf.com>
> +Shuaisong Yang <yangshuaisong at h-partners.com>
> diff --git a/drivers/common/mlx5/mlx5_common_mr.c b/drivers/common/mlx5/mlx5_common_mr.c
> index aa2d5e88a4..aee037abb4 100644
> --- a/drivers/common/mlx5/mlx5_common_mr.c
> +++ b/drivers/common/mlx5/mlx5_common_mr.c
> @@ -1524,7 +1524,9 @@ mlx5_get_mempool_ranges(struct rte_mempool *mp, bool is_extmem,
>    * @param[in] is_extmem
>    *   Whether the pool is contains only external pinned buffers.
>    * @param[out] out
> - *   Receives memory ranges to register, aligned to the system page size.
> + *   Receives memory ranges to register. Aligned to the hugepage size
> + *   if all ranges reside on hugepages of the same size,
> + *   otherwise aligned to the system page size.
>    *   The caller must release them with free().
>    * @param[out] out_n
>    *   Receives the number of @p out items.
> @@ -1541,7 +1543,9 @@ mlx5_mempool_reg_analyze(struct rte_mempool *mp, bool is_extmem,
>   {
>   	struct mlx5_range *ranges = NULL;
>   	unsigned int i, ranges_n = 0;
> +	bool same_hugepage_sz = true;
>   	struct rte_memseg_list *msl;
> +	uint64_t hugepage_sz = 0;
>   
>   	if (mlx5_get_mempool_ranges(mp, is_extmem, &ranges, &ranges_n) < 0) {
>   		DRV_LOG(ERR, "Cannot get address ranges for mempool %s",
> @@ -1552,28 +1556,43 @@ mlx5_mempool_reg_analyze(struct rte_mempool *mp, bool is_extmem,
>   	*share_hugepage = false;
>   	msl = rte_mem_virt2memseg_list((void *)ranges[0].start);
>   	if (msl != NULL) {
> -		uint64_t hugepage_sz = 0;
> +		hugepage_sz = msl->page_sz;
>   
>   		/* Check that all ranges are on pages of the same size. */
>   		for (i = 0; i < ranges_n; i++) {
> -			if (hugepage_sz != 0 && hugepage_sz != msl->page_sz)
> +			struct rte_memseg_list *range_msl;
> +			range_msl = rte_mem_virt2memseg_list(
> +					(void *)ranges[i].start);
> +			if (range_msl == NULL ||
> +			    range_msl->page_sz != hugepage_sz) {
> +				same_hugepage_sz = false;
>   				break;
> -			hugepage_sz = msl->page_sz;
> +			}
>   		}
> -		if (i == ranges_n) {
> -			/*
> -			 * If the entire pool is within one hugepage,
> -			 * combine all ranges into one of the hugepage size.
> -			 */
> -			uintptr_t reg_start = ranges[0].start;
> -			uintptr_t reg_end = ranges[ranges_n - 1].end;
> -			uintptr_t hugepage_start =
> -				RTE_ALIGN_FLOOR(reg_start, hugepage_sz);
> -			uintptr_t hugepage_end = hugepage_start + hugepage_sz;
> -			if (reg_end < hugepage_end) {
> -				ranges[0].start = hugepage_start;
> +	}
> +	if (same_hugepage_sz && hugepage_sz > 0) {
> +		unsigned int orig_ranges_n = ranges_n;
> +
> +		for (i = 0; i < ranges_n; i++) {
> +			ranges[i].start = RTE_ALIGN_FLOOR(ranges[i].start,
> +							  hugepage_sz);
> +			ranges[i].end = RTE_ALIGN_CEIL(ranges[i].end,
> +							hugepage_sz);
> +		}
> +		ranges_n = 1;
> +		for (i = 1; i < orig_ranges_n; i++) {
> +			if (ranges[ranges_n - 1].end >= ranges[i].start)
> +				ranges[ranges_n - 1].end =
> +					RTE_MAX(ranges[ranges_n - 1].end,
> +						ranges[i].end);
> +			else
> +				ranges[ranges_n++] = ranges[i];
> +		}
> +		if (ranges_n == 1) {
> +			uintptr_t hugepage_end = ranges[0].start + hugepage_sz;
> +
> +			if (ranges[0].end <= hugepage_end) {
>   				ranges[0].end = hugepage_end;
> -				ranges_n = 1;
>   				*share_hugepage = true;
>   			}
>   		}
> 


More information about the dev mailing list