[dpdk-dev] [PATCH 2/2] eal/x86: Use lock-prefixed instructions to reduce cost of rte_smp_mb()

Stephen Hemminger stephen at networkplumber.org
Fri Dec 1 19:04:18 CET 2017


On Fri,  1 Dec 2017 11:12:51 +0000
Konstantin Ananyev <konstantin.ananyev at intel.com> wrote:

> On x86 it  is possible to use lock-prefixed instructions to get
> the similar effect as mfence.
> As pointed by Java guys, on most modern HW that gives a better
> performance than using mfence:
> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> That patch adopts that technique for rte_smp_mb() implementation.
> On BDW 2.2 mb_autotest on single lcore reports 2X cycle reduction,
> i.e. from ~110 to ~55 cycles per operation.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev at intel.com>
> ---
>  .../common/include/arch/x86/rte_atomic.h           | 45 +++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> index 4eac66631..07b7fa7f7 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
> @@ -55,12 +55,53 @@ extern "C" {
>  
>  #define	rte_rmb() _mm_lfence()
>  
> -#define rte_smp_mb() rte_mb()
> -
>  #define rte_smp_wmb() rte_compiler_barrier()
>  
>  #define rte_smp_rmb() rte_compiler_barrier()
>  
> +/*
> + * From Intel Software Development Manual; Vol 3;
> + * 8.2.2 Memory Ordering in P6 and More Recent Processor Families:
> + * ...
> + * . Reads are not reordered with other reads.
> + * . Writes are not reordered with older reads.
> + * . Writes to memory are not reordered with other writes,
> + *   with the following exceptions:
> + *   . streaming stores (writes) executed with the non-temporal move
> + *     instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
> + *   . string operations (see Section 8.2.4.1).
> + *  ...
> + * . Reads may be reordered with older writes to different locations but not
> + * with older writes to the same location.
> + * . Reads or writes cannot be reordered with I/O instructions,
> + * locked instructions, or serializing instructions.
> + * . Reads cannot pass earlier LFENCE and MFENCE instructions.
> + * . Writes ... cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
> + * . LFENCE instructions cannot pass earlier reads.
> + * . SFENCE instructions cannot pass earlier writes ...
> + * . MFENCE instructions cannot pass earlier reads, writes ...
> + *
> + * As pointed by Java guys, that makes possible to use lock-prefixed
> + * instructions to get the same effect as mfence and on most modern HW
> + * that gives a better perfomarnce than using mfence:
> + * https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> + * So below we use that technique for rte_smp_mb() implementation.
> + */
> +
> +#ifdef RTE_ARCH_I686
> +#define	RTE_SP	RTE_STR(esp)
> +#else
> +#define	RTE_SP	RTE_STR(rsp)
> +#endif
> +
> +#define RTE_MB_DUMMY_MEMP	"-128(%%" RTE_SP ")"
> +
> +static __rte_always_inline void
> +rte_smp_mb(void)
> +{
> +	asm volatile("lock addl $0," RTE_MB_DUMMY_MEMP "; " ::: "memory");
> +}
> +
>  #define rte_io_mb() rte_mb()
>  
>  #define rte_io_wmb() rte_compiler_barrier()

The lock instruction is a stronger barrier than the compiler barrier
and has worse performance impact. Are you sure it is necessary to use it in DPDK.
Linux kernel has successfully used simple compiler reodering barrier for years.

Don't confuse rte_smp_mb with the required barrier for talking to I/O devices.


More information about the dev mailing list