[dpdk-dev] [PATCH] ring: empty and count optimizations
Morten Brørup
mb at smartsharesystems.com
Thu May 14 20:00:21 CEST 2020
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev,
> Konstantin
> Sent: Thursday, May 14, 2020 6:47 PM
> >
> > > > -static inline unsigned
> > > > +static inline unsigned int
> > > > rte_ring_count(const struct rte_ring *r)
> > > > {
> > > > uint32_t prod_tail = r->prod.tail;
> > > > uint32_t cons_tail = r->cons.tail;
> > > > uint32_t count = (prod_tail - cons_tail) & r->mask;
> > > > - return (count > r->capacity) ? r->capacity : count;
> > > > + return likely(count <= r->capacity) ? count : r->capacity;
> > >
> > > Honestly, I don't see there is any point of that change:
> > > I think it wouldn't change anything in terms of functionality
> > > or performance.
> >
> > Chapter 3.4.1 "Branch Prediction Optimization" in the Intel 64 and
> IA-32 Architectures Optimization Reference Manual recommends this
> > kind of optimization as Assembly/Compiler Coding Rule 3, which is why
> I rearranged the trigraph. Essentially, there is a limit to the number
> > of BTB (Branch Target Buffer) entries, so they should be conserved if
> possible.
> >
> > In addition to that, I have added the likely() because I consider it
> nearly impossible that the count will exceed the capacity.
> >
> > However, it's not the first time I see this kind of response to a
> suggested branch optimization on the DPDK mailing list. Everyone seem
> to
> > think that branch prediction is infinite and always works. It may
> seem as if infinite on trivial applications, but BTB entries may be a
> scarce
> > resource on complex applications. I assume Intel's recommendations
> are not just for the fun of it.
>
> I think it is better to leave such level of micro-optimizations to the
> compiler.
> BTW, in that particular case, compiler most likely will generate a code
> without any branches at all (at least for IA).
> Let say on my box with gcc 7.3:
>
> $ cat trc1.c
> #include <stdint.h>
> #include <rte_config.h>
> #include <rte_ring.h>
>
> uint32_t
> fffx1(const struct rte_ring *r)
> {
> uint32_t prod_tail = r->prod.tail;
> uint32_t cons_tail = r->cons.tail;
> uint32_t count = (prod_tail - cons_tail) & r->mask;
> return (count > r->capacity) ? r->capacity : count;
> }
>
> uint32_t
> fffx2(const struct rte_ring *r)
> {
> uint32_t prod_tail = r->prod.tail;
> uint32_t cons_tail = r->cons.tail;
> uint32_t count = (prod_tail - cons_tail) & r->mask;
> return likely(count <= r->capacity) ? count : r->capacity;
> }
>
> $ gcc -m64 -O3 -march=native -I${RTE_SDK}/x86_64-native-linuxapp-
> gcc/include -c trc1.c
>
> $ objdump -d trc1.o
>
> 0000000000000000 <fffx1>:
> 0: 8b 87 84 00 00 00 mov 0x84(%rdi),%eax
> 6: 8b 97 04 01 00 00 mov 0x104(%rdi),%edx
> c: 29 d0 sub %edx,%eax
> e: 8b 57 38 mov 0x38(%rdi),%edx
> 11: 23 47 34 and 0x34(%rdi),%eax
> 14: 39 d0 cmp %edx,%eax
> 16: 0f 47 c2 cmova %edx,%eax
> 19: c3 retq
> 1a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
>
> 0000000000000020 <fffx2>:
> 20: 8b 87 84 00 00 00 mov 0x84(%rdi),%eax
> 26: 8b 97 04 01 00 00 mov 0x104(%rdi),%edx
> 2c: 29 d0 sub %edx,%eax
> 2e: 8b 57 38 mov 0x38(%rdi),%edx
> 31: 23 47 34 and 0x34(%rdi),%eax
> 34: 39 d0 cmp %edx,%eax
> 36: 0f 47 c2 cmova %edx,%eax
> 39: c3 retq
>
> As you can see, there is no difference.
>
Thank you for the detailed feedback.
Reality trumps theory, so I will leave the count function as is. :-)
> >
> > Konstantin, please note that I'm letting out my frustration about the
> general misconception about branch prediction here. You are doing a
> > great job, so I feel bad about responding like this to you.
>
> No worries, in fact I am glad to know that DPDK contributors
> read IA optimization manual that thoughtfully 😊
>
> Konstantin
More information about the dev
mailing list