[dpdk-dev] [PATCH v7 02/17] lib/ring: apis to support configurable element size

Ananyev, Konstantin konstantin.ananyev at intel.com
Thu Jan 9 01:48:57 CET 2020



> <snip>
> > > > > > > > > > +
> > > > > > > > > > +static __rte_always_inline void
> > > > > > > > > > +enqueue_elems_128(struct rte_ring *r, uint32_t
> > > > > > > > > > +prod_head, const void *obj_table, uint32_t n) {
> > > > > > > > > > +unsigned int i; const uint32_t size =
> > > > > > > > > > +r->size; uint32_t idx = prod_head & r->mask;
> > > > > > > > > > +r->__uint128_t
> > > > > > > > > > +*ring = (__uint128_t *)&r[1]; const __uint128_t *obj =
> > > > > > > > > > +(const __uint128_t *)obj_table; if (likely(idx + n <
> > > > > > > > > > +size)) { for (i = 0; i < (n & ~0x1); i += 2, idx += 2)
> > > > > > > > > > +{ ring[idx] = obj[i]; ring[idx + 1] = obj[i + 1];
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > AFAIK, that implies 16B aligned obj_table...
> > > > > > > > > Would it always be the case?
> > > > > > > > I am not sure from the compiler perspective.
> > > > > > > > At least on Arm architecture, unaligned access (address that
> > > > > > > > is accessed is not aligned to the size of the data element
> > > > > > > > being
> > > > > > > > accessed) will result in faults or require additional cycles.
> > > > > > > > So, aligning on
> > > > > > 16B should be fine.
> > > > > > > Further, I would be changing this to use 'rte_int128_t' as
> > > > > > > '__uint128_t' is
> > > > > > not defined on 32b systems.
> > > > > >
> > > > > > What I am trying to say: with this code we imply new requirement
> > > > > > for elems
> > > > > The only existing use case in DPDK for 16B is the event ring. The
> > > > > event ring
> > > > already does similar kind of copy (using 'struct rte_event').
> > > > > So, there is no change in expectations for event ring.
> > > > > For future code, I think this expectation should be fine since it
> > > > > allows for
> > > > optimal code.
> > > > >
> > > > > > in the ring: when sizeof(elem)==16 it's alignment also has to be
> > > > > > at least
> > > > 16.
> > > > > > Which from my perspective is not ideal.
> > > > > Any reasoning?
> > > >
> > > > New implicit requirement and inconsistency.
> > > > Code like that:
> > > >
> > > > struct ring_elem {uint64_t a, b;};
> > > > ....
> > > > struct ring_elem elem;
> > > > rte_ring_dequeue_elem(ring, &elem, sizeof(elem));
> > > >
> > > > might cause a crash.
> > > The alignment here is 8B. Assuming that instructions generated will
> > > require 16B alignment, it will result in a crash, if configured to generate
> > exception.
> > > But, these instructions are not atomic instructions. At least on
> > > aarch64, unaligned access will not result in an exception for non-atomic
> > loads/stores. I believe it is the same behavior for x86 as well.
> >
> > On IA, there are 2 types of 16B load/store instructions: aligned and unaligned.
> > Aligned are a bit faster, but will cause an exception if used on non 16B aligned
> > address.
> > As you using uint128_t * compiler will assume that both src and dst are 16B
> > aligned and might generate code with aligned instructions.
> Ok, looking at few articles, I read that if the address is aligned, the unaligned instructions do not incur the penalty. Is this understanding
> correct?

Yes, from my experience the difference is negligible.

> 
> I see 2 solutions here:
> 1) We can switch this copy to use uint32_t pointer. It would still allow the compiler to generate (unaligned) instructions for up to 256b
> load/store. The 2 multiplications (to normalize the index and the size of copy) can use shifts. This should make it safer. If one wants
> performance, they can align the obj table to 16B (the ring itself is already aligned on the cache line boundary).

Sounds good to me.

> 
> 2) Considering that performance is paramount, we could document that the obj table needs to be aligned on 16B boundary. This would
> affect event dev (if we go ahead with replacing the event ring implementation) significantly.

I don't think perf difference would be that significant to justify such constraint.
I am in favor of #1.
 
> Note that we have to do the same thing for 64b elements as well.

I don't mind to have one unified copy procedure, which would always use 32bit elems,
but AFAIK, on IA there is no such limitation for 64bit load/stores.


> 
> >
> > >
> > > > While exactly the same code with:
> > > >
> > > > struct ring_elem {uint64_t a, b, c;}; OR struct ring_elem {uint64_t
> > > > a, b, c, d;};
> > > >
> > > > will work ok.
> > > The alignment for these structures is still 8B. Are you saying this
> > > will work because these will be copied using pointer to uint32_t (whose
> > alignment is 4B)?
> >
> > Yes, as we doing uint32_t copies, compiler can't assume the data will be 16B
> > aligned and will use unaligned instructions.
> >
> > >
> > > >
> > > > >
> > > > > > Note that for elem sizes > 16 (24, 32), there is no such constraint.
> > > > > The rest of them need to be aligned on 4B boundary. However, this
> > > > > should
> > > > not affect the existing code.
> > > > > The code for 8B and 16B is kept as is to ensure the performance is
> > > > > not
> > > > affected for the existing code.
> > > <snip>



More information about the dev mailing list