[dpdk-dev] [EXT] Re: [PATCH v3 25/27] mempool/octeontx2: add optimized dequeue operation for arm64
Aaron Conole
aconole at redhat.com
Fri Jun 21 21:26:56 CEST 2019
Pavan Nikhilesh Bhagavatula <pbhagavatula at marvell.com> writes:
> Hi Aaron,
>
>>-----Original Message-----
>>From: Aaron Conole <aconole at redhat.com>
>>Sent: Tuesday, June 18, 2019 2:55 AM
>>To: Jerin Jacob Kollanukkaran <jerinj at marvell.com>
>>Cc: dev at dpdk.org; Nithin Kumar Dabilpuram
>><ndabilpuram at marvell.com>; Vamsi Krishna Attunuru
>><vattunuru at marvell.com>; Pavan Nikhilesh Bhagavatula
>><pbhagavatula at marvell.com>; Olivier Matz <olivier.matz at 6wind.com>
>>Subject: [EXT] Re: [dpdk-dev] [PATCH v3 25/27] mempool/octeontx2:
>>add optimized dequeue operation for arm64
>>
>>> From: Pavan Nikhilesh <pbhagavatula at marvell.com>
>>>
>>> This patch adds an optimized arm64 instruction based routine to
>>leverage
>>> CPU pipeline characteristics of octeontx2. The theme is to fill the
>>> pipeline with CASP operations as much HW can do so that HW can do
>>alloc()
>>> HW ops in full throttle.
>>>
>>> Cc: Olivier Matz <olivier.matz at 6wind.com>
>>> Cc: Aaron Conole <aconole at redhat.com>
>>>
>>> Signed-off-by: Pavan Nikhilesh <pbhagavatula at marvell.com>
>>> Signed-off-by: Jerin Jacob <jerinj at marvell.com>
>>> Signed-off-by: Vamsi Attunuru <vattunuru at marvell.com>
>>> ---
>>> drivers/mempool/octeontx2/otx2_mempool_ops.c | 291
>>+++++++++++++++++++
>>> 1 file changed, 291 insertions(+)
>>>
>>> diff --git a/drivers/mempool/octeontx2/otx2_mempool_ops.c
>>b/drivers/mempool/octeontx2/otx2_mempool_ops.c
>>> index c59bd73c0..e6737abda 100644
>>> --- a/drivers/mempool/octeontx2/otx2_mempool_ops.c
>>> +++ b/drivers/mempool/octeontx2/otx2_mempool_ops.c
>>> @@ -37,6 +37,293 @@ npa_lf_aura_op_alloc_one(const int64_t
>>wdata, int64_t * const addr,
>>> return -ENOENT;
>>> }
>>>
>>> +#if defined(RTE_ARCH_ARM64)
>>> +static __rte_noinline int
>>> +npa_lf_aura_op_search_alloc(const int64_t wdata, int64_t * const
>>addr,
>>> + void **obj_table, unsigned int n)
>>> +{
>>> + uint8_t i;
>>> +
>>> + for (i = 0; i < n; i++) {
>>> + if (obj_table[i] != NULL)
>>> + continue;
>>> + if (npa_lf_aura_op_alloc_one(wdata, addr, obj_table,
>>i))
>>> + return -ENOENT;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static __attribute__((optimize("-O3"))) __rte_noinline int __hot
>>
>>Sorry if I missed this before.
>>
>>Is there a good reason to hard-code this optimization, rather than let
>>the build system provide it?
>
> Some versions of compiler don't have support for __int128_t for CASP inline-asm.
> i.e. if the optimization level is reduced to -O0 the CASP restrictions aren't followed and
> compiler might end up violation the CASP rules example:
>
> /tmp/ccSPMGzq.s:1648: Error: reg pair must start from even reg at
> operand 1 - `casp x21,x22,x0,x1,[x19]'
> /tmp/ccSPMGzq.s:1706: Error: reg pair must start from even reg at
> operand 1 - `casp x13,x14,x0,x1,[x11]'
> /tmp/ccSPMGzq.s:1745: Error: reg pair must start from even reg at
> operand 1 - `casp x9,x10,x0,x1,[x7]'
> /tmp/ccSPMGzq.s:1775: Error: reg pair must start from even reg at
> operand 1 - `casp x7,x8,x0,x1,[x5]'*
>
> Forcing to -O3 with __rte_noinline in place fixes it as the alignment fits in.
It makes sense to document this - it isn't apparent that it is needed.
It would be good to put a comment just before that explains it,
preferably with the compilers that aren't behaving. This would help in
the future to determine when it would be safe to drop the flag.
> Regards,
> Pavan.
>
>>
>>> +npa_lf_aura_op_alloc_bulk(const int64_t wdata, int64_t * const
>>addr,
>>> + unsigned int n, void **obj_table)
>>> +{
>>> + const __uint128_t wdata128 = ((__uint128_t)wdata << 64) |
>>wdata;
>>> + uint64x2_t failed = vdupq_n_u64(~0);
>>> +
>>> + switch (n) {
>>> + case 32:
>>> + {
>>> + __uint128_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9;
>>> + __uint128_t t10, t11;
>>> +
>>> + asm volatile (
>>> + ".cpu generic+lse\n"
>>> + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"
More information about the dev
mailing list