[dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function

Jianbo Liu jianbo.liu at linaro.org
Thu Feb 9 04:49:57 CET 2017


On 9 February 2017 at 03:53, Ananyev, Konstantin
<konstantin.ananyev at intel.com> wrote:
>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Ananyev, Konstantin
>> Sent: Wednesday, February 8, 2017 6:54 PM
>> To: Yigit, Ferruh <ferruh.yigit at intel.com>; Jianbo Liu <jianbo.liu at linaro.org>; dev at dpdk.org; Zhang, Helin <helin.zhang at intel.com>;
>> jerin.jacob at caviumnetworks.com
>> Subject: Re: [dpdk-dev] [PATCH v2 1/2] net/ixgbe: calculate the correct number of received packets in bulk alloc function
>>
>> Hi Ferruh,
>>
>> >
>> > On 2/4/2017 1:26 PM, Ananyev, Konstantin wrote:
>> > >>
>> > >> To get better performance, Rx bulk alloc recv function will scan 8 descs
>> > >> in one time, but the statuses are not consistent on ARM platform because
>> > >> the memory allocated for Rx descriptors is cacheable hugepages.
>> > >> This patch is to calculate the number of received packets by scan DD bit
>> > >> sequentially, and stops when meeting the first packet with DD bit unset.
>> > >>
>> > >> Signed-off-by: Jianbo Liu <jianbo.liu at linaro.org>
>> > >> ---
>> > >>  drivers/net/ixgbe/ixgbe_rxtx.c | 16 +++++++++-------
>> > >>  1 file changed, 9 insertions(+), 7 deletions(-)
>> > >>
>> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> > >> index 36f1c02..613890e 100644
>> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> > >> @@ -1460,17 +1460,19 @@ static inline int __attribute__((always_inline))
>> > >>          for (i = 0; i < RTE_PMD_IXGBE_RX_MAX_BURST;
>> > >>               i += LOOK_AHEAD, rxdp += LOOK_AHEAD, rxep += LOOK_AHEAD) {
>> > >>                  /* Read desc statuses backwards to avoid race condition */
>> > >> -                for (j = LOOK_AHEAD-1; j >= 0; --j)
>> > >> +                for (j = 0; j < LOOK_AHEAD; j++)
>> > >>                          s[j] = rte_le_to_cpu_32(rxdp[j].wb.upper.status_error);
>> > >>
>> > >> -                for (j = LOOK_AHEAD - 1; j >= 0; --j)
>> > >> -                        pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> > >> -                                                       lo_dword.data);
>> > >> +                rte_smp_rmb();
>> > >>
>> > >>                  /* Compute how many status bits were set */
>> > >> -                nb_dd = 0;
>> > >> -                for (j = 0; j < LOOK_AHEAD; ++j)
>> > >> -                        nb_dd += s[j] & IXGBE_RXDADV_STAT_DD;
>> > >> +                for (nb_dd = 0; nb_dd < LOOK_AHEAD &&
>> > >> +                                (s[nb_dd] & IXGBE_RXDADV_STAT_DD); nb_dd++)
>> > >> +                        ;
>> > >> +
>> > >> +                for (j = 0; j < nb_dd; j++)
>> > >> +                        pkt_info[j] = rte_le_to_cpu_32(rxdp[j].wb.lower.
>> > >> +                                                       lo_dword.data);
>> > >>
>> > >>                  nb_rx += nb_dd;
>> > >>
>> > >> --
>> > >
>> > > Acked-by: Konstantin Ananyev <konstantin.ananyev at intel.com>
>> >
>> > Hi Konstantin,
>> >
>> > Is the ack valid for v3 and both patches?
>>
>> No, I didn't look into the second one in details.
>> It is ARM specific, and I left it for people who are more familiar with ARM then me :)
>> Konstantin
>
> Actually, I had a quick look after your mail.
>
> +               /* A.1 load 1 pkts desc */
> +               descs[0] =  vld1q_u64((uint64_t *)(rxdp));
> +               rte_smp_rmb();
>
>                 /* B.2 copy 2 mbuf point into rx_pkts  */
>                 vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
> @@ -271,10 +270,11 @@
>                 /* B.1 load 1 mbuf point */
>                 mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
>
> -               descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
> -               /* B.1 load 2 mbuf point */
>                 descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
> -               descs[0] =  vld1q_u64((uint64_t *)(rxdp));
> +
> +               /* A.1 load 2 pkts descs */
> +               descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
> +               descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
>
> Assuming that on all ARM-NEON platforms 16B reads are atomic,
> I think there is no need for smp_rmb() after the desc[0] read.
> What looks more appropriate to me:

With checking DDs in sequence, it doesn't matter much where the rmb is.
But there is a little performance improvement (0.02%) in my testing
with your suggestion.
So I'll send a new version. Thanks!

>
> descs[0] =  vld1q_u64((uint64_t *)(rxdp));
> descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
> descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
> descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
>
> rte_smp_rmb();
>
> ...
>
> But, as I said would be good if some ARM guys have a look here.
> Konstantin
>
>
>>
>> >
>> > Thanks,
>> > ferruh
>> >
>> > >
>> > >> 1.8.3.1
>> > >
>


More information about the dev mailing list