[dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx

Vladyslav Buslov Vladyslav.Buslov at harmonicinc.com
Tue Oct 11 11:24:37 CEST 2016


Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
> Sent: Tuesday, October 11, 2016 11:51 AM
> To: Vladyslav Buslov; Wu, Jingjing; Yigit, Ferruh; Zhang, Helin
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> Hi Vladislav,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vladyslav Buslov
> > Sent: Monday, October 10, 2016 6:06 PM
> > To: Wu, Jingjing <jingjing.wu at intel.com>; Yigit, Ferruh
> > <ferruh.yigit at intel.com>; Zhang, Helin <helin.zhang at intel.com>
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> > > -----Original Message-----
> > > From: Wu, Jingjing [mailto:jingjing.wu at intel.com]
> > > Sent: Monday, October 10, 2016 4:26 PM
> > > To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> > > Cc: dev at dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > instructions for bulk rx
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Yigit, Ferruh
> > > > Sent: Wednesday, September 14, 2016 9:25 PM
> > > > To: Vladyslav Buslov <vladyslav.buslov at harmonicinc.com>; Zhang,
> > > > Helin <helin.zhang at intel.com>; Wu, Jingjing
> > > > <jingjing.wu at intel.com>
> > > > Cc: dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > > > instructions for bulk rx
> > > >
> > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > Added prefetch of first packet payload cacheline in
> > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > i40e_rx_alloc_bufs
> > > > >
> > > > > Signed-off-by: Vladyslav Buslov
> > > > > <vladyslav.buslov at harmonicinc.com>
> > > > > ---
> > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> i40e_rx_queue
> > > *rxq)
> > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > >                         mb = rxep[j].mbuf;
> > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > RTE_PKTMBUF_HEADROOM));
> > >
> > > Why did prefetch here? I think if application need to deal with
> > > packet, it is more suitable to put it in application.
> > >
> > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > >                         pkt_len = ((qword1 &
> > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > *rxq)
> > > > >
> > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > >                         /* Prefetch next mbuf */
> > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > +                       rte_prefetch0(&rxep[i +
> > > > > + 1].mbuf->cacheline1);
> 
> I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> specially for that case.

Thanks for pointing that out.
I'll submit new patch if you decide to move forward with this development.

> 
> > > > > +               }
> > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > performance increase is observed but minor decrease.
> > > Can you share will us when it will benefit the performance in your
> scenario ?
> > >
> > >
> > > Thanks
> > > Jingjing
> >
> > Hello Jingjing,
> >
> > Thanks for code review.
> >
> > My use case: We have simple distributor thread that receives packets
> > from port and distributes them among worker threads according to VLAN
> and MAC address hash.
> >
> > While working on performance optimization we determined that most of
> CPU usage of this thread is in DPDK.
> > As and optimization we decided to switch to rx burst alloc function,
> > however that caused additional performance degradation compared to
> scatter rx mode.
> > In profiler two major culprits were:
> >   1. Access to packet data Eth header in application code. (cache miss)
> >   2. Setting next packet descriptor field to NULL in DPDK
> > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > line that was not
> > prefetched)
> 
> I wonder what will happen if we'll remove any prefetches here?
> Would it make things better or worse (and by how much)?

In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.

> 
> > After applying my fixes performance improved compared to scatter rx
> mode.
> >
> > I assumed that prefetch of first cache line of packet data belongs to
> > DPDK because it is done in scatter rx mode. (in
> > i40e_recv_scattered_pkts)
> > It can be moved to application side but IMO it is better to be consistent
> across all rx modes.
> 
> I would agree with Jingjing here, probably PMD should avoid to prefetch
> packet's data.

Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive packets directly from rx queues of NIC.
First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each eth header and can't be easily mitigated in application code.
I assume it is ubiquitous use case for DPDK.

Regards,
Vladyslav


More information about the dev mailing list