[dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx

Ananyev, Konstantin konstantin.ananyev at intel.com
Wed Oct 12 02:04:39 CEST 2016


Hi Vladislav,

> > > > >
> > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > Added prefetch of first packet payload cacheline in
> > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > i40e_rx_alloc_bufs
> > > > > >
> > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > <vladyslav.buslov at harmonicinc.com>
> > > > > > ---
> > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > i40e_rx_queue
> > > > *rxq)
> > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > >                         mb = rxep[j].mbuf;
> > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > RTE_PKTMBUF_HEADROOM));
> > > >
> > > > Why did prefetch here? I think if application need to deal with
> > > > packet, it is more suitable to put it in application.
> > > >
> > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > >                         pkt_len = ((qword1 &
> > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > *rxq)
> > > > > >
> > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > >                         /* Prefetch next mbuf */
> > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > + 1].mbuf->cacheline1);
> >
> > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > specially for that case.
> 
> Thanks for pointing that out.
> I'll submit new patch if you decide to move forward with this development.
> 
> >
> > > > > > +               }
> > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > performance increase is observed but minor decrease.
> > > > Can you share will us when it will benefit the performance in your
> > scenario ?
> > > >
> > > >
> > > > Thanks
> > > > Jingjing
> > >
> > > Hello Jingjing,
> > >
> > > Thanks for code review.
> > >
> > > My use case: We have simple distributor thread that receives packets
> > > from port and distributes them among worker threads according to VLAN
> > and MAC address hash.
> > >
> > > While working on performance optimization we determined that most of
> > CPU usage of this thread is in DPDK.
> > > As and optimization we decided to switch to rx burst alloc function,
> > > however that caused additional performance degradation compared to
> > scatter rx mode.
> > > In profiler two major culprits were:
> > >   1. Access to packet data Eth header in application code. (cache miss)
> > >   2. Setting next packet descriptor field to NULL in DPDK
> > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > line that was not
> > > prefetched)
> >
> > I wonder what will happen if we'll remove any prefetches here?
> > Would it make things better or worse (and by how much)?
> 
> In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> 
> >
> > > After applying my fixes performance improved compared to scatter rx
> > mode.
> > >
> > > I assumed that prefetch of first cache line of packet data belongs to
> > > DPDK because it is done in scatter rx mode. (in
> > > i40e_recv_scattered_pkts)
> > > It can be moved to application side but IMO it is better to be consistent
> > across all rx modes.
> >
> > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > packet's data.
> 
> Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will receive
> packets directly from rx queues of NIC.
> First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing each
> eth header and can't be easily mitigated in application code.
> I assume it is ubiquitous use case for DPDK.

Yes it is a quite common use-case.
Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
>From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
Another thing that some people may argue then - why only one cache line is prefetched,
in some use-cases might need to look at 2-nd one.  

Konstantin

> 
> Regards,
> Vladyslav


More information about the dev mailing list