[dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
    Ananyev, Konstantin 
    konstantin.ananyev at intel.com
       
    Thu Oct 13 12:30:09 CEST 2016
    
    
  
> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, October 13, 2016 11:19 AM
> To: Ananyev, Konstantin <konstantin.ananyev at intel.com>
> Cc: Vladyslav Buslov <Vladyslav.Buslov at harmonicinc.com>; Wu, Jingjing <jingjing.wu at intel.com>; Yigit, Ferruh <ferruh.yigit at intel.com>;
> Zhang, Helin <helin.zhang at intel.com>; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch instructions for bulk rx
> 
> On Wed, Oct 12, 2016 at 12:04:39AM +0000, Ananyev, Konstantin wrote:
> > Hi Vladislav,
> >
> > > > > > >
> > > > > > > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > > > > > > Added prefetch of first packet payload cacheline in
> > > > > > > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > > > > > > i40e_rx_alloc_bufs
> > > > > > > >
> > > > > > > > Signed-off-by: Vladyslav Buslov
> > > > > > > > <vladyslav.buslov at harmonicinc.com>
> > > > > > > > ---
> > > > > > > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > > > > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > > > > > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > > > > > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct
> > > > i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >                 /* Translate descriptor info to mbuf parameters */
> > > > > > > >                 for (j = 0; j < nb_dd; j++) {
> > > > > > > >                         mb = rxep[j].mbuf;
> > > > > > > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > > > > > > RTE_PKTMBUF_HEADROOM));
> > > > > >
> > > > > > Why did prefetch here? I think if application need to deal with
> > > > > > packet, it is more suitable to put it in application.
> > > > > >
> > > > > > > >                         qword1 = rte_le_to_cpu_64(\
> > > > > > > >                                 rxdp[j].wb.qword1.status_error_len);
> > > > > > > >                         pkt_len = ((qword1 &
> > > > > > > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > > > > > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> > > > > > *rxq)
> > > > > > > >
> > > > > > > >         rxdp = &rxq->rx_ring[alloc_idx];
> > > > > > > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > > > > > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > > > > > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > > > > > > >                         /* Prefetch next mbuf */
> > > > > > > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > > > > > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > > > > > > +                       rte_prefetch0(&rxep[i +
> > > > > > > > + 1].mbuf->cacheline1);
> > > >
> > > > I think there are rte_mbuf_prefetch_part1/part2 defined in rte_mbuf.h,
> > > > specially for that case.
> > >
> > > Thanks for pointing that out.
> > > I'll submit new patch if you decide to move forward with this development.
> > >
> > > >
> > > > > > > > +               }
> > > > > > Agree with this change. And when I test it by testpmd with iofwd, no
> > > > > > performance increase is observed but minor decrease.
> > > > > > Can you share will us when it will benefit the performance in your
> > > > scenario ?
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > > Jingjing
> > > > >
> > > > > Hello Jingjing,
> > > > >
> > > > > Thanks for code review.
> > > > >
> > > > > My use case: We have simple distributor thread that receives packets
> > > > > from port and distributes them among worker threads according to VLAN
> > > > and MAC address hash.
> > > > >
> > > > > While working on performance optimization we determined that most of
> > > > CPU usage of this thread is in DPDK.
> > > > > As and optimization we decided to switch to rx burst alloc function,
> > > > > however that caused additional performance degradation compared to
> > > > scatter rx mode.
> > > > > In profiler two major culprits were:
> > > > >   1. Access to packet data Eth header in application code. (cache miss)
> > > > >   2. Setting next packet descriptor field to NULL in DPDK
> > > > > i40e_rx_alloc_bufs code. (this field is in second descriptor cache
> > > > > line that was not
> > > > > prefetched)
> > > >
> > > > I wonder what will happen if we'll remove any prefetches here?
> > > > Would it make things better or worse (and by how much)?
> > >
> > > In our case it causes few per cent PPS degradation on next=NULL assignment but it seems that JingJing's test doesn't confirm it.
> > >
> > > >
> > > > > After applying my fixes performance improved compared to scatter rx
> > > > mode.
> > > > >
> > > > > I assumed that prefetch of first cache line of packet data belongs to
> > > > > DPDK because it is done in scatter rx mode. (in
> > > > > i40e_recv_scattered_pkts)
> > > > > It can be moved to application side but IMO it is better to be consistent
> > > > across all rx modes.
> > > >
> > > > I would agree with Jingjing here, probably PMD should avoid to prefetch
> > > > packet's data.
> > >
> > > Actually I can see some valid use cases where it is beneficial to have this prefetch in driver.
> > > In our sw distributor case it is trivial to just prefetch next packet on each iteration because packets are processed one by one.
> > > However when we move this functionality to hw by means of RSS/vfunction/FlowDirector(our long term goal) worker threads will
> receive
> > > packets directly from rx queues of NIC.
> > > First operation of worker thread is to perform bulk lookup in hash table by destination MAC. This will cause cache miss on accessing
> each
> > > eth header and can't be easily mitigated in application code.
> > > I assume it is ubiquitous use case for DPDK.
> >
> > Yes it is a quite common use-case.
> > Though I many cases it is possible to reorder user code to hide (or minimize) that data-access latency.
> > From other side there are scenarios where this prefetch is excessive and can cause some drop in performance.
> > Again, as I know, none of PMDs for Intel devices prefetches packet's data in  simple (single segment) RX mode.
> > Another thing that some people may argue then - why only one cache line is prefetched,
> > in some use-cases might need to look at 2-nd one.
> >
> There is a build-time config setting for this behaviour for exactly the reasons
> called out here - in some apps you get a benefit, in others you see a perf
> hit. The default is "on", which makes sense for most cases, I think.
> From common_base:
> 
> CONFIG_RTE_PMD_PACKET_PREFETCH=y$
Yes, but right now i40e and ixgbe non-scattered RX (both vector and scalar) just ignore that flag.
Though yes, might be a good thing to make them to obey that flag properly.
Konstantin
    
    
More information about the dev
mailing list