[PATCH v2 1/3] net/bonding: support Tx prepare

Chas Williams 3chas3 at gmail.com
Sun Sep 25 12:32:12 CEST 2022



On 9/21/22 22:12, fengchengwen wrote:
> 
> 
> On 2022/9/20 7:02, Chas Williams wrote:
>>
>>
>> On 9/19/22 10:07, Konstantin Ananyev wrote:
>>>
>>>>
>>>> On 9/16/22 22:35, fengchengwen wrote:
>>>>> Hi Chas,
>>>>>
>>>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>>>
>>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>>        devices.
>>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>>        link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>>        bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>>        failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>>        and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>>        IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>>        pre-calc).
>>>>>>>
>>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>>> branch in drivers/net/bonding.
>>>>>
>>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>>> is concerned about impacting performance.
>>>>>
>>>>> As a trade-off, I think we can add the enable/disable API.
>>>>
>>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>>> performance adversly, that is not a bonding problem. All applications
>>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>>> as I previously mentioned, you are just trading the location of
>>>> the branch, especially in the bonding case.
>>>>
>>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>>> should be improved or rewritten. There are PMD that require you to use
>>>> that API. Locally, we had maintained a patch to eliminate the use of
>>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>>> was marginal.
>>>>
>>>>>
>>>>>>
>>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>>> the following patch locally which I never got around to submitting.
>>>>>
>>>>> You are right, I will send V3 fix it.
>>>>>
>>>>>>
>>>>>>
>>>>>>    From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>>> From: "Charles (Chas) Williams" <chwillia at ciena.com>
>>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>>
>>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>>> headers, but it may need to do other things.
>>>>>>
>>>>>> Signed-off-by: Chas Williams <chwillia at ciena.com>
>>>>>
>>>>> ...
>>>>>
>>>>>>                  * ring if transmission fails so the packet isn't lost.
>>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>>
>>>>>>         /* Transmit burst on each active slave */
>>>>>>         for (i = 0; i < num_of_slaves; i++) {
>>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +        uint16_t nb_prep;
>>>>>> +
>>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>>                         bufs, nb_pkts);
>>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +                    bufs, nb_prep);
>>>>>
>>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>>
>>>> This routine is already broken. You can't just increment the refcount
>>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>>> rte_vlan_insert.
>>>
>>> Hmm interesting....
>>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>>> was introduced.
>>
>> Is that documented anywhere? It's been my experience that the device PMD
>> can do practically anything and you need to protect yourself.  Currently,
>> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
>> the virtio driver also used to call rte_vlan_insert during its transmit
>> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
>> header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
>> called. Handling that correctly in broadcast mode probably means always
>> make a deep copy of the packet, or check to see if all the members are
>> the same PMD type. If so, you can just call prepare once. You could track
>> the mismatched nature during additional/removal of the members. Or just
>> assume people aren't going to mismatch bonding members.
> 
> the rte_eth_tx_prepare has notes:
>      * Since this function can modify packet data, provided mbufs must be safely
>      * writable (e.g. modified data cannot be in shared segment).
> but rte_eth_tx_burst have not such requirement.
> 
> except above examples of rte_vlan_insert, there are also some PMDs modify mbuf's header
> and data, e.g. hns3/ark/bnxt will invoke rte_pktmbuf_append in case of the pkt-len too small.
> 
> I prefer the rte_eth_tx_burst add such restricts: the PMD should not modify the mbuf except refcnt==1.
> so that application could rely on there explicit definition to do business.
> 
> 
> As for this bonding scenario, we have three alternatives:
> 1) as Chas provided patch, always do tx-prepare before tx-burst. it was simple, but have: it
> may modify the mbuf but application could not detect (unless especial documents)
> 2) my patch, application could invoke the prepare_enable/disable to control whether to do prepare.
> 3) implement bonding PMD's tx-prepare, it do tx-preare for each slave, but existing some problem:
> if the slave device changes (e.g. add new device), some packet errors may occur because we have not
> do prepare for the new add device.
> 
> note1: the above 1/2 both violate rte_eth_tx_burst's requirement, so we should especial document.
> note2: we can do some optimization for 3, e.g. if the same driver name is detected on multiple slave
>         devices, here only need to perform tx-prepare once. but the problem above descripe still exist
>         because of dynamic slave devices at runtime.
> 
> hope for more discuess. @Ferruh @Chas @Humin @Konstantin

I don't think adding additional API due to concerns about performance is
the solution to the performance problem. If the tx_prepare API is slow,
that's what needs to be fixed. I imagine that more drivers will be using
the tx_prepare API over time not less. It would be a good idea to get
used to calling it.

As for broadcast mode, let's just call tx_prepare once for any given
packet. For now, assume that no one would attempt to bond different
PMDs together. In my experience, that would be unusual. I have never
seen anyone do that in a production context. If a bug report comes in
about this failing for someone, we can fix it then.


>>>> You should at least perform a clone of the packet so
>>>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>>>> should perform a partial deep copy of the packet headers in case some
>>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>>> not need an rte_vlan_insert.
>>>>
>>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>>> worse.
>>>>
>>>>>
>>>>>>
>>>>>>             if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>>                 tx_failed_flag = 1;
>> .


More information about the dev mailing list