[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
Vlad Zolotarov
vladz at cloudius-systems.com
Thu Aug 20 11:06:50 CEST 2015
On 08/20/15 12:05, Vlad Zolotarov wrote:
>
>
> On 08/20/15 11:56, Vlad Zolotarov wrote:
>>
>>
>> On 08/20/15 11:41, Ananyev, Konstantin wrote:
>>> Hi Vlad,
>>>
>>>> -----Original Message-----
>>>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
>>>> Sent: Wednesday, August 19, 2015 11:03 AM
>>>> To: Ananyev, Konstantin; Lu, Wenzhuo
>>>> Cc: dev at dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
>>>> above 1 for all NICs but 82598
>>>>
>>>>
>>>>
>>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>>>> Hi Vlad,
>>>>> Sorry for delay with review, I am OOO till next week.
>>>>> Meanwhile, few questions/comments from me.
>>>> Hi, Konstantin, long time no see... ;)
>>>>
>>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>>>> seastar-based
>>>>>>>>>> application on x540 NIC.
>>>>>>>>> Could you help to share with us how to reproduce the tx hang
>>>>>>>>> issue,
>>>>>> with using
>>>>>>>>> typical DPDK examples?
>>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to
>>>>>>>> help u
>>>>>>>> here. However this is quite irrelevant since without this this
>>>>>>>> patch
>>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained
>>>>>>>> above.
>>>>>>>>
>>>>>>>> We saw the issue when u stressed the xmit path with a lot of
>>>>>>>> highly
>>>>>>>> fragmented TCP frames (packets with up to 33 fragments with
>>>>>>>> non-headers
>>>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>>>> Could you provide us with the pcap file to reproduce the issue?
>>>> Well, the thing is it takes some time to reproduce it (a few
>>>> minutes of
>>>> heavy load) therefore a pcap would be quite large.
>>> Probably you can upload it to some place, from which we will be able
>>> to download it?
>>
>> I'll see what I can do but no promises...
>
> On a second thought pcap file won't help u much since in order to
> reproduce the issue u have to reproduce exactly the same structure of
> clusters i give to HW and it's not what u see on wire in a TSO case.
And not only in a TSO case... ;)
>
>>
>>> Or might be you have some sort of scapy script to generate it?
>>> I suppose we'll need something to reproduce the issue and verify the
>>> fix.
>>
>> Since the original code abuses the HW spec u don't have to... ;)
>>
>>>
>>>>> My concern with you approach is that it would affect TX performance.
>>>> It certainly will ;) But it seem inevitable. See below.
>>>>
>>>>> Right now, for simple TX PMD usually reads only
>>>>> (nb_tx_desc/tx_rs_thresh) TXDs,
>>>>> While with your patch (if I understand it correctly) it has to
>>>>> read all TXDs in the HW TX ring.
>>>> If by "simple" u refer an always single fragment per Tx packet -
>>>> then u
>>>> are absolutely correct.
>>>>
>>>> My initial patch was to only set RS on every EOP descriptor without
>>>> changing the rs_thresh value and this patch worked.
>>>> However HW spec doesn't ensure in a general case that packets are
>>>> always
>>>> handled/completion write-back completes in the same order the packets
>>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
>>>> spec for instance). Therefore AFAIU one should not assume that if
>>>> packet[x+1] DD bit is set then packet[x] is completed too.
>>> From my understanding, TX arbitration controls the order in which
>>> TXDs from
>>> different queues are fetched/processed.
>>> But descriptors from the same TX queue are processed in FIFO order.
>>> So, I think that - yes, if TXD[x+1] DD bit is set, then TXD[x] is
>>> completed too,
>>> and setting RS on every EOP TXD should be enough.
>>
>> Ok. I'll rework the patch under this assumption then.
>>
>>>
>>>> That's why I changed the patch to be as u see it now. However if I
>>>> miss
>>>> something here and your HW people ensure the in-order completion
>>>> this of
>>>> course may be changed back.
>>>>
>>>>> Even if we really need to setup RS bit in each TXD (I still doubt
>>>>> we really do) - ,
>>>> Well, if u doubt u may ask the guys from the Intel networking division
>>>> that wrote the 82599 and x540 HW specs where they clearly state
>>>> that. ;)
>>> Good point, we'll see what we can do here :)
>>> Konstantin
>>>
>>>>> I think inside PMD it still should be possible to check TX
>>>>> completion in chunks.
>>>>> Konstantin
>>>>>
>>>>>
>>>>>>>> Thanks,
>>>>>>>> vlad
>>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/net/ixgbe/ixgbe_ethdev.c | 9 +++++++++
>>>>>>>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 23
>>>>>>>>>> ++++++++++++++++++++++-
>>>>>>>>>> 2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> index b8ee1e9..6714fd9 100644
>>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev
>>>>>>>>>> *dev,
>>>>>>>> struct
>>>>>>>>>> rte_eth_dev_info *dev_info)
>>>>>>>>>> .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>>>>> };
>>>>>>>>>> +
>>>>>>>>>> + /*
>>>>>>>>>> + * According to 82599 and x540 specifications RS bit
>>>>>>>>>> *must* be
>>>>>> set on
>>>>>>>> the
>>>>>>>>>> + * last descriptor of *every* packet. Therefore we will
>>>>>>>>>> not allow
>>>>>> the
>>>>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>>> + */
>>>>>>>>>> + if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>>>>>> +
>>>>>>>>>> dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *
>>>>>>>>>> sizeof(uint32_t);
>>>>>>>>>> dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>>>>> dev_info->flow_type_rss_offloads =
>>>>>>>>>> IXGBE_RSS_OFFLOAD_ALL; diff --
>>>>>>>> git
>>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>> index
>>>>>>>>>> 91023b9..8dbdffc 100644
>>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct
>>>>>>>>>> rte_eth_dev
>>>>>>>>>> *dev,
>>>>>>>>>> struct ixgbe_tx_queue *txq;
>>>>>>>>>> struct ixgbe_hw *hw;
>>>>>>>>>> uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>>>>>> + bool rs_deferring_allowed;
>>>>>>>>>>
>>>>>>>>>> PMD_INIT_FUNC_TRACE();
>>>>>>>>>> hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>> + * According to 82599 and x540 specifications RS bit
>>>>>>>>>> *must* be
>>>>>> set on
>>>>>>>> the
>>>>>>>>>> + * last descriptor of *every* packet. Therefore we will
>>>>>>>>>> not allow
>>>>>> the
>>>>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>>> + */
>>>>>>>>>> + rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>>>>>> +
>>>>>>>>>> + /*
>>>>>>>>>> * Validate number of transmit descriptors.
>>>>>>>>>> * It must not exceed hardware maximum, and must be
>>>>>>>>>> multiple
>>>>>>>>>> * of IXGBE_ALIGN.
>>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct
>>>>>>>>>> rte_eth_dev
>>>>>>>> *dev,
>>>>>>>>>> * to transmit a packet is greater than the number of
>>>>>>>>>> free TX
>>>>>>>>>> * descriptors.
>>>>>>>>>> * The following constraints must be satisfied:
>>>>>>>>>> + * tx_rs_thresh must be less than 2 for NICs for which RS
>>>>>> deferring is
>>>>>>>>>> + * forbidden (all but 82598).
>>>>>>>>>> * tx_rs_thresh must be greater than 0.
>>>>>>>>>> * tx_rs_thresh must be less than the size of the ring
>>>>>>>>>> minus 2.
>>>>>>>>>> * tx_rs_thresh must be less than or equal to
>>>>>>>>>> tx_free_thresh.
>>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct
>>>>>>>>>> rte_eth_dev
>>>>>>>> *dev,
>>>>>>>>>> * When set to zero use default values.
>>>>>>>>>> */
>>>>>>>>>> tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>>>>>> - tx_conf->tx_rs_thresh :
>>>>>>>>>> DEFAULT_TX_RS_THRESH);
>>>>>>>>>> + tx_conf->tx_rs_thresh :
>>>>>>>>>> + (rs_deferring_allowed ?
>>>>>>>>>> DEFAULT_TX_RS_THRESH :
>>>>>> 1));
>>>>>>>>>> tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>>>>> tx_conf->tx_free_thresh :
>>>>>>>>>> DEFAULT_TX_FREE_THRESH);
>>>>>>>>>> +
>>>>>>>>>> + if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>>>>>> + PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than
>>>>>>>>>> 2 since
>>>>>> RS
>>>>>>>> "
>>>>>>>>>> + "must be set for every packet for this
>>>>>> HW. "
>>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>>>>>> + (unsigned int)tx_rs_thresh,
>>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx);
>>>>>>>>>> + return -(EINVAL);
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>>>>> PMD_INIT_LOG(ERR, "tx_rs_thresh must be less
>>>>>>>>>> than the
>>>>>>>> number "
>>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u
>>>>>> "
>>>>>>>>>> --
>>>>>>>>>> 2.1.0
>>
>
More information about the dev
mailing list