[dpdk-dev] [PATCH] mbuf: fix reset on mbuf free

Andrew Rybchenko andrew.rybchenko at oktetlabs.ru
Sun Nov 8 15:16:20 CET 2020


On 11/6/20 3:23 PM, Morten Brørup wrote:
>> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
>> Sent: Friday, November 6, 2020 12:54 PM
>>
>>>>>>>>>>>>>>>>>> Hi Olivier,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> m->nb_seg must be reset on mbuf free
>>>> whatever
>>>>>> the
>>>>>>>> value
>>>>>>>>>> of m->next,
>>>>>>>>>>>>>>>>>>> because it can happen that m->nb_seg is
>> !=
>>>> 1.
>>>>>> For
>>>>>>>>>> instance in this
>>>>>>>>>>>>>>>>>>> case:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    m1 = rte_pktmbuf_alloc(mp);
>>>>>>>>>>>>>>>>>>>    rte_pktmbuf_append(m1, 500);
>>>>>>>>>>>>>>>>>>>    m2 = rte_pktmbuf_alloc(mp);
>>>>>>>>>>>>>>>>>>>    rte_pktmbuf_append(m2, 500);
>>>>>>>>>>>>>>>>>>>    rte_pktmbuf_chain(m1, m2);
>>>>>>>>>>>>>>>>>>>    m0 = rte_pktmbuf_alloc(mp);
>>>>>>>>>>>>>>>>>>>    rte_pktmbuf_append(m0, 500);
>>>>>>>>>>>>>>>>>>>    rte_pktmbuf_chain(m0, m1);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As rte_pktmbuf_chain() does not reset
>>>> nb_seg in
>>>>>> the
>>>>>>>>>> initial m1
>>>>>>>>>>>>>>>>>>> segment (this is not required), after
>> this
>>>> code
>>>>>> the
>>>>>>>>>> mbuf chain
>>>>>>>>>>>>>>>>>>> have 3 segments:
>>>>>>>>>>>>>>>>>>>    - m0: next=m1, nb_seg=3
>>>>>>>>>>>>>>>>>>>    - m1: next=m2, nb_seg=2
>>>>>>>>>>>>>>>>>>>    - m2: next=NULL, nb_seg=1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Freeing this mbuf chain will not
>> restore
>>>>>> nb_seg=1
>>>>>>>> in
>>>>>>>>>> the second
>>>>>>>>>>>>>>>>>>> segment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hmm, not sure why is that?
>>>>>>>>>>>>>>>>>> You are talking about freeing m1, right?
>>>>>>>>>>>>>>>>>> rte_pktmbuf_prefree_seg(struct rte_mbuf
>> *m)
>>>>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>>>> 	...
>>>>>>>>>>>>>>>>>> 	if (m->next != NULL) {
>>>>>>>>>>>>>>>>>>                          m->next = NULL;
>>>>>>>>>>>>>>>>>>                          m->nb_segs = 1;
>>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> m1->next != NULL, so it will enter the
>> if()
>>>>>> block,
>>>>>>>>>>>>>>>>>> and will reset both next and nb_segs.
>>>>>>>>>>>>>>>>>> What I am missing here?
>>>>>>>>>>>>>>>>>> Thinking in more generic way, that
>> change:
>>>>>>>>>>>>>>>>>>   -		if (m->next != NULL) {
>>>>>>>>>>>>>>>>>>   -			m->next = NULL;
>>>>>>>>>>>>>>>>>>   -			m->nb_segs = 1;
>>>>>>>>>>>>>>>>>>   -		}
>>>>>>>>>>>>>>>>>>   +		m->next = NULL;
>>>>>>>>>>>>>>>>>>   +		m->nb_segs = 1;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ah, sorry. I oversimplified the example
>> and
>>>> now
>>>>>> it
>>>>>>>> does
>>>>>>>>>> not
>>>>>>>>>>>>>>>>> show the issue...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The full example also adds a split() to
>> break
>>>> the
>>>>>>>> mbuf
>>>>>>>>>> chain
>>>>>>>>>>>>>>>>> between m1 and m2. The kind of thing that
>>>> would
>>>>>> be
>>>>>>>> done
>>>>>>>>>> for
>>>>>>>>>>>>>>>>> software TCP segmentation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If so, may be the right solution is to care
>>>> about
>>>>>>>> nb_segs
>>>>>>>>>>>>>>>> when next is set to NULL on split? Any
>> place
>>>> when
>>>>>> next
>>>>>>>> is
>>>>>>>>>> set
>>>>>>>>>>>>>>>> to NULL. Just to keep the optimization in a
>>>> more
>>>>>>>> generic
>>>>>>>>>> place.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The problem with that approach is that there
>> are
>>>>>> already
>>>>>>>>>> several
>>>>>>>>>>>>>>> existing split() or trim() implementations in
>>>>>> different
>>>>>>>> DPDK-
>>>>>>>>>> based
>>>>>>>>>>>>>>> applications. For instance, we have some in
>>>>>> 6WINDGate. If
>>>>>>>> we
>>>>>>>>>> force
>>>>>>>>>>>>>>> applications to set nb_seg to 1 when
>> resetting
>>>> next,
>>>>>> it
>>>>>>>> has
>>>>>>>>>> to be
>>>>>>>>>>>>>>> documented because it is not straightforward.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think it is better to go that way.
>>>>>>>>>>>>>>  From my perspective it seems natural to reset
>>>> nb_seg at
>>>>>>>> same
>>>>>>>>>> time
>>>>>>>>>>>>>> we reset next, otherwise inconsistency will
>> occur.
>>>>>>>>>>>>>
>>>>>>>>>>>>> While it is not explicitly stated for nb_segs, to
>> me
>>>> it
>>>>>> was
>>>>>>>> clear
>>>>>>>>>> that
>>>>>>>>>>>>> nb_segs is only valid in the first segment, like
>> for
>>>> many
>>>>>>>> fields
>>>>>>>>>> (port,
>>>>>>>>>>>>> ol_flags, vlan, rss, ...).
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we say that nb_segs has to be valid in any
>>>> segments,
>>>>>> it
>>>>>>>> means
>>>>>>>>>> that
>>>>>>>>>>>>> chain() or split() will have to update it in all
>>>>>> segments,
>>>>>>>> which
>>>>>>>>>> is not
>>>>>>>>>>>>> efficient.
>>>>>>>>>>>>
>>>>>>>>>>>> Why in all?
>>>>>>>>>>>> We can state that nb_segs on non-first segment
>> should
>>>>>> always
>>>>>>>> equal
>>>>>>>>>> 1.
>>>>>>>>>>>> As I understand in that case, both split() and
>> chain()
>>>> have
>>>>>> to
>>>>>>>>>> update nb_segs
>>>>>>>>>>>> only for head mbufs, rest ones will remain
>> untouched.
>>>>>>>>>>>
>>>>>>>>>>> Well, anyway, I think it's strange to have a
>> constraint
>>>> on m-
>>>>>>>>> nb_segs
>>>>>>>>>> for
>>>>>>>>>>> non-first segment. We don't have that kind of
>> constraints
>>>> for
>>>>>>>> other
>>>>>>>>>> fields.
>>>>>>>>>>
>>>>>>>>>> True, we don't. But this is one of the fields we
>> consider
>>>>>> critical
>>>>>>>>>> for proper work of mbuf alloc/free mechanism.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure that requiring m->nb_segs == 1 on non-first
>>>>>> segments
>>>>>>>> will provide any benefits.
>>>>>>>>
>>>>>>>> It would make this patch unneeded.
>>>>>>>> So, for direct, non-segmented mbufs  pktmbuf_free() will
>> remain
>>>>>> write-
>>>>>>>> free.
>>>>>>>
>>>>>>> I see. Then I agree with Konstantin that alternative
>> solutions
>>>> should
>>>>>> be considered.
>>>>>>>
>>>>>>> The benefit regarding free()'ing non-segmented mbufs - which
>> is a
>>>>>> very common operation - certainly outweighs the cost of
>> requiring
>>>>>> split()/chain() operations to set the new head mbuf's nb_segs =
>> 1.
>>>>>>>
>>>>>>> Nonetheless, the bug needs to be fixed somehow.
>>>>>>>
>>>>>>> If we can't come up with a better solution that doesn't break
>> the
>>>>>> ABI, we are forced to accept the patch.
>>>>>>>
>>>>>>> Unless the techboard accepts to break the ABI in order to
>> avoid
>>>> the
>>>>>> performance cost of this patch.
>>>>>>
>>>>>> Did someone notice a performance drop with this patch?
>>>>>> On my side, I don't see any regression on a L3 use case.
>>>>>
>>>>> I am afraid that the DPDK performance regression tests are based
>> on
>>>> TX immediately following RX, so cache misses in TX may go by
>> unnoticed
>>>> because RX warmed up the cache for TX already. And similarly for RX
>>>> reusing mbufs that have been warmed up by the preceding free() at
>> TX.
>>>>>
>>>>> Please consider testing the performance difference with the mbuf
>>>> being completely cold at TX, and going completely cold again before
>>>> being reused for RX.
>>>>>
>>>>>>
>>>>>> Let's sumarize: splitting a mbuf chain and freeing it causes
>>>> subsequent
>>>>>> mbuf
>>>>>> allocation to return a mbuf which is not correctly initialized.
>>>> There
>>>>>> are 2
>>>>>> options to fix it:
>>>>>>
>>>>>> 1/ change the mbuf free function (this patch)
>>>>>>
>>>>>>     - m->nb_segs would behave like many other field: valid in
>> the
>>>> first
>>>>>>       segment, ignored in other segments
>>>>>>     - may impact performance (suspected)
>>>>>>
>>>>>> 2/ change all places where a mbuf chain is split, or trimmed
>>>>>>
>>>>>>     - m->nb_segs would have a specific behavior: count the
>> number of
>>>>>>       segments in the first mbuf, should be 1 in the last
>> segment,
>>>>>>       ignored in other ones.
>>>>>>     - no code change in mbuf library, so no performance impact
>>>>>>     - need to patch all places where we do a mbuf split or trim.
>>>> From
>>>>>> afar,
>>>>>>       I see at least mbuf_cut_seg_ofs() in DPDK. Some external
>>>>>> applications
>>>>>>       may have to be patched (for instance, I already found 3
>> places
>>>> in
>>>>>>       6WIND code base without a deep search).
>>>>>>
>>>>>> In my opinion, 1/ is better, except we notice a significant
>>>>>> performance,
>>>>>> because the (implicit) behavior is unchanged.
>>>>>>
>>>>>> Whatever the solution, some documentation has to be added.
>>>>>>
>>>>>> Olivier
>>>>>>
>>>>>
>>>>> Unfortunately, I don't think that anything but the first option
>> will
>>>> go into 20.11 and stable releases of older versions, so I stand by
>> my
>>>> acknowledgment of the patch.
>>>>
>>>> If we are affraid about 20.11 performance (it is legitimate, few
>> days
>>>> before the release), we can target 21.02. After all, everybody
>> lives
>>>> with this bug since 2017, so there is no urgency. If accepted and
>> well
>>>> tested, it can be backported in stable branches.
>>>
>>> +1
>>>
>>> Good thinking, Olivier!
>>
>> Looking at the changes once again, it probably can be reworked a bit:
>>
>> -	if (m->next != NULL) {
>> -		m->next = NULL;
>> -		m->nb_segs = 1;
>> -	}
>>
>> +	if (m->next != NULL)
>> +		m->next = NULL;
>> +	if (m->nb_segs != 1)
>> +		m->nb_segs = 1;
>>
>> That way we add one more condition checking, but I suppose it
>> shouldn't be that perf critical.
>> That way for direct,non-segmented mbuf it still should be write-free.
>> Except cases as you described above: chain(), then split().
>>
>> Of-course we still need to do perf testing for that approach too.
>> So if your preference it to postpone it till 21.02 - that's ok for me.
>> Konstantin
> 
> With this suggestion, I cannot imagine any performance drop for direct, non-segmented mbufs: It now reads m->nb_segs, residing in the mbuf's first cache line, but the function already reads m->refcnt in the first cache line; so no cache misses are introduced.

+1




More information about the dev mailing list