[dpdk-dev] [PATCH v2 00/15] Unit tests fixes for CI

Michael Santana Francisco msantana at redhat.com
Tue Jul 9 17:50:49 CEST 2019


On 7/1/19 2:07 PM, Michael Santana Francisco wrote:
>>
>>
>> On Mon, Jul 1, 2019 at 6:04 PM Aaron Conole <aconole at redhat.com> wrote:
>>>>> - rwlock_autotest and hash_readwrite_lf_autotest are taking a little more
>>>>>    than 10s,
>>> Occasionally the distributor test times out as well.  I've moved them as
>>> part of a separate patch, that I'll post along with a bigger series to
>>> enable the unit tests under travis.  Michael and I are leaning toward
>>> introducing a new variable called RUN_TESTS which will do the docs and
>>> unit testing since those combined would add quite a bit to the execution
>>> time of each job (and feel free to bike shed the name, since the patches
>>> aren't final).
>>
>> Seeing how the distributor autotest usually takes less than a second to complete, this sounds like a bug.
>> I don't think I caught this so far.
> So I actually ran into the distributor test timing out. I agree with
> David in that it is a bug with the test. Looking at the logs that test
> normally finishes in less than 1/2 a second, so running to 10 seconds
> and timing out is a big jump in run time. I ran into the issue where
> it timedout, so I restarted the job and it finished no problem.
> The test fails every so often for no good reason and the logs[1] dont
> really say much. I speculate that it is waiting for a resource to
> become available or in the worse case a deadlock. Seeing that it only
> fails every so often and it passes when restarted I don't think it's a
> big deal, nevertheless it's worth investing time figuring out what's
> wrong
>
> [1] https://api.travis-ci.com/v3/job/212335916/log.txt

I investigated a little bit on this this test. CC'd David Hunt,

I was able to reproduce the problem on v19.08-rc1 with:

`while sudo sh -c "echo 'distributor_autotest' | 
./build/app/test/dpdk-test"; do :; done`

It runs a couple of times fine showing output and showing progress, but 
then at some point after a couple of seconds it just stops - no longer 
getting any output. It just sits there with no further output. I let it 
sit there for a whole minute and nothing happens. So I attach gdb to try 
to figure out what is happening. One thread seems to be stuck on a while 
loop, see lib/librte_distributor/rte_distributor.c:310.

I looked at the assembly code (layout asm, ni) and I saw these four 
lines below (which correspond to the while loop) being executed 
repeatedly and indefinitely. It looks like this thread is waiting for 
the variable bufptr64[0] to change state.

0xa064d0 <release+32>   pause
0xa064d2 <release+34>   mov    0x3840(%rdx),%rax
0xa064d9 <release+41>   test   $0x1,%al
0xa064db <release+43>   je     0xa064d0 <release+32>


While the first thread is waiting on bufptr64[0] to change state, there 
is another thread that is also stuck on another while loop on 
lib/librte_distributor/rte_distributor.c:53. It seems that this thread 
is stuck waiting for retptr64 to change state. Corresponding assembly 
being executed indefinitely:

0xa06de0 <rte_distributor_request_pkt_v1705+592> mov    0x38c0(%r8),%rax
0xa06de7 <rte_distributor_request_pkt_v1705+599> test   $0x1,%al
0xa06de9 <rte_distributor_request_pkt_v1705+601> je     0xa06bbd 
<rte_distributor_request_pkt_v1705+45>
0xa06def <rte_distributor_request_pkt_v1705+607>        nop
0xa06df0 <rte_distributor_request_pkt_v1705+608> pause
0xa06df2 <rte_distributor_request_pkt_v1705+610> rdtsc
0xa06df4 <rte_distributor_request_pkt_v1705+612> mov    %rdx,%r10
0xa06df7 <rte_distributor_request_pkt_v1705+615> shl    $0x20,%r10
0xa06dfb <rte_distributor_request_pkt_v1705+619> mov    %eax,%eax
0xa06dfd <rte_distributor_request_pkt_v1705+621> or     %r10,%rax
0xa06e00 <rte_distributor_request_pkt_v1705+624> lea    0x64(%rax),%r10
0xa06e04 <rte_distributor_request_pkt_v1705+628> jmp    0xa06e12 
<rte_distributor_request_pkt_v1705+642>
0xa06e06 <rte_distributor_request_pkt_v1705+630> nopw   %cs:0x0(%rax,%rax,1)
0xa06e10 <rte_distributor_request_pkt_v1705+640> pause
0xa06e12 <rte_distributor_request_pkt_v1705+642> rdtsc
0xa06e14 <rte_distributor_request_pkt_v1705+644> shl    $0x20,%rdx
0xa06e18 <rte_distributor_request_pkt_v1705+648> mov    %eax,%eax
0xa06e1a <rte_distributor_request_pkt_v1705+650> or     %rdx,%rax
0xa06e1d <rte_distributor_request_pkt_v1705+653> cmp    %rax,%r10
0xa06e20 <rte_distributor_request_pkt_v1705+656> ja     0xa06e10 
<rte_distributor_request_pkt_v1705+640>
0xa06e22 <rte_distributor_request_pkt_v1705+658> jmp    0xa06de0 
<rte_distributor_request_pkt_v1705+592>


My guess is that these threads are interdependent, so one thread is 
waiting for the other thread to change the state of the control 
variable. I can't say for sure if this is what is happening or why the 
these variables don't change state, so I would like ask someone who is 
more familiar with this particular code to take a look

>>
>> Yes, we need a variable to control this and select the targets that will do the tests and/or build the doc.
>> About the name, RUN_TESTS is ok for me.
>>
>> What do you want to make of this variable?
>> Have it as a simple boolean that enables everything? Or a selector with strings like unit-tests+doc+perf-tests?
>>
>>
>>>>> - librte_table unit test crashes on ipv6 [2],
>>> I guess we're waiting on a patch from Jananee (CC'd)?
>>
>> Yep.
>>
>>
>> --
>> David Marchand




More information about the dev mailing list