rte_exit() does not terminate the program -- is it a bug or a new feature?

Stephen Hemminger stephen at networkplumber.org
Fri Sep 15 17:06:08 CEST 2023

Previous message (by thread): rte_exit() does not terminate the program -- is it a bug or a new feature?
Next message (by thread): rte_exit() does not terminate the program -- is it a bug or a new feature?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 15 Sep 2023 10:24:01 +0200
Gabor LENCSE <lencse at hit.bme.hu> wrote:

> Dear DPDK Developers and Users,
> 
> I have met the following issue with my RFC 8219 compliant SIIT and 
> stateful NAT64/NAT44 tester, siitperf: 
> https://github.com/lencsegabor/siitperf
> 
> Its main program starts two sending threads and two receiving threads on 
> their exclusively used CPU cores using the rte_eal_remote_launch() 
> function, e.g., the code is as follows:
> 
>            // start left sender
>            if ( rte_eal_remote_launch(send, &spars1, cpu_left_sender) )
>              std::cout << "Error: could not start Left Sender." << 
> std::endl;
> 
> When the test frame sending is finished, the senders check the sending 
> time, and if the allowed time was significantly exceeded, the sender 
> gives an error message and terminates (itself and also the main program) 
> using the rte_exit() function.
> 
> This is the code:
> 
>    elapsed_seconds = (double)(rte_rdtsc()-start_tsc)/hz;
>    printf("Info: %s sender's sending took %3.10lf seconds.\n", side, 
> elapsed_seconds);
>    if ( elapsed_seconds > duration*TOLERANCE )
>      rte_exit(EXIT_FAILURE, "%s sending exceeded the %3.10lf seconds 
> limit, the test is invalid.\n", side, duration*TOLERANCE);
>    printf("%s frames sent: %lu\n", side, sent_frames);
> 
>    return 0;
> 
> The above code worked as I expected, while I used siitperf under Debian 
> 9.13 with DPDK 16.11.11-1+deb9u2. It always displayed the execution time 
> of test frame sending, and if the allowed time was significantly exceed, 
> then it gave an error message, and it was terminated, thus the sender 
> did not print out the number of send frames. And also the main program 
> was terminated due to the call of this function: it did not write out 
> the "Info: Test finished." message.
> 
> However, when I updated siitperf to use it with Ubuntu 22.04 with DPDK 
> version "21.11.3-0ubuntu0.22.04.1 amd64", then I experienced something 
> rather strange:
> 
> In the case, when the sending time is significantly exceeded, I get the 
> following messages from the program (I copy here the full output, as it 
> may be useful):
> 
> root at x033:~/siitperf# cat temp.out
> EAL: Detected CPU lcores: 56
> EAL: Detected NUMA nodes: 4
> EAL: Detected shared linkage of DPDK
> EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
> EAL: Selected IOVA mode 'PA'
> EAL: No free 2048 kB hugepages reported on node 0
> EAL: No free 2048 kB hugepages reported on node 1
> EAL: No free 2048 kB hugepages reported on node 2
> EAL: No free 2048 kB hugepages reported on node 3
> EAL: No available 2048 kB hugepages reported
> EAL: VFIO support initialized
> EAL: Probe PCI driver: net_ice (8086:159b) device: 0000:98:00.0 (socket 2)
> ice_load_pkg_type(): Active package is: 1.3.26.0, ICE OS Default Package 
> (single VLAN mode)
> EAL: Probe PCI driver: net_ice (8086:159b) device: 0000:98:00.1 (socket 2)
> ice_load_pkg_type(): Active package is: 1.3.26.0, ICE OS Default Package 
> (single VLAN mode)
> TELEMETRY: No legacy callbacks, legacy socket not created
> ice_set_rx_function(): Using AVX2 Vector Rx (port 0).
> ice_set_rx_function(): Using AVX2 Vector Rx (port 1).
> Info: Left port and Left Sender CPU core belong to the same NUMA node: 2
> Info: Right port and Right Receiver CPU core belong to the same NUMA node: 2
> Info: Right port and Right Sender CPU core belong to the same NUMA node: 2
> Info: Left port and Left Receiver CPU core belong to the same NUMA node: 2
> Info: Testing initiated at 2023-09-15 07:50:17
> EAL: Error - exiting with code: 1
>    Cause: Forward sending exceeded the 60.0006000000 seconds limit, the 
> test is invalid.
> EAL: Error - exiting with code: 1
>    Cause: Reverse sending exceeded the 60.0006000000 seconds limit, the 
> test is invalid.
> root at x033:~/siitperf#
> 
> The rte_exit() function seems to work, as the error message appears, and 
> the number of sent frames is not displayed, however, the "Info: ..." 
> message about the sending time (printed out earlier in the code) is 
> missing! This is rather strange!
> 
> What is worse, the program does not stop, but *the sender threads and 
> the main program remain running (forever)*.
> 
> Here is the output of the "top" command:
> 
> top - 07:54:24 up 1 day, 14:12,  2 users, load average: 3.02, 2.41, 2.10
> Tasks: 591 total,   2 running, 589 sleeping,   0 stopped,   0 zombie
> %Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu2  :  0.0 us,  0.0 sy,  0.0 ni, 94.1 id,  0.0 wa,  0.0 hi, 5.9 si,  
> 0.0 st
> %Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu6  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu9  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> %Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi, 0.0 si,  
> 0.0 st
> 
> CPU0 is the main core, the left sender and the right sender use CPU1 and 
> CPU9, respectively. (The receivers that used CPU5 and CPU13 already 
> terminated due to their timeout.)
> 
> *Thus, rte_exit() behaves differently now: it used the terminate the 
> main program but now it does not. *(And it also suppresses some 
> previously sent output.)*
> *
> 
> Is it a bug a new feature?
> 
> *How could I achieve the old behavior?* (Or at least the termination of 
> the main program by sender threads?)
> 
> Thank you very much for you guidance!
> 
> Best regards,
> 
> Gábor Lencse


Please get a backtrace. Simple way is to attach gdb to that process.
I suspect that since rte_exit() call the internal eal_cleanup function
and that calls close in the dirver, that ICE driver close function has
a bug. Perhaps ice close function does not correctly handle case where
the device has not started.

Previous message (by thread): rte_exit() does not terminate the program -- is it a bug or a new feature?
Next message (by thread): rte_exit() does not terminate the program -- is it a bug or a new feature?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the users mailing list