[dpdk-dev] Kernel panic in KNI

Sanford, Robert rsanford at akamai.com
Thu Apr 7 00:30:29 CEST 2016


Hi Jay,

I won't try to interpret your kernel stack trace. But, I'll tell you about
a KNI-related problem that we once experienced, and the symptom was a
kernel hang.

The problem was that we were passing mbufs allocated out of one mempool,
to a KNI context that we had set up with a different mempool (on a
different CPU socket). The KNI kernel driver, converts the user-space mbuf
virtual address (VA) to a kernel VA by adding the difference between the
user and kernel VAs of the mempool used to create the KNI context. So, if
an mbuf comes from a different mempool, the calculated address will
probably be VERY BAD.

Could this be your problem?

--
Robert


On 4/6/16 4:16 PM, "Jay Rolette" <rolette at infinite.io> wrote:

>I had a system lockup hard a couple of days ago and all we were able to
>get
>was a photo of the LCD monitor with most of the kernel panic on it. No way
>to scroll back the buffer and nothing in the logs after we rebooted. Not
>surprising with a kernel panic due to an exception during interrupt
>processing. We have a serial console attached in case we are able to get
>it
>to happen again, but it's not easy to reproduce (hours of runtime for this
>instance).
>
>Ran the photo through OCR software to get a text version of the dump, so
>possible I missed some fixups in this:
>
>[39178.433262] RDX: 00000000000000ba RSI: ffff881fd2f350ee RDI:
>a12520669126180a
>[39178.464020] RBP: ffff880433966970 R08: a12520669126180a R09:
>ffff881fd2f35000
>[39178.495091] R10: 000000000000ffff R11: ffff881fd2f88000 R12:
>ffff883fdla75ee8
>[39178.526594] R13: 00000000000000ba R14: 00007fdad5a66780 R15:
>ffff883715ab6780
>[39178.559011] FS:  00007ffff7fea740(0000) GS:ffff88lfffc00000(0000)
>knlGS:0000000000000000
>[39178.592005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[39178.623931] CR2: 00007ffff7ea2000 CR3: 0000001fd156f000 CR4:
>00000000001407f0
>[39178.656187] Stack:
>[39178.689025] ffffffffc067c7ef 00000000000000ba 00000000000000ba
>ffff881fd2f88000
>[39178.722682] 0000000000004000 ffff8B3fd0bbd09c ffff883fdla75ee8
>ffff8804339bb9c8
>[39178.756525] ffffffff81658456 ffff881fcd2ec40c ffffffffc0680700
>ffff880436bad800
>[39178.790577] Call Trace:
>[39178.824420] [<ffffffffc067c7ef>] ? kni_net_tx+0xef/0x1a0 [rte_kni]
>[39178.859190] [<ffffffff81658456>] dev_hard_start_xmit+0x316/0x5c0
>[39178.893426] [<ffffffff816797ae>] sch_direct_xmit+0xee/0xic0
>[39178.927435] [<ffffffff81658900>l __dev_queue_xmit+0x200/0x4d0
>[39178.961684] [<ffffffff816586e0>l dev_queue_xmit+0x10/0x20
>[39178.996194] [<ffffffff81661477>] neigh_connected_output+0x67/0x100
>[39179.031098] [<ffffffff81693748>] ip_finish_output+0xid8/0x850
>[39179.066709] [<ffffffff816950e8>l ip_output+0x58/0x90
>[39179.101551] [<ffffffff81694860>] ip_local_out_sk+0x30/0x40
>[39179.136823] [<ffffffff81694c1f>] ip_queue_xmit+0xl3f/0x3d0
>[39179.171742] [<ffffffff816aba2c>] tcp_transmit_skb+0x47c/0x900
>[39179.206854] [<ffffffff816abfc0>l tcp_write_xmit+0x110/0xcb0
>[39179.242335] [<ffffffff816acdce>] __tcp_push_pending_frames+0x2e/0xc0
>[39179.277632] [<ffffffff8169c41c>] tcp_push+0xec/0x120
>[39179.311768] [<ffffffff8169e199>] tcp_sendmsg+0xb9/0xce0
>[39179.346934] [<ffffffff8169fa22>] ? tcp_recvmsg+0x6e2/0xba0
>[39179.385586] [<ffffffff816c8c94>] inet_sendmsg+0x64/0x60
>[39179.424228] [<ffffffff81324461>] ? apparmor_socket_sendmsg+0x21/0x30
>[39179.4586581 [<ffffffff8163d32b>] sock_sendmsg+0x86/0xc0
>[39179.493220] [<ffffffff816c7995>] ? __inet_stream_connect+0xa5/0x320
>[39179.528033] [<ffffffff811ec163>] ? __fdget+0x13/0x20
>[39179.561214] [<ffffffff8163d4d1>] SYSC_sendto+0x121/0x1c0
>[39179.594665] [<ffffffff8132edfd>] ? aa_sk_perm.isra.4+0x6d/0x150
>[39179.6268931 [<ffffffff8101bd49>] ? read_tsc+0x9/0x20
>[39179.6586541 [<ffffffff810d6128>] ? ktime_get_ts+0x48/0xe0
>[39179.689944] [<ffffffff8163dfae>] SyS_sendto+0xe/0x10
>[39179.719575] [<ffffffff8175d48d>] system_call_fastpath+0xia/0xif
>[39179.748760] Code: 43 58 48 Zb 43 50 88 43 4e 5b 5d c3 66 Of if 84 00 00
>00 00 00 e8 fb fb ff ff eb e2 90 90 90 90 90 90 90
> 90 48 89 f8 48 89 d1 <f3> a4 c3 03 83 eZ 07 f3 48 .15 89 di f3 a4 c3 20
>4c
>8b % 4c 86
>[39179.808690] RIP  [<ffffffff8138d8e6>] memcpy+0x6/0x110
>[39179.837238]  RSP <ffff8804339bb938>
>[39179.933755] ---[ end trace 2971562f425e2cf8 ]---
>[39179.964856] Kernel panic - not syncing: Fatal exception in interrupt
>[39179.992896] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
>range: 0xffffffff80000000-0xffffffffbfffffff)
>[39180.024617] ---[ end Kernel panic - not syncing: Fatal exception in
>interrupt
>
>It blew up when kni_net_tx() called memcpy() to copy data from the skb to
>an mbuf.
>
>Disclosure: I'm not a Linux device driver guy. I dip into the kernel as
>needed. Plenty of experience doing RTOS and bare metal development, but
>not
>a Linux kernel expert.
>
>What context does kni_net_tx() run in? On the receive path, my
>understanding is that KNI always runs in process context on a kthread.
>I've
>been assuming that the transmit path was also in process context (albeit
>on
>the app's process), so the "Fatal exception in interrupt" is throwing me.
>
>Does kni_net_tx() ever run in interrupt (or soft-interrupt) context?
>
>Thanks,
>Jay



More information about the dev mailing list