[dpdk-dev] [PATCH v2 1/2] doc: add guide for debug and troubleshoot

Vipin Varghese vipin.varghese at intel.com
Fri Nov 9 11:06:01 CET 2018


Add user guide for debug and troubleshoot for common issues and bottleneck
found in various application models running on single or multi stages.

Signed-off-by: Vipin Varghese <vipin.varghese at intel.com>
Acked-by: Marko Kovacevic <marko.kovacevic at intel.com>
---

V2:
 - add offload flag check - Vipin Varghese
 - change tab to space - Marko Kovacevic
 - Spelling correction - Marko Kovacevic
 - remove extra characters - Marko Kovacevic
 - add ACK by Marko - Vipn Varghese
---
 doc/guides/howto/debug_troubleshoot_guide.rst | 349 ++++++++++++++++++
 doc/guides/howto/index.rst                    |   1 +
 2 files changed, 350 insertions(+)
 create mode 100644 doc/guides/howto/debug_troubleshoot_guide.rst

diff --git a/doc/guides/howto/debug_troubleshoot_guide.rst b/doc/guides/howto/debug_troubleshoot_guide.rst
new file mode 100644
index 000000000..a76000231
--- /dev/null
+++ b/doc/guides/howto/debug_troubleshoot_guide.rst
@@ -0,0 +1,349 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2018 Intel Corporation.
+
+.. _debug_troubleshoot_via_pmd:
+
+Debug & Troubleshoot guide via PMD
+==================================
+
+DPDK applications can be designed to run as single thread simple stage to
+multiple threads with complex pipeline stages. These application can use poll
+mode devices which helps in offloading CPU cycles. A few models are
+
+	*  single primary
+	*  multiple primary
+	*  single primary single secondary
+	*  single primary multiple secondary
+
+In all the above cases, it is a tedious task to isolate, debug and understand
+odd behaviour which can occurring random or periodic. The goal of guide is to
+share and explore a few commonly seen patterns and beahviour. Then isolate and
+identify the root cause via step by step debug at various processing stages.
+
+Application Overview
+--------------------
+
+Let us take up an example application as reference for explaining issues and
+patterns commonly seen. The sample application in discussion makes use of
+single primary model with various pipeline stages. The application uses PMD
+and libraries such as service cores, mempool, pkt mbuf, event, crypto, QoS
+and eth.
+
+The overview of an application modeled using PMD is shown in
+:numref:`dtg_sample_app_model`.
+
+.. _dtg_sample_app_model:
+
+.. figure:: img/dtg_sample_app_model.*
+
+   Overview of pipeline stage of an application
+
+Bottleneck Analysis
+-------------------
+
+To debug the bottleneck and performance issues the desired application
+is made to run in an environment matching as below
+-  Linux 64-bit|32-bit
+-  DPDK PMD and libraries are used
+-  Libraries and PMD are either static or shared. But not both
+-  Machine flag optimizations of gcc or compiler are made constant
+
+Is there mismatch in packet rate (received < send)?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RX Port and associated core :numref:`dtg_rx_rate`.
+
+.. _dtg_rx_rate:
+
+.. figure:: img/dtg_rx_rate.*
+
+   RX send rate compared against Receieved rate
+
+#. are generic configuration correct?
+	-  What is port Speed, Duplex? rte_eth_link_get()
+	-  Is packet of higher sizes are dropped? rte_eth_get_mtu()
+	-  Are only specific MAC are received? rte_eth_promiscuous_get()
+
+#. are there NIC specific drops?
+	-  Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx,
+	-  Check rte_eth_dev_stats() for Stats per queue
+	-  Is stats of other queues shows no change via
+	   rte_eth_dev_dev_rss_hash_conf_get()
+
+#. If problem still persists, this might be at RX lcore thread
+	-  Check if RX thread, distributor or event rx adapter is holding or
+	   processing more than required
+	-  try using rte_prefetch_non_temporal() to intimate the mbuf in pulled
+	   to cache for temporary
+
+
+Are there packet drops (receive|transmit)?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RX-TX Port and associated cores :numref:`dtg_rx_tx_drop`.
+
+.. _dtg_rx_tx_drop:
+
+.. figure:: img/dtg_rx_tx_drop.*
+
+   RX-TX drops
+
+#. at RX
+	-  Get the rx queues by rte_eth_dev_info_get() for nb_rx_queues
+	-  Check for miss, errors, qerros by rte_eth_dev_stats() for imissed,
+	   ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count
+
+#. at TX
+	-  Are we doing in bulk to reduce the TX descriptor overhead?
+	-  Check rte_eth_dev_stats() for oerrors, qerros, rte_mbuf_ref_count
+
+Are there object drops in producer point for ring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Producer point for ring :numref:`dtg_producer_ring`.
+
+.. _dtg_producer_ring:
+
+.. figure:: img/dtg_producer_ring.*
+
+   Producer point for Rings
+
+#. Performance for Producer
+	-  Fetch the type of RING 'rte_ring_dump()' for flags (RING_F_SP_ENQ)
+	-  If '(burst enqueue - actual enqueue) > 0' check rte_ring_count() or
+	   rte_ring_free_count()
+	-  If 'burst or single enqueue is 0', then there is no more space check
+	   rte_ring_full() or not
+
+Are there object drops in consumer point for ring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Consumer point for ring :numref:`dtg_consumer_ring`.
+
+.. _dtg_consumer_ring:
+
+.. figure:: img/dtg_consumer_ring.*
+
+   Consumer point for Rings
+
+#. Performance for Consumer
+	-  Fetch the type of RING – rte_ring_dump() for flags (RING_F_SC_DEQ)
+	-  If '(burst dequeue - actual dequeue) > 0' for rte_ring_free_count()
+	-  If 'burst or single enqueue' always results 0 check the ring is empty
+	   via rte_ring_empty()
+
+Is packets or objects are not processed at desired rate?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Memory objects close to NUMA :numref:`dtg_mempool`.
+
+.. _dtg_mempool:
+
+.. figure:: img/dtg_mempool.*
+
+   Memory objects has to be close to device per NUMA
+
+#. Is the performance low?
+	-  Are packets received from multiple NIC? rte_eth_dev_count_all()
+	-  Are NIC interfaces on different socket? use rte_eth_dev_socket_id()
+	-  Is mempool created with right socket? rte_mempool_create() or
+	   rte_pktmbuf_pool_create()
+	-  Are we seeing drop on specific socket? It might require more
+	   mempool objects; try allocating more objects
+	-  Is there single RX thread for multiple NIC? try having multiple
+	   lcore to read from fixed interface or we might be hitting cache
+	   limit, so Increase cache_size for pool_create()
+
+#. Are we are still seeing low performance
+        -  Check if sufficient objects in mempool by rte_mempool_avail_count()
+        -  Is failure in some pkt? we might be getting pkts with size > mbuf
+	   data size. Check rte_pktmbuf_is_continguous()
+        -  If user pthread is used to access object access
+	   rte_mempool_cache_create()
+        -  Try using 1GB huge pages instead of 2MB. If there is difference,
+           try then rte_mem_lock_page() for 2MB pages
+
+.. note::
+  Stall in release of MBUF can be because
+
+	*  Processing pipeline is too heavy
+	*  Number of stages are too many
+	*  TX is not transferred at desired rate
+	*  Multi segment is not offloaded at TX device.
+	*  Application misuse scenarios can be
+		-  not freeing packets
+		-  invalid rte_pktmbuf_refcnt_set
+		-  invalid rte_pktmbuf_prefree_seg
+
+Is there difference in performance for crypto?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Crypto device and PMD :numref:`dtg_crypto`.
+
+.. _dtg_crypto:
+
+.. figure:: img/dtg_crypto.*
+
+   CRYPTO and interaction with PMD device
+
+#. are generic configuration correct?
+	-  Get total crypto devices – rte_cryptodev_count()
+	-  Cross check SW or HW flags are configured properly
+	   rte_cryptodev_info_get() for feature_flags
+
+#. If enqueue request > actual enqueue (drops)?
+	-  Is the queue pair setup for proper node
+	   rte_cryptodev_queue_pair_setup() for socket_id
+	-  Is the session_pool created from same socket_id as queue pair?
+	-  Is enqueue thread same socket_id?
+	-  rte_cryptodev_stats() for drops err_count for enqueue or dequeue
+	-  Are there multiple threads enqueue or dequeue from same queue pair?
+
+#. If enqueue rate > dequeue rate?
+	-  Is dequeue lcore thread is same socket_id?
+	-  If SW crypto is in use, check if the CRYPTO Library build with
+	   right (SIMD) flags Or check if the queue pair using CPU ISA by
+	   rte_cryptodev_info_get() for feature_flags for AVX|SSE
+	-  If we are using HW crypto – Is the card on same NUMA socket as
+	   queue pair and session pool?
+
+worker functions not giving performance?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Custom worker function :numref:`dtg_distributor_worker`.
+
+.. _dtg_distributor_worker:
+
+.. figure:: img/dtg_distributor_worker.*
+
+   Custom worker function performance drops
+
+#. Performance
+	-  Threads context switches more frequently? Identify lcore with
+	   rte_lcore() and lcore index mapping with rte_lcore_index(). Best
+	   performance when mapping of thread and core is 1:1.
+	-  Check lcore role type and state? rte_eal_lcore_role for
+	   rte, off and service. User function on service core might be
+	   sharing timeslots with other functions.
+	-  Check the cpu core? check rte_thread_get_affinity() and
+	   rte_eal_get_lcore_state() for run state.
+
+#. Debug
+	-  Mode of operation? rte_eal_get_configuration() for master, fetch
+	   lcore|service|numa count, process_type.
+	-  Check lcore run mode? rte_eal_lcore_role() for rte, off, service.
+	-  process details? rte_dump_stack(), rte_dump_registers() and
+	   rte_memdump() will give insights.
+
+service functions are not frequent enough?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+service functions on service cores :numref:`dtg_service`.
+
+.. _dtg_service:
+
+.. figure:: img/dtg_service.*
+
+   functions running on service cores
+
+#. Performance
+	-  Get service core count? rte_service_lcore_count() and compare with
+	   result of rte_eal_get_configuration()
+	-  Check if registered service is available?
+	   rte_service_get_by_name(), rte_service_get_count() and
+	   rte_service_get_name()
+	-  Is given service running parallel on multiple lcores?
+	   rte_service_probe_capability() and rte_service_map_lcore_get()
+	-  Is service running? rte_service_runstate_get()
+
+#. Debug
+	-  Find how many services are running on specific service lcore by
+	   rte_service_lcore_count_services()
+	-  Generic debug via rte_service_dump()
+
+Is there bottleneck in eventdev?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. are generic configuration correct?
+	-  Get event_dev devices? rte_event_dev_count()
+	-  Are they created on correct socket_id? - rte_event_dev_socket_id()
+	-  Check if HW or SW capabilities? - rte_event_dev_info_get() for
+	   event_qos, queue_all_types, burst_mode, multiple_queue_port,
+	   max_event_queue|dequeue_depth
+	-  Is packet stuck in queue? check for stages (event qeueue) where
+	   packets are looped back to same or previous stages.
+
+#. Performance drops in enqueue (event count > actual enqueue)?
+	-  Dump the event_dev information? rte_event_dev_dump()
+	-  Check stats for queue and port for eventdev
+	-  Check the inflight, current queue element for enqueue|deqeue
+
+How to debug QoS via TM?
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+TM on TX interface :numref:`dtg_qos_tx`.
+
+.. _dtg_qos_tx:
+
+.. figure:: img/dtg_qos_tx.*
+
+   Traffic Manager just before TX
+
+#. Is configuration right?
+	-  Get current capabilities for DPDK port rte_tm_capabilities_get()
+	   for max nodes, level, shaper_private, shaper_shared, sched_n_children
+	   and stats_mask
+	-  Check if current leaf are configured identically rte_tm_capabilities_get()
+	   for lead_nodes_identicial
+	-  Get leaf nodes for a dpdk port – rte_tn_get_number_of_leaf_node()
+	-  Check level capabilities by rte_tm_level_capabilities_get for n_nodes
+		-  Max, nonleaf_max, leaf_max
+		-  identical, non_identical
+		-  Shaper_private_supported
+		-  Stats_mask
+		-  Cman wred packet|byte supported
+		-  Cman head drop supported
+	-  Check node capabilities by rte_tm_node_capabilities_get for n_nodes
+		-  Shaper_private_supported
+		-  Stats_mask
+		-  Cman wred packet|byte supported
+		-  Cman head drop supported
+	-  Debug via stats – rte_tm_stats_update() and rte_tm_node_stats_read()
+
+Packet is not of right format?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Packet capture before and after processing :numref:`dtg_pdump`.
+
+.. _dtg_pdump:
+
+.. figure:: img/dtg_pdump.*
+
+   Capture points of Traffic at RX-TX
+
+#.  with primary enabling then secondary can access. Copies packets from
+    specific RX or TX queues to secondary process ring buffers.
+
+.. note::
+  Need to explore:
+	*  if secondary shares same interface can we enable from secondary
+	   for rx|tx happening on primary
+	*  Specific PMD private data dump the details
+	*  User private data if present, dump the details
+
+How to develop custom code to debug?
+------------------------------------
+
+-  For single process – the debug functionality is to be added in same
+   process
+-  For multiple process – the debug functionality can be added to
+   secondary multi process
+
+..
+
+These can be achieved by Primary’s Debug functions invoked via
+
+	#. Timer call-back
+	#. Service function under service core
+	#. USR1 or USR2 signal handler
+
diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index a642a2be1..ca4905e29 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -18,3 +18,4 @@ HowTo Guides
     virtio_user_as_exceptional_path
     packet_capture_framework
     telemetry
+    debug_troubleshoot_guide.rst
-- 
2.17.1



More information about the dev mailing list