[dpdk-dev] [PATCH v5 0/8] Introduce event vectorization

Pavan Nikhilesh Bhagavatula pbhagavatula at marvell.com
Wed Mar 24 07:44:59 CET 2021
Previous message (by thread): [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
Next message (by thread): [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> From: pbhagavatula at marvell.com <pbhagavatula at marvell.com>
>> Sent: Wednesday, March 24, 2021 10:35 AM
>> To: jerinj at marvell.com; Jayatheerthan, Jay
><jay.jayatheerthan at intel.com>; Carrillo, Erik G
><erik.g.carrillo at intel.com>; Gujjar, Abhinandan
>> S <abhinandan.gujjar at intel.com>; McDaniel, Timothy
><timothy.mcdaniel at intel.com>; hemant.agrawal at nxp.com; Van
>Haaren, Harry
>> <harry.van.haaren at intel.com>; mattias.ronnblom
><mattias.ronnblom at ericsson.com>; Ma, Liang J
><liang.j.ma at intel.com>
>> Cc: dev at dpdk.org; Pavan Nikhilesh <pbhagavatula at marvell.com>
>> Subject: [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
>>
>> From: Pavan Nikhilesh <pbhagavatula at marvell.com>
>>
>> In traditional event programming model, events are identified by a
>> flow-id and a uintptr_t. The flow-id uniquely identifies a given event
>> and determines the order of scheduling based on schedule type, the
>> uintptr_t holds a single object.
>>
>> Event devices also support burst mode with configurable dequeue
>depth,
>> i.e. each dequeue call would return multiple events and each event
>> might be at a different stage of the pipeline.
>> Having a burst of events belonging to different stages in a dequeue
>> burst is not only difficult to vectorize but also increases the scheduler
>> overhead and application overhead of pipelining events further.
>> Using event vectors we see a performance gain of ~628% as shown in
>[1].
>This is very impressive performance boost. Thanks so much for putting
>this patchset together! Just curious, was any performance
>measurement done for existing applications (non-vector)?
>>
>> By introducing event vectorization, each event will be capable of
>holding
>> multiple uintptr_t of the same flow thereby allowing applications
>> to vectorize their pipeline and reduce the complexity of pipelining
>> events across multiple stages. This also reduces the complexity of
>handling
>> enqueue and dequeue on an event device.
>>
>> Since event devices are transparent to the events they are scheduling
>> so the event producers such as eth_rx_adapter, crypto_adapter , etc..
>> are responsible for vectorizing the buffers of the same flow into a
>single
>> event.
>>
>> The series also breaks ABI in the patch [8/8] which is targetted to the
>> v21.11 release.
>>
>> The dpdk-test-eventdev application has been updated with options to
>test
>> multiple vector sizes and timeouts.
>>
>> [1]
>> As for performance improvement, with a ARM Cortex-A72 equivalent
>processer,
>> software event device (--vdev=event_sw0), single worker core, single
>stage
>> and using one service core for Rx adapter, Tx adapter, Scheduling.
>>
>> Without event vectorization:
>>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --
>vdev="event_sw0" --
>>          --prod_type_ethdev --nb_pkts=0 --verbose 2 --
>test=pipeline_queue
>>          --stlist=a --wlcores=20
>>     Port[0] using Rx adapter[0] configured
>>     Port[0] using Tx adapter[0] Configured
>>     4.728 mpps avg 4.728 mpps
>Is this number before the patchset? If so, it would help put similar
>number with the patchset but not using vectorization feature.

I don’t remember the exact clock frequency I was using when I ran 
the above test but with equal clocks:
1. Without the patchset applied
	5.071 mpps
2. With patchset applied w/o enabling vector
	5.123 mpps
3. With patchset applied with enabling vector
	vector_sz at 256 42.715 mpps
	vector_sz at 512 45.335 mpps
	
>>
>> With event vectorization:
>>     ./build/app/dpdk-test-eventdev -l 7-23 -s 0x700 --
>vdev="event_sw0" --
>>         --prod_type_ethdev --nb_pkts=0 --verbose 2 --
>test=pipeline_queue
>>         --stlist=a --wlcores=20 --enable_vector --nb_eth_queues 1
>>         --vector_size 256
>>     Port[0] using Rx adapter[0] configured
>>     Port[0] using Tx adapter[0] Configured
>>     34.383 mpps avg 34.383 mpps
>>
>> Having dedicated service cores for each Rx queues and tweaking the
>vector,
>> dequeue burst size would further improve performance.
>>
>> API usage is shown below:
>>
>> Configuration:
>>
>> 	struct rte_event_eth_rx_adapter_event_vector_config
>vec_conf;
>>
>> 	vector_pool = rte_event_vector_pool_create("vector_pool",
>> 			nb_elem, 0, vector_size, socket_id);
>>
>> 	rte_event_eth_rx_adapter_create(id, event_id, &adptr_conf);
>> 	rte_event_eth_rx_adapter_queue_add(id, eth_id, -1,
>&queue_conf);
>> 	if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_EVENT_VECTOR)
>{
>> 		vec_conf.vector_sz = vector_size;
>> 		vec_conf.vector_timeout_ns = vector_tmo_nsec;
>> 		vec_conf.vector_mp = vector_pool;
>>
>	rte_event_eth_rx_adapter_queue_event_vector_config(id,
>> 				eth_id, -1, &vec_conf);
>> 	}
>>
>> Fastpath:
>>
>> 	num = rte_event_dequeue_burst(event_id, port_id, &ev, 1, 0);
>> 	if (!num)
>> 		continue;
>>
>> 	if (ev.event_type & RTE_EVENT_TYPE_VECTOR) {
>> 		switch (ev.event_type) {
>> 		case RTE_EVENT_TYPE_ETHDEV_VECTOR:
>> 		case RTE_EVENT_TYPE_ETH_RX_ADAPTER_VECTOR:
>> 			struct rte_mbuf **mbufs;
>>
>> 			mbufs = ev.vector_ev->mbufs;
>> 			for (i = 0; i < ev.vector_ev->nb_elem; i++)
>> 				//Process mbufs.
>> 			break;
>> 		case ...
>> 		}
>> 	}
>> 	...
>>
>> v5 Changes:
>> - Make `rte_event_vector_pool_create non-inline` to ease ABI
>stability.(Ray)
>> - Move `rte_event_eth_rx_adapter_queue_event_vector_config` and
>>   `rte_event_eth_rx_adapter_vector_limits_get` implementation to
>the patch
>>   where they are initially defined.(Ray)
>> - Multiple gramatical and style fixes.(Jerin)
>> - Add missing release notes.(Jerin)
>>
>> v4 Changes:
>> - Fix missing event vector structure in event structure.(Jay)
>>
>> v3 Changes:
>> - Fix unintended formatting changes.
>>
>> v2 Changes:
>> - Multiple gramatical and style fixes.(Jerin)
>> - Add parameter to define vector size in power of 2. (Jerin)
>> - Redo patch series w/o breaking ABI till the last patch.(David)
>> - Add deprication notice to announce ABI break in 21.11.(David)
>> - Add vector limits validation to app/test-eventdev.
>>
>> Pavan Nikhilesh (8):
>>   eventdev: introduce event vector capability
>>   eventdev: introduce event vector Rx capability
>>   eventdev: introduce event vector Tx capability
>>   eventdev: add Rx adapter event vector support
>>   eventdev: add Tx adapter event vector support
>>   app/eventdev: add event vector mode in pipeline test
>>   doc: announce event Rx adapter config changes
>>   eventdev: simplify Rx adapter event vector config
>>
>>  app/test-eventdev/evt_common.h                |   4 +
>>  app/test-eventdev/evt_options.c               |  52 +++
>>  app/test-eventdev/evt_options.h               |   4 +
>>  app/test-eventdev/test_pipeline_atq.c         | 310 +++++++++++++++--
>>  app/test-eventdev/test_pipeline_common.c      | 105 +++++-
>>  app/test-eventdev/test_pipeline_common.h      |  18 +
>>  app/test-eventdev/test_pipeline_queue.c       | 320
>++++++++++++++++--
>>  .../prog_guide/event_ethernet_rx_adapter.rst  |  38 +++
>>  .../prog_guide/event_ethernet_tx_adapter.rst  |  12 +
>>  doc/guides/prog_guide/eventdev.rst            |  36 +-
>>  doc/guides/rel_notes/deprecation.rst          |   9 +
>>  doc/guides/rel_notes/release_21_05.rst        |   8 +
>>  doc/guides/tools/testeventdev.rst             |  45 ++-
>>  lib/librte_eventdev/eventdev_pmd.h            |  31 +-
>>  .../rte_event_eth_rx_adapter.c                | 305 ++++++++++++++++-
>>  .../rte_event_eth_rx_adapter.h                |  78 +++++
>>  .../rte_event_eth_tx_adapter.c                |  66 +++-
>>  lib/librte_eventdev/rte_eventdev.c            |  53 ++-
>>  lib/librte_eventdev/rte_eventdev.h            | 113 ++++++-
>>  lib/librte_eventdev/version.map               |   4 +
>>  20 files changed, 1524 insertions(+), 87 deletions(-)
>>
>> --
>> 2.17.1
>
>Just a heads up. v5 patchset doesn't apply cleanly on HEAD
>(5f0849c1155849dfdbf950c91c52cdf9cd301f59). Although, it applies
>cleanly on app/eventdev: fix timeout accuracy
>(c33d48387dc8ccf1b432820f6e0cd4992ab486df).

This patchset is currently rebased on main branch, I will rebase it on 
dpdk-next-event in next version.
Previous message (by thread): [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
Next message (by thread): [dpdk-dev] [PATCH v5 0/8] Introduce event vectorization
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list