[dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock

Dumitrescu, Cristian cristian.dumitrescu at intel.com
Mon Feb 16 23:44:31 CET 2015


Hi Stephen,

Sorry, NACK.

1. Overflow issue
As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, the line of code below results in overflow:
	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / params->rate;
Therefore, there is most likely a significant accuracy loss, which might result in more packets allowed to go out than it should.

2. Integer division has a higher cost than floating point division
My understanding is we are considering a performance improvement by replacing the double precision floating point division in:
	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
with an integer division:
	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / port->cycles_per_byte;
I don't think this is going to have the claimed benefit, as acording to "Intel 64 and IA-32 Architectures Optimization  Reference Manual" (Appendix C), the latency of the integer division instruction is significantly bigger than the latency of integer division:
	Instruction FDIV double precision: latency = 38-40 cycles
	Instruction IDIV: latency = 56 - 80 cycles

3. Alternative
I hear though your suggestion about replacing the floating point division with a more performant construction. One suggestion would be to replace it with an integer multiplication followed by a shift right, probably by using a uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need to prototype this code myself. Would you be OK to look into providing an alternative implementation?

Thanks,
Cristian


-----Original Message-----
From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
Sent: Thursday, February 5, 2015 6:14 AM
To: dev at dpdk.org
Cc: Stephen Hemminger
Subject: [dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock

From: Stephen Hemminger <shemming at brocade.com>

The old code was doing a floating point divide for each rte_dequeue()
which is very expensive. Change to using fixed point scaled math instead.
This improved performance from 5Gbit/sec to 10 Gbit/sec

Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
---
 lib/librte_sched/rte_sched.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index 55fbc14..3023457 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -102,6 +102,9 @@
 
 #define RTE_SCHED_BMP_POS_INVALID             UINT32_MAX
 
+/* For cycles_per_byte calculation */
+#define RTE_SCHED_TIME_SHIFT		      20
+
 struct rte_sched_subport {
 	/* Token bucket (TB) */
 	uint64_t tb_time; /* time of last update */
@@ -239,7 +242,7 @@ struct rte_sched_port {
 	uint64_t time_cpu_cycles;     /* Current CPU time measured in CPU cyles */
 	uint64_t time_cpu_bytes;      /* Current CPU time measured in bytes */
 	uint64_t time;                /* Current NIC TX time measured in bytes */
-	double cycles_per_byte;       /* CPU cycles per byte */
+	uint32_t cycles_per_byte;       /* CPU cycles per byte (scaled) */
 
 	/* Scheduling loop detection */
 	uint32_t pipe_loop;
@@ -657,7 +660,9 @@ rte_sched_port_config(struct rte_sched_port_params *params)
 	port->time_cpu_cycles = rte_get_tsc_cycles();
 	port->time_cpu_bytes = 0;
 	port->time = 0;
-	port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double) params->rate);
+
+	port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT)
+		/ params->rate;
 
 	/* Scheduling loop detection */
 	port->pipe_loop = RTE_SCHED_PIPE_INVALID;
@@ -2156,11 +2161,12 @@ rte_sched_port_time_resync(struct rte_sched_port *port)
 {
 	uint64_t cycles = rte_get_tsc_cycles();
 	uint64_t cycles_diff = cycles - port->time_cpu_cycles;
-	double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
+	uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT)
+		/ port->cycles_per_byte;
 
 	/* Advance port time */
 	port->time_cpu_cycles = cycles;
-	port->time_cpu_bytes += (uint64_t) bytes_diff;
+	port->time_cpu_bytes += bytes_diff;
 	if (port->time < port->time_cpu_bytes) {
 		port->time = port->time_cpu_bytes;
 	}
-- 
2.1.4

--------------------------------------------------------------
Intel Shannon Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
Business address: Dromore House, East Park, Shannon, Co. Clare

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.




More information about the dev mailing list