I do not have any lock/critical sections in my code. 
I have logs to print out the core id, src port, dst port and queue id.  worker 0 runs on core 1,  run macswap very light, the throughput is 4.5Mpps. worker 1 runs on core2, is a load balancer heavy, the throughput is also 4.5Mpps. This does not make sense at all. 

***thread core_id=1, src_port=0, dst_port=0, rx_queue_id=0, tx_queue_id=0

***thread core_id=2, src_port=0, dst_port=0, rx_queue_id=1, tx_queue_id=1

Core 0: Running stat thread

worker_id=0, core_id=1, pkt_rate=4418972

worker_id=1, core_id=2, pkt_rate=4419808

worker_id=0, core_id=1, pkt_rate=4631684

worker_id=1, core_id=2, pkt_rate=4632928

>> I have two threads process the packets with different ways. thread A (core 0) is very heavy, thread B (core 1) is very light.   If I just run each of them, their throughput is huge different with small packet. Thread A polls queue 0 of port 0, thread B polls queue 1 of port 0. If I run them at the same time, why thread A and thread B get same throughput. This makes me very confused. Does anyone have the same experience or know some possible reasons?
>Can you give some examples with numbers? My first thought is that
>maybe the two threads are contending for the same physical core. You
>don't have any locking/critical sections, do you?
