[dpdk-dev] [PATCH v3 15/29] crypto/qat: use eal I/O device memory read/write API

Ferruh Yigit ferruh.yigit at intel.com
Fri Jan 13 19:20:46 CET 2017


On 1/13/2017 4:21 PM, Jerin Jacob wrote:
> On Fri, Jan 13, 2017 at 03:50:59PM +0000, Ferruh Yigit wrote:
>> On 1/13/2017 2:57 PM, Jerin Jacob wrote:
>>> On Fri, Jan 13, 2017 at 11:32:29AM +0000, Ferruh Yigit wrote:
>>>> On 1/13/2017 8:17 AM, Jerin Jacob wrote:
>>>>> On Thu, Jan 12, 2017 at 07:09:22PM +0000, Ferruh Yigit wrote:
>>>>>> Hi Jerin,
>>>>>>
>>>>>> On 1/12/2017 9:17 AM, Jerin Jacob wrote:
>>>>>> <...>
>>>>>>
>>>>>>> +#include <rte_io.h>
>>>>>>> +
>>>>>>>  /* CSR write macro */
>>>>>>> -#define ADF_CSR_WR(csrAddr, csrOffset, val) \
>>>>>>> -	(void)((*((volatile uint32_t *)(((uint8_t *)csrAddr) + csrOffset)) \
>>>>>>> -			= (val)))
>>>>>>> +#define ADF_CSR_WR(csrAddr, csrOffset, val)		\
>>>>>>> +	rte_write32(val, (((uint8_t *)csrAddr) + csrOffset))
>>>>>>
>>>>>> For IA, this update introduces an extra compiler barrier (rte_io_wmb()),
>>>>>> which is indeed not a must, is this correct?
>>>>>
>>>>> AFAIK, Compiler barrier is required for IA. I am not an IA expert, if
>>>>> someone thinks it needs to changed then I can fix it in following commit
>>>>> in this patch series by making rte_io_wmb() and rte_io_rmb() as empty.
>>>>>
>>>>> Let me know.
>>>>>
>>>>> AFAIK, Linux kernel code has a barrier in readl/writel for IA.
>>>>>
>>>>> Typically we don't use any non relaxed versions in fast path.In fast
>>>>> typically all the drivers has explicit write barrier for doorbell write
>>>>> and followed by a relaxed version of write. IMO, In any event, it won't
>>>>> generate performance regression.
>>>>>
>>>>> [dpdk-master] $ git show
>>>>> 70c343bdc8c33a51a9db23cd58122bdfc120a58f
>>>>> commit 70c343bdc8c33a51a9db23cd58122bdfc120a58f
>>>>> Author: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>>>>> Date:   Mon Dec 5 06:36:49 2016 +0530
>>>>>
>>>>>     eal/x86: define I/O device memory barriers for IA
>>>>>
>>>>>     The patch does not provide any functional change for IA.
>>>>>     I/O barriers are mapped to existing smp barriers.
>>>>>
>>>>>     CC: Bruce Richardson <bruce.richardson at intel.com>
>>>>>     CC: Konstantin Ananyev <konstantin.ananyev at intel.com>
>>>>>     Signed-off-by: Jerin Jacob <jerin.jacob at caviumnetworks.com>
>>>>>
>>>>> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>>>>> b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>>>>> index 00b1cdf..4eac666 100644
>>>>> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>>>>> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
>>>>> @@ -61,6 +61,12 @@ extern "C" {
>>>>>  
>>>>>  #define rte_smp_rmb() rte_compiler_barrier()
>>>>>  
>>>>> +#define rte_io_mb() rte_mb()
>>>>> +
>>>>> +#define rte_io_wmb() rte_compiler_barrier()
>>>>> +
>>>>> +#define rte_io_rmb() rte_compiler_barrier()
>>>>> +
>>>>>  /*------------------------- 16 bit atomic operations
>>>>>  * -------------------------*/
>>>>>  
>>>>>  #ifndef RTE_FORCE_INTRINSICS
>>>>>
>>>>>>
>>>>>> If so, does it make sense to override these functions for x86, and make
>>>>>> rte_writeX = rte_writeX_relaxed
>>>>>> rte_readX = rte_readX_relaxed
>>>>>>
>>>>>>>  
>>>>>>>  /* CSR read macro */
>>>>>>> -#define ADF_CSR_RD(csrAddr, csrOffset) \
>>>>>>> -	(*((volatile uint32_t *)(((uint8_t *)csrAddr) + csrOffset)))
>>>>>>> +#define ADF_CSR_RD(csrAddr, csrOffset)			\
>>>>>>> +	rte_read32((((uint8_t *)csrAddr) + csrOffset))
>>>>>>
>>>>>> This patchset both introduces new rte_readX/rte_writeX functions, also
>>>>>> applies them into drivers.
>>>>>>
>>>>>> While applying them, it changes the behavior.
>>>>>> Like above code was doing a read, but after update it does read and
>>>>>> read_memory_barrier.
>>>>>>
>>>>>> What do you think this patchset updates usage in a manner that keeps
>>>>>> behavior exact same. Like using rte_read32_relaxed for this case.
>>>>>> And doing architecture related updates in a different patchset?
>>>>>
>>>>> Need to use rte_read32 at this commit otherwise it will break for ARM.
>>>>> That's was all point for this patchset.
>>>>
>>>> Why it breaks the ARM, is it because rte_*mb() updated for ARM in this
>>>> patchset (patch 7/29) ?
>>>
>>> Yes.
>>>
>>>
>>>>
>>>> I believe it is good to make these modifications in two phase:
>>>
>>> It is in two phases only. First introduced the API with implementation and
>>> enabled in each driver. Why did you think other-way around it is better?
>>
>> For two things:
>> 1- If something goes wrong, find the source of problem easier.
> 
> How?
> 
> Are you suggesting like this below,
> 0) Introduce rte_io_?mb()
> 1) Introduce readxx_relaxed and writexx_relaxed
> 2) Change all the drivers with readxx_relaxed and writexx_relaxed(15
> change sets) to keep the same behavior
> 3) Introduce readxx and writexx
> 4) revert step 2 changes and make driver based on readxx and
> writexx(again 15 change sets)

Almost yes, instead:

0) Introduce generic rte_io_?mb()
1) Introduce generic readxx_relaxed and writexx_relaxed
2) Introduce generic readxx and writexx
3) Change all the drivers with readxx_relaxed and writexx_relaxed(15
change sets) to keep the same behavior

At this point all same functionality, with new functions.

4) Introduce arch specific rte_io_?mb()
5) Introduce arch specific readxx_relaxed and writexx_relaxed
6) Introduce arch specific readxx and writexx
7) _Update_ step 3 changes and replace some readxx_relaxed and
writexx_relaxed with readxx and writexx in drivers (I expect this to be
less change comparing step 3)

Steps 0,1,2 and 4,5,6 already separated in your patchset.

> 
> 
> Instead of(existing one)
> 0) Introduce rte_io_?mb()
> 1) Introduce readxx_relaxed and writexx_relaxed
> 2) Introduce readxx and writexx
> 3) Change all the drivers with readxx_[relaxed] and writexx_[relaxed]
> 
> Proposed scheme makes driver authors to review two check-ins.

Yes, this brings some extra work, that is why I hesitate.

> And git bisect fail on fourth case of instead of thrird case with
> existing one.

Perhaps you are right here.

> 
> Not sure what we soloving here?

Testing first patchset will show if "readxx and writexx" conversion done
correctly or not, and second patchset will show what has been done arch
specific and it's effect.

But not really sure if this worth the effort, so,
If there is no any other objection, let's continue with existing one.

> 
> Thomas,
> Any thoughts?
> 
> 
>> 2- Make architectural changes obvious, right now it is a little hard to
>> see, and this again for item 1.
>>
>> But I also would like to hear more comments before you change/try anything.
>>
>>> I can rework and test if there is any value addition. If you concerned
>>> about git bisect ability then I don't think we are loosing that in this
>>> model.
>>>
>>> Thoughts?
>>>
>>>> - First replace old usage with rte_readX/rte_writeX while keeping exact
>>>> same behavior
>>>>
>>>> - Second, do architecture specific changes. Both in eal and drivers
>>>> level if required.
>>>>
>>>> Thanks,
>>>> ferruh
>>>>
>>>>> For performance regression, we can always verify by taking delta
>>>>> between this changeset and the previous changeset. If you think, I need
>>>>> to make rte_io_wmb()/rte_io_rmb() as empty for IA then I could do that
>>>>> as well.
>>>>>
>>>>>
>>>>>>
>>>>>> This both makes easy to see architecture specific updates, and makes
>>>>>> easy to trace any possible performance issues by this patchset.
>>>>>>
>>>>>>>  
>>>>>>>  #define ADF_BANK_INT_SRC_SEL_MASK_0 0x4444444CUL
>>>>>>>  #define ADF_BANK_INT_SRC_SEL_MASK_X 0x44444444UL
>>>>>>>
>>>>>>
>>>>
>>



More information about the dev mailing list