[dpdk-dev] [PATCH 3/4] eal/arm: Enable lpm/table/pipeline libs

Jianbo Liu jianbo.liu at linaro.org
Wed Dec 2 14:13:51 CET 2015


On 2 December 2015 at 18:39, Jerin Jacob <jerin.jacob at caviumnetworks.com> wrote:
> On Wed, Dec 02, 2015 at 05:49:41PM +0800, Jianbo Liu wrote:
>> On 2 December 2015 at 16:03, Jerin Jacob <jerin.jacob at caviumnetworks.com> wrote:
>> > On Wed, Dec 02, 2015 at 02:54:52PM +0800, Jianbo Liu wrote:
>> >> On 2 December 2015 at 00:41, Jerin Jacob <jerin.jacob at caviumnetworks.com> wrote:
>> >> > On Tue, Dec 01, 2015 at 01:41:15PM -0500, Jianbo Liu wrote:
>> >> >> Adds ARM NEON support for lpm.
>> >> >> And enables table/pipeline libraries which depend on lpm.
>> >> >
>> >> > I already sent the patch on the same yesterday.
>> >> > We can converge the patches after the discussion.
>> >> > Please check "[dpdk-dev] [PATCH 0/3] add lpm support for NEON" on ml
>> >> >
>> >> Yes, I have read your patch. But there are many differences, so I sent
>> >> mine for your reviewing :)
>> >>
>> >> >
>> >> >>
>> >> >> Signed-off-by: Jianbo Liu <jianbo.liu at linaro.org>
>> >> >> ---
>> >> >>  config/defconfig_arm-armv7a-linuxapp-gcc          |  3 -
>> >> >>  config/defconfig_arm64-armv8a-linuxapp-gcc        |  3 -
>> >> >>  lib/librte_eal/common/include/arch/arm/rte_vect.h | 28 ++++++++++
>> >> >>  lib/librte_lpm/rte_lpm.h                          | 68 ++++++++++++++++-------
>> >> >>  4 files changed, 77 insertions(+), 25 deletions(-)
>> >> >>
>> >> >> diff --git a/config/defconfig_arm-armv7a-linuxapp-gcc b/config/defconfig_arm-armv7a-linuxapp-gcc
>> >> >> index cbebd64..efffa1f 100644
>> >> >> --- a/config/defconfig_arm-armv7a-linuxapp-gcc
>> >> >> +++ b/config/defconfig_arm-armv7a-linuxapp-gcc
>> >> >> @@ -53,9 +53,6 @@ CONFIG_RTE_LIBRTE_KNI=n
>> >> >>  CONFIG_RTE_EAL_IGB_UIO=n
>> >> >>
>> >> >>  # fails to compile on ARM
>> >> >> -CONFIG_RTE_LIBRTE_LPM=n
>> >> >> -CONFIG_RTE_LIBRTE_TABLE=n
>> >> >> -CONFIG_RTE_LIBRTE_PIPELINE=n
>> >> >>  CONFIG_RTE_SCHED_VECTOR=n
>> >> >>
>> >> >>  # cannot use those on ARM
>> >> >> diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc b/config/defconfig_arm64-armv8a-linuxapp-gcc
>> >> >> index 504f3ed..57f7941 100644
>> >> >> --- a/config/defconfig_arm64-armv8a-linuxapp-gcc
>> >> >> +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc
>> >> >> @@ -51,7 +51,4 @@ CONFIG_RTE_LIBRTE_IVSHMEM=n
>> >> >>  CONFIG_RTE_LIBRTE_FM10K_PMD=n
>> >> >>  CONFIG_RTE_LIBRTE_I40E_PMD=n
>> >> >>
>> >> >> -CONFIG_RTE_LIBRTE_LPM=n
>> >> >> -CONFIG_RTE_LIBRTE_TABLE=n
>> >> >> -CONFIG_RTE_LIBRTE_PIPELINE=n
>> >> >>  CONFIG_RTE_SCHED_VECTOR=n
>> >> >> diff --git a/lib/librte_eal/common/include/arch/arm/rte_vect.h b/lib/librte_eal/common/include/arch/arm/rte_vect.h
>> >> >> index a33c054..7437711 100644
>> >> >> --- a/lib/librte_eal/common/include/arch/arm/rte_vect.h
>> >> >> +++ b/lib/librte_eal/common/include/arch/arm/rte_vect.h
>> >> >> @@ -41,6 +41,8 @@ extern "C" {
>> >> >>
>> >> >>  typedef int32x4_t xmm_t;
>> >> >>
>> >> >> +typedef int32x4_t __m128i;
>> >> >> +
>> >> >>  #define      XMM_SIZE        (sizeof(xmm_t))
>> >> >>  #define      XMM_MASK        (XMM_SIZE - 1)
>> >> >>
>> >> >> @@ -53,6 +55,32 @@ typedef union rte_xmm {
>> >> >>       double   pd[XMM_SIZE / sizeof(double)];
>> >> >>  } __attribute__((aligned(16))) rte_xmm_t;
>> >> >>
>> >> >> +static __inline __m128i
>> >> >> +_mm_set_epi32(int i3, int i2, int i1, int i0)
>> >> >> +{
>> >> >> +     int32_t r[4] = {i0, i1, i2, i3};
>> >> >> +
>> >> >> +     return vld1q_s32(r);
>> >> >> +}
>> >> >> +
>> >> >> +static __inline __m128i
>> >> >> +_mm_loadu_si128(__m128i *p)
>> >> >> +{
>> >> >> +     return vld1q_s32((int32_t *)p);
>> >> >> +}
>> >> >> +
>> >> >> +static __inline __m128i
>> >> >> +_mm_set1_epi32(int i)
>> >> >> +{
>> >> >> +     return vdupq_n_s32(i);
>> >> >> +}
>> >> >> +
>> >> >> +static __inline __m128i
>> >> >> +_mm_and_si128(__m128i a, __m128i b)
>> >> >> +{
>> >> >> +     return vandq_s32(a, b);
>> >> >> +}
>> >> >> +
>> >
>> > IMO, it's not always good to emulate GCC defined intrinsics of
>> > other architecture. What if a legacy DPDK application has such mappings
>> > then BOOM, multiple definition, which one is correct? which one
>> > to comment it out? Integration pain starts for DPDK library consumer:-(
>> >
>> They can include rte_vect.h in build/include directly, which is linked correctly
>> to the one for that ARCH, so there is no need to worry about.
>
> I think you missed the point,I was trying to say that
> legacy DPDK application and third party stacks uses SSE2NEON kind of
> libraries
> for quick integration, for example, something like this
> https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NEON.h
>
> AND they include "rte_lpm.h"(it internally includes rte_vect.h)
> that lead to multiple definition and its not good.
>
But you will have similar issue since "typedef int32x4_t __m128i"
appears in both your patch and this header file.

>>
>>
>> >> >
>> >> > IMO, it makes sense to not emulate the SSE intrinsics with NEON
>> >> > Let's create the rte_vect_* as required. look at the existing patch.
>> >> >
>> >> I thought of creating a layer of SIMD over all the platforms before.
>> >> But can't you see it make things complicated, considering there are
>> >> only few simple intrinsic to implement?
>> >
>> > Not true, There were, a lot of SSE intrinsics needs be to emulated for ACL NEON
>> > implementation if I were to take this approach and emulation comes with
>> > the cost.
>> >
>> No, I will not re-implement all the intrinsic like that .
>> I only do with the simple intrinsic, such as load/store, as you said below.
>
> but you forced to add _mm_and_si128 also to the list and emulated
> _mm_and_si128 intrinsic. Am just saying no emulation.
>
I means simple intrinsic, not load/store only.
Depends on how you define emulation. Actually, these simple intrisinic
could be only one NEON instruction, and will not bring cost.

>
>>
>> > So my take is,
>> > lets the each architecture implementation for specific SIMD version of DPDK
>> > API in the library should have the freedom to implement the API in
>> > NATIVE.
>> >
>> > And let's create only rte_vect_* abstraction only for using
>> > that API/library. Which boils down to have very minimal rte_vect_*
>> > abstraction to load, store, set not beyond that.
>> >
>> > This makes clear "contract" between DPDK library and the applications.
>> > and make easy for remaning new architecture  porting effort in DPDK.
>> >
>> Agree.
>> But I reuse existing intrinsic names, and you recreate new ones.
>> And I try to do as few changes as possible, and try to avoid any
>> mistaken which may cause code un-compiled.
>
> Its trival to verify. Just compile it
>
>> I think it's design level question, we need to hear what others talk about it.
>>
>> > Imagine how your proposed function will look like if new architecture
>> > wants to implement "optimized" version of rte_lpm_lookupx4
>> >
>> There is no optimization for this (simple) rte_lpm_lookupx4, otherwise
>> you have done that in your patch.
>> If there is for other new platform, defintely they should do like
>> yours, as you did for NEON ACL.
>>
>> >
>> >> If do so, we also need to explain to others how to use these interfaces.
>> >> Besides, this patch did the smallest changes to the original code, and
>> >> more likely to be accepted by others.
>> >
>> > other patch makes no changes to IA version of rte_lpm_lookupx4.I thought
>> > that make reviewer easy to review the changes in architecture
>> > perspective.
>> >
>> As I know, they don't enable LPM for PPC, and ARM is the first one to
>> touch this issue.
>>
>> >>
>> >> >
>> >> >>  #ifdef RTE_ARCH_ARM
>> >> >>  /* NEON intrinsic vqtbl1q_u8() is not supported in ARMv7-A(AArch32) */
>> >> >>  static __inline uint8x16_t
>> >> >> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
>> >> >> index c299ce2..c76c07d 100644
>> >> >> --- a/lib/librte_lpm/rte_lpm.h
>> >> >> +++ b/lib/librte_lpm/rte_lpm.h
>> >> >> @@ -361,6 +361,47 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips,
>> >> >>  /* Mask four results. */
>> >> >>  #define       RTE_LPM_MASKX4_RES     UINT64_C(0x00ff00ff00ff00ff)
>> >> >>
>> >> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64)
>> >> >
>> >> > Separate out arm implementation to the different header file.
>> >> > Too many ifdef looks odd in the header file and difficult to manage.
>> >> >
>> >> But there are many ifdefs already.
>> >> And It seems unreasonable to add a new file only for one small function.
>> >>
>> >
>> > small or big, its matter of each architecture to have
>> > the freedom for the optimized version for the implementation.
>> >
>> > What if  other architecture demands to write this function in assembly
>> > or restructure it for performance improvement?
>> >
>> If there is such demands, should do like that.
>> But I don't see any restructure in your patch, and you still follow
>> the logic as x86, is it worth adding a new file?
>
> SIMD Logic on getting  4 indexes for tbl24[] is different.
>
> /* get 4 indexes for tbl24[]. */
> i24 = _mm_srli_epi32(ip, CHAR_BIT);
>
> /* extract values from tbl24[] */
> idx = _mm_cvtsi128_si64(i24);
> i24 = _mm_srli_si128(i24, sizeof(uint64_t));
>
> tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>
> idx = _mm_cvtsi128_si64(i24);
>
> tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>
> VS
>
> /* extract values from tbl24[] */
> idx = vgetq_lane_u64((uint64x2_t)i24, 0);
>
> tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>
> idx = vgetq_lane_u64((uint64x2_t)i24, 1);
>
> tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
> tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>
It's only the optimazation of part of code in that function. I did the
similar in my patch.
But, looking from the whole, this function is not restructured, and
the logic is the same as x86.

>>
>> >
>> >> >
>> >> >> +static inline void
>> >> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t tbl[4])
>> >> >> +{
>> >> >> +     uint32x4_t i24;
>> >> >> +     uint32_t idx[4];
>> >> >> +
>> >> >> +     /* get 4 indexes for tbl24[]. */
>> >> >> +     i24 = vshrq_n_u32(vreinterpretq_u32_s32(ip), CHAR_BIT);
>> >> >> +     vst1q_u32(idx, i24);
>> >> >> +
>> >> >> +     /* extract values from tbl24[] */
>> >> >> +     tbl[0] = *(const uint16_t *)&lpm->tbl24[idx[0]];
>> >> >> +     tbl[1] = *(const uint16_t *)&lpm->tbl24[idx[1]];
>> >> >> +     tbl[2] = *(const uint16_t *)&lpm->tbl24[idx[2]];
>> >> >> +     tbl[3] = *(const uint16_t *)&lpm->tbl24[idx[3]];
>> >> >> +}
>> >> >
>> >> > Nice. There is an improvement in this portion code wrt my patch. This is
>> >> > a candidate for convergence.
>> >> >
>> >> >
>> >> >> +#else
>> >> >> +static inline void
>> >> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, __m128i ip, uint16_t tbl[4])
>> >> >> +{
>> >> >> +     __m128i i24;
>> >> >> +     uint64_t idx;
>> >> >> +
>> >> >> +     /* get 4 indexes for tbl24[]. */
>> >> >> +     i24 = _mm_srli_epi32(ip, CHAR_BIT);
>> >> >> +
>> >> >> +     /* extract values from tbl24[] */
>> >> >> +     idx = _mm_cvtsi128_si64(i24);
>> >> >> +     i24 = _mm_srli_si128(i24, sizeof(uint64_t));
>> >> >> +
>> >> >> +     tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
>> >> >> +     tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>> >> >> +
>> >> >> +     idx = _mm_cvtsi128_si64(i24);
>> >> >> +
>> >> >> +     tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
>> >> >> +     tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>> >> >> +}
>> >> >> +#endif
>> >> >> +
>> >> >>  /**
>> >> >>   * Lookup four IP addresses in an LPM table.
>> >> >>   *
>> >> >> @@ -381,17 +422,19 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips,
>> >> >>   *   if lookup would fail.
>> >> >>   */
>> >> >>  static inline void
>> >> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64)
>> >> >> +rte_lpm_lookupx4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t hop[4],
>> >> >> +     uint16_t defv)
>> >> >
>> >> > This would call for change in the change the ABI,
>> >> > IMO, __m128i can be used to represent 128bit vector to avoid ABI chang
>> >> >
>> >> This redefine rte_lpm_lookupx4 is unncessary, I will remove it, so no
>> >> ABI change.
>> >> And there only one ifdef for ARM platforms left.
>> >>
>> >> >
>> >> >> +#else
>> >> > separate out arm implementation to the different header file. Too many
>> >> > ifdef looks odd in the header file.
>> >> >
>> >> > Could you  rebase your patch based on existing patch and send the
>> >> > improvement portion as separate patch or I can send update patch with
>> >> > your improvements and with your signoff.
>> >> >
>> >> >
>> >> >>  rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4],
>> >> >>       uint16_t defv)
>> >> >> +#endif
>> >> >>  {
>> >> >> -     __m128i i24;
>> >> >>       rte_xmm_t i8;
>> >> >>       uint16_t tbl[4];
>> >> >> -     uint64_t idx, pt;
>> >> >> -
>> >> >> -     const __m128i mask8 =
>> >> >> -             _mm_set_epi32(UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX);
>> >> >> +     uint64_t pt;
>> >> >>
>> >> >> +     const __m128i mask8 = _mm_set1_epi32(UINT8_MAX);
>> >> >>       /*
>> >> >>        * RTE_LPM_VALID_EXT_ENTRY_BITMASK for 4 LPM entries
>> >> >>        * as one 64-bit value (0x0300030003000300).
>> >> >> @@ -412,20 +455,7 @@ rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4],
>> >> >>               (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 32 |
>> >> >>               (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 48);
>> >> >>
>> >> >> -     /* get 4 indexes for tbl24[]. */
>> >> >> -     i24 = _mm_srli_epi32(ip, CHAR_BIT);
>> >> >> -
>> >> >> -     /* extract values from tbl24[] */
>> >> >> -     idx = _mm_cvtsi128_si64(i24);
>> >> >> -     i24 = _mm_srli_si128(i24, sizeof(uint64_t));
>> >> >> -
>> >> >> -     tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
>> >> >> -     tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>> >> >> -
>> >> >> -     idx = _mm_cvtsi128_si64(i24);
>> >> >> -
>> >> >> -     tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx];
>> >> >> -     tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
>> >> >> +     rte_lpm_tbl24_val4(lpm, ip, tbl);
>> >> >>
>> >> >>       /* get 4 indexes for tbl8[]. */
>> >> >>       i8.x = _mm_and_si128(ip, mask8);
>> >> >> --
>> >> >> 1.8.3.1
>> >> >>


More information about the dev mailing list