[dpdk-dev] [PATCH v2 1/2] Add RIB library

Vladimir Medvedkin medvedkinv at gmail.com
Sun Mar 25 20:17:20 CEST 2018


Hi,

2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson at intel.com>:

> On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > RIB is an alternative to current LPM library.
> > It solves the following problems
> >  - Increases the speed of control plane operations against lpm such as
> >    adding/deleting routes
> >  - Adds abstraction from dataplane algorithms, so it is possible to add
> >    different ip route lookup algorythms such as DXR/poptrie/lpc-trie/etc
> >    in addition to current dir24_8
> >  - It is possible to keep user defined application specific additional
> >    information in struct rte_rib_node which represents route entry.
> >    It can be next hop/set of next hops (i.e. active and feasible),
> >    pointers to link rte_rib_node based on some criteria (i.e. next_hop),
> >    plenty of additional control plane information.
> >  - For dir24_8 implementation it is possible to remove
> rte_lpm_tbl_entry.depth
> >    field that helps to save 6 bits.
> >  - Also new dir24_8 implementation supports different next_hop sizes
> >    (1/2/4/8 bytes per next hop)
> >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate ternary
> operator.
> >    Instead it returns special default value if there is no route.
> >
> > Signed-off-by: Medvedkin Vladimir <medvedkinv at gmail.com>
> > ---
> >  config/common_base                 |   6 +
> >  doc/api/doxy-api.conf              |   1 +
> >  lib/Makefile                       |   2 +
> >  lib/librte_rib/Makefile            |  22 ++
> >  lib/librte_rib/rte_dir24_8.c       | 482 ++++++++++++++++++++++++++++++
> +++
> >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> >  lib/librte_rib/rte_rib.c           | 526 ++++++++++++++++++++++++++++++
> +++++++
> >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> >  lib/librte_rib/rte_rib_version.map |  18 ++
> >  mk/rte.app.mk                      |   1 +
> >  10 files changed, 1496 insertions(+)
> >  create mode 100644 lib/librte_rib/Makefile
> >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> >  create mode 100644 lib/librte_rib/rte_rib.c
> >  create mode 100644 lib/librte_rib/rte_rib.h
> >  create mode 100644 lib/librte_rib/rte_rib_version.map
> >
>
> First pass review comments. For now just reviewed the main public header
> file rte_rib.h. Later reviews will cover the other files as best I can.
>
> /Bruce
>
> <snip>
> > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > new file mode 100644
> > index 0000000..6eac8fb
> > --- /dev/null
> > +++ b/lib/librte_rib/rte_rib.h
> > @@ -0,0 +1,322 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv at gmail.com>
> > + */
> > +
> > +#ifndef _RTE_RIB_H_
> > +#define _RTE_RIB_H_
> > +
> > +/**
> > + * @file
> > + * Compressed trie implementation for Longest Prefix Match
> > + */
> > +
> > +/** @internal Macro to enable/disable run-time checks. */
> > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {    \
> > +     if (cond)                                       \
> > +             return retval;                          \
> > +} while (0)
> > +#else
> > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > +#endif
>
> use RTE_ASSERT?
>
it was done just like it was done in the LPM lib. But if you think it
should be RTE_ASSERT so be it.


>
> > +
> > +#define RTE_RIB_VALID_NODE   1
>
> should there be an INVALID_NODE macro?
>
No


>
> > +#define RTE_RIB_GET_NXT_ALL  0
> > +#define RTE_RIB_GET_NXT_COVER        1
> > +
> > +#define RTE_RIB_INVALID_ROUTE        0
> > +#define RTE_RIB_VALID_ROUTE  1
> > +
> > +/** Max number of characters in RIB name. */
> > +#define RTE_RIB_NAMESIZE     64
> > +
> > +/** Maximum depth value possible for IPv4 RIB. */
> > +#define RTE_RIB_MAXDEPTH     32
>
> I think we should have IPv4 in the name here. Will it not be extended to
> support IPv6 in future?
>
I think there should be a separate implementation of the library for ipv6


>
> > +
> > +/**
> > + * Macro to check if prefix1 {key1/depth1}
> > + * is covered by prefix2 {key2/depth2}
> > + */
> > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
>      \
> > +     ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) == 0)\
> > +             && (depth1 > depth2))
> Neat check!
>
> Any particular reason for using UINT64_MAX here rather than UINT32_MAX?

in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
UINT32_MAX because shift count will be masked to 5 bits.

I think you can avoid the casting and have a slightly shorter mask by
> changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> "~(UINT32_MAX >> depth2)"
> I'd also suggest for readability putting the second check first, and,
> for maintainability, using an inline function rather than a macro.
>
 Agree, it looks clearer


> > +
> > +/** @internal Macro to get next node in tree*/
> > +#define RTE_RIB_GET_NXT_NODE(node, key)
>       \
> > +     ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > +/** @internal Macro to check if node is right child*/
> > +#define RTE_RIB_IS_RIGHT_NODE(node)  (node->parent->right == node)
>
> Again, consider inline fns rather than macros.
>
Ok

For the latter macro, rather than doing additional pointer derefs to
> parent, can you also get if it's a right node by using:
> "(node->key & (1 << (32 - node->depth)))"?
>
No, it is not possible. Decision whether node be left or right is made
using parent and child common depth.
Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common depth
will be /8 and 10.128.0.0/24 will be right child.


> > +
> > +
> > +struct rte_rib_node {
> > +     struct rte_rib_node *left;
> > +     struct rte_rib_node *right;
> > +     struct rte_rib_node *parent;
> > +     uint32_t        key;
> > +     uint8_t         depth;
> > +     uint8_t         flag;
> > +     uint64_t        nh;
> > +     uint64_t        ext[0];
> > +};
> > +
> > +struct rte_rib;
> > +
> > +/** Type of FIB struct*/
> > +enum rte_rib_type {
> > +     RTE_RIB_DIR24_8_1B,
> > +     RTE_RIB_DIR24_8_2B,
> > +     RTE_RIB_DIR24_8_4B,
> > +     RTE_RIB_DIR24_8_8B,
> > +     RTE_RIB_TYPE_MAX
> > +};
>
> If the plan is to support multiple underlying fib types and algorithms
> under the rib library, would it not be better to separate out the
> algorithm part from the data storage part? So have the type just be
> DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
>
Yes, we were talk about it in IRC, agree. Now I pass next hop size in
union rte_rib_fib_conf inside rte_rib_conf


>
> > +
> > +enum rte_rib_op {
> > +     RTE_RIB_ADD,
> > +     RTE_RIB_DEL
> > +};
> > +
> > +/** RIB nodes allocation type */
> > +enum rte_rib_alloc_type {
> > +     RTE_RIB_MALLOC,
> > +     RTE_RIB_MEMPOOL,
> > +     RTE_RIB_ALLOC_MAX
> > +};
>
> Not sure you need this any more. Malloc allocations and mempool
> allocations are now pretty much the same thing.
>
Actually I think to remove malloc. On performance tests with
adding/deleting huge amount of routes malloc is slower. Maybe because of
fragmentation.
What do you think?


> > +
> > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t key,
> > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
>
> Do you anticipate more ops in future than just add and delete? If not,
> why not just split this function into two and drop the op struct.
>
It is difficult question. I'm not ready to make decision at the moment.


>
> > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t *ips,
> > +     uint64_t *next_hops, const unsigned n);
> > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct rte_rib
> *rib);
> > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > +     struct rte_rib_node *node);
> > +
> > +struct rte_rib {
> > +     char name[RTE_RIB_NAMESIZE];
> > +     /*pointer to rib trie*/
> > +     struct rte_rib_node     *trie;
> > +     /*pointer to dataplane struct*/
> > +     void    *fib;
> > +     /*prefix modification*/
> > +     rte_rib_modify_fn_t     modify;
> > +     /* Bulk lookup fn*/
> > +     rte_rib_tree_lookup_fn_t        lookup;
> > +     /*alloc trie element*/
> > +     rte_rib_alloc_node_fn_t alloc_node;
> > +     /*free trie element*/
> > +     rte_rib_free_node_fn_t  free_node;
> > +     struct rte_mempool      *node_pool;
> > +     uint32_t                cur_nodes;
> > +     uint32_t                cur_routes;
> > +     int                     max_nodes;
> > +     int                     node_sz;
> > +     enum rte_rib_type       type;
> > +     enum rte_rib_alloc_type alloc_type;
> > +};
> > +
> > +/** RIB configuration structure */
> > +struct rte_rib_conf {
> > +     enum rte_rib_type       type;
> > +     enum rte_rib_alloc_type alloc_type;
> > +     int     max_nodes;
> > +     size_t  node_sz;
> > +     uint64_t def_nh;
> > +};
> > +
> > +/**
> > + * Lookup an IP into the RIB structure
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  IP to be looked up in the RIB
> > + * @return
> > + *  pointer to struct rte_rib_node on success,
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > +
> > +/**
> > + * Lookup less specific route into the RIB structure
> > + *
> > + * @param ent
> > + *  Pointer to struct rte_rib_node that represents target route
> > + * @return
> > + *  pointer to struct rte_rib_node that represents
> > + *  less specific route on success,
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > +
> > +/**
> > + * Lookup prefix into the RIB structure
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net to be looked up in the RIB
> > + * @param depth
> > + *  prefix length
> > + * @return
> > + *  pointer to struct rte_rib_node on success,
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key, uint8_t
> depth);
>
> Can you explain the difference between this and regular lookup, and how
> they would be used. I don't think the names convey the differences
> sufficiently, and so we should look to rename one or both to be clearer.
>
Regular lookup (rte_rib_tree_lookup) will lookup for most specific node for
passed key.
rte_rib_tree_lookup_exact will lookup node contained key and depth equal to
passed in args. It used to find exact route.


>
> > +
> > +/**
> > + * Retrieve next more specific prefix from the RIB
> s/more/most/
>

> > + * that is covered by key/depth supernet
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net address of supernet prefix that covers returned more specific
> prefixes
> > + * @param depth
> > + *  supernet prefix length
> > + * @param cur
> > + *   pointer to the last returned prefix to get next prefix
> > + *   or
> > + *   NULL to get first more specific prefix
> > + * @param flag
> > + *  -RTE_RIB_GET_NXT_ALL
> > + *   get all prefixes from subtrie
>
> By all prefixes do you mean more specific, i.e. the final prefix?
>
What do you mean the final prefix?


> > + *  -RTE_RIB_GET_NXT_COVER
> > + *   get only first more specific prefix even if it have more specifics
> > + * @return
> > + *  pointer to the next more specific prefix
> > + *  or
> > + *  NULL if there is no prefixes left
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t depth,
> > +     struct rte_rib_node *cur, int flag);
> > +
> > +/**
> > + * Remove prefix from the RIB
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net to be removed from the RIB
> > + * @param depth
> > + *  prefix length
> > + */
> > +void
> > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > +
> > +/**
> > + * Insert prefix into the RIB
> > + *
> > + * @param rib
> > + *  RIB object handle
> > + * @param key
> > + *  net to be inserted to the RIB
> > + * @param depth
> > + *  prefix length
> > + * @return
> > + *  pointer to new rte_rib_node on success
> > + *  NULL otherwise
> > + */
> > +struct rte_rib_node *
> > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t depth);
> > +
> > +/**
> > + * Create RIB
> > + *
> > + * @param name
> > + *  RIB name
> > + * @param socket_id
> > + *  NUMA socket ID for RIB table memory allocation
> > + * @param conf
> > + *  Structure containing the configuration
> > + * @return
> > + *  Handle to RIB object on success
> > + *  NULL otherwise with rte_errno set to an appropriate values.
> > + */
> > +struct rte_rib *
> > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> *conf);
> > +
> > +/**
> > + * Find an existing RIB object and return a pointer to it.
> > + *
> > + * @param name
> > + *  Name of the rib object as passed to rte_rib_create()
> > + * @return
> > + *  Pointer to rib object or NULL if object not found with rte_errno
> > + *  set appropriately. Possible rte_errno values include:
> > + *   - ENOENT - required entry not available to return.
> > + */
> > +struct rte_rib *
> > +rte_rib_find_existing(const char *name);
> > +
> > +/**
> > + * Free an RIB object.
> > + *
> > + * @param rib
> > + *   RIB object handle
> > + * @return
> > + *   None
> > + */
> > +void
> > +rte_rib_free(struct rte_rib *rib);
> > +
> > +/**
> > + * Add a rule to the RIB.
> > + *
> > + * @param rib
> > + *   RIB object handle
> > + * @param ip
> > + *   IP of the rule to be added to the RIB
> > + * @param depth
> > + *   Depth of the rule to be added to the RIB
> > + * @param next_hop
> > + *   Next hop of the rule to be added to the RIB
> > + * @return
> > + *   0 on success, negative value otherwise
> > + */
> > +int
> > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth, uint64_t
> next_hop);
> > +
> > +/**
> > + * Delete a rule from the RIB.
> > + *
> > + * @param rib
> > + *   RIB object handle
> > + * @param ip
> > + *   IP of the rule to be deleted from the RIB
> > + * @param depth
> > + *   Depth of the rule to be deleted from the RIB
> > + * @return
> > + *   0 on success, negative value otherwise
> > + */
> > +int
> > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > +
> > +/**
> > + * Lookup multiple IP addresses in an FIB. This may be implemented as a
> > + * macro, so the address of the function should not be used.
> > + *
> > + * @param RIB
> > + *   RIB object handle
> > + * @param ips
> > + *   Array of IPs to be looked up in the FIB
> > + * @param next_hops
> > + *   Next hop of the most specific rule found for IP.
> > + *   This is an array of eight byte values.
> > + *   If the lookup for the given IP failed, then corresponding element
> would
> > + *   contain default value, see description of then next parameter.
> > + * @param n
> > + *   Number of elements in ips (and next_hops) array to lookup. This
> should be a
> > + *   compile time constant, and divisible by 8 for best performance.
> > + * @param defv
> > + *   Default value to populate into corresponding element of hop[]
> array,
> > + *   if lookup would fail.
> > + *  @return
> > + *   -EINVAL for incorrect arguments, otherwise 0
> > + */
> > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)      \
> > +     rib->lookup(rib->fib, ips, next_hops, n)
>
> My main thought here is whether this needs to be a function at all?
> Given that it takes a full burst of addresses in a single go, how much
> performance would actually be lost by making this a regular function in
> the C file?
> IF we do convert this to a regular function, then a lot of the structure
> definitions above - most importantly, the rib structure itself - can
> probably be moved to a private header file and not exposed to
> applications at all. This will make ABI compatibility a *lot* easier, as
> the structures can be changed without affecting the public ABI.
>
I didn't quite understand what you mean.


> /Bruce
>
> > +
> > +#endif /* _RTE_RIB_H_ */
>



-- 
Regards,
Vladimir


More information about the dev mailing list