[dpdk-dev] [PATCH v2 1/2] Add RIB library

Vladimir Medvedkin medvedkinv at gmail.com
Thu Mar 29 21:59:33 CEST 2018


2018-03-26 12:50 GMT+03:00 Bruce Richardson <bruce.richardson at intel.com>:

> On Sun, Mar 25, 2018 at 09:17:20PM +0300, Vladimir Medvedkin wrote:
> > Hi,
> >
> > 2018-03-14 14:09 GMT+03:00 Bruce Richardson <bruce.richardson at intel.com
> >:
> >
> > > On Wed, Feb 21, 2018 at 09:44:54PM +0000, Medvedkin Vladimir wrote:
> > > > RIB is an alternative to current LPM library.
> > > > It solves the following problems
> > > >  - Increases the speed of control plane operations against lpm such
> as
> > > >    adding/deleting routes
> > > >  - Adds abstraction from dataplane algorithms, so it is possible to
> add
> > > >    different ip route lookup algorythms such as
> DXR/poptrie/lpc-trie/etc
> > > >    in addition to current dir24_8
> > > >  - It is possible to keep user defined application specific
> additional
> > > >    information in struct rte_rib_node which represents route entry.
> > > >    It can be next hop/set of next hops (i.e. active and feasible),
> > > >    pointers to link rte_rib_node based on some criteria (i.e.
> next_hop),
> > > >    plenty of additional control plane information.
> > > >  - For dir24_8 implementation it is possible to remove
> > > rte_lpm_tbl_entry.depth
> > > >    field that helps to save 6 bits.
> > > >  - Also new dir24_8 implementation supports different next_hop sizes
> > > >    (1/2/4/8 bytes per next hop)
> > > >  - Removed RTE_LPM_LOOKUP_SUCCESS to save 1 bit and to eleminate
> ternary
> > > operator.
> > > >    Instead it returns special default value if there is no route.
> > > >
> > > > Signed-off-by: Medvedkin Vladimir <medvedkinv at gmail.com>
> > > > ---
> > > >  config/common_base                 |   6 +
> > > >  doc/api/doxy-api.conf              |   1 +
> > > >  lib/Makefile                       |   2 +
> > > >  lib/librte_rib/Makefile            |  22 ++
> > > >  lib/librte_rib/rte_dir24_8.c       | 482
> ++++++++++++++++++++++++++++++
> > > +++
> > > >  lib/librte_rib/rte_dir24_8.h       | 116 ++++++++
> > > >  lib/librte_rib/rte_rib.c           | 526
> ++++++++++++++++++++++++++++++
> > > +++++++
> > > >  lib/librte_rib/rte_rib.h           | 322 +++++++++++++++++++++++
> > > >  lib/librte_rib/rte_rib_version.map |  18 ++
> > > >  mk/rte.app.mk                      |   1 +
> > > >  10 files changed, 1496 insertions(+)
> > > >  create mode 100644 lib/librte_rib/Makefile
> > > >  create mode 100644 lib/librte_rib/rte_dir24_8.c
> > > >  create mode 100644 lib/librte_rib/rte_dir24_8.h
> > > >  create mode 100644 lib/librte_rib/rte_rib.c
> > > >  create mode 100644 lib/librte_rib/rte_rib.h
> > > >  create mode 100644 lib/librte_rib/rte_rib_version.map
> > > >
> > >
> > > First pass review comments. For now just reviewed the main public
> header
> > > file rte_rib.h. Later reviews will cover the other files as best I can.
> > >
> > > /Bruce
> > >
> > > <snip>
> > > > diff --git a/lib/librte_rib/rte_rib.h b/lib/librte_rib/rte_rib.h
> > > > new file mode 100644
> > > > index 0000000..6eac8fb
> > > > --- /dev/null
> > > > +++ b/lib/librte_rib/rte_rib.h
> > > > @@ -0,0 +1,322 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + * Copyright(c) 2018 Vladimir Medvedkin <medvedkinv at gmail.com>
> > > > + */
> > > > +
> > > > +#ifndef _RTE_RIB_H_
> > > > +#define _RTE_RIB_H_
> > > > +
> > > > +/**
> > > > + * @file
> > > > + * Compressed trie implementation for Longest Prefix Match
> > > > + */
> > > > +
> > > > +/** @internal Macro to enable/disable run-time checks. */
> > > > +#if defined(RTE_LIBRTE_RIB_DEBUG)
> > > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval) do {    \
> > > > +     if (cond)                                       \
> > > > +             return retval;                          \
> > > > +} while (0)
> > > > +#else
> > > > +#define RTE_RIB_RETURN_IF_TRUE(cond, retval)
> > > > +#endif
> > >
> > > use RTE_ASSERT?
> > >
> > it was done just like it was done in the LPM lib. But if you think it
> > should be RTE_ASSERT so be it.
> >
> >
> > >
> > > > +
> > > > +#define RTE_RIB_VALID_NODE   1
> > >
> > > should there be an INVALID_NODE macro?
> > >
> > No
> >
> >
> > >
> > > > +#define RTE_RIB_GET_NXT_ALL  0
> > > > +#define RTE_RIB_GET_NXT_COVER        1
> > > > +
> > > > +#define RTE_RIB_INVALID_ROUTE        0
> > > > +#define RTE_RIB_VALID_ROUTE  1
> > > > +
> > > > +/** Max number of characters in RIB name. */
> > > > +#define RTE_RIB_NAMESIZE     64
> > > > +
> > > > +/** Maximum depth value possible for IPv4 RIB. */
> > > > +#define RTE_RIB_MAXDEPTH     32
> > >
> > > I think we should have IPv4 in the name here. Will it not be extended
> to
> > > support IPv6 in future?
> > >
> > I think there should be a separate implementation of the library for ipv6
> >
> I can understand the need for a separate LPM implementation, but should
> they both not be under the same rib library?
>
I planned to have sepatate rib6 for ipv6.
Of course it is possible to make universal abstraction for both v4 and v6.
But in this case there will be universal rte_rib_node with type (v4|v6) and
variable length, keys will become union of uint32_t for v4 and uint8_t [16]
for v6. I think it is overcomplication.


> >
> > >
> > > > +
> > > > +/**
> > > > + * Macro to check if prefix1 {key1/depth1}
> > > > + * is covered by prefix2 {key2/depth2}
> > > > + */
> > > > +#define RTE_RIB_IS_COVERED(key1, depth1, key2, depth2)
> > >      \
> > > > +     ((((key1 ^ key2) & (uint32_t)(UINT64_MAX << (32 - depth2))) ==
> 0)\
> > > > +             && (depth1 > depth2))
> > > Neat check!
> > >
> > > Any particular reason for using UINT64_MAX here rather than UINT32_MAX?
> >
> > in case when depth2 = 0 UINT32_MAX shifted left by 32 bit will remain
> > UINT32_MAX because shift count will be masked to 5 bits.
> >
> > I think you can avoid the casting and have a slightly shorter mask by
> > > changing "(uint32_t)(UINT64_MAX << (32 - depth2)" to
> > > "~(UINT32_MAX >> depth2)"
> > > I'd also suggest for readability putting the second check first, and,
> > > for maintainability, using an inline function rather than a macro.
> > >
> >  Agree, it looks clearer
> >
> >
> > > > +
> > > > +/** @internal Macro to get next node in tree*/
> > > > +#define RTE_RIB_GET_NXT_NODE(node, key)
> > >       \
> > > > +     ((key & (1 << (31 - node->depth))) ? node->right : node->left)
> > > > +/** @internal Macro to check if node is right child*/
> > > > +#define RTE_RIB_IS_RIGHT_NODE(node)  (node->parent->right == node)
> > >
> > > Again, consider inline fns rather than macros.
> > >
> > Ok
> >
> > For the latter macro, rather than doing additional pointer derefs to
> > > parent, can you also get if it's a right node by using:
> > > "(node->key & (1 << (32 - node->depth)))"?
> > >
> > No, it is not possible. Decision whether node be left or right is made
> > using parent and child common depth.
> > Consider case with 10.0.0.0/8 and 10.128.0.0/24. In this way common
> depth
> > will be /8 and 10.128.0.0/24 will be right child.
> >
> >
> > > > +
> > > > +
> > > > +struct rte_rib_node {
> > > > +     struct rte_rib_node *left;
> > > > +     struct rte_rib_node *right;
> > > > +     struct rte_rib_node *parent;
> > > > +     uint32_t        key;
> > > > +     uint8_t         depth;
> > > > +     uint8_t         flag;
> > > > +     uint64_t        nh;
> > > > +     uint64_t        ext[0];
> > > > +};
> > > > +
> > > > +struct rte_rib;
> > > > +
> > > > +/** Type of FIB struct*/
> > > > +enum rte_rib_type {
> > > > +     RTE_RIB_DIR24_8_1B,
> > > > +     RTE_RIB_DIR24_8_2B,
> > > > +     RTE_RIB_DIR24_8_4B,
> > > > +     RTE_RIB_DIR24_8_8B,
> > > > +     RTE_RIB_TYPE_MAX
> > > > +};
> > >
> > > If the plan is to support multiple underlying fib types and algorithms
> > > under the rib library, would it not be better to separate out the
> > > algorithm part from the data storage part? So have the type just be
> > > DIR_24_8, and have the 1, 2, 4 or 8 specified separately.
> > >
> > Yes, we were talk about it in IRC, agree. Now I pass next hop size in
> > union rte_rib_fib_conf inside rte_rib_conf
> >
> >
> > >
> > > > +
> > > > +enum rte_rib_op {
> > > > +     RTE_RIB_ADD,
> > > > +     RTE_RIB_DEL
> > > > +};
> > > > +
> > > > +/** RIB nodes allocation type */
> > > > +enum rte_rib_alloc_type {
> > > > +     RTE_RIB_MALLOC,
> > > > +     RTE_RIB_MEMPOOL,
> > > > +     RTE_RIB_ALLOC_MAX
> > > > +};
> > >
> > > Not sure you need this any more. Malloc allocations and mempool
> > > allocations are now pretty much the same thing.
> > >
> > Actually I think to remove malloc. On performance tests with
> > adding/deleting huge amount of routes malloc is slower. Maybe because of
> > fragmentation.
> > What do you think?
> >
> Yes, definitely mempool allocations are the way to go!
>
> >
> > > > +
> > > > +typedef int (*rte_rib_modify_fn_t)(struct rte_rib *rib, uint32_t
> key,
> > > > +     uint8_t depth, uint64_t next_hop, enum rte_rib_op op);
> > >
> > > Do you anticipate more ops in future than just add and delete? If not,
> > > why not just split this function into two and drop the op struct.
> > >
> > It is difficult question. I'm not ready to make decision at the moment.
> >
> >
> > >
> > > > +typedef int (*rte_rib_tree_lookup_fn_t)(void *fib, const uint32_t
> *ips,
> > > > +     uint64_t *next_hops, const unsigned n);
> > > > +typedef struct rte_rib_node *(*rte_rib_alloc_node_fn_t)(struct
> rte_rib
> > > *rib);
> > > > +typedef void (*rte_rib_free_node_fn_t)(struct rte_rib *rib,
> > > > +     struct rte_rib_node *node);
> > > > +
> > > > +struct rte_rib {
> > > > +     char name[RTE_RIB_NAMESIZE];
> > > > +     /*pointer to rib trie*/
> > > > +     struct rte_rib_node     *trie;
> > > > +     /*pointer to dataplane struct*/
> > > > +     void    *fib;
> > > > +     /*prefix modification*/
> > > > +     rte_rib_modify_fn_t     modify;
> > > > +     /* Bulk lookup fn*/
> > > > +     rte_rib_tree_lookup_fn_t        lookup;
> > > > +     /*alloc trie element*/
> > > > +     rte_rib_alloc_node_fn_t alloc_node;
> > > > +     /*free trie element*/
> > > > +     rte_rib_free_node_fn_t  free_node;
> > > > +     struct rte_mempool      *node_pool;
> > > > +     uint32_t                cur_nodes;
> > > > +     uint32_t                cur_routes;
> > > > +     int                     max_nodes;
> > > > +     int                     node_sz;
> > > > +     enum rte_rib_type       type;
> > > > +     enum rte_rib_alloc_type alloc_type;
> > > > +};
> > > > +
> > > > +/** RIB configuration structure */
> > > > +struct rte_rib_conf {
> > > > +     enum rte_rib_type       type;
> > > > +     enum rte_rib_alloc_type alloc_type;
> > > > +     int     max_nodes;
> > > > +     size_t  node_sz;
> > > > +     uint64_t def_nh;
> > > > +};
> > > > +
> > > > +/**
> > > > + * Lookup an IP into the RIB structure
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  IP to be looked up in the RIB
> > > > + * @return
> > > > + *  pointer to struct rte_rib_node on success,
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_lookup(struct rte_rib *rib, uint32_t key);
> > > > +
> > > > +/**
> > > > + * Lookup less specific route into the RIB structure
> > > > + *
> > > > + * @param ent
> > > > + *  Pointer to struct rte_rib_node that represents target route
> > > > + * @return
> > > > + *  pointer to struct rte_rib_node that represents
> > > > + *  less specific route on success,
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_lookup_parent(struct rte_rib_node *ent);
> > > > +
> > > > +/**
> > > > + * Lookup prefix into the RIB structure
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net to be looked up in the RIB
> > > > + * @param depth
> > > > + *  prefix length
> > > > + * @return
> > > > + *  pointer to struct rte_rib_node on success,
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_lookup_exact(struct rte_rib *rib, uint32_t key,
> uint8_t
> > > depth);
> > >
> > > Can you explain the difference between this and regular lookup, and how
> > > they would be used. I don't think the names convey the differences
> > > sufficiently, and so we should look to rename one or both to be
> clearer.
> > >
> > Regular lookup (rte_rib_tree_lookup) will lookup for most specific node
> for
> > passed key.
> > rte_rib_tree_lookup_exact will lookup node contained key and depth equal
> to
> > passed in args. It used to find exact route.
> >
> So if there is no node exactly matching the parameters, it the lookup_exact
> returns failure? E.g. if you request a /24 node, it won't return a /8 node
> that would cover the /24?
>
yes, it returns failure. Use  rte_rib_tree_lookup(without depth) and if you
want use  rte_rib_tree_lookup_parent after.


> >
> > >
> > > > +
> > > > +/**
> > > > + * Retrieve next more specific prefix from the RIB
> > > s/more/most/
> > >
> >
> > > > + * that is covered by key/depth supernet
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net address of supernet prefix that covers returned more
> specific
> > > prefixes
> > > > + * @param depth
> > > > + *  supernet prefix length
> > > > + * @param cur
> > > > + *   pointer to the last returned prefix to get next prefix
> > > > + *   or
> > > > + *   NULL to get first more specific prefix
> > > > + * @param flag
> > > > + *  -RTE_RIB_GET_NXT_ALL
> > > > + *   get all prefixes from subtrie
> > >
> > > By all prefixes do you mean more specific, i.e. the final prefix?
> > >
> > What do you mean the final prefix?
> >
> The most specific one, or the longest prefix.

This function is created for different task, not for lpm lookup. This
function is for traverse on trie and retrieve routes that falls under the
key/depth so term the longest prefix is irrelevant here.
Let me explain with an example. Imagine there are 10.0.0.0/8, 10.0.0.0/24,
10.0.0.10/32 and 10.64.0.0/10 in routing table.
You have code like:

rte_rib_node *tmp = NULL;
do {
    tmp = rte_rib_tree_get_nxt(rib, IPv4(10,0,0,0), 8,
RTE_RIB_GET_NXT_ALL); /* retrieve all routes that belongs to 10.0.0.0/8 */
    if (node)
          printf("%u/%u\n", tmp->key, tmp->depth);
} while (tmp);

in this case you will see all subprefixes, but without 10.0.0.0/8:
10.0.0.0/24
10.0.0.10/32
10.64.0.0/10

If you want 10.0.0.0/8 do
tmp = rte_rib_tree_get_nxt(rib, IPv4(10,0,0,0), 7, RTE_RIB_GET_NXT_ALL); /*
retrieve all routes that belongs to 10.0.0.0/7 */

And if you call it with RTE_RIB_GET_NXT_COVER like
tmp = rte_rib_tree_get_nxt(rib, IPv4(10,0,0,0), 8, RTE_RIB_GET_NXT_COVER);
you will get
10.0.0.0/24
10.64.0.0/10
without
10.0.0.10/32
This is useful if you want to get gaps for 10.0.0.0/8 that not covered by
presented routes.


>
> >
> > > > + *  -RTE_RIB_GET_NXT_COVER
> > > > + *   get only first more specific prefix even if it have more
> specifics
> > > > + * @return
> > > > + *  pointer to the next more specific prefix
> > > > + *  or
> > > > + *  NULL if there is no prefixes left
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_get_nxt(struct rte_rib *rib, uint32_t key, uint8_t
> depth,
> > > > +     struct rte_rib_node *cur, int flag);
> > > > +
> > > > +/**
> > > > + * Remove prefix from the RIB
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net to be removed from the RIB
> > > > + * @param depth
> > > > + *  prefix length
> > > > + */
> > > > +void
> > > > +rte_rib_tree_remove(struct rte_rib *rib, uint32_t key, uint8_t
> depth);
> > > > +
> > > > +/**
> > > > + * Insert prefix into the RIB
> > > > + *
> > > > + * @param rib
> > > > + *  RIB object handle
> > > > + * @param key
> > > > + *  net to be inserted to the RIB
> > > > + * @param depth
> > > > + *  prefix length
> > > > + * @return
> > > > + *  pointer to new rte_rib_node on success
> > > > + *  NULL otherwise
> > > > + */
> > > > +struct rte_rib_node *
> > > > +rte_rib_tree_insert(struct rte_rib *rib, uint32_t key, uint8_t
> depth);
> > > > +
> > > > +/**
> > > > + * Create RIB
> > > > + *
> > > > + * @param name
> > > > + *  RIB name
> > > > + * @param socket_id
> > > > + *  NUMA socket ID for RIB table memory allocation
> > > > + * @param conf
> > > > + *  Structure containing the configuration
> > > > + * @return
> > > > + *  Handle to RIB object on success
> > > > + *  NULL otherwise with rte_errno set to an appropriate values.
> > > > + */
> > > > +struct rte_rib *
> > > > +rte_rib_create(const char *name, int socket_id, struct rte_rib_conf
> > > *conf);
> > > > +
> > > > +/**
> > > > + * Find an existing RIB object and return a pointer to it.
> > > > + *
> > > > + * @param name
> > > > + *  Name of the rib object as passed to rte_rib_create()
> > > > + * @return
> > > > + *  Pointer to rib object or NULL if object not found with rte_errno
> > > > + *  set appropriately. Possible rte_errno values include:
> > > > + *   - ENOENT - required entry not available to return.
> > > > + */
> > > > +struct rte_rib *
> > > > +rte_rib_find_existing(const char *name);
> > > > +
> > > > +/**
> > > > + * Free an RIB object.
> > > > + *
> > > > + * @param rib
> > > > + *   RIB object handle
> > > > + * @return
> > > > + *   None
> > > > + */
> > > > +void
> > > > +rte_rib_free(struct rte_rib *rib);
> > > > +
> > > > +/**
> > > > + * Add a rule to the RIB.
> > > > + *
> > > > + * @param rib
> > > > + *   RIB object handle
> > > > + * @param ip
> > > > + *   IP of the rule to be added to the RIB
> > > > + * @param depth
> > > > + *   Depth of the rule to be added to the RIB
> > > > + * @param next_hop
> > > > + *   Next hop of the rule to be added to the RIB
> > > > + * @return
> > > > + *   0 on success, negative value otherwise
> > > > + */
> > > > +int
> > > > +rte_rib_add(struct rte_rib *rib, uint32_t ip, uint8_t depth,
> uint64_t
> > > next_hop);
> > > > +
> > > > +/**
> > > > + * Delete a rule from the RIB.
> > > > + *
> > > > + * @param rib
> > > > + *   RIB object handle
> > > > + * @param ip
> > > > + *   IP of the rule to be deleted from the RIB
> > > > + * @param depth
> > > > + *   Depth of the rule to be deleted from the RIB
> > > > + * @return
> > > > + *   0 on success, negative value otherwise
> > > > + */
> > > > +int
> > > > +rte_rib_delete(struct rte_rib *rib, uint32_t ip, uint8_t depth);
> > > > +
> > > > +/**
> > > > + * Lookup multiple IP addresses in an FIB. This may be implemented
> as a
> > > > + * macro, so the address of the function should not be used.
> > > > + *
> > > > + * @param RIB
> > > > + *   RIB object handle
> > > > + * @param ips
> > > > + *   Array of IPs to be looked up in the FIB
> > > > + * @param next_hops
> > > > + *   Next hop of the most specific rule found for IP.
> > > > + *   This is an array of eight byte values.
> > > > + *   If the lookup for the given IP failed, then corresponding
> element
> > > would
> > > > + *   contain default value, see description of then next parameter.
> > > > + * @param n
> > > > + *   Number of elements in ips (and next_hops) array to lookup. This
> > > should be a
> > > > + *   compile time constant, and divisible by 8 for best performance.
> > > > + * @param defv
> > > > + *   Default value to populate into corresponding element of hop[]
> > > array,
> > > > + *   if lookup would fail.
> > > > + *  @return
> > > > + *   -EINVAL for incorrect arguments, otherwise 0
> > > > + */
> > > > +#define rte_rib_fib_lookup_bulk(rib, ips, next_hops, n)      \
> > > > +     rib->lookup(rib->fib, ips, next_hops, n)
> > >
> > > My main thought here is whether this needs to be a function at all?
> > > Given that it takes a full burst of addresses in a single go, how much
> > > performance would actually be lost by making this a regular function in
> > > the C file?
> > > IF we do convert this to a regular function, then a lot of the
> structure
> > > definitions above - most importantly, the rib structure itself - can
> > > probably be moved to a private header file and not exposed to
> > > applications at all. This will make ABI compatibility a *lot* easier,
> as
> > > the structures can be changed without affecting the public ABI.
> > >
> > I didn't quite understand what you mean.
> >
> Sorry, by "needs to be a function" in first line read "needs to be a
> macro". Basically, the point is to not inline anything that doesn't need
> it. If a function works on a burst of packets, it probably will be fine
> being a regular function than a macro or inline function.
>
ok, got it after conversation in IRC

>
> /Bruce
>



-- 
Regards,
Vladimir


More information about the dev mailing list