[RFC PATCH 0/4] VRF support in FIB library

Maxime Leroy maxime at leroys.fr
Thu Apr 2 18:51:20 CEST 2026
Previous message (by thread): 25.11.1 patches review and test
Next message (by thread): [DPDK/ethdev Bug 1922] VLAN pattern flow rules not matching packets on XL710
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Vladimir,

On Fri, Mar 27, 2026 at 7:28 PM Medvedkin, Vladimir
<vladimir.medvedkin at intel.com> wrote:
>
> Hi Maxime,
>
> On 3/25/2026 9:43 PM, Maxime Leroy wrote:
> > Hi Vladimir,
> >
> > On Wed, Mar 25, 2026 at 4:56 PM Medvedkin, Vladimir
> > <vladimir.medvedkin at intel.com> wrote:
> >>
> >> On 3/24/2026 9:19 AM, Maxime Leroy wrote:
> >>> Hi Vladimir,
> >>>
> >>> On Mon, Mar 23, 2026 at 7:46 PM Medvedkin, Vladimir
> >>> <vladimir.medvedkin at intel.com> wrote:
> >>>> On 3/23/2026 2:53 PM, Maxime Leroy wrote:
> >>>>> On Mon, Mar 23, 2026 at 1:49 PM Medvedkin, Vladimir
> >>>>> <vladimir.medvedkin at intel.com> wrote:
> >>>>>> Hi Maxime,
> >>>>>>
> >>>>>> On 3/23/2026 11:27 AM, Maxime Leroy wrote:
> >>>>>>>      Hi Vladimir,
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, Mar 22, 2026 at 4:42 PM Vladimir Medvedkin
> >>>>>>> <vladimir.medvedkin at intel.com> wrote:
> >>>>>>>> This series adds multi-VRF support to both IPv4 and IPv6 FIB paths by
> >>>>>>>> allowing a single FIB instance to host multiple isolated routing domains.
> >>>>>>>>
> >>>>>>>> Currently FIB instance represents one routing instance. For workloads that
> >>>>>>>> need multiple VRFs, the only option is to create multiple FIB objects. In a
> >>>>>>>> burst oriented datapath, packets in the same batch can belong to different VRFs, so
> >>>>>>>> the application either does per-packet lookup in different FIB instances or
> >>>>>>>> regroups packets by VRF before lookup. Both approaches are expensive.
> >>>>>>>>
> >>>>>>>> To remove that cost, this series keeps all VRFs inside one FIB instance and
> >>>>>>>> extends lookup input with per-packet VRF IDs.
> >>>>>>>>
> >>>>>>>> The design follows the existing fast-path structure for both families. IPv4 and
> >>>>>>>> IPv6 use multi-ary trees with a 2^24 associativity on a first level (tbl24). The
> >>>>>>>> first-level table scales per configured VRF. This increases memory usage, but
> >>>>>>>> keeps performance and lookup complexity on par with non-VRF implementation.
> >>>>>>>>
> >>>>>>> Thanks for the RFC. Some thoughts below.
> >>>>>>>
> >>>>>>> Memory cost: the flat TBL24 replicates the entire table for every VRF
> >>>>>>> (num_vrfs * 2^24 * nh_size). With 256 VRFs and 8B nexthops that is
> >>>>>>> 32 GB for TBL24 alone. In grout we support up to 256 VRFs allocated
> >>>>>>> on demand -- this approach forces the full cost upfront even if most
> >>>>>>> VRFs are empty.
> >>>>>> Yes, increased memory consumption is the
> >>>>>> trade-off.WemakethischoiceinDPDKquite often,such as pre-allocatedmbufs,
> >>>>>> mempoolsand many other stuff allocated in advance to gain performance.
> >>>>>> For FIB, I chose to replicate TBL24 per VRF for this same reason.
> >>>>>>
> >>>>>> And, as Morten mentioned earlier, if memory is the priority, a table
> >>>>>> instance per VRF allocated on-demand is still supported.
> >>>>>>
> >>>>>> The high memory cost stems from TBL24's design: for IPv4, it was
> >>>>>> justified by the BGP filtering convention (no prefixes more specific
> >>>>>> than /24 in BGPv4 full view), ensuring most lookups hit with just one
> >>>>>> random memory access. For IPv6, we should likely switch to a 16-bit TRIE
> >>>>>> scheme on all layers. For IPv4, alternative algorithms with smaller
> >>>>>> footprints (like DXR or DIR16-8-8, as used in VPP) may be worth
> >>>>>> exploring if BGP full view is not required for those VRFs.
> >>>>>>
> >>>>>>> Per-packet VRF lookup: Rx bursts come from one port, thus one VRF.
> >>>>>>> Mixed-VRF bulk lookups do not occur in practice. The three AVX512
> >>>>>>> code paths add complexity for a scenario that does not exist, at
> >>>>>>> least for a classic router. Am I missing a use-case?
> >>>>>> That's not true, you're missing out on a lot of established core use
> >>>>>> cases that are at least 2 decades old:
> >>>>>>
> >>>>>> - VLAN subinterface abstraction. Each subinterface may belong to a
> >>>>>> separate VRF
> >>>>>>
> >>>>>> - MPLS VPN
> >>>>>>
> >>>>>> - Policy based routing
> >>>>>>
> >>>>> Fair point on VLAN subinterfaces and MPLS VPN. SRv6 L3VPN (End.DT4/
> >>>>> End.DT6) also fits that pattern after decap.
> >>>>>
> >>>>> I agree DPDK often pre-allocates for performance, but I wonder if the
> >>>>> flat TBL24 actually helps here. Each VRF's working set is spread
> >>>>> 128 MB apart in the flat table. Would regrouping packets by VRF and
> >>>>> doing one bulk lookup per VRF with separate contiguous TBL24s be
> >>>>> more cache-friendly than a single mixed-VRF gather? Do you have
> >>>>> benchmarks comparing the two approaches?
> >>>> It depends. Generally, if we assume that we are working with wide
> >>>> internet traffic, then even for a single VRF we most likely will miss
> >>>> the cache for TLB24, thus, regardless of the size of the tbl24, each
> >>>> memory access will be performed directly to DRAM.
> >>> If the lookup is DRAM-bound anyway, then the 10 cycles/addr cost
> >>> is dominated by memory latency, not CPU. The CPU cost of a bucket
> >>> sort on 32-64 packets is negligible next to a DRAM access (~80-100
> >>> ns per cache miss).
> >> memory accesses are independent and executed in parallel in the CPU
> >> pipeline
> >>>    That actually makes the case for regroup +
> >>> per-VRF lookup: the regrouping is pure CPU work hidden behind
> >>> memory stalls,
> >> regrouping must be performed before memory accesses, so it cannot be
> >> amortized in between of memory reads
> > With internet traffic, TBL24 lookups quickly become limited by
> > cache misses, not CPU cycles. Even if some bursts hit the same
> > routes and benefit from cache locality, the CPU has a limited
> > number of outstanding misses (load buffer entries, MSHRs) --
> > out-of-order execution helps, but it is not magic.
> Correct, but this does not contradict to what I'm saying
> >
> > The whole point of vector/graph processing (VPP, DPDK graph, etc.)
> > is to amortize that memory latency: prefetch for packet N+1 while
> > processing packet N. This works because all packets in a batch
> > hit the same data structure in a tight loop.
> https://github.com/DPDK/dpdk/blob/626d4e39327333cd5508885162e45ca7fb94ef7f/lib/fib/dir24_8.h#L161
>
> >
> > With separate per-VRF TBL24s, a bucket sort by VRF -- a few
> > dozen cycles, all in L1 -- gives you clean batches where
> > prefetching works as designed. This is exactly what graph nodes
> > already do: classify, then process per-class in a tight loop.
> How lookup is performed in this design? Am I understand it right:
> 1. sort the batch by VRF ids, splitting IPs from the batch with IP
> sub-batches belonging to the same VRF id
> 2. for each subset of IPs perform lookup in tbl24[batch_common_vrf_id]
> 3. unsort nexthops
>
> Correct?

No sort/unsort. This is how rte_graph classification works:

 ip_input (validation)
   -> ip_lookup-v0  (bulk fib4_lookup on homogeneous VRF 0 burst)
   -> ip_lookup-v1  (bulk fib4_lookup on homogeneous VRF 1 burst)
   -> ip_forward / ip_input_local / ...

 ip_input already iterates over packets for header validation and
 enqueues them to different next nodes. Adding a per-VRF edge costs
 one iface->vrf_id load (already in L1) and one rte_node_enqueue_x1()
 (already done today). Each ip_lookup-vN clone holds its VRF's
 rte_fib in node context and calls rte_fib_lookup_bulk() on the
 whole burst at once.

 We do not use bulk lookups yet in grout (each packet does its own
 rte_fib_lookup_bulk(..., 1) today), but this is how we would
 implement it.

 The tradeoff is batch fragmentation: with traffic spread across K
 active VRFs, each sub-batch is ~N/K packets. But in practice, most
 deployments have 1-3 hot VRFs, so batches stay large. And even
 fragmented batches benefit from the vectorized lookup -- 8 packets
 is still one AVX512 iteration, vs. 8 scalar lookups today.

> >
> >>>    and each per-VRF bulk lookup hits a contiguous
> >>> TBL24 instead of scattering across 128 MB-apart VRF regions.
> >> why is a contiguous 128Mb single-VRF TBL24 OK for you, but bigger
> >> contiguous multi-VRF TBL24 is not OK in the context of lookup (here we
> >> are talking about lookup, omitting the problem of memory consumption on
> >> init)?
> > The performance difference may be small, but the flat approach
> > is not faster either -- while costing 64 GB upfront.
> it seems you are implicitly take an assumption of 256 VRFs. Is my
> usecase with a few VRFs have a right to exist?
> >
> >> In both of these cases, memory access behaves the same way within a
> >> single batch of packets during lookup, i.e. the first hit is likely a
> >> cache miss, regardless of whether we are dealing with one or more VRFs,
> >> it will not maintain TBL24 in L3$ in any way in a real dataplane app.
> >>
> >>>> And if the addresses are localized (i.e. most traffic is internal), then
> >>>> having multiple TBL24 won'tmake the situationmuchworse.
> >>>>
> >>> With localized traffic, regrouping by VRF + per-VRF lookup on
> >>> contiguous TBL24s would benefit from cache locality,
> >> why so? There will be no differences within a single batch with a
> >> reasonable size (for example 64), because within the lookup session, no
> >> matter with or without regrouping, temporal cache locality will be the same.
> >>
> >> Let't look at it from a different angle. Is it
> >> worthregroupingipaddressesby/8(i.e. 8 MSBs)withthe
> >> currentimplementationof a singleVRFFIB?
> >>
> >>>    while the
> >>> flat multi-VRF table spreads hot entries 128 MB apart. The flat
> >>> approach may actually be worse in that scenario
> >>>
> >>>> I don't have any benchmarks for regrouping, however I have 2 things to
> >>>> consider:
> >>>>
> >>>> 1. lookup is relatively fast (for IPv4 it is about 10 cycles per
> >>>> address, and I don't really want to slow it down)
> >>>>
> >>>> 2. incoming addresses and their corresponding VRFs are not controlled by
> >>>> "us", so this is a random set. Regrouping effectively is sorting. I'm
> >>>> not really happy to have nlogn complexity on a fast path :)
> >>> Without benchmarks, we do not know whether the flat approach is
> >>> actually faster than regroup + per-VRF lookup.
> >> feel free to share benchmark results. The only thing you need to add is
> >> the packets regrouping logic, and then use separate single-VRF FIB
> >> instances.
> > Your series introduces a new API that optimizes multi-VRF lookup.
> > The performance numbers should come with the proposal.
> By the policy we can not share raw performance numbers, and I think this
> is unnecessary, because performance depends on the testing environment
> (content of the routing table, CPU model, etc).
>
> Tests I've done on my board with ipv4 full-view (782940 routes) with 4
> VRFs performing random lookup in all of them was 180% in cost comparing
> to a single VRF with the same RT content.
>
> You can test it in your environment with
> dpdk-test-fib -l 1,2  --no-pci -- -f <path to your routes> -e 4 -l
> 100000000 -V <number of VRFs>

The dpdk-test-fib benchmark is useful for measuring raw lookup
 throughput, but it does not capture the full picture. In a real
 router stack with rte_graph, the classification by VRF happens
 naturally as part of packet processing -- it is not an extra
 sorting step. The only way to compare both approaches fairly is
 to measure end-to-end forwarding performance in a real datapath.

 grout is an open source DPDK-based router built on rte_graph,
 designed to exercise and validate DPDK APIs in realistic
 conditions. I would be happy to help benchmark both approaches
 there.

>
> >
> >>>>> On the memory trade-off and VRF ID mapping: the API uses vrf_id as
> >>>>> a direct index (0 to max_vrfs-1). With 256 VRFs and 8B nexthops,
> >>>>> TBL24 alone costs 32 GB for IPv4 and 32 GB for IPv6 -- 64 GB total
> >>>>> at startup. In grout, VRF IDs are interface IDs that can be any
> >>>>> uint16_t, so we would also need to maintain a mapping between our
> >>>>> VRF IDs and FIB slot indices.
> >>>> of course, this is an application responsibility. In FIB VRFs are in
> >>>> continuous range.
> >>>>>     We would need to introduce a max_vrfs
> >>>>> limit, which forces a bad trade-off: either set it low (e.g. 16)
> >>>>> and limit deployments, or set it high (e.g. 256) and pay 64 GB at
> >>>>> startup even with a single VRF. With separate FIB instances per VRF,
> >>>>> we only allocate what we use.
> >>>> Yes, I understand this. In the end, if the user wants to use 256 VRFs,
> >>>> the amount of memory footprint will be at least 64Gb anyway.
> >>> The difference is when the memory is committed.
> >> yes, this's the only difference. It all comes down to the static vs
> >> dynamic memory allocation problem. And each of these approaches is good
> >> for solving a specific task. For the task of creating a new VRF, what is
> >> more preferable - to fail on init or runtime?
> > The main problem is that your series imposes contiguous VRF IDs
> > (0 to max_vrfs-1). How a VRF is represented is a network stack
> > design decision
> exactly - a network stack decision. FIB is not a network stack.
> >   -- in Linux it is an ifindex,
> so every interface lives in it's own private VRF?
> >   in Cisco a name,
> are you going to pass an array of strings on lookup?
> > in grout an interface ID.
> haven't we decided this is a problematic design(VLANs, L3VPN, etc)?
> >   Any application using this API needs
> > a mapping layer on top.
> I think from my rhetoric questions this should be obvious
> >
> > In grout, everything is allocated dynamically: mempools, FIBs,
> > conntrack tables. Pre-allocating everything at init forces
> > hardcoded arbitrary limits and prevents memory reuse between
> > subsystems -- memory reserved for FIB TBL24 cannot be used for
> > conntrack when the VRF has no routes, and vice versa. We prefer
> > to allocate resources only when needed. It is simpler for users
> > and more efficient for memory.
> >
> >>>    With separate FIB
> >>> instances per VRF, you allocate 128 MB only when a VRF is actually
> >>> created at runtime. With the flat multi-VRF approach, you pay
> >>> max_vrfs * 128 MB at startup, even if only one VRF is active.
> >>>
> >>> On top of that, the API uses vrf_id as a direct index (0 to
> >>> max_vrfs-1). As Stephen noted, there are multiple ways to model
> >>> VRFs. Depending on the networking stack, VRFs are identified by
> >>> ifindex (Linux l3mdev), by name (Cisco, Juniper), or by some
> >>> other scheme. This means the application must maintain a mapping
> >>> between its own VRF representation and the FIB slot indices, and
> >>> choose max_vrfs upfront. What is the benefit of this flat
> >>> multi-VRF FIB if the application still needs to manage a
> >>> translation layer and pre-commit memory for VRFs that may never
> >>> exist?
> >> This is the control plane task.
> >>>> As a trade-off for a bad trade-off ;) I can suggest to allocate it in
> >>>> chunks. Let's say you are starting with 16 VRFs, and during runtime, if
> >>>> the user wants to increase the number of VRFs above this limit, you can
> >>>> allocate another 16xVRF FIB. Then, of course, you need to split
> >>>> addresses into 2 bursts each for each FIB handle.
> >>> But then we are back to regrouping packets -- just by chunk of
> >>> VRFs instead of by individual VRF. If we have to sort the burst
> >>> anyway, what does the flat multi-VRF table buy us?
> >>>
> >>>>>>> I am not too familiar with DPDK FIB internals, but would it be
> >>>>>>> possible to keep a separate TBL24 per VRF and only share the TBL8
> >>>>>>> pool?
> >>>>>> it is how it is implemented right now with one note - TBL24 are pre
> >>>>>> allocated.
> >>>>>>> Something like pre-allocating an array of max_vrfs TBL24
> >>>>>>> pointers, allocating each TBL24 on demand at VRF add time,
> >>>>>> and you suggesting to allocate TBL24 on demand by adding an extra
> >>>>>> indirection layer. Thiswill leadtolowerperformance,whichIwouldliketo avoid.
> >>>>>>>      and
> >>>>>>> having them all point into a shared TBL8 pool. The TBL8 index in
> >>>>>>> TBL24 entries seems to already be global, so would that work without
> >>>>>>> encoding changes?
> >>>>>>>
> >>>>>>> Going further: could the same idea extend to IPv6? The dir24_8 and
> >>>>>>> trie seem to use the same TBL8 block format (256 entries, same
> >>>>>>> (nh << 1) | ext_bit encoding, same size). Would unifying the TBL8
> >>>>>>> allocator allow a single pool shared across IPv4, IPv6, and all
> >>>>>>> VRFs? That could be a bigger win for /32-heavy and /128-heavy tables
> >>>>>>> and maybe a good first step before multi-VRF.
> >>>>>> So, you are suggesting merging IPv4 and IPv6 into a single unified FIB?
> >>>>>> I'm not sure how this can be a bigger win, could you please elaborate
> >>>>>> more on this?
> >>>>> On the IPv4/IPv6 TBL8 pool: I was not suggesting merging FIBs, just
> >>>>> sharing the TBL8 block allocator between separate FIB instances.
> >>>>> This is possible since dir24_8 and trie use the same TBL8 block
> >>>>> format (256 entries, same encoding, same size).
> >>>>>
> >>>>> Would it be possible to pass a shared TBL8 pool at rte_fib_create()
> >>>>> time? Each FIB keeps its own TBL24 and RIB, but TBL8 is shared
> >>>>> across all FIBs and potentially across IPv4/IPv6. Users would no
> >>>>> longer have to guess num_tbl8 per FIB.
> >>>> Yes, this is possible. However, this will significantly complicate the
> >>>> work with the library, solving a not so big problem.
> >>> Your series already shares TBL8 across all VRFs within a single
> >>> FIB -- that part is useful, and it does not require the flat
> >>> multi-VRF TBL24.
> >>>
> >>> In grout, routes arrive from FRR (BGP, OSPF, etc.) at runtime.
> >>> We cannot predict TBL8 usage per VRF in advance
> >> and you don't need it (knowing per-VRF consumption) now. If I understood
> >> your request here properly, do you want to share TBL8 between ipv4 and
> >> ipv6 FIBs? I don't think this is a good idea at all. At least because
> >> this is a good idea to split them in case if one AF consumes (because of
> >> attack/bogus CP) all TBL8, another AF remains intact.
> > If TBL8 isolation per AF is meant as a protection against route
> > floods, then the same argument applies between VRFs: your series
> > shares TBL8 across all VRFs within a single FIB, so a bogus
> > control plane in one VRF exhausts TBL8 for all other VRFs.
> >
> > But more fundamentally, this is not how route flood protection
> > works. It is handled in the control plane: the routing daemon
> > limits the number of prefixes accepted per BGP session
> > (max-prefix) and selects which routes are installed via prefix
> > filters -- before those routes ever reach the forwarding table.
> >
> > The Linux kernel is a good reference here. IPv6 used to enforce
> > a max_size limit on FIB + cache entries (net.ipv6.route.max_size,
> > defaulting to 4096). It caused real production issues and was
> > removed in kernel 6.3. IPv4 never had a FIB route limit. There
> > is no per-VRF route limit either. The kernel relies entirely on
> > the control plane for route flood protection.
>
> FIB is not the Linux kernel, as well as not a network stack. We can not
> rely on a control plane protection, since control plane is a 3rd party
> software.
>
> Also, I think allocating a very algorithm-specific entity such as pool
> of TBL8 prior to calling rte_fib_create and passing a pointer on it
> could be confusing for many users and bloating API.
>
> FIB supports pluggable lookup algorithms, you can write your own and
> specify a pointer to the tbl8_pool in an algorithm-specific
> configuration defined for your algorithm, where you may also create a
> dynamic table of TBL24 pointers per each VRF. If you need any help with
> this task - I would be happy to help.
I have sent this RFC:
https://mails.dpdk.org/archives/dev/2026-March/335512.html

Thanks in advance for your help.

>
> >
> >>>    -- it depends on
> >>> prefix length distribution which varies per VRF and changes over
> >>> time. No production LPM (Linux kernel, JunOS, IOS) asks the
> >>> operator to size these structures per routing table upfront.
> >> - they are using different lpm algorithms
> >> - you use these facilities, developers properly tuned them. FIB is a low
> >> level library, it cannot be used without any knowledge, it will not
> >> solve all the problems with a single red button "make it work, don't do
> >> any bugs"
> >> P.S. how do you know how JunOS  /IOS implements their LPMs? ;)
> >>
> > I do not need to know their LPM implementation -- I only need
> > to know how they are configured. No production router requires
> > the operator to size internal LPM structures.
> >
> > We can impose a maximum number of IPv4/IPv6 routes on the user
> > -- even though the kernel does not need this either. But TBL8
> > is a different problem: the application cannot predict TBL8
> > consumption because it depends on prefix length distribution,
> > which varies per VRF and changes over time with dynamic routing.
> > Today there is no API to query TBL8 usage, and no API to resize
> > a FIB without destroying it.
> >
> > This is exactly why a shared TBL8 pool across VRFs is useful:
> > VRFs with few long prefixes naturally leave room for VRFs that
> > need more.
>
> but this is already implemented. I don't know why you repeatedly
> concerning about this. We are aligned about this and this feature is
> already there - in the patch
> On the other hand what we disagreed on is the sharing not only across
> VRFs, but also across address families. If you don't understand the
> amount of TBL8 per AF, how would you magically understand the number of
> TBL8s for a merged pool?
>
> >   This is the valuable part of your series. But it
> > does not require a flat multi-VRF TBL24 -- separate per-VRF
> > TBL24s sharing a common TBL8 pool would give the same benefit
> > without the 64 GB upfront cost.
>
> andthat'sa completelydifferentproblem.Please,let's separatethe
> problemsandnotmixthemup.
>
> I understand your concern about memory consumption. Ihavesomeideason
> howto solvethisproblem as a parallel to proposed solution.
>
> >
> >>> Today we do not even have TBL8 usage stats (Robin's series
> >>> addresses that)
> >> I will tryto findtimetoreviewthispatchinthe nearfuture
> > Thanks, Robin's TBL8 stats series would help users understand
> > their TBL8 consumption -- a more practical improvement for
> > current users.
> >
> > Regards,
> > Maxime
>
> --
> Regards,
> Vladimir
>

--
Regards,

Maxime
Previous message (by thread): 25.11.1 patches review and test
Next message (by thread): [DPDK/ethdev Bug 1922] VLAN pattern flow rules not matching packets on XL710
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list