[RFC PATCH 0/4] VRF support in FIB library
Medvedkin, Vladimir
vladimir.medvedkin at intel.com
Fri Mar 27 19:27:55 CET 2026
Hi Maxime,
On 3/25/2026 9:43 PM, Maxime Leroy wrote:
> Hi Vladimir,
>
> On Wed, Mar 25, 2026 at 4:56 PM Medvedkin, Vladimir
> <vladimir.medvedkin at intel.com> wrote:
>>
>> On 3/24/2026 9:19 AM, Maxime Leroy wrote:
>>> Hi Vladimir,
>>>
>>> On Mon, Mar 23, 2026 at 7:46 PM Medvedkin, Vladimir
>>> <vladimir.medvedkin at intel.com> wrote:
>>>> On 3/23/2026 2:53 PM, Maxime Leroy wrote:
>>>>> On Mon, Mar 23, 2026 at 1:49 PM Medvedkin, Vladimir
>>>>> <vladimir.medvedkin at intel.com> wrote:
>>>>>> Hi Maxime,
>>>>>>
>>>>>> On 3/23/2026 11:27 AM, Maxime Leroy wrote:
>>>>>>> Hi Vladimir,
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 22, 2026 at 4:42 PM Vladimir Medvedkin
>>>>>>> <vladimir.medvedkin at intel.com> wrote:
>>>>>>>> This series adds multi-VRF support to both IPv4 and IPv6 FIB paths by
>>>>>>>> allowing a single FIB instance to host multiple isolated routing domains.
>>>>>>>>
>>>>>>>> Currently FIB instance represents one routing instance. For workloads that
>>>>>>>> need multiple VRFs, the only option is to create multiple FIB objects. In a
>>>>>>>> burst oriented datapath, packets in the same batch can belong to different VRFs, so
>>>>>>>> the application either does per-packet lookup in different FIB instances or
>>>>>>>> regroups packets by VRF before lookup. Both approaches are expensive.
>>>>>>>>
>>>>>>>> To remove that cost, this series keeps all VRFs inside one FIB instance and
>>>>>>>> extends lookup input with per-packet VRF IDs.
>>>>>>>>
>>>>>>>> The design follows the existing fast-path structure for both families. IPv4 and
>>>>>>>> IPv6 use multi-ary trees with a 2^24 associativity on a first level (tbl24). The
>>>>>>>> first-level table scales per configured VRF. This increases memory usage, but
>>>>>>>> keeps performance and lookup complexity on par with non-VRF implementation.
>>>>>>>>
>>>>>>> Thanks for the RFC. Some thoughts below.
>>>>>>>
>>>>>>> Memory cost: the flat TBL24 replicates the entire table for every VRF
>>>>>>> (num_vrfs * 2^24 * nh_size). With 256 VRFs and 8B nexthops that is
>>>>>>> 32 GB for TBL24 alone. In grout we support up to 256 VRFs allocated
>>>>>>> on demand -- this approach forces the full cost upfront even if most
>>>>>>> VRFs are empty.
>>>>>> Yes, increased memory consumption is the
>>>>>> trade-off.WemakethischoiceinDPDKquite often,such as pre-allocatedmbufs,
>>>>>> mempoolsand many other stuff allocated in advance to gain performance.
>>>>>> For FIB, I chose to replicate TBL24 per VRF for this same reason.
>>>>>>
>>>>>> And, as Morten mentioned earlier, if memory is the priority, a table
>>>>>> instance per VRF allocated on-demand is still supported.
>>>>>>
>>>>>> The high memory cost stems from TBL24's design: for IPv4, it was
>>>>>> justified by the BGP filtering convention (no prefixes more specific
>>>>>> than /24 in BGPv4 full view), ensuring most lookups hit with just one
>>>>>> random memory access. For IPv6, we should likely switch to a 16-bit TRIE
>>>>>> scheme on all layers. For IPv4, alternative algorithms with smaller
>>>>>> footprints (like DXR or DIR16-8-8, as used in VPP) may be worth
>>>>>> exploring if BGP full view is not required for those VRFs.
>>>>>>
>>>>>>> Per-packet VRF lookup: Rx bursts come from one port, thus one VRF.
>>>>>>> Mixed-VRF bulk lookups do not occur in practice. The three AVX512
>>>>>>> code paths add complexity for a scenario that does not exist, at
>>>>>>> least for a classic router. Am I missing a use-case?
>>>>>> That's not true, you're missing out on a lot of established core use
>>>>>> cases that are at least 2 decades old:
>>>>>>
>>>>>> - VLAN subinterface abstraction. Each subinterface may belong to a
>>>>>> separate VRF
>>>>>>
>>>>>> - MPLS VPN
>>>>>>
>>>>>> - Policy based routing
>>>>>>
>>>>> Fair point on VLAN subinterfaces and MPLS VPN. SRv6 L3VPN (End.DT4/
>>>>> End.DT6) also fits that pattern after decap.
>>>>>
>>>>> I agree DPDK often pre-allocates for performance, but I wonder if the
>>>>> flat TBL24 actually helps here. Each VRF's working set is spread
>>>>> 128 MB apart in the flat table. Would regrouping packets by VRF and
>>>>> doing one bulk lookup per VRF with separate contiguous TBL24s be
>>>>> more cache-friendly than a single mixed-VRF gather? Do you have
>>>>> benchmarks comparing the two approaches?
>>>> It depends. Generally, if we assume that we are working with wide
>>>> internet traffic, then even for a single VRF we most likely will miss
>>>> the cache for TLB24, thus, regardless of the size of the tbl24, each
>>>> memory access will be performed directly to DRAM.
>>> If the lookup is DRAM-bound anyway, then the 10 cycles/addr cost
>>> is dominated by memory latency, not CPU. The CPU cost of a bucket
>>> sort on 32-64 packets is negligible next to a DRAM access (~80-100
>>> ns per cache miss).
>> memory accesses are independent and executed in parallel in the CPU
>> pipeline
>>> That actually makes the case for regroup +
>>> per-VRF lookup: the regrouping is pure CPU work hidden behind
>>> memory stalls,
>> regrouping must be performed before memory accesses, so it cannot be
>> amortized in between of memory reads
> With internet traffic, TBL24 lookups quickly become limited by
> cache misses, not CPU cycles. Even if some bursts hit the same
> routes and benefit from cache locality, the CPU has a limited
> number of outstanding misses (load buffer entries, MSHRs) --
> out-of-order execution helps, but it is not magic.
Correct, but this does not contradict to what I'm saying
>
> The whole point of vector/graph processing (VPP, DPDK graph, etc.)
> is to amortize that memory latency: prefetch for packet N+1 while
> processing packet N. This works because all packets in a batch
> hit the same data structure in a tight loop.
https://github.com/DPDK/dpdk/blob/626d4e39327333cd5508885162e45ca7fb94ef7f/lib/fib/dir24_8.h#L161
>
> With separate per-VRF TBL24s, a bucket sort by VRF -- a few
> dozen cycles, all in L1 -- gives you clean batches where
> prefetching works as designed. This is exactly what graph nodes
> already do: classify, then process per-class in a tight loop.
How lookup is performed in this design? Am I understand it right:
1. sort the batch by VRF ids, splitting IPs from the batch with IP
sub-batches belonging to the same VRF id
2. for each subset of IPs perform lookup in tbl24[batch_common_vrf_id]
3. unsort nexthops
Correct?
>
>>> and each per-VRF bulk lookup hits a contiguous
>>> TBL24 instead of scattering across 128 MB-apart VRF regions.
>> why is a contiguous 128Mb single-VRF TBL24 OK for you, but bigger
>> contiguous multi-VRF TBL24 is not OK in the context of lookup (here we
>> are talking about lookup, omitting the problem of memory consumption on
>> init)?
> The performance difference may be small, but the flat approach
> is not faster either -- while costing 64 GB upfront.
it seems you are implicitly take an assumption of 256 VRFs. Is my
usecase with a few VRFs have a right to exist?
>
>> In both of these cases, memory access behaves the same way within a
>> single batch of packets during lookup, i.e. the first hit is likely a
>> cache miss, regardless of whether we are dealing with one or more VRFs,
>> it will not maintain TBL24 in L3$ in any way in a real dataplane app.
>>
>>>> And if the addresses are localized (i.e. most traffic is internal), then
>>>> having multiple TBL24 won'tmake the situationmuchworse.
>>>>
>>> With localized traffic, regrouping by VRF + per-VRF lookup on
>>> contiguous TBL24s would benefit from cache locality,
>> why so? There will be no differences within a single batch with a
>> reasonable size (for example 64), because within the lookup session, no
>> matter with or without regrouping, temporal cache locality will be the same.
>>
>> Let't look at it from a different angle. Is it
>> worthregroupingipaddressesby/8(i.e. 8 MSBs)withthe
>> currentimplementationof a singleVRFFIB?
>>
>>> while the
>>> flat multi-VRF table spreads hot entries 128 MB apart. The flat
>>> approach may actually be worse in that scenario
>>>
>>>> I don't have any benchmarks for regrouping, however I have 2 things to
>>>> consider:
>>>>
>>>> 1. lookup is relatively fast (for IPv4 it is about 10 cycles per
>>>> address, and I don't really want to slow it down)
>>>>
>>>> 2. incoming addresses and their corresponding VRFs are not controlled by
>>>> "us", so this is a random set. Regrouping effectively is sorting. I'm
>>>> not really happy to have nlogn complexity on a fast path :)
>>> Without benchmarks, we do not know whether the flat approach is
>>> actually faster than regroup + per-VRF lookup.
>> feel free to share benchmark results. The only thing you need to add is
>> the packets regrouping logic, and then use separate single-VRF FIB
>> instances.
> Your series introduces a new API that optimizes multi-VRF lookup.
> The performance numbers should come with the proposal.
By the policy we can not share raw performance numbers, and I think this
is unnecessary, because performance depends on the testing environment
(content of the routing table, CPU model, etc).
Tests I've done on my board with ipv4 full-view (782940 routes) with 4
VRFs performing random lookup in all of them was 180% in cost comparing
to a single VRF with the same RT content.
You can test it in your environment with
dpdk-test-fib -l 1,2 --no-pci -- -f <path to your routes> -e 4 -l
100000000 -V <number of VRFs>
>
>>>>> On the memory trade-off and VRF ID mapping: the API uses vrf_id as
>>>>> a direct index (0 to max_vrfs-1). With 256 VRFs and 8B nexthops,
>>>>> TBL24 alone costs 32 GB for IPv4 and 32 GB for IPv6 -- 64 GB total
>>>>> at startup. In grout, VRF IDs are interface IDs that can be any
>>>>> uint16_t, so we would also need to maintain a mapping between our
>>>>> VRF IDs and FIB slot indices.
>>>> of course, this is an application responsibility. In FIB VRFs are in
>>>> continuous range.
>>>>> We would need to introduce a max_vrfs
>>>>> limit, which forces a bad trade-off: either set it low (e.g. 16)
>>>>> and limit deployments, or set it high (e.g. 256) and pay 64 GB at
>>>>> startup even with a single VRF. With separate FIB instances per VRF,
>>>>> we only allocate what we use.
>>>> Yes, I understand this. In the end, if the user wants to use 256 VRFs,
>>>> the amount of memory footprint will be at least 64Gb anyway.
>>> The difference is when the memory is committed.
>> yes, this's the only difference. It all comes down to the static vs
>> dynamic memory allocation problem. And each of these approaches is good
>> for solving a specific task. For the task of creating a new VRF, what is
>> more preferable - to fail on init or runtime?
> The main problem is that your series imposes contiguous VRF IDs
> (0 to max_vrfs-1). How a VRF is represented is a network stack
> design decision
exactly - a network stack decision. FIB is not a network stack.
> -- in Linux it is an ifindex,
so every interface lives in it's own private VRF?
> in Cisco a name,
are you going to pass an array of strings on lookup?
> in grout an interface ID.
haven't we decided this is a problematic design(VLANs, L3VPN, etc)?
> Any application using this API needs
> a mapping layer on top.
I think from my rhetoric questions this should be obvious
>
> In grout, everything is allocated dynamically: mempools, FIBs,
> conntrack tables. Pre-allocating everything at init forces
> hardcoded arbitrary limits and prevents memory reuse between
> subsystems -- memory reserved for FIB TBL24 cannot be used for
> conntrack when the VRF has no routes, and vice versa. We prefer
> to allocate resources only when needed. It is simpler for users
> and more efficient for memory.
>
>>> With separate FIB
>>> instances per VRF, you allocate 128 MB only when a VRF is actually
>>> created at runtime. With the flat multi-VRF approach, you pay
>>> max_vrfs * 128 MB at startup, even if only one VRF is active.
>>>
>>> On top of that, the API uses vrf_id as a direct index (0 to
>>> max_vrfs-1). As Stephen noted, there are multiple ways to model
>>> VRFs. Depending on the networking stack, VRFs are identified by
>>> ifindex (Linux l3mdev), by name (Cisco, Juniper), or by some
>>> other scheme. This means the application must maintain a mapping
>>> between its own VRF representation and the FIB slot indices, and
>>> choose max_vrfs upfront. What is the benefit of this flat
>>> multi-VRF FIB if the application still needs to manage a
>>> translation layer and pre-commit memory for VRFs that may never
>>> exist?
>> This is the control plane task.
>>>> As a trade-off for a bad trade-off ;) I can suggest to allocate it in
>>>> chunks. Let's say you are starting with 16 VRFs, and during runtime, if
>>>> the user wants to increase the number of VRFs above this limit, you can
>>>> allocate another 16xVRF FIB. Then, of course, you need to split
>>>> addresses into 2 bursts each for each FIB handle.
>>> But then we are back to regrouping packets -- just by chunk of
>>> VRFs instead of by individual VRF. If we have to sort the burst
>>> anyway, what does the flat multi-VRF table buy us?
>>>
>>>>>>> I am not too familiar with DPDK FIB internals, but would it be
>>>>>>> possible to keep a separate TBL24 per VRF and only share the TBL8
>>>>>>> pool?
>>>>>> it is how it is implemented right now with one note - TBL24 are pre
>>>>>> allocated.
>>>>>>> Something like pre-allocating an array of max_vrfs TBL24
>>>>>>> pointers, allocating each TBL24 on demand at VRF add time,
>>>>>> and you suggesting to allocate TBL24 on demand by adding an extra
>>>>>> indirection layer. Thiswill leadtolowerperformance,whichIwouldliketo avoid.
>>>>>>> and
>>>>>>> having them all point into a shared TBL8 pool. The TBL8 index in
>>>>>>> TBL24 entries seems to already be global, so would that work without
>>>>>>> encoding changes?
>>>>>>>
>>>>>>> Going further: could the same idea extend to IPv6? The dir24_8 and
>>>>>>> trie seem to use the same TBL8 block format (256 entries, same
>>>>>>> (nh << 1) | ext_bit encoding, same size). Would unifying the TBL8
>>>>>>> allocator allow a single pool shared across IPv4, IPv6, and all
>>>>>>> VRFs? That could be a bigger win for /32-heavy and /128-heavy tables
>>>>>>> and maybe a good first step before multi-VRF.
>>>>>> So, you are suggesting merging IPv4 and IPv6 into a single unified FIB?
>>>>>> I'm not sure how this can be a bigger win, could you please elaborate
>>>>>> more on this?
>>>>> On the IPv4/IPv6 TBL8 pool: I was not suggesting merging FIBs, just
>>>>> sharing the TBL8 block allocator between separate FIB instances.
>>>>> This is possible since dir24_8 and trie use the same TBL8 block
>>>>> format (256 entries, same encoding, same size).
>>>>>
>>>>> Would it be possible to pass a shared TBL8 pool at rte_fib_create()
>>>>> time? Each FIB keeps its own TBL24 and RIB, but TBL8 is shared
>>>>> across all FIBs and potentially across IPv4/IPv6. Users would no
>>>>> longer have to guess num_tbl8 per FIB.
>>>> Yes, this is possible. However, this will significantly complicate the
>>>> work with the library, solving a not so big problem.
>>> Your series already shares TBL8 across all VRFs within a single
>>> FIB -- that part is useful, and it does not require the flat
>>> multi-VRF TBL24.
>>>
>>> In grout, routes arrive from FRR (BGP, OSPF, etc.) at runtime.
>>> We cannot predict TBL8 usage per VRF in advance
>> and you don't need it (knowing per-VRF consumption) now. If I understood
>> your request here properly, do you want to share TBL8 between ipv4 and
>> ipv6 FIBs? I don't think this is a good idea at all. At least because
>> this is a good idea to split them in case if one AF consumes (because of
>> attack/bogus CP) all TBL8, another AF remains intact.
> If TBL8 isolation per AF is meant as a protection against route
> floods, then the same argument applies between VRFs: your series
> shares TBL8 across all VRFs within a single FIB, so a bogus
> control plane in one VRF exhausts TBL8 for all other VRFs.
>
> But more fundamentally, this is not how route flood protection
> works. It is handled in the control plane: the routing daemon
> limits the number of prefixes accepted per BGP session
> (max-prefix) and selects which routes are installed via prefix
> filters -- before those routes ever reach the forwarding table.
>
> The Linux kernel is a good reference here. IPv6 used to enforce
> a max_size limit on FIB + cache entries (net.ipv6.route.max_size,
> defaulting to 4096). It caused real production issues and was
> removed in kernel 6.3. IPv4 never had a FIB route limit. There
> is no per-VRF route limit either. The kernel relies entirely on
> the control plane for route flood protection.
FIB is not the Linux kernel, as well as not a network stack. We can not
rely on a control plane protection, since control plane is a 3rd party
software.
Also, I think allocating a very algorithm-specific entity such as pool
of TBL8 prior to calling rte_fib_create and passing a pointer on it
could be confusing for many users and bloating API.
FIB supports pluggable lookup algorithms, you can write your own and
specify a pointer to the tbl8_pool in an algorithm-specific
configuration defined for your algorithm, where you may also create a
dynamic table of TBL24 pointers per each VRF. If you need any help with
this task - I would be happy to help.
>
>>> -- it depends on
>>> prefix length distribution which varies per VRF and changes over
>>> time. No production LPM (Linux kernel, JunOS, IOS) asks the
>>> operator to size these structures per routing table upfront.
>> - they are using different lpm algorithms
>> - you use these facilities, developers properly tuned them. FIB is a low
>> level library, it cannot be used without any knowledge, it will not
>> solve all the problems with a single red button "make it work, don't do
>> any bugs"
>> P.S. how do you know how JunOS /IOS implements their LPMs? ;)
>>
> I do not need to know their LPM implementation -- I only need
> to know how they are configured. No production router requires
> the operator to size internal LPM structures.
>
> We can impose a maximum number of IPv4/IPv6 routes on the user
> -- even though the kernel does not need this either. But TBL8
> is a different problem: the application cannot predict TBL8
> consumption because it depends on prefix length distribution,
> which varies per VRF and changes over time with dynamic routing.
> Today there is no API to query TBL8 usage, and no API to resize
> a FIB without destroying it.
>
> This is exactly why a shared TBL8 pool across VRFs is useful:
> VRFs with few long prefixes naturally leave room for VRFs that
> need more.
but this is already implemented. I don't know why you repeatedly
concerning about this. We are aligned about this and this feature is
already there - in the patch
On the other hand what we disagreed on is the sharing not only across
VRFs, but also across address families. If you don't understand the
amount of TBL8 per AF, how would you magically understand the number of
TBL8s for a merged pool?
> This is the valuable part of your series. But it
> does not require a flat multi-VRF TBL24 -- separate per-VRF
> TBL24s sharing a common TBL8 pool would give the same benefit
> without the 64 GB upfront cost.
andthat'sa completelydifferentproblem.Please,let's separatethe
problemsandnotmixthemup.
I understand your concern about memory consumption. Ihavesomeideason
howto solvethisproblem as a parallel to proposed solution.
>
>>> Today we do not even have TBL8 usage stats (Robin's series
>>> addresses that)
>> I will tryto findtimetoreviewthispatchinthe nearfuture
> Thanks, Robin's TBL8 stats series would help users understand
> their TBL8 consumption -- a more practical improvement for
> current users.
>
> Regards,
> Maxime
--
Regards,
Vladimir
More information about the dev
mailing list