[dpdk-dev] packet data access bug in bpf and pdump libs
Ananyev, Konstantin
konstantin.ananyev at intel.com
Thu Oct 10 17:36:30 CEST 2019
> -----Original Message-----
> From: Morten Brørup [mailto:mb at smartsharesystems.com]
> Sent: Thursday, October 10, 2019 8:30 AM
> To: Stephen Hemminger <stephen at networkplumber.org>
> Cc: Ananyev, Konstantin <konstantin.ananyev at intel.com>; dpdk-dev <dev at dpdk.org>; Jerin Jacob <jerinj at marvell.com>
> Subject: RE: [dpdk-dev] packet data access bug in bpf and pdump libs
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Stephen Hemminger
> > Sent: Wednesday, October 9, 2019 7:25 PM
> >
> > On Wed, 9 Oct 2019 17:20:58 +0200
> > Morten Brørup <mb at smartsharesystems.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> > > > Sent: Wednesday, October 9, 2019 5:15 PM
> > > >
> > > > On Wed, 9 Oct 2019 17:06:24 +0200
> > > > Morten Brørup <mb at smartsharesystems.com> wrote:
> > > >
> > > > > > -----Original Message-----
> > > > > > From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> > > > > > Sent: Wednesday, October 9, 2019 5:02 PM
> > > > > >
> > > > > > On Wed, 9 Oct 2019 11:11:46 +0000
> > > > > > "Ananyev, Konstantin" <konstantin.ananyev at intel.com> wrote:
> > > > > >
> > > > > > > Hi Morten,
> > > > > > >
> > > > > > > >
> > > > > > > > Hi Konstantin and Stephen,
> > > > > > > >
> > > > > > > > I just noticed the same bug in your bpf and pcap libraries:
> > > > > > > >
> > > > > > > > You are using rte_pktmbuf_mtod(), but should be using
> > > > > > rte_pktmbuf_read(). Otherwise you cannot read data across
> > multiple
> > > > > > segments.
> > > > > > >
> > > > > > > In plain data buffer mode expected input for BPF program is
> > start
> > > > of
> > > > > > first segment packet data.
> > > > > > > Other segments are simply not available to BPF program in
> > that
> > > > mode.
> > > > > > > AFAIK, cBPF uses the same model.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Med venlig hilsen / kind regards
> > > > > > > > - Morten Brørup
> > > > > > >
> > > > > >
> > > > > > For packet capture, the BPF program is only allowed to look at
> > > > first
> > > > > > segment.
> > > > > > pktmbuf_read is expensive and can cause a copy.
> > > > >
> > > > > It is only expensive if going beyond the first segment:
> > > > >
> > > > > static inline const void *rte_pktmbuf_read(const struct rte_mbuf
> > *m,
> > > > > uint32_t off, uint32_t len, void *buf)
> > > > > {
> > > > > if (likely(off + len <= rte_pktmbuf_data_len(m)))
> > > > > return rte_pktmbuf_mtod_offset(m, char *, off);
> > > > > else
> > > > > return __rte_pktmbuf_read(m, off, len, buf);
> > > > > }
> > > >
> > > > But it would mean potentially big buffer on the stack (in case)
> > >
> > > No, the buffer only needs to be the size of the accessed data. I use
> > it like this:
> > >
> > > char buffer[sizeof(uint32_t)];
> > >
> > > for (;; pc++) {
> > > switch (pc->code) {
> > > case BPF_LD_ABS_32:
> > > p = rte_pktmbuf_read(m, pc->k, sizeof(uint32_t), buffer);
> > > if (unlikely(p == NULL)) return 0; /* Attempting to read
> > beyond packet. Bail out. */
> > > a = rte_be_to_cpu_32(*(const uint32_t *)p);
> > > continue;
> > > case BPF_LD_ABS_16:
> > > p = rte_pktmbuf_read(m, pc->k, sizeof(uint16_t), buffer);
> > > if (unlikely(p == NULL)) return 0; /* Attempting to read
> > beyond packet. Bail out. */
> > > a = rte_be_to_cpu_16(*(const uint16_t *)p);
> > > continue;
> > >
> >
> > Reading down the chain of mbuf segments to find a uint32_t (and that
> > potentially crosses)
> > seems like a waste.
+1
Again just imagine how painful it would be to support it in JIT...
Another thing - in librte_bpf, if RTE_BPF_ARG_PTR is used,
it means that input parameter for exec/jit is just a plain data buffer and it doesn’t
make any assumptions about it (mbuf or not, etc.).
Though there is a possibility to access mbuf metadata
and call rte_pktmbuf_read() or any other functions from your eBPF code,
but for that you need to specify RTE_BPF_ARG_PTR_MBUF at load stage.
> >
>
> Slow and painful is the only way to read beyond the first segment, I agree.
>
> But when reading from the first segment, rte_pktmbuf_read() basically does the same as your code. So there shouldn't be any performance
> penalty from supporting both by using rte_pktmbuf_read() instead.
>
> I think the modification in the pdump library is simple, as you already pass the mbuf. But the bpf library requires more work, as it passes a
> pointer to the data in the first segment to the processing function instead of passing the mbuf.
>
> > The purpose of the filter is to look at packet headers.
>
> Some might look deeper. So why prevent it? E.g. our StraighShaper appliance sometimes looks deeper, but for performance reasons we
> stopped using BPF for this a long time ago.
>
> > Any driver
> > making mbufs that
> > are dripples of data is broken.
>
> I agree very much with you on this regarding general-purpose NICs! Although I know of an exception that confirms the rule... a few year
> ago we worked with a component vendor with some very clever performance optimizations doing exactly this for specific purposes.
> Unfortunately it's under NDA, so I can't go into details.
>
> > chaining is really meant for case of jumbo or tso.
> >
>
> Try thinking beyond PMD ingress. There are multiple use cases in egress. Here are a couple:
>
> - IP Multicast to multiple subnets on a Layer 3 switch. The VLAN ID and Source MAC must be replaced in each packet; this can be done using
> segments.
> - Tunnel encapsulation. E.g. putting a packet into a VXLAN tunnel could be done using segments.
More information about the dev
mailing list