[dpdk-dev] [PATCH 01/17] mbuf: add definitions of unified packet types

Zhang, Helin helin.zhang at intel.com
Tue Feb 3 04:18:59 CET 2015



> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz at 6wind.com]
> Sent: Monday, February 2, 2015 7:18 PM
> To: Zhang, Helin; dev at dpdk.org
> Cc: Stephen Hemminger
> Subject: Re: [dpdk-dev] [PATCH 01/17] mbuf: add definitions of unified packet
> types
> 
> Hi Helin,
> 
> On 02/02/2015 02:43 AM, Zhang, Helin wrote:
> >>> +/*
> >>> + * Sixteen bits are divided into several fields to mark packet types.
> >>> +Note that
> >>> + * each field is indexical.
> >>> + * - Bit 3:0 is for tunnel types.
> >>> + * - Bit 7:4 is for L3 or outer L3 (for tunneling case) types.
> >>> + * - Bit 10:8 is for L4 types. It can also be used for inner L4 types for
> >>> + *   tunneling packets.
> >>> + * - Bit 13:11 is for inner L3 types.
> >>> + * - Bit 15:14 is reserved.
> >>
> >> Is there a reason why using this specific order?
> > Yes, to support ixgbe Vector PMD, outer L3 types and L4 types need to
> > be contiguous and in this order.
> 
> When you say "need to be", do you mean it's impossible to do in another
> manner or just that it would be slower?
It was designed to be like this, otherwise, performance drop must be expected.

> 
> >> Also, there are 4 bits for outer L3 types and 3 bits for inner L3
> >> types, but both of them have 6 different supported types. Is it intentional?
> > Yes, it is to support ixgbe Vector PMD. Contiguous 7 bits are needed, though
> 1 bit wasted.
> 
> To be honnest, I'm always a surprised that in dpdk we prefer having a strange
> API just because it's faster or easier to do on one specific driver (usually i40e or
> ixgbe). Unfortunately, trying to optimize the API for one driver may result in
> making the rest of the code (application and other drivers) slower and more
> complex.
Based on my understanding, 'faster' is most of DPDK customers wanted. Otherwise,
they don't need DPDK. Different hardware must have different capabilities, I am trying
to unify at least packet types to get things easier.

> 
> In your proposition, there is no inner l4_type. I consider it's as useful as the
> other fields. From what I see, there are only 2 bits left. What do you think about
> changing the packet type to 64 bits now?
For tunneling cases, L4_type is for inner L4 type, outer L4 type is not needed, as it
can be in tunnel type.
I can expect 64 bits are needed in the future. But for now, I don't see any strong
demand on that for currently supported hardware.
In addition, there is no free bit in the first cache line of mbuf header, mbuf changes
are needed to expand it. I'd prefer to do it later to make things easier.

> 
> From an API point of view, I think it would be good to have the same structure
> for inner and outer types. For instance (this is just an example):
> 
> union layer_pkt_type {
> 	struct {
> 		uint16_t l2_type:4;
> 		uint16_t l3_type:4;
> 		uint16_t l4_type:4;
> 		uint16_t tun_type:4;
> 	};
> 	uint16_t u16;
> };
> 
> struct pkt_type {
> 	union layer_pkt_type outer;
> 	union layer_pkt_type inner;
> };
> 
> When your application decapsulates tunnels, you can just do outer = inner and
> enter into the same code.
Expanding packet_type is not easy, as there is no free bits in the first cache line.
Is there any tunnel type in inner packet? Is it a waste?
Is L2 type really needed? I don't know.

> 
> 
> >>> + * RTE_PTYPE_L3_IPV6, RTE_PTYPE_L3_IPV6_EXT, RTE_PTYPE_L4_TCP,
> >>> +RTE_PTYPE_L4_UDP
> >>> + * and RTE_PTYPE_L4_SCTP should be kept as below in a contiguous 7
> bits.
> >>> + *
> >>> + * Note that L3 types values are selected for checking IPV4/IPV6
> >>> +header from
> >>> + * performance point of view. Reading annotations of
> >>> +RTE_ETH_IS_IPV4_HDR and
> >>> + * RTE_ETH_IS_IPV6_HDR is needed for any future changes of L3 type
> >> values.
> >>> + */
> >>> +#define RTE_PTYPE_UNKNOWN                   0x0000 /*
> >> 0b0000000000000000 */
> >>> +/* bit 3:0 for tunnel types */
> >>> +#define RTE_PTYPE_TUNNEL_IP                 0x0001 /*
> >> 0b0000000000000001 */
> >>> +#define RTE_PTYPE_TUNNEL_TCP                0x0002 /*
> >> 0b0000000000000010 */
> >>> +#define RTE_PTYPE_TUNNEL_UDP                0x0003 /*
> >> 0b0000000000000011 */
> >>> +#define RTE_PTYPE_TUNNEL_GRE                0x0004 /*
> >> 0b0000000000000100 */
> >>> +#define RTE_PTYPE_TUNNEL_VXLAN              0x0005 /*
> >> 0b0000000000000101 */
> >>> +#define RTE_PTYPE_TUNNEL_NVGRE              0x0006 /*
> >> 0b0000000000000110 */
> >>> +#define RTE_PTYPE_TUNNEL_GENEVE             0x0007 /*
> >> 0b0000000000000111 */
> >>> +#define RTE_PTYPE_TUNNEL_GRENAT             0x0008 /*
> >> 0b0000000000001000 */
> >>> +#define RTE_PTYPE_TUNNEL_GRENAT_MAC         0x0009 /*
> >> 0b0000000000001001 */
> >>> +#define RTE_PTYPE_TUNNEL_GRENAT_MACVLAN     0x000a /*
> >> 0b0000000000001010 */
> >>> +#define RTE_PTYPE_TUNNEL_MASK               0x000f /*
> >> 0b0000000000001111 */
> >>> +/* bit 7:4 for L3 types */
> >>> +#define RTE_PTYPE_L3_IPV4                   0x0010 /*
> >> 0b0000000000010000 */
> >>> +#define RTE_PTYPE_L3_IPV4_EXT               0x0030 /*
> >> 0b0000000000110000 */
> >>> +#define RTE_PTYPE_L3_IPV6                   0x0040 /*
> >> 0b0000000001000000 */
> >>> +#define RTE_PTYPE_L3_IPV4_EXT_UNKNOWN       0x0090 /*
> >> 0b0000000010010000 */
> >>> +#define RTE_PTYPE_L3_IPV6_EXT               0x00c0 /*
> >> 0b0000000011000000 */
> >>> +#define RTE_PTYPE_L3_IPV6_EXT_UNKNOWN       0x00e0 /*
> >> 0b0000000011100000 */
> >>> +#define RTE_PTYPE_L3_MASK                   0x00f0 /*
> >> 0b0000000011110000 */
> >>
> >> can we expect that when RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV4_EXT or
> >> RTE_PTYPE_L3_IPV4_EXT_UNKNOWN is set, the hardware also verified the
> >> L3 checksum?
> > RTE_PTYPE_L3_IPV4 means there is NONE-EXT. Each time only one of above 3
> can be set.
> > These bits don't indicate any checksum, checksum should be indicated by
> other flags.
> > They are just for packet types hardware can recognized.
> 
> I think these 2 information are linked:
> 
> - if the hardware cannot recognize packet, it cannot calculate the
>   checksum because it does not know the packet type
> - if the hardware can recognize the packet, it can verify that the
>   checksum is good or wrong
We cannot know how hardware works, we care about what hardware can report.

> 
> Today, we have:
> 
> - PKT_RX_IPV4_HDR and PKT_RX_IPV4_HDR_EXT to tell if the packet is
>   seen as IPv4 by the hw.
> 
> - We can suppose that:
> 
>   - PKT_RX_IPV4_HDR(_EXT)=0 -> no hw checksum information
>   - PKT_RX_IPV4_HDR(_EXT)=1 and PKT_RX_IP_CKSUM_BAD=0 -> checksum
>     is correct
>   - PKT_RX_IPV4_HDR(_EXT)=1 and PKT_RX_IP_CKSUM_BAD=1 -> checksum
>     is not correct
> 
> - We cannot do the same with L4 because we have no L4 type info,
>   but it would be good to be able to do the same.
> 
> With your patch, you are removing the PKT_RX_IPV4_HDR and
> PKT_RX_IPV4_HDR_EXT flags, but I think the above assumption about
> checksum should be kept. As you are adding a L4 type info, the same method
> could be applied to L4 checksums.
> 
> I think this would definitely solve the problem described by Stephen.
I think packet type and checksum are different things. They are reported by different fields.
PKT_RX_IPV4_HDR and PKT_RX_IPV4_HDR_EXT mean packet type only,
nothing about checksum. Checksum GOOD/BAD can be reported by other flags in ol_flags.

> 
> 
> >> My understanding is:
> >>
> >> - if packet_type is IPv4* and PKT_RX_IP_CKSUM_BAD is 0
> >>    -> checksum was checked by hw and is good
> >> - if packet_type is IPv4* and PKT_RX_IP_CKSUM_BAD is 1
> >>    -> checksum was checked by hw and is bad
> >> - if packet_type is not IPv4*
> >>    -> checksum was not checked by hw
> >>
> >> I think it would solve the problem asked by Stephen
> >> http://dpdk.org/ml/archives/dev/2015-January/011550.html
> >>
> >>> +/* bit 10:8 for L4 types */
> >>> +#define RTE_PTYPE_L4_TCP                    0x0100 /*
> >> 0b0000000100000000 */
> >>> +#define RTE_PTYPE_L4_UDP                    0x0200 /*
> >> 0b0000001000000000 */
> >>> +#define RTE_PTYPE_L4_FRAG                   0x0300 /*
> >> 0b0000001100000000 */
> >>> +#define RTE_PTYPE_L4_SCTP                   0x0400 /*
> >> 0b0000010000000000 */
> >>> +#define RTE_PTYPE_L4_ICMP                   0x0500 /*
> >> 0b0000010100000000 */
> >>> +#define RTE_PTYPE_L4_NONFRAG                0x0600 /*
> >> 0b0000011000000000 */
> >>> +#define RTE_PTYPE_L4_MASK                   0x0700 /*
> >> 0b0000011100000000 */
> >>
> >> Same question for L4.
> >>
> >> Note: it would means that if a hardware is able to recognize a TCP
> >> packet but not to verify the checksum, it has to set RTE_PTYPE_L4 to
> unknown.
> >>
> >>> +/* bit 13:11 for inner L3 types */
> >>> +#define RTE_PTYPE_INNER_L3_IPV4             0x0800 /*
> >> 0b0000100000000000 */
> >>> +#define RTE_PTYPE_INNER_L3_IPV4_EXT         0x1000 /*
> >> 0b0001000000000000 */
> >>> +#define RTE_PTYPE_INNER_L3_IPV6             0x1800 /*
> >> 0b0001100000000000 */
> >>> +#define RTE_PTYPE_INNER_L3_IPV6_EXT         0x2000 /*
> >> 0b0010000000000000 */
> >>> +#define RTE_PTYPE_INNER_L3_IPV4_EXT_UNKNOWN 0x2800 /*
> >>> +0b0010100000000000 */ #define
> >> RTE_PTYPE_INNER_L3_IPV6_EXT_UNKNOWN 0x3000 /*
> 0b0011000000000000 */
> > We cannot define the hardware behaviors, it just reports the hardware
> > recognized packet information directly to the mbuf.
> > Based on my experiment on i40e hardware, if a IPV4 packet with wrong
> > checksum, by default, the PMD driver cannot see the packet at all. So
> > we don't need to care about it too much!
> 
> I agree that the hardware reports some info that can be different depending on
> the hw. But the role of the driver is to convert these info into a common API
> with a well-defined behavior.
Yes, driver should report the received packet information to a well-defined behavior,
but not the same behavior, even for the same packet.
Capability can be queried for each port, and then the application can know the port
capability well, and know what the hardware can report, and what the hardware
cannot report.
Driver should enable the hardware with its advanced capabilities as most as possible.

> 
> Regards,
> Olivier


More information about the dev mailing list