[PATCH v5 0/4] add pointer compression API
    Morten Brørup 
    mb at smartsharesystems.com
       
    Thu May 16 00:34:33 CEST 2024
    
    
  
> From: Paul Szczepanek [mailto:paul.szczepanek at arm.com]
> Sent: Wednesday, 15 May 2024 19.01
> 
> On 04/03/2024 14:44, Konstantin Ananyev wrote:
> >> This feature is targeted for pipeline mode of applications. We see
> many customers using pipeline mode. This feature helps in reducing
> >> the cost of transferring the packets between cores by reducing the
> copies involved.
> >
> > I do understand the intention, and I am not arguing about usefulness
> of the pipeline model.
> > My point is you are introducing new API: compress/decompress pointers,
> > but don't provide (or even describe) any proper way for the developer
> to use it in a safe and predictable manner.
> > Which from my perspective make it nearly useless and misleading.
> 
> In the latest version there is an example in the docs showing how to use
> it. There is an integration test that shows how to use it. The comments
> in the header also provide detailed guidance.
> 
> >> For an application with multiple pools, it depends on how the
> applications are using multiple pools. But, if there is a bunch of
> packets
> >> belonging to multiple mempools, compressing those mbufs may not be
> possible. But if those mbufs are grouped per mempool and
> >> are transferred on different queues, then it is possible. Hence the
> APIs are implemented very generically.
> >
> > Ok, let's consider even more simplistic scenario - all pointers belong
> to one mempool.
> > AFAIK, even one mempool can contain elements from different memzones,
> > and these memzones are not guaranteed to have consecutive VAs.
> > So even one mempool, with total size <=4GB can contain elements with
> distances between them more than 4GB.
> > Now let say at startup user created a mempool, how he can determine
> programmatically
> > can he apply your compress API safely on it or not?
> > I presume that if you are serious about this API usage, then such
> ability has to be provided.
> > Something like:
> >
> > int compress_pointer_deduce_mempool_base(const struct rte_memepool
> *mp[],
> > 	uint32_t nb_mp, uint32_t compress_size, uintptr_t *base_ptr);
> >
> > Or probably even more generic one:
> >
> > struct mem_buf {uintptr_t base, size_t len;};
> > int compress_pointer_deduce_base(const struct mem_buf *mem_buf[],
> > 	uint32_t nb_membuf, uint32_t compress_size, uintptr_t *base_ptr);
> >
> > Even with these functions in-place, user has to be extra careful:
> >  - he can't add new memory chunks to these mempools (or he'll need to
> re-calcualte the new base_ptr)
> >  - he needs to make sure that pointers from only these mempools will
> be used by compress/decompress.
> > But at least it provides some ability to use this feature in real
> apps.
> >
> > With such API in place it should be possible to make the auto-test
> more realistic:
> > - allocate mempool
> > - deduce base_pointer
> > - then we can have a loop with producer/consumer to mimic realistic
> workload.
> >     As an example:
> >      producer(s):  mempool_alloc(); <fill mbuf with some values>;
> ring_enqueue();
> >      consumer(s): ring_dequeue(); <read_and_check_mbuf_data>;
> free_mbuf();
> > - free mempool
> 
> I understand your objections and agree that the proposed API is limited
> in its applicability due to its strict requirements.
> 
> AFAIK DPDK rte_mempool does require the addresses to be virtually
> contiguous as the memory reservation is done during creation of the
> mempool and a single memzone is reserved.
No, it does not.
rte_pktmbuf_pool_create() first creates an empty mempool using rte_mempool_create_empty(), and then populates it using rte_mempool_populate_default(), which may use multiple memzones.
As one possible solution to this, the application can call rte_pktmbuf_pool_create() as usual, and then check that mp->nb_mem_chunks == 1 to confirm that all objects in the created mempool reside in one contiguous chunk of memory and thus compression can be used.
Or even better, add a new mempool flag to the mempool library, e.g. RTE_MEMPOOL_F_COMPRESSIBLE, specifying that the mempool objects must be allocated as one chunk of memory with contiguous addresses.
Unfortunately, rte_pktmbuf_pool_create() is missing the memzone flags parameter, so a new rte_pktmbuf_pool_create() API with the flags parameter added would also need to be added.
> However, I do not require
> users to use the rte_mempool as the API accepts pointers so other
> mempool implementations could indeed allow non-contiguous VAs.
> 
> To help decide at compile time if compression is allowed I will add
> extra macros to the header
> 
> #define BITS_REQUIRED_TO_STORE_VALUE(x) \
> 	((x) == 0 ? 1 : ((sizeof(x)) - __builtin_clzl(x)))
> 
> #define BIT_SHIFT_FROM_ALIGNMENT(x) ((x) == 0 ? 0 : __builtin_clzl(x))
> 
> #define CAN_USE_RTE_PTR_COMPRESS_16(memory_range, object_alignment) \
> 	((BITS_REQUIRED_TO_STORE_VALUE(memory_range) - \
> 	BIT_SHIFT_FROM_ALIGNMENT(object_alignment)) <= 16 ? 1 : 0)
> 
> #define CAN_USE_RTE_PTR_COMPRESS_32(memory_range, object_alignment) \
> 	((BITS_REQUIRED_TO_STORE_VALUE(memory_range) - \
> 	BIT_SHIFT_FROM_ALIGNMENT(object_alignment)) <= 32 ? 1 : 0)
> 
> And explain usage in the docs.
> 
> The API is very generic and does not even require you to use a mempool.
> There are no runtime checks to verify or calculate if the pointers can
> be compressed. This is because doing this at runtime would remove any
> gains achieved through compression. The code doing the compression needs
> to remain limited in size, branching and execution time to remain fast.
> 
> This is IMHO the nature of C applications, they trade off runtime checks
> for performance. Program correctness needs to be enforced through other
> means, linting, valgrind, tests, peer review, etc. It is up to the
> programmer to calculate and decide on the viability of compression as it
> cannot be done at compile time automatically. There is no way for me to
> programmatically verify the alignment and distance of the pointers being
> passed in at compile time as I don't require the user to use any
> particular mempool implementation.
> 
> These limitations are clearly documented in the API and the guide.
> 
> > Or probably you can go even further: take some existing pipeline
> sample app and make it use compress/decompress API.
> > That will provide people with some ability to test it and measure it's
> perf impact.
> > Again, it will provide an example of the amount of changes required to
> enable it.
> > My speculation here that majority of users will find the effort too
> big,
> > while the gain way too limited and fragile.
> > But at least, there would be some realistic reference point for it and
> users can decide themselves is it worth it or not.
> 
> I have added a performance test that runs the compression and
> decompression loop with different burst sizes so that you can easily
> test if attempting compression is worth the the effort in your usecase.
> This is documented in the guide.
> 
> >
> >>>
> >>> I would like to add:
> >>> If we want to offer optimizations specifically for applications with
> a single mbuf pool, I think it should be considered in a system-wide
> >> context to determine if performance could be improved in more areas.
> >>> E.g. removing the pool field from the rte_mbuf structure might free
> up space to move hot fields from the second cache line to the
> >> first, so the second cache line rarely needs to be touched. (As an
> alternative to removing the pool field, it could be moved to the
> >> second cache line, only to be used if the global "single mbuf pool"
> is NULL.)
> >> Agree on this. The feedback I have received is on similar lines, many
> are using simple features. I also received feedback that 90% of
> >> the applications use less than 4GB of memory for mbuf and burst sizes
> are up to 256.
> >
> > Well, from my perspective the story is completely different:
> > Majority of real-world apps I am aware do use multiple mempools,
> > it is also not uncommon to have a mempools with size bigger then 4GB
> (8/16).
> > Again, there are queries to make mempools growable/shrinkable on
> demand.
> 
> You can use this API with mempools even as big as 32GB as long as your
> alignment allows for sufficient shift (as explained in the headers and
> docs) and rte_mempool objects will have at least 8 bytes alignment so
> can fit a 32GB mempool in. It is true that you cannot use it if you
> cannot guarantee the address range at compile time. This utility is not
> a golden bullet to use on every pointer.
For future proofing, please rename the compression functions to include the compression algorithm, i.e. "shift" or similar, in the function names.
Specifically I'm thinking about an alternative "multiply" compression algorithm based on division/multiplication by a constant "multiplier" parameter (instead of the "bit_shift" parameter).
This "multiplier" would typically be the object size of the packet mbuf mempool.
The "multiplier" could be constant at built time, e.g. 2368, or determined at runtime.
I don't know the performance of division/multiplication compared to bit shifting for various CPUs, but it would make compression to 16 bit compressed pointers viable for more applications.
The perf test in this series could be used to determine compression/decompression performance of such an algorithm, and the application developer can determine which algorithm to use; "shift" with 32 bit compressed pointers, or "multiply" with 16 bit compressed pointers.
    
    
More information about the dev
mailing list