<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Thu, Oct 9, 2025 at 10:11 PM Bruce Richardson <<a href="mailto:bruce.richardson@intel.com">bruce.richardson@intel.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, Jul 16, 2025 at 04:04:39PM +0530, Shreesh Adiga wrote:<br> > Replace the clearing of lower 32 bits of XMM register with blend of<br> > zero register.<br> > Replace the clearing of upper 64 bits of XMM register with _mm_move_epi64.<br> > Clang is able to optimize away the AND + memory operand with the<br> > above sequence, however GCC is still emitting the code for AND with<br> > memory operands which is being explicitly eliminated here.<br> > <br> > Additionally replace the 48 byte crc_xmm_shift_tab with the contents of<br> > shf_table which is 32 bytes, achieving the same functionality.<br> > <br> > Signed-off-by: Shreesh Adiga <<a href="mailto:16567adigashreesh@gmail.com" target="_blank">16567adigashreesh@gmail.com</a>><br> > ---<br> > lib/net/net_crc_sse.c | 30 +++++++-----------------------<br> > 1 file changed, 7 insertions(+), 23 deletions(-)<br> > <br> <br> See inline below. Changes to the reduce_64_to_32 look ok, I don't know<br> enough to understand fully the other changes you made. Maybe split the<br> patch into two patches for review and merge separately?<br> <br> /Bruce<br> <br> > diff --git a/lib/net/net_crc_sse.c b/lib/net/net_crc_sse.c<br> > index 112dc94ac1..eec854e587 100644<br> > --- a/lib/net/net_crc_sse.c<br> > +++ b/lib/net/net_crc_sse.c<br> > @@ -96,20 +96,13 @@ crcr32_reduce_128_to_64(__m128i data128, __m128i precomp)<br> > static __rte_always_inline uint32_t<br> > crcr32_reduce_64_to_32(__m128i data64, __m128i precomp)<br> > {<br> > - static const alignas(16) uint32_t mask1[4] = {<br> > - 0xffffffff, 0xffffffff, 0x00000000, 0x00000000<br> > - };<br> > -<br> > - static const alignas(16) uint32_t mask2[4] = {<br> > - 0x00000000, 0xffffffff, 0xffffffff, 0xffffffff<br> > - };<br> > __m128i tmp0, tmp1, tmp2;<br> > <br> > - tmp0 = _mm_and_si128(data64, _mm_load_si128((const __m128i *)mask2));<br> > + tmp0 = _mm_blend_epi16(_mm_setzero_si128(), data64, 252);<br> <br> Minor nit: 252 would be better in hex to make it clearer that it's the<br> lower two bits are unset. Even better, how about switching the operands so<br> that the constant is just "3", which is clearer again.<br></blockquote><div>Okay I will update it with your suggestion in the next patch. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> > <br> > tmp1 = _mm_clmulepi64_si128(tmp0, precomp, 0x00);<br> > tmp1 = _mm_xor_si128(tmp1, tmp0);<br> > - tmp1 = _mm_and_si128(tmp1, _mm_load_si128((const __m128i *)mask1));<br> > + tmp1 = _mm_move_epi64(tmp1);<br> > <br> <br> This change LGTM.<br> <br> > tmp2 = _mm_clmulepi64_si128(tmp1, precomp, 0x10);<br> > tmp2 = _mm_xor_si128(tmp2, tmp1);<br> > @@ -118,13 +111,11 @@ crcr32_reduce_64_to_32(__m128i data64, __m128i precomp)<br> > return _mm_extract_epi32(tmp2, 2);<br> > }<br> > <br> > -static const alignas(16) uint8_t crc_xmm_shift_tab[48] = {<br> > - 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,<br> > - 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,<br> > +static const alignas(16) uint8_t crc_xmm_shift_tab[32] = {<br> > + 0x00, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,<br> > + 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,<br> > 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,<br> > - 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,<br> > - 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,<br> > - 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff<br> > + 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f<br> > };<br> > <br> <br> Can you perhaps explain how changing this table doesn't break existing uses<br> of the table as it now is in the code? Specifically, does xmm_shift_left<br> function not now have different behaviour?<br></blockquote><div>Sure, crc_xmm_shift_tab is only used inside xmm_shift_left which is only used when</div><div>the total data_len is < 16. We call xmm_shift_left(fold, 8 - data_len) when len <= 4 and</div><div>xmm_shift_left(fold, 16 - data_len) when 5 <= len <= 15. This results in accessing</div><div>crc_xmm_shift_tab between 1 and 31, i.e. element 0 is never accessed.</div><div><br></div><div>Now if we take a specific case of xmm_shift_left(fold, 10), then previously the shuffle register</div><div>would get loaded with (crc_xmm_shift_tab + 16 - 10) which would be crc_xmm_shift_tab + 6:</div><div>{0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05} which</div><div>when used with PSHUFB would result in first 10 bytes of reg being 0 and the lower 6</div><div>elements moving left: {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, d1, d2, d3, d4, d5, d6}. The 0 get inserted</div><div>because PSHUFB with index > 0x7f (MSB set) results in 0.</div><div><br></div><div>Now since we have replaced the contents of crc_xmm_shift_tab with shf_table, we will load:</div><div>{0x86, 0x87, 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, 0x00, 0x01, 0x02, 0x03, 0x04, 0x05}</div><div>into the index register. Since the first 10 elements have MSB set, PSHUFB will again result in the </div><div>same vector above with first 10 elements being zeroes and the 6 moving left as intended.</div><div>Since xmm_shift_left is called with num between 11 and 1, we don't access crc_xmm_shift_tab[0]</div><div>and the remaining elements 0x8{i} behave identically to 0xff when used with PSHUFB. Thus</div><div>the xmm_shift_left behavior for currently used num values inside this file has identical behavior.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> > /**<br> > @@ -216,19 +207,12 @@ crc32_eth_calc_pclmulqdq(<br> > 0x80808080, 0x80808080, 0x80808080, 0x80808080<br> > };<br> > <br> > - const alignas(16) uint8_t shf_table[32] = {<br> > - 0x00, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,<br> > - 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,<br> > - 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,<br> > - 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f<br> > - };<br> > -<br> > __m128i last16, a, b;<br> > <br> > last16 = _mm_loadu_si128((const __m128i *)&data[data_len - 16]);<br> > <br> > temp = _mm_loadu_si128((const __m128i *)<br> > - &shf_table[data_len & 15]);<br> > + &crc_xmm_shift_tab[data_len & 15]);<br> > a = _mm_shuffle_epi8(fold, temp);<br> > <br> > temp = _mm_xor_si128(temp,<br> > -- <br> > 2.49.1<br> > <br> </blockquote></div></div>