optimization - Optimizing loop with few instructions(SSE2, SSE4) with TBB -


i have simple image processing related algorithm. briefly, image(mean) in float subtracted 8-bit image result save float image(dest)

this function written intrinsics.

i have tried optimize function tbb, parrallel_for, received no gain in speed penalty.

what should ? should use more low-level scheme such tbb task optimize code ?

float           *m, **m_data,                 *o, **o_data; unsigned char   *p, **src_data; register unsigned long len, i; unsigned long   nr,                 nc;  src_data    =   src->ubytedata;    // 2d array m_data      =   mean->floatdata;   // 2d array o_data      =   dest->floatdata;   // 2d array nr          =   src->rows; nc          =   src->cols;  __m128i xmm0;  for(i=0; i<nr; i++) {     m = m_data[i];     o = o_data[i];     p = src_data[i];     len = nc;         {         _mm_prefetch((const char *)(p + 16),  _mm_hint_nta);         _mm_prefetch((const char *)(m + 16),  _mm_hint_nta);          xmm0 = _mm_load_si128((__m128i *) (p));          _mm_stream_ps(                         o,                         _mm_sub_ps(                                     _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 0))),                                     _mm_load_ps(m + offset)                                 )                     );         _mm_stream_ps(                         o + 4,                         _mm_sub_ps(                                     _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 4))),                                     _mm_load_ps(m + offset + 4)                                 )                     );         _mm_stream_ps(                         o + 8,                         _mm_sub_ps(                                     _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 8))),                                     _mm_load_ps(m + offset + 8)                                 )                     );         _mm_stream_ps(                         o + 12,                         _mm_sub_ps(                                     _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 12))),                                     _mm_load_ps(m + offset + 12)                                 )                     );          p += 16;         m += 16;         o += 16;         len -= 16;     }     while(len); } 

you doing no computation here, relative number of loads , stores, it's being limited memory bandwidth rather computation. explain why don't see improvement in throughput when optimise computation.

i rid of _mm_prefetch instructions though - not helping here , may hurting performance.

if possible should combine loop other operations doing before/after - way amortise cost of memory i/o on more computation.


Comments

Popular posts from this blog

python - Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000 -

binding - How can you make the color of elements of a WPF DrawingImage dynamic? -

c# - How to add a new treeview at the selected node? -