optimization - Optimizing loop with few instructions(SSE2, SSE4) with TBB -
i have simple image processing related algorithm. briefly, image(mean) in float subtracted 8-bit image result save float image(dest)
this function written intrinsics.
i have tried optimize function tbb, parrallel_for, received no gain in speed penalty.
what should ? should use more low-level scheme such tbb task optimize code ?
float *m, **m_data, *o, **o_data; unsigned char *p, **src_data; register unsigned long len, i; unsigned long nr, nc; src_data = src->ubytedata; // 2d array m_data = mean->floatdata; // 2d array o_data = dest->floatdata; // 2d array nr = src->rows; nc = src->cols; __m128i xmm0; for(i=0; i<nr; i++) { m = m_data[i]; o = o_data[i]; p = src_data[i]; len = nc; { _mm_prefetch((const char *)(p + 16), _mm_hint_nta); _mm_prefetch((const char *)(m + 16), _mm_hint_nta); xmm0 = _mm_load_si128((__m128i *) (p)); _mm_stream_ps( o, _mm_sub_ps( _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 0))), _mm_load_ps(m + offset) ) ); _mm_stream_ps( o + 4, _mm_sub_ps( _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 4))), _mm_load_ps(m + offset + 4) ) ); _mm_stream_ps( o + 8, _mm_sub_ps( _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 8))), _mm_load_ps(m + offset + 8) ) ); _mm_stream_ps( o + 12, _mm_sub_ps( _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_srli_si128(xmm0, 12))), _mm_load_ps(m + offset + 12) ) ); p += 16; m += 16; o += 16; len -= 16; } while(len); }
you doing no computation here, relative number of loads , stores, it's being limited memory bandwidth rather computation. explain why don't see improvement in throughput when optimise computation.
i rid of _mm_prefetch
instructions though - not helping here , may hurting performance.
if possible should combine loop other operations doing before/after - way amortise cost of memory i/o on more computation.
Comments
Post a Comment