gcc - SSE2 instructions not working in inline assembly with C++ -
i have function uses sse2 add values it's supposed add lhs , rhs , store result lhs:
template<typename t> void simdadd(t *lhs,t *rhs) { asm volatile("movups %0,%%xmm0"::"m"(lhs)); asm volatile("movups %0,%%xmm1"::"m"(rhs)); switch(sizeof(t)) { case sizeof(uint8_t): asm volatile("paddb %%xmm0,%%xmm1":); break; case sizeof(uint16_t): asm volatile("paddw %%xmm0,%%xmm1":); break; case sizeof(float): asm volatile("addps %%xmm0,%%xmm1":); break; case sizeof(double): asm volatile("addpd %%xmm0,%%xmm1":); break; default: std::cout<<"error"<<std::endl; break; } asm volatile("movups %%xmm0,%0":"=m"(lhs)); }
and code uses function this:
float *values=new float[4]; float *values2=new float[4]; values[0]=1.0f; values[1]=2.0f; values[2]=3.0f; values[3]=4.0f; values2[0]=1.0f; values2[1]=2.0f; values2[2]=3.0f; values2[3]=4.0f; simdadd(values,values2); for(uint32_t count=0;count<4;count++) std::cout<<values[count]<<std::endl;
however isn't working because when code runs outputs 1,2,3,4 instead of 2,4,6,8
i've found inline assembly support isn't reliable in modern compilers (as in, implementations plain buggy). better off using compiler intrinsics declarations c functions, compile specific opcode.
intrinsics let specify exact sequence of opcodes, leave register coloring compiler. it's more reliable trying move data between c variables , asm registers, inline assemblers have fallen down me. lets compiler schedule instructions, can provide better performance if works around pipeline hazards. ie, in case do
void simdadd(float *lhs,float *rhs) { _mm_storeu_ps( lhs, _mm_add_ps(_mm_loadu_ps( lhs ), _mm_loadu_ps( rhs )) ); }
in case, anyway, you've 2 problems:
- the terrible gcc inline assembly syntax makes great confusion of difference between pointers , values. use
*lhs
,*rhs
instead of lhs , rhs; apparently "=m" syntax means "implicitly use pointer thing i'm passing instead of thing itself." - gcc has source,destination syntax -- addps stores result in second parameter, you need output
xmm1
, notxmm0
.
i've put a fixed example on codepad (to avoid cluttering answer, , demonstrate works).
Comments
Post a Comment