simd - ARM NEON count compare result -
i need make parallel compare under uint16x8_t
vectors, , increment local variable (counter) according it, example +8 increment, if elements of vector compared true. implement algorithm:
... register int objects = 0; uint16x8_t vcmp0,vobj; uint32x2_t dobj; register uint32_t temp0; ... vobj = vreinterpretq_u16_u8(vcntq_u8(vreinterpretq_u8_u16(vcmp0))); vobj = vpaddlq_u8(vreinterpretq_u8_u16(vobj)); vobj = vreinterpretq_u16_u32(vpaddlq_u16(vobj)); vobj = vreinterpretq_u16_u64(vpaddlq_u32(vreinterpretq_u32_u16(vobj))); dobj = vmovn_u64(vreinterpretq_u64_u16(vobj)); dobj = vreinterpret_u32_u64(vpaddl_u32(dobj)); __asm__ __volatile__ ( "vmov.u32 %[temp0] , %[dobj][0] \n\t" "add %[objects] ,%[objects], %[temp0], asr #4 \n\t" : [dobj]"+w"(dobj), [temp0]"=r"(temp0), [objects]"+r"(objects) : : "memory" ); ...
vector vcmp0
contains results of compare, vobj
, dobj
used computation, objects
counter. using count of set bits , pairwise add computation. there faster way work?
Comments
Post a Comment