assembly - NEON memcpy , memset and using .c with .s files -
I am trying to get acquainted with the Neon instructions both the assembly and the internal. I would like to use Neon Memcpy with GCV V4.8.2 hardfp accordindg to I usee:
I've also found this topic: but it's slightly different from the official ARM Page Implementation
Unfortunately I have never C is not used with files, so I need some help. my. The c file looks like this:
#include & lt; Stdlib.h & gt; # Include & lt; Stdio.h & gt; #include & lt; String.h & gt; # Include & lt; Math.h> # Include & lt; Time.h> # Include & lt; Stdint.h & gt; #include & lt; arm_neon.h & gt; Int main () {clock_t start, end; // timer variable uint32_t i, X = 100; Size_t size = 2048 * 32 / * arbitrary * /; Size_t offset = 1; Char * src = malloc (sizeof (char) * (size + offset)); Char * dst = malloc (sizeof (char) * (size)); NEONCopyPLD (DST, SRT + offset, size); Memcpy (dst, src + offset, size); Return (0); } and the assembly.s file is the following:
.global NEONCopyPLD NEONCopyPLD: PLD [r1, # 0xC0] VLDM r1, {D0-d7} VSTM r0, {D0-d7} SUBS r2, r2, # BGE, NEONCopyPLD 0x40 I use the following compilation instruction:
hand-linux -gnueabihf- GCC -mthumb March = ARMv7- a -mtune = Cortex-a 9 -mcpu = Cortex-a 9 -mfloat- abi = hard -mfpu = neon -Ofast -fprefetch loop arrays assembly.s asm_pr.c -o Output and I get the following error:
potentially unanticipated fatal signal 11. CPU: 0p ED: 670 Com: out_asm not tainted 3kl0k9-RT 5 + # 2 functions: BF 907 Siand TI: Beef 4 EFT Taskktii: Beef 4 Aaftiaks PC on 0x4c90 CCR 0x852 D PC: [& lt; 004c90 ccs]] LR: [and lieutenum; 0000852 D & gt;] SSR: 40030030 SP: 7 EME 9 8 CB IP: 00000107 FP: 00000000 r10: 76f91000 r9: 00000000 r8: 00000000 r7: 00001017 r6: 0001855 r5: 00e75009 r4: 00010001 r3: 000f4240 r2: 00010000 R1: 00e75009 r0: 00e85010 Flag: nZcv IRQs at FIQs on mode USER_32 ISA Thum B. Segment User Control: 10C5387D Table: 4F 7404 DAC: 00000015 CPU: 0 PID: 670 com: Out_jum not spotted 3.10.9-RT 5 + # 2 Backrass: [& lt; 800120a4 & gt;] (dump_backtrace + 0x0 / 0x118) with [80012318 to & gt; ; & Lt; 804fab0c & gt;] (dump_stack + 0x24 / 0x28) [& lt; 804faae8 & gt;] from (dump_stack + 0x0 / 0x28);] [show_stack + 0x20 / 0x24] [& lt; 800122f8> gt; [show_stack + 0x0 / 0x24] [& lt; 8000f560 & gt;] [show_regs + 0x30 / 0x34] [& lt; 8000f530 & gt;] [show_regs + 0x0 / 0x34] [& lt; 800334 9c & gt;] (get_signal_to_deliver + 0x318 / 0x668) [& lt; 80033184 & gt;] (get_signal_to_deliver + 0x0 / 800x664 & gt;] (do_signal + 0x11c / 0x450) [& lt; 80011548 & gt;] [& the (do_signal + 0x0 / 0x450) lt; 80011b20 & gt;] (do_work_pending + 0x74 / 0xac) [& lt; 80011aac & gt;] [& lt; 80011664 & gt; the (do_work_pending + 0x0 / 0xac) [& LT; 8000e500 & gt;] (work_pending + 0xc / 0x20) segmentation fault I have another question if we can use the SIMD instructions (using intrinsics or autovectorization) to speed up the initiality of an array with 0? I have seen that the following code can not be autovectorized:
for (i = 0; i Although this code The block can be autovectorized:
for (i = 0; i & lt; n; i ++) a [i] = i; My ultimate goal is to check that if I have a neon function that runs faster than memset () . In the end, I would like to have some ambiguous ends According to ask: The following code can not be autovectorized:
while (* p! = NULL) {* q ++ = * p ++; } While it is possible to use internal or assembly to develop a fast version of this loop? If you have done something then can you post it here?
You never return to your assembler functions, so whatever code is stored under the assembler function It will be executed. This will crash sooner or later.
Exit your work on this:
mov pc, lr This is very likely to fix your problems. You should also check which registers (Neon and General Purpose registers) you should maintain during the softness function calls. This page is a useful resource that shows examples of how to do this:
Comments
Post a Comment