caching - C++ pool allocator vs static allocation , cache performance -


Given that I have parallel and evenly resized to the following strings:

 < Code> struct matrix {float data [16]; }; Structure Vec4 {float data [4]; } // matrix ARM [256]; // For example // Vec4 arrV [256]; Let us say that I want to repeat sequentially on two arrays as fast as possible. Let's say that the function is something:  
 for  (int i = 0; i  

If I want to store my data as , I will receive the same cash area and performance:

a)

  // aligned stable matrix ARM [256]; Stable Vec4 arrV [256]; Matrix * arramptr = arrM [0]; Vec4 * arrVPtr = arrV [0];   

versus

b)

  // aligned char * ptr = (four *) malloc (256 * size (matrix) +256 * sizeof (Vec4)); Matrix * ARMMPR = (matrix *) PTR; Vec4 * arrVPtr = (VAC4 *) PTR + 256 * size (matrix); No matter how much memory is allocated (stack or static allocated) memory    

cache The ability to be since these two data structures are quite large (1024 and 4096 bytes respectively), the precise alignment of the first and the last elements probably does not matter (but if you are using the SSE instructions to access the material, Does not matter!).

Whether the memory is together or near, no big difference will be made, unless the allocation is too small to fit easily in the cache, but many cache-lines are big enough to take.

You think that the use of the structure works better with 20 float values, if you are working sequentially through both arrays but it only works when you have ever There is no need to do other things with data where there is more meaning with an array

There can be a difference in the compiler's ability to translate the code to avoid an additional memory access . It will obviously depend on the actual code (for example, compiler inline function which includes for-loop) does it inline the readonlyfunc code, etc. If so, the translation of the stable allocation pointer version (which loads the address of the indicator to get the address of the data) does a constant address calculation. It probably does not make a big difference in such a large loop.

Always, when it comes to performance, sometimes there can be a big difference in small things, so if it is really important, then do some experiments, using your compiler, your Actual code On the basis of our experience, we can only give relatively speculative advice. Things with the same code, different processors are the same machine code (both different architecture (whether the command set architecture ARM vs. X86, or AMD Opteron v. Intel Atom or ARM Cortex A15 vs Core Tech M3). Memory configuration on special systems will also affect things, how big is the cache, etc., etc.

Comments

Popular posts from this blog

Java - Error: no suitable method found for add(int, java.lang.String) -

java - JPA TypedQuery: Parameter value element did not match expected type -

c++ - static template member variable has internal linkage but is not defined -