performance - NVidia CUDA Thrust device vector allocation is too slow -
Does anyone know why the vector allocation on the device takes a lot more to the first part compiled in debug mode? In my special case (NVIDIA Quadro 3000M, Kuda Toolkit 6.0, Windows 7, MSVC2, 2010) debug compiled version takes the first run more than 40 seconds, the next (no recombination) takes 10 times less ( Vector allocation on device for release version 1 second).
#include & lt; Emphasis / host_vector.h & gt; #include & lt; Thrust / device_vector.h & gt; #include & lt; Thrust / generate.h> # Include & lt; Emphasis / order & Gt; # Include & lt; Emphasis / Copy H & gt; # Include & lt; Cstdlib & gt; # Include & lt; Ctime & gt; Int main (zero) {clock_t t; T = clock (); Emphasis: host_vector & lt; Int & gt; H_vec (100); Clock_t dt = clock () - t; Printf ("Allocation on host -% f seconds. \ N", (float) DT / CLOCKS_PER_SEC); T = clock (); Emphasis: generated (h_vec.begin (), h_vec.end (), rand); DT = clock () - T; Printf ("initialization on host -% f seconds. \ N", (float), DT / CLOCKS_PER_SEC); T = clock (); Emphasis: device_vector & lt; Int & gt; D_vec (100); // The first run for debug compiled version takes more than 40 seconds ... dt = clock () - t; Printf ("Allocation on device -% f seconds. \ N", (float) DT / CLOCKS_PER_SEC); T = clock (); D_vec [0] = H_wec [0]; DT = clock () - T; Printf ("Copy to a device -% f seconds. \ N", (float) DT / CLOCKS_PER_SEC); T = clock (); D_vec = h_vec; DT = clock () - T; Printf ("Copy All Devices -% F Seconds. \ N", (Float) DT / CLOCKS_PER_SEC); T = clock (); Thrust: sort (d_vec.begin (), d_vec.end ()); DT = clock () - T; Printf ("Sort on device -% f seconds. \ N", (float) DT / CLOCKS_PER_SEC); T = clock (); Emphasis :: copy (d_vec.begin (), d_vec.end (), h_vec.begin ()); DT = clock () - T; Printf ("Copy to host -% f seconds. \ N", (float), DT / CLOCKS_PER_SEC); T = clock (); For (Int i = 0; I <10; i ++) printf ("% d \ n", h_vec [i]); DT = clock () - T; Printf ("Output -% f seconds. \ N", (Float) DT / CLOCKS_PER_SEC); Std :: cin.ignore (); Return 0; } The vector you are measuring for instant, the time is not the vector The cost of allocation and initialization, it is the CUDA runtime and the upper cost associated with the driver. I think that if you changed your code like this:
int main (zero) {clock_t t; .... Kudapri (0); // This force reference establishment and lazy runtime overheads T = clock (); Emphasis: device_vector & lt; Int & gt; d_vec (100); // The first run for debug compiled version takes more than 40 seconds ... dt = clock () - t; Printf ("Allocation on device -% f seconds. \ N", (float) DT / CLOCKS_PER_SEC); ..... You should see that when you measure the allocation of vectors between first and second runs, then it becomes the same even though the wall The clock time shows a big difference.
I do not have a good explanation why there is such a big difference in startup time between first and second runs, but if I used to guess, then that at some driver level, JIT reunited at first Runs on the bar, and the driver caches the code for later runs. One thing to check is that you are compiling the code for the right architecture for your GPU, which will eliminate the re-assembly of the driver as the source of difference of time.
The nvprof utility can provide you an API trace and time. You can see where it is run and the API is generated by the difference in timing in the call sequence. It is not beyond the scope of the possibility that you are seeing the effect of some type of driver bug, but it is impossible to say without much information.
Comments
Post a Comment