performance - OpenMP with "collapse()" for nested for-loops performs worse when without -
this code:
double res1[nnn]; #pragma omp parallel collapse(3) schedule(dynamic) (int i=0; i<nnn; i++) { (int j=0;j<nnn;j++) { (int k=0;k<nnn;k++) { res1[i] = log(fabs(i*j*k)); } } } std::cout<< res1[10] << std::endl; when use collapse(3) takes ~50 sec; without collapse(3) ~6-7 seconds. puzzled behavior, since have expected better performance "collapse" without.
am missing something?
i did experiments , played different configs:
(nnn = 2500 , 24 cores)
schedule(static)&&collapse(3)-> ~54 secschedule(static)&&collapse(2)-> ~8 secschedule(static)&&collapse(1)-> ~8 sec
i tried dynamic schedule, takes enormous time (several minutes).
in original problem, have 4 dim "for-loops" (4d array): 51x23x51x23.
what best way use openmp/mpi minimize running time? have in total ~300 cpu cores. best way spread array on these cores? length of array flexible (i can fit number of cpus in way).
any suggestions?
you missing concept of impact of using dynamic scheduling on openmp overhead. dynamic scheduling there load balance problems each loop iteration can take different amount of time , static iteration distribution create work imbalance between different threads. work imbalance leads wasted cpu time threads finish earlier wait other threads finish. dynamic scheduling overcomes distributing loop chunks on first come, first served basis. adds overhead, since openmp runtime system has implement bookkeeping on iteration given out , not , has implement type of synchronisation. each thread must make @ lest 1 call openmp runtime every time finishes iteration block , goes looking one. static scheduling iteration blocks precomputed , each thread runs on part without interaction openmp runtime environment.
the crucial difference between static , dynamic scheduling iteration chunk size (i.e. number of consecutive loop iterations each thread before seeking work in part of iteration space). if omitted, chunk size static scheduling defaults #_of_iterations/#_of_threads while default chunk size dynamic scheduling 1, i.e. each thread has ask openmp runtime each iteration of distributed loop.
what happens in case without collapse(3) have nnn iteration chunks of outer loop , each thread runs nnn*nnn iterations (of inner loops) before asking openmp runtime iteration. when collapse loops, number of iteration chunks grows nnn*nnn*nnn, i.e. there many more chunks , each thread going ask openmp runtime chunk after each iteration.
this brings problem when inner loops collapsed outermost: happen many threads iterations have same value of i, break computation order of execution not guaranteed , might happen last thread writes res1[i] not 1 executes last iteration of both inner loops.
Comments
Post a Comment