c - Parallelizing LibTomMath's Karatsuba impl. with OpenMP failed -


i thought i'll give openmp try before roll myself (it?) failed quite miserably.

the karatsuba multiplying , squaring needed small change recurse instead of common multiplication branches in bn_mp_mul.c , addition of openmp pragmas.

changes karatsuba multiplication diff (squaring done in same manner) libtommath master:

diff -rtbbw libtommath-master-original/bn_mp_mul.c libtommath-master/bn_mp_mul.c 32a34,37 > #ifdef use_open_mp > #include <omp.h> >          #pragma omp parallel >          #pragma omp master 33a39,41 > #else >          res = mp_karatsuba_mul(a, b, c); > #endif diff -rtbbw libtommath-master-original/bn_mp_karatsuba_mul.c  libtommath-master/bn_mp_karatsuba_mul.c 51a52,54 > #ifdef use_open_mp >    int e1,e2; > #endif 57c60,62 <  --- >    if (b < karatsuba_mul_cutoff) { >       return mp_mul(a,b,c); >    } 120c125,135 <    if (mp_mul(&x0, &y0, &x0y0) != mp_okay) --- > #ifdef use_open_mp > #include <omp.h> >    #pragma omp task shared(x0y0) >    e1 = mp_karatsuba_mul(&x0, &y0, &x0y0); >    #pragma omp task shared(x1y1) >    e2 = mp_karatsuba_mul(&x1, &y1, &x1y1); >    #pragma omp taskwait >    if ((e1 != mp_okay) || (e2 != mp_okay)) >       goto x1y1; > #else >    if (mp_karatsuba_mul(&x0, &y0, &x0y0) != mp_okay) 122c137 <    if (mp_mul(&x1, &y1, &x1y1) != mp_okay) --- >    if (mp_karatsuba_mul(&x1, &y1, &x1y1) != mp_okay) 123a139 > #endif 130,131d145 <    if (mp_mul(&t1, &x0, &t1) != mp_okay) <       goto x1y1;          /* t1 = (x1 + x0) * (y1 + y0) */ 132a147,157 > #ifdef use_open_mp > #include <omp.h> >    #pragma omp task shared(t1) >    e1 = mp_karatsuba_mul(&t1, &x0, &t1); >    #pragma omp taskwait >    if (e1 != mp_okay) >       goto x1y1; > #else >    if (mp_karatsuba_mul(&t1, &x0, &t1) != mp_okay) >       goto x1y1;          /* t1 = (x1 + x0) * (y1 + y0) */ > #endif diff -rtbbw libtommath-master-original/makefile  libtommath-master/makefile 17c17,19 < cflags += -o3 -funroll-loops --- > #cflags += -o3 -funroll-loops > # oficial macro _openmp > cflags += -g3 -o3  -funroll-loops -fopenmp -duse_open_mp 

using openmp's task recommended many such recursive function. test computing log(2) 50,000 bits (with libtomfloat) 1 thread (by adding num_threads(1) after parallel) gave expected result on amd a8-6600k (with gcc-4.8.real (ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4 , kernel linux debian2 3.16.0-38-generic, 64-bit).

real    0m7.748s user    0m7.728s sys 0m0.013s 

but more threads? well...with 2 threads

real    0m8.025s user    0m14.936s sys 0m0.060s 

and 4 threads

real    0m14.198s user    0m48.224s sys 0m0.109s 

you can see cpus sweating in graphing system-monitor no avail here.

i used atanh() method compute constant log(2) in libtomfloat , implemented calculation of acoth() binary splitting algorithm. that recursion works (switched off test above) , implemented openmp's task in same way karatsuba recursion. not gain log(2) (about 5%), gains @ least something although 5% quite close statistical error.

so, did wrong here?

edit

i ran more tests try overhead suggested richardbruce. karatsube set 250 (was 120 not 48, sorry richardbruce) optimum karatsuba 1.1 min. real time (for 200k bits) 1 , 4 threads , 1.5min. user time 4 threads: no gain. cutoff set 500 1.5 min. real time 1 thread , 1.3 4 threads , 2.3 min. user time 4 threads: relative gain absolute loss.

i didn't expect overhead openmp. handrolled pthreads or find grave error?


Comments

Popular posts from this blog

how to insert data php javascript mysql with multiple array session 2 -

multithreading - Exception in Application constructor -

windows - CertCreateCertificateContext returns CRYPT_E_ASN1_BADTAG / 8009310b -