multithreading - yield loss with OpenMP in Fortran -


First of all, I'm sorry that I make grammatical errors. I am not English.

Increasing the number of threads using OpenMP in Fortran, I am trying to improve the loss generated.

I have a X5650 (12 physical core) with two Intel Xen 96 GB RAM

The best results I have achieved is:

1 proc -> 15.50 seconds; 2 proc -> 8.10 seconds; 4 proc -> 4.42 seconds; 8 proc -> 2.81 seconds; 12 proc -> 2.43 sec

As you see, more threads decrease due to improvement.

Here is the code: Allocation of allocation (sum (PUNTOST, PUNTOSP, NUM_DATA, 2)) (allocation of SUMATORIO (1)) (SUMATORIO (1)%

  Area (region)) that I = 1, REGIONES allocated (SUMATORIO (1)% area (i)% war (2, PUNTOSP, PUNTOST) SUMATORIO (1)% area (i)% war = cmplx (0.0.0.0) End allocation (valor_aux (2 DO, PUNTOSP, PUNTOST))! ... phone SYSTEM_CLOCK (count, count_rate)! $ OMP PARALLEL NUM_THREADS (THREADS) DEFAULT (Private) First Private (Region) & amp; ! $ OMP Share (SUMATORIO, Yoga, PP, VEC_1, VEC_2, President, TIPO, muestra, PUNTOST, PUNTOSP)! $ OMP DO Schedule (dynamic, 8) that I = 1, field indice = VEC_1 (i) valor_aux = cmplx (0.0.0.0) DOJ = 1, VEC_2 (i) id = idnt (indices + 1) indices = india + 1 if (TIPO (ii). 4.4) then J1 = Maestusta (ii) DO I1 = 1, PUNTOST DO I2 = 1, PUNTOSP valor_aux (1, I2, I1) = valor_aux (1, I2, I1) + Yoga ( I1, I2, J1,1) * PP (ii) valor_aux (2, I2, I1) = valor_aux (2, I2, I1) + Yoga (I1, I2, J1,2) * PP (ii) end end if End SUMATORIO (1)% area (i)% war = valor_aux end! $ OMP end DO $ OMP end parallel call SYSTEM_CLOCK (countf) dt = real (countf-counti) / real (count_rate) type (*, *) 'FASE_1: time:' !, dt, 'seconds'  <  
  • To know some points:

    • In addition to the loop vectors, all data types are COMPLEX
    • NUM_DATA = 14000000
    • Area = 1000000
    • The values ​​contained in VEC_2 are betwe are PONTOSSP = 20
    • All allocated memory consume approximately 60 GB of RAM

    I The amplitude of the matrix to the xcinem memory caching (eg (2, punts) SP, Puntoost, NUM_DATA), but the way I have achieved the best performance (I do not know the reason because I have read that in most documents they say that you have access to memory , So that the CPU can be able to cache the memory at least.)

    In addition to this, I have changed the memory alignment to 32, 64 and 128 bytes, but nothing can improve it Happened.

    In addition to that I have changed the SCHEDULE option to a static with a different Chak size and with dynamic different size size but the results are similar or worse

    Do you have some ideas Which can I use to improve performance when using 8 or more cores?

    Thank you very much for your attention and support.

  • The CPU time is quite good using the 12 processors divided by 6. In my applications, I rarely get more than 4 or 5 (but this is always the sequential part which is probably due to it).

    You can try to collapse the option, allowing the two ends to be merged together ... but I do not know if this is possible in your case, because it is a condition to fulfill ( For example, no instructions between two loops).


    Comments