September 29, 2009


Link to previous meetings.
  1. Bluegene scaling results
  2. Clumps
  3. HYPRE
BlueGene Scaling

I have performed a simple scaling test on BlueGene/P:

Module/Problem description Shocked clump (template.shockedclump.f90)
Geometry 2D
Source terms used Cylindrical, Cooling
Resolution 16,384 x 4,096 (2,048 per clump radius)
Equivalent resolution 8,1922, or 4063
Number of processors 256  (16 x 16 decomposition)
512  (32 x 16 decomposition)
1,024  (64 x 16 decomposition)
Grid size per core 256: 5122, or 643
512: 3622, or 513
1,024: 2562, or 403

Below is a plot of TGHOST, the time spent populating ghost cells.

We see several things:

  1. The 512 run was faster than the 256 run, though not twice as fast (~0.74 time instead of 0.5).
  2. The 1024 run was about the same speed as the 256 run.
  3. The time spent on ghost increased from 28% for 256, to 50% for 512, to 82% for 1024

I was unable to launch the problem using 32 nodes/128 cores (7242/813 per core). The master process runs out of memory when allocating q arrays. However, I was able to run it using 64 nodes/128 cores by specifying mpirun -mode DUAL instead of mpirun -mode VN. This makes the nodes act like there are 2 cores per node instead of 4, allowing more memory per core. (Update: If we comment out where, in the initial testing, the master allocates 3 Info structures, then it runs on 128.) This job took longer and had reduced TGHOST (~18%), implying that computation was the bottleneck, not communication. In table form:

Nprocs trun/trun,128IdealUs vs. Ideal
128 1. 1. 1.
256 0.60 0.5 1.2
512 0.44 0.25 1.76
1,024 0.59 0.125 4.72

In graphical form:

For comparison, here's the results from the meeting where I talked about fixed grid scaling before, at 1, 4, 16, and 64 processors on the Rice cluster (myrinet):


(The numbers in parenthesies are the N2 values at 4, 16, and 64 procs.)

The problem I ran on BG/P was a larger problem (it would be 40962 at 4, 20482 at 16, and 10242 at 64). We appear to see similar behavior, however: there is an ideal balance between number of processors and communication time, based on the problem size.

Conclusion: I anticipate we can expect to see an improvement on BG/P over bluehive on large problems if we run at ~503 cells per core. The main benefit of BG/P in my mind right now is that no one is using it. There are 4,096 cores total, allocatable in blocks of 256. This means that getting the best anticipated performance, the maximum concurrent number of jobs (using all 4 cores per node) is

Expected problem sizes for 2D hydro w/ tracers (7 components in q)
Job sizeSimulation resolution
(at 503 per core)
Simulation resolution
(1283 per core)
(Est. upper bound)
Number of
concurrent jobs
256 cores 3203 8003 16
512 cores 4003 10243 8
1,024 cores 5003 12803 4
2,048 cores 6303 16203 2
4,096 cores 8003 20483 1

Finally, this problem size is ~1.78x larger than my R_1536 run, which was 12,288 x 3,072 (61442 / 3353, vs. 81922 / 4063). That simulation took roughly 2 months to run with base+2 levels of AMR. The 512 run was on track to finish the present simulation in perhaps the same time (less than 60 days). Note that BG/P's cores are 1/3rd slower than bluehive's, so we would expect the 256 run to be somewhat on par with a 64 core job on bluehive considering the increased amount of communication on 256 vs. 64 cores.


Clumps stuff

Why do these look different?


Top: adiabatic. Bottom: cooling

We know that the short answer is, because of the location of the bow shock, the bow shock stand-off distance. Why should that matter?

It matters because of the velocity gradient between the bow shock and the clump. Here is a plot on-axis of density:

And here is a plot of velocity on-axis, where it's easier to see the adiabatic bow shock:

We can examine the velocity shear if we zoom into the bow shock/clump surface region:

I haven't looked enough to find an analytic expression involving the shear magnitude relating to KH growth, though I'm sure one exists. Regardless, this answers the question: the greatly increased shearing at the clump surface drives much faster-growing KH modes.

Something else to note is that when considering the growth of KH at the clump surface, people use

tKH = Χ1/2 / (k vrelative)

where Χ is the density contrast between clump and ambient preshock material, k is the wavenumber and vrelative the relative velocity bewteen the postshock material and the clump. This velocity is typically assumed to be the postshock velocity. But as we can see from the above images, the velocity is reduced by a factor of 0.6--0.8 by the time it reaches the clump surface. This therefore implies a longer KH growth time by a factor of 1.25--1.67, whereas typically it's assumed that tKH ~ tcc, the cloud-crushing time. Considering that we see KH growth in a time much less than t_cc, this means that the shearing strength plays a large part in the growth times of the KH instabilities.

A similar argument applies, involving density shear (stratification), for the greatly reduced RT growth times that we observe.


An extremely brief overview of how using HYPRE works

In general, the prescription for using HYPRE to solve a problem will be the same. HYPRE will solve a problem of the form

A⋅b = u

where A is a matrix, b the solution vector, and u the source vector. To get your physical equation (Poisson's equation for self-gravity; the diffusion equation for radiative transfer in the diffusive limit; etc.) into this matrix form, you discretize it and shuffle the components around a bit. Then, in order to use HYPRE to solve a problem, you do the following:

  1. Specify the matrix elements of A for each point in the grid.
  2. Specify the vector elements of u for each point in the grid.
  3. Specify the locations of the matrix elements via a "stencil" for each point in the grid. This is how different subdomains (or AMR levels) communicate information. Stencils indicate what cells border each other physically, and what the weighting should be (the weighting will be different across AMR levels).
  4. Specify boundary conditions as additional elements in the matrix A and vector u.

This information needs to be specified by every processor for every grid it possesses. Once this is done, a particular solver is chosen from the list,


and the GO button is pushed.

Then the black box thinks about it for awhile and returns something in the solution vector b. The user then must extract the data from this vector and put it back into the state variable q in one of the components labeled as being for Elliptic Variables. The variable update (e.g., updating the velocity and energy based on the gravitational potential φ for self-gravity) is done elsewhere.

At present, the effects of self-gravity are applied at the same time as the other source terms. I'm hoping this will be sufficient and that we won't need to do a separate update just for self-gravity.