Sample layout for a simulation with a grid of [nx,ny,nz] grid points per processor.
There are (px,py,pz) processors along the x, y, and z direction.
The number of ghost layers is set here to NG and is of course variable in reality.
This results in a global grid size of mxgrid=nx*px+2*NG in a monolithic layout.
Likewise, the global grid size without grid cells is defined as nxgrid=nx*px.
The chunk dimensions are denoted in square brackets.
The group and dataset structure and dimensions would be as follows:


The main idea in the collect-x variant is to first combine the data along the x direction
via MPI. This is done on all processors in parallel and it is an independent communication.
In small setups with px=1 there would be no need for any communication along x.

We can include the lower and upper x ghost layers in the collection along x,
as one such layer is anyways already local on the collecting processor:

data/
	ax	[mxgrid,ny,nz],(py,pz)
	ay	[mxgrid,ny,nz],(py,pz)
	az	[mxgrid,ny,nz],(py,pz)
	lnTT	[mxgrid,ny,nz],(py,pz)
	lnrho	[mxgrid,ny,nz],(py,pz)
	ux	[mxgrid,ny,nz],(py,pz)
	uy	[mxgrid,ny,nz],(py,pz)
	uz	[mxgrid,ny,nz],(py,pz)
ghost/
	y,z	=> see "chunked" scheme, only for y and z layers


The alternative is not to include x layers in the collection along x:

data/
	ax	[nxgrid,ny,nz],(py,pz)
	ay	[nxgrid,ny,nz],(py,pz)
	az	[nxgrid,ny,nz],(py,pz)
	lnTT	[nxgrid,ny,nz],(py,pz)
	lnrho	[nxgrid,ny,nz],(py,pz)
	ux	[nxgrid,ny,nz],(py,pz)
	uy	[nxgrid,ny,nz],(py,pz)
	uz	[nxgrid,ny,nz],(py,pz)
ghost/
	x,y,z	=> see "chunked" scheme


In summary, one must note that the actual writing (and accessing of the HDF5 libaray)
is done here from only py×pz processors, therefore chunks can be larger.
The previous collection step might be a bottleneck, because only one processor
recieves the data from px other processors, which is a sequential communication!
Only after the collection step has finished, the writing via HDF5 calls may begin.

I wonder in how far the above schemes perform better than the "chunked" scheme.