Sample layout for a simulation with a grid of [nx,ny,nz] grid points per processor. There are (px,py,pz) processors along the x, y, and z direction. The number of ghost layers is set here to NG and is of course variable in reality. This results in a global grid size of mxgrid=nx*px+2*NG in a monolithic layout. Likewise, the global grid size without grid cells is defined as nxgrid=nx*px. The chunk dimensions are denoted in square brackets. The group and dataset structure and dimensions would be as follows: The main idea in the collect-x variant is to first combine the data along the x direction via MPI. This is done on all processors in parallel and it is an independent communication. In small setups with px=1 there would be no need for any communication along x. We can include the lower and upper x ghost layers in the collection along x, as one such layer is anyways already local on the collecting processor: data/ ax [mxgrid,ny,nz],(py,pz) ay [mxgrid,ny,nz],(py,pz) az [mxgrid,ny,nz],(py,pz) lnTT [mxgrid,ny,nz],(py,pz) lnrho [mxgrid,ny,nz],(py,pz) ux [mxgrid,ny,nz],(py,pz) uy [mxgrid,ny,nz],(py,pz) uz [mxgrid,ny,nz],(py,pz) ghost/ y,z => see "chunked" scheme, only for y and z layers The alternative is not to include x layers in the collection along x: data/ ax [nxgrid,ny,nz],(py,pz) ay [nxgrid,ny,nz],(py,pz) az [nxgrid,ny,nz],(py,pz) lnTT [nxgrid,ny,nz],(py,pz) lnrho [nxgrid,ny,nz],(py,pz) ux [nxgrid,ny,nz],(py,pz) uy [nxgrid,ny,nz],(py,pz) uz [nxgrid,ny,nz],(py,pz) ghost/ x,y,z => see "chunked" scheme In summary, one must note that the actual writing (and accessing of the HDF5 libaray) is done here from only py×pz processors, therefore chunks can be larger. The previous collection step might be a bottleneck, because only one processor recieves the data from px other processors, which is a sequential communication! Only after the collection step has finished, the writing via HDF5 calls may begin. I wonder in how far the above schemes perform better than the "chunked" scheme.