Sample layout for a simulation with a grid of [nx,ny,nz] grid points per processor. There are (px,py,pz) processors along the x, y, and z direction. The number of ghost layers is set here to NG and is of course variable in reality. This results in a global grid size of mxgrid=nx*px+2*NG in a monolithic layout. The group and dataset structure and dimensions would be as follows: data/ ax [mxgrid,mygrid,mzgrid] ay [mxgrid,mygrid,mzgrid] az [mxgrid,mygrid,mzgrid] lnTT [mxgrid,mygrid,mzgrid] lnrho [mxgrid,mygrid,mzgrid] ux [mxgrid,mygrid,mzgrid] uy [mxgrid,mygrid,mzgrid] uz [mxgrid,mygrid,mzgrid] This strategy is implemented in the "hdf5_io_parallel" module. Our finding is that this is significantly slower than distributed IO because the data needs to get combined in the monolithic snapshot. Here, this requires to write non-aligned data into the file, which means that each processor writes array stripes of only nx or nx+NG size! Of course this striping makes the overall write process very slow.