Sample layout for a simulation with a grid of [nx,ny,nz] grid points per processor. There are (px,py,pz) processors along the x, y, and z direction. The number of ghost layers is set here to NG and is of course variable in reality. This results in a global grid size of mxgrid=nx*px+2*NG in a monolithic layout. Likewise, the global grid size without grid cells is defined as nxgrid=nx*px. The chunk dimensions are denoted in square brackets. The group and dataset structure and dimensions would be as follows: data/ ax [nx,ny,nz],(px,py,pz) ay [nx,ny,nz],(px,py,pz) az [nx,ny,nz],(px,py,pz) lnTT [nx,ny,nz],(px,py,pz) lnrho [nx,ny,nz],(px,py,pz) ux [nx,ny,nz],(px,py,pz) uy [nx,ny,nz],(px,py,pz) uz [nx,ny,nz],(px,py,pz) ghost/ ax/ x_lower [ 3,ny,nz],(py,pz) x_upper [ 3,ny,nz],(py,pz) y_lower [nx, 3,nz],(px,pz) y_upper [nx, 3,nz],(px,pz) z_lower [nx,ny, 3],(px,py) z_upper [nx,ny, 3],(px,py) x_lower_y_lower [ 3, 3,nz],(pz) x_upper_y_lower [ 3, 3,nz],(pz) x_lower_y_upper [ 3, 3,nz],(pz) x_upper_y_upper [ 3, 3,nz],(pz) x_lower_z_lower [ 3,ny, 3],(py) x_upper_z_lower [ 3,ny, 3],(py) x_lower_z_upper [ 3,ny, 3],(py) x_upper_z_upper [ 3,ny, 3],(py) y_lower_z_lower [nx, 3, 3],(px) y_upper_z_lower [nx, 3, 3],(px) y_lower_z_upper [nx, 3, 3],(px) y_upper_z_upper [nx, 3, 3],(px) x_lower_y_lower_z_lower [3,3,3] x_upper_y_lower_z_lower [3,3,3] x_lower_y_upper_z_lower [3,3,3] x_upper_y_upper_z_lower [3,3,3] x_lower_y_lower_z_upper [3,3,3] x_upper_y_lower_z_upper [3,3,3] x_lower_y_upper_z_upper [3,3,3] x_upper_y_upper_z_upper [3,3,3] ay/ ... .../ Alternatively, some of these shapes are geometrically the same and can be combined. The chunk size would be the size of the first three dimensions, only. Each chunk is a separate write operation, usually issued from a different processor. (L=lower, U=upper) ghost/ ax/ x [ 3,ny,nz],(py,pz),2 y [nx, 3,nz],(px,pz),2 z [nx,ny, 3],(px,py),2 => 1=L, 2=U xy [ 3, 3,nz],(pz),4 xz [ 3,ny, 3],(py),4 yz [nx, 3, 3],(px),4 => 1=LL, 2=UL, 3=LU, 4=UU xyz [3,3,3],8 => 1=LLL, 2=ULL, 3=LUL, 4=UUL, 5=LLU, 6=ULU, 7=LUU, 8=UUU .../ One more alternative is to store the ghost layers collectively for each direction, because the amount of data is relatively small. Note that x means the yz-boundary layer. Note also that x contains all the y- and z-edges and all 8 corners, while y contains only the x-edges without any corners. The z (xy-boundary) layer contains only the inner xy-grid: ghost/ ax/ x ( 3,mygrid,mzgrid),2 y (nxgrid, 3,mzgrid),2 z [ nx, ny, 3],(px,py),2 => 1=L, 2=U .../ The advantage is here that the reading of the ghosts will be faster. As a trade-off, x and y have to be non-chunky datasets because the data portion per processor is not the same between outer and inner processors. The z layer could be chunky, but the resulting speed-up is probably negligible. PS: It seems beneficial to do an independent multi-chucked IO to write the f-array. See discussion at page 81-86: https://www.alcf.anl.gov/files/Parallel_HDF5_1.pdf This means setting H5Pset_dxpl_mpio() to H5FD_MPIO_COLLECTIVE and setting H5Pset_dxpl_mpio_collective_opt() to H5FD_MPIO_INDIVIDUAL_IO. The reason is that all processors would of course take part in writing the f-array, but in fact only one processor writes to each chunk, hence this is individual MPI-IO. The respective call to activate the chunking should be: CALL h5pcreate_f(H5P_DATASET_CREATE_F, param_list_id, h5_error) CALL h5pset_chunk_f(param_list_id, dimensionality, chunk_dims, h5_error)