Reading HDF5 Files

HDF5 Overview

Many TXPipe outputs are in the HDF5 format. This is a fast and flexible file type that can also be easily read/written in parallel.

HDF5 files contain three types of object:

  • datasets are equivalent to saved numpy arrays.

  • groups are like directories and can contain datasets or sub-groups.

  • attributes are for small pieces of metadata, and a set of attributes can convert to a python dictionary. They can be attached to whole files or to individual groups or datasets.

The name scheme for datasets and groups is the same a for Unix files and folders, e.g. f['group/subgroup/dataset'].

From the command line, you can use the h5ls command to list the contents of an HDF5 file:

h5ls -r filename.hdf5

or the TXPipe command:

python bin/h5show.py filename_or_directory

h5py

In python, you read these files with the h5py library. Here’s an example opening one of the files generated by the example “laptop” pipeline in TXPipe:

import h5py
f = h5py.File("./data/example/outputs/shear_tomography_catalog.hdf5")

# Print out the items in the root of the file
print(f.keys())
# prints <KeysViewHDF5 ['metacal_response', 'provenance', 'tomography']>
# showing the three groups generated by tomography stage

We can create variables to represent groups in the file:

g = f["tomography"]
print(g.keys())
# prints <KeysViewHDF5 ['N_eff', 'N_eff_2d', 'mean_e1', 'mean_e1_2d', 'mean_e2', 'mean_e2_2d', 'sigma_e', 'sigma_e_2d', 'source_bin', 'source_counts', 'source_counts_2d']>

Printing a dataset doesn’t load it, it just shows the size and type of the data:

print(g["mean_e1"])
# prints <HDF5 dataset "mean_e1": shape (4,), type "<f4">

Instead we load data sets as a numpy arrays with a slice:

e = g["mean_e1"][:]
print(e)
# prints [ 0.00283134 -0.0140038   0.0011645  -0.01299088]

For longer arrays we may want to just read a subset of the data:

b = g["source_bins"][0:100]

Attributes

The easiest way to read attributes from h5py is to turn them into a dictionary:

d = dict(f['provenance'].attrs)
print(d)
# prints lots of provenance tracking information like all the package versions
# and configuration options