HDF5 for efficient I/O

Dataset Details

A dataset object comprises data values and the metadata that describes it. A dataset has two fundamental parts: a header and a data array. The header contains information about the data array portion of the dataset and the associated metadata. Typical header information includes the name of the object, dimensionality, number type, information about how the data is stored on disk, and other information that HDF5 can use to speed up data access or improve data integrity.

The header has four essential classes of information: name, datatype, dataspace, and storage layout. A name in HDF5 is just a set of ASCII characters, but you should use a name that is meaningful to the dataset. A datatype in HDF5 describes the individual data elements in a dataset and comprises two categories or types: atomic and compound. Datatypes can be quite complicated to define, so I focus only on the basics. Some predefined datatypes [11] can be used for the data you typically might encounter.

The atomic datatype includes integers, floating-point numbers, and strings. Each datatype has a set of properties. For example the integer datatype properties are size, order (endianness), and sign (signed/unsigned). The float datatype properties are size, location of the exponent and mantissa, and location of the sign bit.

Compound datatypes refer to collections of several datatypes that are presented as a single unit. In C, this is similar to a struct. The various parts of a compound datatype are called members and may be of any datatype, including another compound datatype. One of the fancy features of HDF5 is that it is possible to read members from a compound datatype without reading the whole type.

The layout of a dataset's data elements can consist of non-elements (NULL), a single element (a scalar), or a simple array. The dataspace can be fixed or unlimited, which allows it to be extensible (i.e., it can grow larger).

Dataspace properties include rank (number of dimensions), size (dimensions), and maximum size (size to which an array may grow). The dimensionality (rank) of the dataspace is fixed when the array is created and can include a maximum size that each dimension can grow during the lifetime of the dataspace.

If you are not sure what dimensions your dataspace might become, you can always use the HDF5 predefined variable H5P_UNLIMITED.

Attributes

One of the fundamental objects in HDF5 is an attribute, which is how you store metadata inside an HDF5 file. Optionally, attributes can be associated with other HDF5 objects, such as groups, datasets, or named datatypes if they are not independent objects. As such, attributes are accessed by opening the object to which they are attached.

As the user, you define the attributes (make it meaningful), and you can delete them and overwrite them as you see fit.

Attributes have two parts. The first is a name, and the second is a value. Classically, the value is a string that describes the data to which it is attached. They can be extremely useful in a data file. Using attributes, you can describe the data, including information such as when the data was collected, who collected it, what applications or sensors were used in its creation, a description (with as much information as you can include), and so on. A lack of useful metadata is one of the biggest problems in HPC data today, and attributes can be used to help alleviate the problem. You just have to use them.

HDF5 Basics

In this section, I want to present a quick introduction to HDF5 through some simple code examples. The goal is not to dive deep into HDF5 but to illustrate the basics in practice. I'll start with Python because it is a widely used language, and the HDF5 Python library h5py [12] is very easy to use and very easy to understand.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News