HDF5 for efficient I/O
Dataset Details
A dataset object comprises data values and the metadata that describes it. A dataset has two fundamental parts: a header and a data array. The header contains information about the data array portion of the dataset and the associated metadata. Typical header information includes the name of the object, dimensionality, number type, information about how the data is stored on disk, and other information that HDF5 can use to speed up data access or improve data integrity.
The header has four essential classes of information: name, datatype, dataspace, and storage layout. A name in HDF5 is just a set of ASCII characters, but you should use a name that is meaningful to the dataset. A datatype in HDF5 describes the individual data elements in a dataset and comprises two categories or types: atomic and compound. Datatypes can be quite complicated to define, so I focus only on the basics. Some predefined datatypes [11] can be used for the data you typically might encounter.
The atomic datatype includes integers, floating-point numbers, and strings. Each datatype has a set of properties. For example the integer datatype properties are size, order (endianness), and sign (signed/unsigned). The float datatype properties are size, location of the exponent and mantissa, and location of the sign bit.
Compound datatypes refer to collections of several datatypes that are presented as a single unit. In C, this is similar to a struct
. The various parts of a compound datatype are called members and may be of any datatype, including another compound datatype. One of the fancy features of HDF5 is that it is possible to read members from a compound datatype without reading the whole type.
The layout of a dataset's data elements can consist of non-elements (NULL
), a single element (a scalar), or a simple array. The dataspace can be fixed or unlimited, which allows it to be extensible (i.e., it can grow larger).
Dataspace properties include rank (number of dimensions), size (dimensions), and maximum size (size to which an array may grow). The dimensionality (rank) of the dataspace is fixed when the array is created and can include a maximum size that each dimension can grow during the lifetime of the dataspace.
If you are not sure what dimensions your dataspace might become, you can always use the HDF5 predefined variable H5P_UNLIMITED
.
Attributes
One of the fundamental objects in HDF5 is an attribute, which is how you store metadata inside an HDF5 file. Optionally, attributes can be associated with other HDF5 objects, such as groups, datasets, or named datatypes if they are not independent objects. As such, attributes are accessed by opening the object to which they are attached.
As the user, you define the attributes (make it meaningful), and you can delete them and overwrite them as you see fit.
Attributes have two parts. The first is a name, and the second is a value. Classically, the value is a string that describes the data to which it is attached. They can be extremely useful in a data file. Using attributes, you can describe the data, including information such as when the data was collected, who collected it, what applications or sensors were used in its creation, a description (with as much information as you can include), and so on. A lack of useful metadata is one of the biggest problems in HPC data today, and attributes can be used to help alleviate the problem. You just have to use them.
HDF5 Basics
In this section, I want to present a quick introduction to HDF5 through some simple code examples. The goal is not to dive deep into HDF5 but to illustrate the basics in practice. I'll start with Python because it is a widely used language, and the HDF5 Python library h5py [12] is very easy to use and very easy to understand.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.