Data Pipeline

apax.data

apax.data.preprocessing.compute_nl(positions, box, r_max)[source]

Computes the neighbor list for a single structure. For periodic systems, positions are assumed to be in fractional coordinates.

Parameters:
  • positions (np.ndarray) – Positions of atoms.

  • box (np.ndarray) – Simulation box dimensions.

  • r_max (float) – Maximum interaction radius.

Returns:

Tuple containing neighbor indices array and offsets array.

Return type:

Tuple[np.ndarray, np.ndarray]

apax.data.preprocessing.get_shrink_wrapped_cell(positions)[source]

Get the shrink-wrapped simulation cell based on atomic positions.

Parameters:

positions (np.ndarray) – Atomic positions.

Returns:

Tuple containing the shrink-wrapped cell matrix and origin.

Return type:

Tuple[np.ndarray, np.ndarray]

apax.data.preprocessing.prefetch_to_single_device(iterator, size: int, data_sharding=None)[source]

inspired by https://flax.readthedocs.io/en/latest/_modules/flax/jax_utils.html#prefetch_to_device

apax.data.initialization.load_data_files(data_config)[source]

Load data files for training and validation.

Parameters:

data_config (object) – Data configuration object.

Returns:

Tuple containing list of ase.Atoms objects for training and validation.

Return type:

Tuple

class apax.data.input_pipeline.CachedInMemoryDataset(atoms_list, cutoff, bs, n_epochs, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties: list[tuple] = [], pre_shuffle=False, shuffle_buffer_size=1000, ignore_labels=False, cache_path='.')[source]

Dataset which pads everything (atoms, neighbors) to the largest system in the dataset. The NL is computed on the fly during the first epoch and stored to disk using tf.data’s cache. Most performant option for datasets with samples of very similar size.

shuffle_and_batch(mesh=None)[source]

Shuffles and batches the inputs/labels. This function prepares the inputs and labels for the whole training and prefetches the data.

Returns:

Iterator that returns inputs and labels of one batch in each step.

Return type:

ds

class apax.data.input_pipeline.InMemoryDataset(atoms_list, cutoff, bs, n_epochs, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties: list[tuple] = [], pre_shuffle=False, shuffle_buffer_size=1000, ignore_labels=False, cache_path='.')[source]

Baseclass for all datasets which store data in memory.

init_input() tuple[Dict[str, Array], ndarray][source]

Returns first batch of inputs and labels to init the model.

steps_per_epoch() int[source]

Returns the number of steps per epoch dependent on the number of data and the batch size. Steps per epoch are calculated in a way that all epochs have the same number of steps, and all batches have the same length. To do so, some training data are dropped in each epoch.

class apax.data.input_pipeline.OTFInMemoryDataset(atoms_list, cutoff, bs, n_epochs, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties: list[tuple] = [], pre_shuffle=False, shuffle_buffer_size=1000, ignore_labels=False, cache_path='.')[source]

Dataset which pads everything (atoms, neighbors) to the largest system in the dataset. The NL is computed on the fly and fed into a tf.data generator. Mostly for internal purposes.

shuffle_and_batch(mesh=None)[source]

Shuffles and batches the inputs/labels. This function prepares the inputs and labels for the whole training and prefetches the data.

Returns:

Iterator that returns inputs and labels of one batch in each step.

Return type:

ds

class apax.data.input_pipeline.PerBatchPaddedDataset(atoms_list, cutoff, bs, n_epochs, num_workers: int | None = None, atom_padding: int = 10, nl_padding: int = 2000, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties=[], pre_shuffle=False)[source]

Dataset with padding that leverages multiprocessing and optimized buffering.

Per-atom and per-neighbor arrays are padded to the next multiple of a user specified integer. This limits the compute wasted due to padding at the (negligible) cost of some recompilations. Since the padding occurs on a per-batch basis, it is the most performant option for datasets with significantly differently sized systems (e.g. MaterialsProject, SPICE).

Further, the neighborlist is computed on-the-fly in parallel on a side thread. Does not use tf.data.

num_workers

Number of processes to use for preprocessing batches.

Type:

int

atom_padding

Pad extensive arrays (positions, etc.) to next multiple of this integer.

Type:

int

nl_padding

Pad neighborlist arrays to next multiple of this integer.

Type:

int

enqueue_batches()[source]

Function to enqueue batches on a side thread.

apax.data.input_pipeline.find_largest_system(inputs, r_max) tuple[int][source]

Finds the maximal number of atoms and neighbors.

Parameters:
  • inputs (dict) – Dictionary containing input data.

  • r_max (float) – Maximum interaction radius.

Returns:

Tuple containing the maximum number of atoms and neighbors.

Return type:

Tuple[int]

apax.data.input_pipeline.pad_nl(idx, offsets, max_neighbors)[source]

Pad the neighbor list arrays to the maximal number of neighbors occurring.

Parameters:
  • idx (np.ndarray) – Neighbor indices array.

  • offsets (np.ndarray) – Offset array.

  • max_neighbors (int) – Maximum number of neighbors.

Returns:

Tuple containing padded neighbor indices array and offsets array.

Return type:

Tuple[np.ndarray, np.ndarray]

apax.data.input_pipeline.round_up_to_multiple(value, multiple)[source]

Rounds up the given integer value to the next multiple of multiple.

Parameters: - value (int): The integer to round up. - multiple (int): The multiple to round up to.

Returns: - int: The rounded-up value.

class apax.data.statistics.DatasetStats(elemental_shift: <built-in function array> = None, elemental_scale: float = None, n_species: int = 119)[source]