Data Pipeline¶
apax.data
- apax.data.preprocessing.compute_nl(positions, box, r_max)[source]¶
Computes the neighbor list for a single structure. For periodic systems, positions are assumed to be in fractional coordinates.
- Parameters:
positions (np.ndarray) – Positions of atoms.
box (np.ndarray) – Simulation box dimensions.
r_max (float) – Maximum interaction radius.
- Returns:
Tuple containing neighbor indices array and offsets array.
- Return type:
Tuple[np.ndarray, np.ndarray]
- apax.data.preprocessing.get_shrink_wrapped_cell(positions)[source]¶
Get the shrink-wrapped simulation cell based on atomic positions.
- Parameters:
positions (np.ndarray) – Atomic positions.
- Returns:
Tuple containing the shrink-wrapped cell matrix and origin.
- Return type:
Tuple[np.ndarray, np.ndarray]
- apax.data.preprocessing.prefetch_to_single_device(iterator, size: int, data_sharding=None)[source]¶
inspired by https://flax.readthedocs.io/en/latest/_modules/flax/jax_utils.html#prefetch_to_device
- apax.data.initialization.load_data_files(data_config)[source]¶
Load data files for training and validation.
- Parameters:
data_config (object) – Data configuration object.
- Returns:
Tuple containing list of ase.Atoms objects for training and validation.
- Return type:
Tuple
- class apax.data.input_pipeline.CachedInMemoryDataset(atoms_list, cutoff, bs, n_epochs, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties: list[tuple] = [], pre_shuffle=False, shuffle_buffer_size=1000, ignore_labels=False, cache_path='.')[source]¶
Dataset which pads everything (atoms, neighbors) to the largest system in the dataset. The NL is computed on the fly during the first epoch and stored to disk using tf.data’s cache. Most performant option for datasets with samples of very similar size.
- class apax.data.input_pipeline.InMemoryDataset(atoms_list, cutoff, bs, n_epochs, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties: list[tuple] = [], pre_shuffle=False, shuffle_buffer_size=1000, ignore_labels=False, cache_path='.')[source]¶
Baseclass for all datasets which store data in memory.
- init_input() tuple[Dict[str, Array], ndarray][source]¶
Returns first batch of inputs and labels to init the model.
- steps_per_epoch() int[source]¶
Returns the number of steps per epoch dependent on the number of data and the batch size. Steps per epoch are calculated in a way that all epochs have the same number of steps, and all batches have the same length. To do so, some training data are dropped in each epoch.
- class apax.data.input_pipeline.OTFInMemoryDataset(atoms_list, cutoff, bs, n_epochs, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties: list[tuple] = [], pre_shuffle=False, shuffle_buffer_size=1000, ignore_labels=False, cache_path='.')[source]¶
Dataset which pads everything (atoms, neighbors) to the largest system in the dataset. The NL is computed on the fly and fed into a tf.data generator. Mostly for internal purposes.
- class apax.data.input_pipeline.PerBatchPaddedDataset(atoms_list, cutoff, bs, n_epochs, num_workers: int | None = None, atom_padding: int = 10, nl_padding: int = 2000, pos_unit: str = 'Ang', energy_unit: str = 'eV', additional_properties=[], pre_shuffle=False)[source]¶
Dataset with padding that leverages multiprocessing and optimized buffering.
Per-atom and per-neighbor arrays are padded to the next multiple of a user specified integer. This limits the compute wasted due to padding at the (negligible) cost of some recompilations. Since the padding occurs on a per-batch basis, it is the most performant option for datasets with significantly differently sized systems (e.g. MaterialsProject, SPICE).
Further, the neighborlist is computed on-the-fly in parallel on a side thread. Does not use tf.data.
- num_workers¶
Number of processes to use for preprocessing batches.
- Type:
int
- atom_padding¶
Pad extensive arrays (positions, etc.) to next multiple of this integer.
- Type:
int
- nl_padding¶
Pad neighborlist arrays to next multiple of this integer.
- Type:
int
- apax.data.input_pipeline.find_largest_system(inputs, r_max) tuple[int][source]¶
Finds the maximal number of atoms and neighbors.
- Parameters:
inputs (dict) – Dictionary containing input data.
r_max (float) – Maximum interaction radius.
- Returns:
Tuple containing the maximum number of atoms and neighbors.
- Return type:
Tuple[int]
- apax.data.input_pipeline.pad_nl(idx, offsets, max_neighbors)[source]¶
Pad the neighbor list arrays to the maximal number of neighbors occurring.
- Parameters:
idx (np.ndarray) – Neighbor indices array.
offsets (np.ndarray) – Offset array.
max_neighbors (int) – Maximum number of neighbors.
- Returns:
Tuple containing padded neighbor indices array and offsets array.
- Return type:
Tuple[np.ndarray, np.ndarray]