Atompack: A Fast Storage Layer for Atomistic ML Training
We built Atompack as part of LeMaterial, where the same datasets need to move between curation, training, benchmarking, and public distribution. The goal is to write atomistic structures once, reopen them efficiently for training, and publish the same artifacts without converting through several intermediate formats.
The project combines:
- a Python API
- a Rust storage engine
- an append-only
.atpformat - read-only mmap-backed access for serving static datasets
- batch ingestion paths for NumPy and ASE (Atomic Simulation Environment)
- Hugging Face upload and download support in the base package
Repository: https://github.com/LeMaterial/atompack.
Why We Built Atompack
Atomistic ML pipelines often start with tools that are a great fit for scientific workflows. But the requirements change once the dataset is feeding dataloaders and training loops: shuffled reads across many epochs become the workload that matters most.
This gets even more noticeable once datasets are distributed as large collections of shards. In practice, some training splits end up with thousands of files, for example more than 6,000 shards for the train split of OMAT24. On shared filesystems such as Lustre, many small files and random reads can create substantial metadata and I/O pressure. In many cases, that sharding pattern is partly an artifact of slow write paths and export workflows rather than something the training setup actually needs.
Atompack is aimed at that workload. The core storage unit is the whole molecule, with direct indexing into an immutable dataset snapshot. The main workflow is:
- write molecules or stacked array batches into an append-only
.atpfile flush()to publish a new committed trailing index- reopen in read-only mode with
Database.open(...) - read by molecule index, convert to ASE when needed
Under the hood, the file layout is simple:
- two 4 KiB header slots
- a data region containing molecule records
- a trailing index written on
flush()
That layout keeps appends straightforward while giving Atompack O(1) lookup through the committed index. For read-mostly datasets, Database.open(path) uses mmap-backed read-only mode by default.
That focus also reflects the broader LeMaterial workflow. The project is not just about storing one dataset efficiently; it is about making large atomistic datasets easier to build, benchmark, publish, share, and reuse across a shared open-science ecosystem.
Try It From Hugging Face
Install from PyPI:
pip install atompack-db
The quickest way to try Atompack is to open one of the public datasets already packaged on the Hub:
import atompack
db = atompack.hub.open(
repo_id="LeMaterial/Atompack",
path_in_repo="lematbulk/pbe",
)
print(len(db))
mol = db[0]
print(mol.energy)
print(mol.positions.shape)
API for Dataset Workflows
Write a dataset and reopen it for reads:
import atompack
### Create a dataset with your data...
import numpy as np
positions = np.random.rand(32, 64, 3).astype(np.float32)
atomic_numbers = np.full((32, 64), 6, dtype=np.uint8)
db = atompack.Database("train.atp", overwrite=True)
db.add_arrays_batch(positions, atomic_numbers)
db.flush()
db = atompack.Database.open("train.atp")
for i in range(4):
mol = db[i]
print(i, len(mol), mol.positions.shape)
### ... Or use existing datasets from Hugging Face
remote_db = atompack.hub.open(
repo_id="LeMaterial/Atompack",
path_in_repo="omat/train",
)
print(len(remote_db))
print(remote_db[0].energy)
If your pipeline already uses ASE, you can ingest structures directly:
import atompack
from ase import Atoms
structures = [
Atoms("H2O", positions=[[0, 0, 0], [1, 0, 0], [0, 1, 0]]),
Atoms("CO2", positions=[[0, 0, 0], [1.16, 0, 0], [-1.16, 0, 0]]),
]
db = atompack.Database("ase_data.atp", overwrite=True)
atompack.add_ase_batch(db, structures, batch_size=256)
db.flush()
For uploading the datasets on Hugging Face:
import atompack
atompack.hub.upload(
"exports/omat/train",
repo_id="org/atompack-demo",
path_in_repo="omat/train",
)
Performance on Read-Heavy Workloads
The benchmarks show that Atompack performs well on read-heavy dataset serving. Write throughput is strong with the native batch APIs, and artifact size stays close to HDF5 SOA while remaining much smaller than the LMDB and ASE baselines used in this repository.
Benchmark setup: this slice uses synthetic fixed-size records with 64 atoms per molecule. The high-throughput read benchmark uses 1M generated molecules and reports read loops on local NVMe storage (Samsung 990 EVO Plus SSD). The random/shuffled number is the single-worker shuffled-read path.
646k mol/son sequential reads446k mol/son the random/shuffled read path (single worker)24.0xfaster than HDF5 SOA on the random or shuffled path2.81xfaster than LMDB Packed on the random or shuffled path3.82xfaster than LMDB Pickle on the random or shuffled path
Write throughput is strong as well. On the same 64-atom NVMe slice, Atompack reaches:
105,473 mol/sfor builtin-field writes77,193 mol/swhen writing additional custom properties
Storage footprint stays near the compact end of the comparison set:
- HDF5 SOA:
0.96xAtompack size on builtins and0.95xon the custom-property slice - Atompack:
1.00x - LMDB Packed:
2.34xbuiltins and1.35xcustom - LMDB Pickle:
2.35xbuiltins and1.35xcustom - ASE SQLite:
3.05xbuiltins and2.08xcustom - ASE LMDB:
4.69xbuiltins and2.69xcustom
While Atompack is not always the absolute smallest representation, the main result is that it stays in the compact-storage regime while pairing that with much stronger read behavior.
Similar behaviours were observed on Lustre / NFS / GPFS filesystems.
Public Datasets on Hugging Face
We also provide some public datasets in the Atompack format through Hugging Face https://huggingface.co/datasets/LeMaterial/Atompack. The main dataset paths currently exposed there include:
lematbulk/pbe, from LeMat-Bulkmatpes/pbeandmatpes/r2scan, from MatPESmp_aloe, from MP-ALOEmptrj, from MPtrjomat/trainandomat/val, from OMAT24oc20/s2ef_train_all, from Open Catalyst 2020 (OC20)
If you use any of these datasets, please cite the original dataset authors. The Atompack repository is a packing and serving layer, not the original source of the data.
When to Use Atompack
Atompack is a good fit when the storage layer itself has become a bottleneck: large datasets, random reads, many worker processes, and repeated conversion or publish steps. It is not trying to replace the rest of the scientific Python ecosystem. It is focused for atomistic ML workloads that need a faster and simpler path between dataset creation, dataset serving, and publication. When the bottleneck is not the storage layer and is rather in the graph construction, feature computation, or model training, then existing tools are a good fit already. We built Atompack to fill that specific gap and hope it can support faster, more efficient training pipelines that push the state of the art in atomistic ML.
Additional Resources
Citations
If you use one of the packaged datasets or mentioned tools, please cite the original authors.
- MatPES: A Foundational Potential Energy Surface Dataset for Materials, Kaplan et al. (2025).
- MP-ALOE: MP-ALOE: an r2SCAN dataset for universal machine learning interatomic potentials, Kuner et al. (2025).
- MPtrj / CHGNet: CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling, Deng et al. (2023).
- Open Catalyst 2020 / OC20: Open Catalyst 2020 (OC20) Dataset and Community Challenges, Chanussot et al. (2021).
- OMat24: Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models, Barroso-Luque et al. (2024).
- LeMat-Bulk: LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases, Siron et al. (2025).
- ASE: The atomic simulation environment - a Python library for working with atoms, Larsen et al. (2017).
- HDF5: HDF5 Library, The HDF Group. Citation DOI: https://doi.org/10.5281/zenodo.17808558
- LMDB / MDB: MDB: A Memory-Mapped Database and Backend for OpenLDAP, Chu (2011).



