Matminer

Matminer is an open-source Python library that accelerates data-driven materials discovery. It consolidates ready-made datasets, automated data-retrieval utilities, and a comprehensive catalog of featurizers so researchers can move from raw compositions or structures to machine-learning-ready tables with minimal boilerplate code.

MatDaCs Tool Review: Matminer

h2>Overview

Matminer packages together dataset access, feature engineering, and pandas utilities so that materials informatics projects can sprint from structures to machine-learning models. I exercised the latest 0.9.3 release on macOS (Apple M4 Pro) to evaluate day-one usability for MatDaCs authors.

What is Matminer?

Originally released by the HackingMaterials group at LBNL, Matminer is a BSD-licensed Python toolkit targeting tabular materials analytics. It ships with:

  • matminer.datasets: 40+ ready-to-use benchmark datasets with provenance metadata.
  • matminer.data_retrieval: authenticated clients for Materials Project, Citrination, MPDS, and more.
  • matminer.featurizers: 70+ descriptors for compositions, structures, sites, electronic DOS/bandstructures, and conversions between ASE/pymatgen/pandas objects.

Matminer relies on pandas DataFrames as its lingua franca, so it interoperates instantly with scikit-learn, PyTorch tabular backends, and visualization stacks such as matplotlib or plotly.

Key Features

  • Curation-first datasets: getavailabledatasets() lists dielectric tensors, experimental band gaps, thermoelectric transport, metallic-glass screens, etc., making literature replication straightforward.
  • Composable featurizers: Chain StrToComposition → ElementProperty → SiteStatsFingerprint to transform raw formulas/structures into dense vectors without bespoke scripts.
  • Retrieval helpers: Built-in MaterialsProjectDataRetrieval and CitrineDataRetrieval classes handle API throttling, query chunking, and data normalization.
  • Citation tracking: Every featurizer exposes citations() so you can auto-generate BibTeX blocks for MatDaCs writeups.
  • Parallel-friendly design: Featurizers accept n_jobs arguments, and many leverage joblib under the hood for multi-core throughput.

Installation

python3 -m pip install --user matminer

This pulled pymatgen 2024.8.9, pandas 2.0, scikit-learn 1.3, and visualization extras (~60 MB total). No compilation was necessary on Apple silicon.

Examples in the Official Gallery

The Matminer website’s Examples section curates notebooks that span data retrieval, feature engineering, and visualization.

Example 1 · Predicting bulk modulus

Notebook machinelearning-nb/bulkmodulus.ipynb demonstrates how to load the elastic tensor dataset, convert formulas to compositions, attach Magpie statistics, and visualize model quality with FigRecipes. Reproducing the core pipeline locally (matminerbulkmodulusdemo.py) gave a 1,181‑row, 135‑feature table and a RandomForestRegressor baseline of MAE = 14.5 GPa, R^2 = 0.916 for KVRH, matching the accuracy envelope advertised in the docs.

# matminer_bulk_modulus_demo.py
from matminer.datasets.convenience_loaders import load_elastic_tensor
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.composition import ElementProperty
from matminer.featurizers.structure import DensityFeatures

df = load_elastic_tensor()[["formula", "structure", "K_VRH"]].dropna()
df = StrToComposition().featurize_dataframe(df, "formula")
df = ElementProperty.from_preset("magpie").featurize_dataframe(df, "composition")
df = DensityFeatures().featurize_dataframe(df, "structure")

Example 2 · Experimental vs. computed band gaps

Notebook dataretrieval-nb/exptvscompbandgap.ipynb pairs CitrineDataRetrieval with MaterialsProjectDataRetrieval to pull experimental gaps, fetch MP’s calculated values, and generate parity plots. Even when API tokens are required, the example shows how Matminer harmonizes disparate sources into pandas tables, tags each column with BibTeX metadata, and then hands the cleaned data to FigRecipes. This workflow is ideal for MatDaCs articles that need to fact-check public databases against high-throughput calculations.

Comparison with DScribe

  • Descriptor granularity: Matminer excels at global/tabular descriptors (composition statistics, averaged site fingerprints), whereas DScribe focuses on local atomic environments (SOAP, MBTR, ACSF). Combining them lets you couple global context with DScribe’s high-resolution kernels.
  • Dataset integration: Matminer offers curated datasets and API retrievers; DScribe purposely stays lean and assumes you already have structures. For MatDaCs, Matminer can be the staging ground and DScribe the local descriptor plug-in.
  • Performance focus: DScribe’s descriptors are heavily optimized in C++ for 3D environments. Matminer’s bottlenecks are typically pandas operations; multi-core featurization is available but not as GPU-friendly. Choose according to whether you need per-site SOAP vectors (DScribe) or tabular feature sets for classical ML (Matminer).

Application Areas

  • Rapid benchmarking of ML pipelines (e.g., MatBench tasks).
  • Feature screening for automated synthesis/design campaigns.
  • Teaching materials informatics—datasets + featurizers provide reproducible labs.
  • Preprocessing layer before feeding data to DScribe, MEGNet, or custom GNNs.

Hands-on Notes

  • The library prints a NotOpenSSLWarning because macOS ships LibreSSL; functionality remains unaffected.
  • StrToComposition and ElementProperty supply tqdm progress bars, which is handy on large datasets.
  • When running featurizers that spawn subprocesses, set the multiprocessing start method to fork (macOS) or execute from a .py file (matminerbulkmodulusdemo.py, matminerdielectric_demo.py) to avoid the <stdin> launch issue.
  • Matminer’s featurizer defaults keep impute_nan=False. Consider enabling imputation before model fitting to avoid NaNs when elements lack tabulated properties.

Conclusion

Matminer remains a dependable backbone for MatDaCs contributors who need curated datasets plus feature pipelines under one roof. Pair it with DScribe when your workflow needs both global (composition/structure) and local (atomic environment) descriptors. The hands-on dielectric-constant example shows that you can achieve publishable baselines within minutes.

References

  • Matminer documentation: <https://hackingmaterials.lbl.gov/matminer>
  • Matminer examples gallery: <https://hackingmaterials.lbl.gov/matminer/index.html#examples>
  • Matminer GitHub repository: <https://github.com/hackingmaterials/matminer>
  • Ward et al., Comput. Mater. Sci. 152, 60–69 (2018)
  • Official notebook – bulk modulus regression: <https://nbviewer.jupyter.org/github/hackingmaterials/matminerexamples/blob/main/matminerexamples/machinelearning-nb/bulkmodulus.ipynb>
  • Official notebook – experimental vs. computed band gap: <https://nbviewer.jupyter.org/github/hackingmaterials/matminerexamples/blob/main/matminerexamples/dataretrieval-nb/exptvscompbandgap.ipynb>