CDVAE

CDVAE is a diffusion-based variational autoencoder for crystalline materials. It learns from existing crystal structures and can generate novel periodic structures, as well as optimize for target properties in latent space. The project provides benchmark datasets (Perov-5, Carbon-24, MP-20) and training/evaluation scripts for crystal generation and property conditioning.

MatDaCs tool review: CDVAE

Overview

CDVAE is a diffusion-based variational autoencoder for crystal structures. It targets materials generation, reconstruction, and property-conditioned design, with curated datasets (Perov-5, Carbon-24, MP-20) and training/evaluation utilities. I set up a local baseline on the Perov-5 dataset to validate the data pipeline and establish a simple reference model.

What is CDVAE?

CDVAE learns a latent representation of periodic structures and uses diffusion to generate realistic crystals while handling lattice and fractional coordinates. The project’s Hydra configs cover data preprocessing, model architecture, and training schedules, and the dataset package provides ready-to-run train/val/test splits for standard benchmarks.

Key Features

  • Generative crystal modeling: diffusion + VAE backbone designed for periodic structure generation.
  • Property conditioning: optional property prediction heads for inverse design workflows.
  • Benchmark datasets: Perov-5 (composition-varying perovskites), Carbon-24 (composition-fixed carbon allotropes), MP-20 (general inorganic crystals).
  • Config-driven training: Hydra configs separate data, model, and trainer settings for reproducibility.
  • Evaluation scripts: metrics and visualization utilities for reconstruction, generation quality, and property optimization.

Installation

CDVAE ships with conda environment files (env.yml, env.cpu.yml) that pin CUDA/PyTorch/Lightning versions. The intended setup is:

conda env create -f env.yml

I recreated a fresh cdvae conda environment and installed dependencies via conda-forge/pytorch/pyg channels. The full training demo still could not complete on macOS arm64 because torch-sparse, torch-cluster, and torch-spline-conv are not available for this architecture via conda, and pip downloads are blocked by DNS in this environment. I therefore ran a lightweight baseline on the official Perov-5 dataset to validate data access and provide a comparison point.

Local Example (Perov-5 Baseline)

Goal: build a composition-only baseline on the official Perov-5 splits (train.csv, val.csv, test.csv) using elemental fractions as features. This validates the dataset files and yields a reference error for later CDVAE comparisons.

Minimal implementation

import pandas as pd
import numpy as np
from pymatgen.core import Composition
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

train_df = pd.read_csv(".../perov_5/train.csv")
val_df = pd.read_csv(".../perov_5/val.csv")

elements = sorted({el for f in train_df["formula"]
                   for el in Composition(f).as_dict().keys()})

def featurize_formula(series):
    rows = []
    for f in series.astype(str).tolist():
        comp = Composition(f).get_el_amt_dict()
        total = sum(comp.values())
        rows.append([comp.get(el, 0.0) / total for el in elements])
    return np.array(rows, dtype=float)

X_train = featurize_formula(train_df["formula"])
X_val = featurize_formula(val_df["formula"])

y_train = train_df["heat_ref"].values
y_val = val_df["heat_ref"].values

model = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
val_pred = model.predict(X_val)

mae = mean_absolute_error(y_val, val_pred)
r2 = r2_score(y_val, val_pred)

Results

  • Target: heat_ref
  • Train/Val/Test sizes: 11,356 / 3,787 / 3,785
  • Features: 56 elemental fractions
  • Validation: MAE = 0.543, R² = -0.162
  • Test: MAE = 0.547, R² = -0.182

The negative R² indicates this baseline is intentionally weak (composition-only, no structure), which makes it a useful lower bound before running the full CDVAE model.

Official Demo Attempt (Perov-5)

Goal: run the official training entrypoint on the Perov-5 dataset with a short 3‑epoch configuration.

Command used

PYTHONPATH=/Users/lihengyu/Desktop/Research_Project/On-campus 2025-2/CDVAE/cdvae \
PROJECT_ROOT=/Users/lihengyu/Desktop/Research_Project/On-campus 2025-2/CDVAE/cdvae \
HYDRA_JOBS=/Users/lihengyu/Desktop/Research_Project/On-campus 2025-2/CDVAE/outputs \
WABDB_DIR=/Users/lihengyu/Desktop/Research_Project/On-campus 2025-2/CDVAE/wandb \
conda run -n cdvae python /Users/lihengyu/Desktop/Research_Project/On-campus 2025-2/CDVAE/cdvae/cdvae/run.py \
  data=perov expname=perov_demo hydra.job.chdir=false \
  train.pl_trainer.gpus=0 train.pl_trainer.max_epochs=3

Observed behavior

  • The datamodule started preprocessing the Perov‑5 training set (11,356 structures). This step completed in ~24 seconds on CPU.
  • Model instantiation failed because torchgeometric 2.6.1 on macOS arm64 does not provide torchgeometric.nn.acts, and torch-sparse/torch-cluster/torch-spline-conv are unavailable on conda for arm64.

Error summary

ModuleNotFoundError: No module named 'torch_geometric.nn.acts'

Hands-on Notes

  • The official CDVAE conda environment failed to resolve because the pinned versions (pytorch=1.8.1, pymatgen=2020.12.31) are no longer available on current default channels.
  • The local machine is Apple Silicon (arm64). PyTorch is installed in the base environment, but MPS acceleration is not available (torch.backends.mps.is_available() == False), so training would run on CPU even if dependencies were satisfied.
  • The Perov-5 dataset files (train.csv, val.csv, test.csv) are accessible and can be used immediately for baselines or for CDVAE once the PyG stack (torch-sparse, torch-cluster, torch-spline-conv) is available.

Conclusion

CDVAE is a well-structured reference for crystal generation, offering datasets, configs, and evaluation tooling that are valuable for MatDaCs authors. In this macOS arm64 environment, the official demo could not complete due to missing PyG native extensions, but the Perov-5 baseline validates data access and provides a clear performance floor. Running the same command on Linux/x8664 (or a macOS x8664 conda subdir under Rosetta) should allow the official training demo to complete.

References