This repository contains the official implementation of ConfDENSE, proposed and used by Pinaki et al. The primary goal of ConfDENSE is to identify the molecular conformer that is most responsible for the odor profile exhibited by a molecule.
A conformer is a different three-dimensional arrangement of the same molecule that arises due to rotations around its chemical bonds. While the atoms and chemical composition remain unchanged, the spatial arrangement of the atoms differs. Since molecular interactions are inherently three-dimensional, different conformers can contribute differently to a molecule's odor profile. In addition to conformer discovery, we also investigate how effectively the ConfDENSE framework can be used for molecular odor prediction.
ConfDENSE consists of two main components.
The first component is the pretraining of a PointNet-based architecture. For each molecule in the dataset, we generate multiple conformers and sample points from their corresponding electron density distributions to construct point clouds (Refer to this). Each conformer-specific point cloud is then used to train the PointNet model to predict the odor labels of the parent molecule. In our experiments, we generate 100 conformers for each molecule and train the PointNet model on the resulting point clouds. For more details, refer to this.
Once the PointNet model has been trained, we move to the second component, called the Aggregator. The Aggregator is trained separately using the saved outputs of the PointNet model. For each molecule, it receives the predictions corresponding to its 100 conformers and learns to produce a final odor prediction for the molecule. The train, validation, and test splits used during PointNet training are retained for Aggregator training. For more details, refer to this.
Finally, to identify the "optimal" conformer of a given molecule, we explore two different approaches. Finally, to identify the "optimal" conformer of a given molecule, we explore two different approaches. In the second approach, we compute the cosine similarity between each conformer's PointNet prediction and the ground-truth odor profile of the molecule. The conformer whose prediction is most similar to the true odor profile is selected as the optimal conformer. Further details regarding both approaches can be found here.
We use the standard molecular odor dataset, which we refer to throughout this repository as the GS-LF dataset. The dataset contains 4,983 molecules, where each molecule is annotated with one or more odor labels, making it a multi-label classification dataset.
For every molecule, we generate 100 conformers and sample points from the corresponding electron density distribution of each conformer. This process produces a point cloud representation for every conformer of every molecule. Consequently, each molecule is represented by 100 conformer-specific point clouds that capture its three-dimensional electron density structure.
For illustration, examples of point clouds generated from conformers of different molecules are shown below.
![]() CCC(O)c1ccccc1
|
![]() CC(C)(C)c1ccc(O)cc1
|
![]() CC(C)(C)c1ccc(O)cc1
|
The above illustration is generated from a small subset of the full dataset. Similar visualizations can be produced using the illustrator.ipynb notebook provided in this repository. The notebook operates on the sample dataset located in sample_shard_data. We provide this sample so that users can better understand the storage format and structure of the point cloud data used throughout the ConfDENSE framework.
Note
The complete dataset is substantially larger than the sample data included in this repository. Researchers interested in obtaining the full dataset may contact the corresponding author at pinaki@uk.hert.edu.
The outputs produced by the PointNet component are stored in the conf_data directory as:
train_predictions.npzvalid_predictions.npztest_predictions.npz
These files contain the conformer-level predictions that are subsequently used to train the Aggregator model.
The Aggregator architecture consists of three main components:
- Index Encoding – analogous to the positional encodings used in standard attention mechanisms, allowing the model to distinguish between different conformers.
- Set2Set Pooling – used to aggregate information across the set of conformer predictions.
- Multi-Layer Perceptron (MLP) – used to produce the final molecular odor prediction, followed by a sigmoid activation layer for multi-label classification.
The implementation of the Aggregator, ConformerAggregator, can be found in:
utils/AggregatorClasses.py
Training is performed using the .npz files described above. The complete training and evaluation pipeline is provided in the notebook:
analysis_and_kde.ipynb
By default, the notebook loads the pretrained Aggregator weights used in the paper. Users who wish to retrain the Aggregator from scratch can uncomment the following code in the corresponding block:
# Train the model
# trained_model = train_model(
# model,
# train_loader,
# counts_pos=counts_pos,
# valid_loader=valid_loader,
# learning_rate=CONFIG['learning_rate'],
# num_epochs=CONFIG['num_epochs'],
# gamma=CONFIG['gamma'],
# step_size=CONFIG['step_size'],
# patience=100
# )
# Save trained model
# trained_model.eval()
# torch.save(
# trained_model.state_dict(),
# "conformer_model_weights.pth"
# )Users interested in modifying the training configuration can edit the CONFIG.yaml file. Each configuration parameter is documented within the file itself.
For hyperparameter search and experimentation, refer to:
hyper_aggregator.py
This script contains the code used for exploring different Aggregator configurations and training settings.
The conformer analysis pipeline is also provided in the analysis_and_kde.ipynb notebook.
The primary function responsible for performing conformer-level analysis is:
analyze_conformer_key(key, model)Here, key corresponds to the value stored in the Index_ column of the GS-LF dataset located in the conf_data directory. This identifier uniquely specifies a molecule and allows the analysis pipeline to retrieve all associated conformers and predictions.
The function evaluates the conformer-level predictions produced by the PointNet model and compares them using the similarity-based approaches described earlier in this README. This enables the identification and ranking of conformers that are most representative of a molecule's odor profile.
The resulting analyses and their implications are discussed in detail in the paper.
| Name | Affiliation |
|---|---|
| Sarabeshwar Balaji | Indian Institute of Science Education and Research Bhopal (IISER Bhopal), India |
| Mrityunjay Sharma | CSIR-CSIO, Chandigarh, India |
| Aryan Amit Barsainyan | National Institute of Technology Karnataka Surathkal, Karnataka, India |
| Pinaki Saha | University of Hertfordshire, UH Biocomputation Group, United Kingdom |
| Ritesh Kumar | CSIR-CSIO, Chandigarh, India |
| Volker Steuber | University of Hertfordshire, UH Biocomputation Group, United Kingdom |
| Michael Schmuker | Helmholtz-Gemeinschaft, Berlin, Germany |
To cite this work, please use this bibtex entry:
@article{saha2026confdense,
title={ConfDENSE: A conformer aware electron density based machine learning paradigm for navigating the odorant landscape},
author={Saha, Pinaki and Balaji, Sarabeshwar and Sharma, Mrityunjay and Barsainyan, Aryan Amit and Kumar, Ritesh and Steuber, Volker and Schmuker, Michael},
year={2026},
publisher={ChemRxiv}
}

