Introduction

Welcome to the final lab of NANOx81 - Data Science in Materials Science. Unlike previous labs, this lab is deliberately more open-ended. You are encouraged to exercise your creativity in finding the best solutions.

Assessment criteria

Try to complete all questions, doing everything in your Jupyter notebook. Make generous use of code cells, text cells, etc. and write your notebook as though it is a lab report but with Python code incorporated. The easier you make it for your instructors to find the answers, the better.

At the end of the lab, please submit the NANOx81-lab3-<first_name>-<last_name>.ipynb file.

Just a reminder on our assessment criteria:

Model performance: 30%
Materials Science Insights: 30%
Data Science Technique: 30%
Programming Style: 10%

You should ensure that any notebook you submit for your labs can be executed completely without errors. The easiest way to do this is to do a “Restart and Run All” from the notebook, which will execute all cells in your Jupyter notebook.

Lab

Now that we are at the Final Lab, hints will be minimal. It is expected that your ML models are properly trained using best practices, your notebooks are clearly annotated and your code is Pythonic. All these will be accounted for in your final score.

Problem

In 2025, the Materials Virtual Lab in collaboration with the Materials Project released MatPES, a foundational potential energy surface dataset for materials. This dataset comprises ~400,000 structures computed using density functional theory calculations in both the PBE and r2SCAN functionals. In this lab, we will use only the r2SCAN dataset. For foundation potentials, one of the key properties of interest is the cohesive_energy_per_atom, E_c.

A separate test set has already been created. You do not need to create a test set. Answer the following questions (points in brackets are indicative score allocations):

(2 points) Download the data files from Kaggle. Create a Pandas DataFrame with the training data.
(3 points) Plot histograms of E_c of the training data.
(30 points) The instructions differ here for the undergraduates (NANO181) and graduate (NANO281) students:
- NANO181: The MatPES dataset contains multiple entries with the same formula. Filter your data to only keep the lowest E_c for each unique formula. How many unique formulas are there? Generate composition-based features (similar to what you have done in lab 2). You can reuse the same element_properties.csv file from lab 2, you should try to get more features from pymatgen’s Element object. The features are documented here. Think carefully about what kind of features you should be using and what kind of processing you can use (min, max, weighted average, etc.). Note that just copy and pasting what you have from lab 2 will not result in good model performance or a good grade.
- NANO281: Do not filter to keep the lowest E_c data for each formula. For the graduate version of the problem, you are using structural information as well as composition features. Generate composition-based and structure-based features. In addition to the NANO181 instructions above on compositional features, you will be required to extract structural features from the original MatPES dataset. Simple structural features may include things like the density, volume per atom, etc. But much more sophisticated structure features can be constructed based on your imagination and domain knowledge. The entire dataset is available on the MatPES website. You should download the r2SCAN file, which contains a gzipped json file containing the structures.
(20 points) Train a simple linear type model with shrinkage/regularization and/or feature transformations to predict E_c. Upload the predictions of your best model to the Kaggle site. Look at the file called nanox81_sample_submission.csv to understand the format of the file that needs to be submitted. You just need two columns - matpes_id and cohesive_energy_per_atom. You can make multiple uploads over the course of the lab. Report the training, validation and test MSE for your best model.
(25 points) Train a tree-based/neural network model to predict E_c. Upload the predictions of your best model to Kaggle. Report the training, validation and test MSE for your best model.
(20 points) Perform analyses of which features are important to the prediction of E_c for your best linear/tree/neural network model. Interpret the results using your materials science knowledge. Discuss which model performs better and why?

Expectations

We expect to see you apply all that you have learnt in NANOx81. That includes proper use of cross-validation for model selection, hyperparameter tuning for each of your models (plots showing loss function/score against hyperparameters are expected). Note that we can only grade based on what we see in your submitted notebook. Simply sending in your models with no evidence of all these efforts will not result in a good grade, regardless of how well the models perform.
We expect at least one Kaggle prediction upload per week from all students. One in week 1 and another in week 2. Uploading much more frequently, e.g., at least once a day, is highly encouraged.
Model performance is 30% of your grade. Since this is the final lab in a data science class, we will use statistial approaches to assign this grade. We will take the average MSE of all NANO181 models and use it to assign a score for this aspect.
Since this is the final lab, we are expecting well-written code with clear documentation at every step. Explain your thinking behind every single thing you do. Use Jupyter’s markdown cells to present your thought processes. Make plots to show key analyses.