Introduction
Welcome to Lab 2 of NANOx81 - Data Science in Materials Science. So far you should already have basic knowledge of python data science stack and know how to manipulate materials data. In this lab, it is time to put your knowledge into use. We will be solving real materials research problems - using both theoretically computed as well as experimental data using data science techniques.
Getting started
If you have not already done so, please follow the setup instructions to set up your computer. You may alternatively use Google Colab to do this lab, in which case you can skip directly to step 3.
- Activate your NANOx81 virtual environment.
conda activate NANOx81
- Start a Jupyter notebook.
jupyter notebook
- Create a Python 3 notebook and rename it
NANOx81-lab2-<first_name>-<last_name>
.
Assessment criteria
Try to complete all questions, doing everything in your Jupyter notebook. Make generous use of code cells, text cells, etc. and write your notebook as though it is a lab report but with Python code incorporated. The easier you make it for your instructors to find the answers, the better.
At the end of the lab, please submit the NANOx81-lab2-<first_name>-<last_name>.ipynb
file via Canvas.
Just a reminder on our assessment criteria:
- Model performance: 30%
- Materials Science Insights: 30%
- Data Science Technique: 30%
- Programming Style: 10%
You should ensure that any notebook you submit for your labs can be executed completely without errors. The easiest way to do this is to do a “Restart and Run All” from the notebook, which will execute all cells in your Jupyter notebook.
Lab
Download data2022.csv, which contains some data for all binary bromides and iodides in the Materials Project.
Q1 - Exploratory data analysis (7 points)
Load the data2022.csv
in variable orig_data
using pandas.read_csv
with na_filter=False
option, and perform the following analysis.
- How many materials are there in this dataset? (1 point)
- How many elements are there in this data set? (1 point)
- How many unique formulae are there? (1 point)
- Count the number of materials where each element is present. Sort this count. Create a barplot showing the number of materials with the top 10 most common elements in this data set. (4 points)
Hint: When dealing with formula, you may use pymatgen.core.Composition
to speed up the process. For example, the following code snippet shows the use of Composition to process formula. For more usage, you may visit https://matgenb.materialsvirtuallab.org/2013/01/01/Basic-functionality.html
from pymatgen.core import Composition
comp = Composition('Al2O3')
print(comp.elements) # this will give you the elements
print(comp.to_data_dict['unit_cell_composition']) # this will give you the element-stoichiometry dictionary.
Q2 - Data cleaning and feature computations (24 points)
About 80% of the effort in ML modelling is in data processing. The goal is to develop ML models to predict the formation energy per atom and band gap of the material from the formula. To do that, we will first convert the formula to numeric vectors (descriptors) for model inputs. For data filtering steps, note that you should use the filtered data after each filtering step henceforth.
- From Q1.3, we note that the number of materials is greater than the number of formulae, i.e., there can be more than 1 polymorph present per formula. We do not expect the compositional features to be able to predict multiple values of the same property for the same formula. Filter the data to remove duplicate formulae, keeping only the row with the lowest formation energy per atom for each formula. How many materials are left? (2 points)
- Positive formation energies are often a sign that a calculation is poorly converged. Filter the data to remove rows with positive formation energies as well. How many materials are left? (2 point)
- Download and load the element property data file in variable
element_data
using pandas by settingindex_col=0
inpandas.read_csv
function. How many NaN (Not a Number) are there in each column? (1 point) - Compute the mean values for each column, ignoring the NaNs. For each column, fill the NaN with the mean value of that column. This is a common data imputation technique. (2 points)
- Compute the composition-averaged
AtomicRadius
for all materials and store the results in variableatomic_radius
. For example, the composition-averagedAtomicRadius
forLi2O
can be computed as(2 * 1.45 + 0.6) / 3
, where1.45
is theAtomicRadius
forLi
and0.6
is theAtomicRadius
forO
. Hint: Read the Pandas documentation on indexing. E.g.,element_data.loc["Fe"]["AtomicRadius"]
gives you the atomic radius for Fe. (5 points) - Compute the composition-averaged properties for all properties in
element_data
and for all materials. Store the results in the variableaverage_properties
.average_properties
should have a dimension of(n, 11)
wheren
is the number of materials and 11 is the number of properties. (5 points) - Similar to the previous computations of average properties, compute the maximum properties and minimum properties for all properties and all materials, and store them in variables
max_properties
andmin_properties
respectively. Both variables should have dimension(n, 11)
. (5 points) - Concatenate
average_properties
,max_properties
andmin_properties
, and store the result in variabledesign_matrix
with dimension(n, 33)
. (2 point)
Q3 - Regression and classification modeling (39 points)
We are going to use band_gap
and formation_energy_per_atom
in data
as the targets. To make sure the results are reproducible, set the random_state=42
in all cases where random sampling is involved, e.g., train_test_split, shuffling, etc.
- Split the data (
design_matrix
as X, andtargets
as y) into training and test sets in the ratio 90%:10%. Store the training data in variablestrain_X
andtrain_y
and the test data as variablestest_X
andtest_y
. (2 points) - Compute the mean and standard deviation of columns in
train_X
. Both of them should be length 33 vectors. Use them to normalizetrain_X
andtest_X
, so that each column has a mean of 0 and standard deviation of 1. Store the normalized design matrices tonorm_train_X
,norm_test_X
. (4 points)
Note that from here out, all model training and validation should be done with the training split. The test split is only used for calculating a final model performance assessment in part 5. It is not used for any other purposes. You should be using proper ML best practices such as cross-validation in fitting the model. In all cases, the loss function you use should be the mean squared error. You need to figure out where and how to specify the loss function in the training.
- Train a simple linear regression model to predict
formation_energy_per_atom
. What is the CV score of your model? (4 points) - Train a Ridge regression model and a LASSO regression model for the
formation_energy_per_atom
. You need to search for an optimal value ofalpha
for each model. To help you, try the following ranges of alpha: ridge (0.1-10), lasso (0.0001-0.01). You have to figure out how best to sample the range of alphas. Too dense a sampling will result in very slow searches and too sparse will result in non-optimal models. What are the CV scores of your best Ridge and Lasso models? (10 points) - What are the test MAE and RMSE of the best model (among all models you have fitted so far)? (2 points)
- What are the features that do not contribute to the LASSO prediction? (4 points)
- Let’s define
band_gap < 0.001
as metallic andband_gap >= 0.001
as nonmetallic. Construct linear discriminant analysis, quadratic discriminant analysis, and logistic regression models on train data and predict the accuracy of the models on test data. (11 points) - What are the problems of using only the compositions to predict material properties? (2 points)
Q4 - Clustering (30 points)
In this problem, we will be looking at catalyst clusters. The image file catalyst.png
below is extracted from a figure shared on figshare by Gomez-Bolivar et al. (Front. Microbiol., 20 June 2019, DOI: 10.3389/fmicb.2019.01276). It is an energy dispersive X-ray (EDX) microanalysis of Pd/Ru bimetallic nanoparticle catalysts synthesized by Escherichia coli. For this whole exercise, it is recommended that you use the hot
colormap in matplotlib.
- Read in the image as a numpy array using the
imread
method inmatplotlib.pyplot
. Show the image in your Jupyter notebook usingimshow
. What are the dimensions of the array? (1 point) - Plot the distribution of the values in the numpy array representing the image (Hint: you need to
flatten
the array first). Note that the values in the numpy array are between 0 and 1 for png images representing the levels. (1 point) - Measured images has a variety of levels. Sometimes, we want to label each pixel at pre-specified levels, e.g., 0 representing the background, and fixed values representing certain features. This is known as vector quantization. Here, we will quantize the image using K-means. We know for a fact that there are two elements (Pd and Ru) in the system. Using K-means, quantize the image such that there are three levels: 0 = background, 1 and 2 = Pd or Ru. Ensure that 0 corresponds to the background (this should be the cluster with the largest number of data points) and non-zero levels correspond to the elements. Plot the quantized image (Hint: you may need to
reshape
your predicted array back to the original image dimensions), which should look like a slightly modified version of the image in Q4.1. (6 points) - For the purposes of this exercise, we will not attempt to distinguish between different elements. Any value within the numpy array that is > 0 is considered a catalyst particle. Use K-means clustering to identify clusters of metal particles (you will need to figure out what a good value of K is). Hint: the way to do this is to consider the 2D coordinates of non-zero data points and use them in your K-Means fitting. An example is shown in your lecture notebooks on unsupervised learning. After you have obtained your cluster labels, you will need to remap your coordinates-label pairs into pixel values. Plot your clustered image, ensuring that each cluster has a different color. Comment on how you chose your value of K. (10 points)
- Finally, we will use a density-based clustering method called DBSCAN. Similar to part 4, any value in the numpy array that is > 0 is considered a catalyst particle. Use DBSCAN clustering to distinguish identify clusters of metal particles (you will need to figure out what a good value of
eps
is). Plot your clustered image, ensuring that each cluster has a different color. Comment on how you chose your value ofeps
. (10 points) - Discuss on the differences between the K-means and DBSCAN results, and which method is more appropriate for the purpose we are using it for. (2 points)