The data are normalized and columns on the composition of the molecules are added. For more information, see the initialization code of the MoleculeSklearnDataset class.
%% Cell type:code id: tags:
``` python
path="../data/data.csv"
dm=MoleculeSklearnDataset(path)
```
%% Output
/home/moi/.local/lib/python3.8/site-packages/pandas/core/nanops.py:892: RuntimeWarning: overflow encountered in square
sqr = _ensure_numeric((avg - values) ** 2)
%% Cell type:markdown id: tags:
### Getting the different data sets
%% Cell type:markdown id: tags:
We create three data sets: \
- one that does not contain the SMILES data of the molecules \
- one that contains only the SMILES data of the molecules \
- one that contains all the data
%% Cell type:code id: tags:
``` python
X_no_smiles=dm.getDataNoSmiles()
X_only_smiles=dm.getDataWithOnlySmiles()
X_with_smiles=dm.getDataWithSmiles()
Xs=[X_no_smiles,X_only_smiles,X_with_smiles]
data_model=["X without SMILES","X with only SMILES","X with SMILES"]
y=dm.getY()
```
%% Cell type:markdown id: tags:
### Cross-validation
%% Cell type:markdown id: tags:
We create a function that takes a data set as input and an integer k, and splits the set into k parts. Then it performs the cross validation on these sets and returns the linear regression with the best MSE.
@@ -148,12 +148,92 @@ And the last one contains all the available information.
Linear regression tries to find the linear relationship between two variables by calculating a linear equation.
If a linear relationship exists between the two variables, this equation allows us to estimate the desired variable.
The general linear equation is as follows: \[ Y = w_{T}X + b \]
The general linear equation is as follows: \[ Y = w^{T}X + b \]
In order to find the coefficients $w_{T}$ and the intercept $b$, we must minimize the residual
sum of squares between $Y$ and $w_{T}X$ as followed: \[ MIN \sum_{n=1}^{\infty}(y_{i}- w_{T}x_{i})^{2}\]
In order to find the coefficients $w$ and the intercept $b$, we must minimize the residual
sum of squares between $Y$ and $w^{T}X$ as followed: \[\min_{w}\sum_{n=1}^{n}(y_{i}- w^{T}x_{i})^{2}\]
This kind of method in linear regression is called an Ordinary least squares (OLS).
For this model, we used the Sklearn library which proposes an implementation of linear regression with the LinearRegression function.
In addition, we also used Sklearn to perform cross-validation using the KFold function.
This function allows us to separate the data set into several subsets in order to perform the cross-validation.
Before that, we split the dataset in two parts in order to have an unused set to test the best model found during the cross-validation.
\subsection{Support Vector Regression}
Support Vector Regression (SVR) is a specification of the Support Vector Machine (SVM) classification model.
In order to understand the SVR model, one must first understand the SVM model.
SVM is a model for binary classification of data. This model is used to find a hyperplane that best separates data.
The constraints ensure that the data is classified correctly. In other words, this problem consists in finding a hyperplane separating
the data and having the largest margin. It can be shown that maximizing the size of the margin is equivalent to minimizing the norm of the hyperplane vector.
More formally, SVM consists in solving the following minimization problem:
$$\min_{w}\frac{1}{2}\|w\|^{2}$$
with the constraints:
$$ y_{i}(\vec{w}\cdot\vec{x_{i}}- b)\geq1$$
where:\\
$y_{i}\in\{-1,1\}$ is the binary labels\\
$\vec{w}$ is the hyperplane\\
$\vec{x_{i}}$ is the $i^{th}$ data\\
$b$ is the intercept\\
This problem can be expressed using the Lagrange multiplier as follows:
$$ L( w, b, \lambda)=\frac{1}{2} w^{T} w -\sum_{i=1}^{N}\lambda_{i}(y_{i}(w^{T}x_{i}-b)-1)$$
By a derivation sequence, this equation can be expressed as follows:
This form allows us to introduce the kernel trick. Originally, SVM is used to classify linearly separable data. However, thanks to the so-called kernel trick,
it is possible to adapt the model to non-linearly separable data. This kernel trick allows us to perform a transformation of the data in higher dimensions
where the data are linearly separable.
By a mathematical trick, the kernel trick allows to calculate the dot product between two transformed vectors without actually
transforming each vector. This trick transforms the previous equation into the following:
Unlike SVM, SVR seeks to predict the continuous values of a so-called dependent variable. Moreover, for SVR, the size of the margin is a hyperparameter.
This model will perform the linear regression by trying to minimize the number and distance of data falling outside the margin.
Another hyperparameter allows us to weight the importance of the data falling outside the margin.
The objective function of SVR is as follows: $$\min_{w}\frac{1}{2}\|w\|^{2}+ C \sum_{i=1}^{n}|\xi_{i}| $$
And the constraints of the problem is: $$ |y_{i}- w_{i}x_{i}| \leq\epsilon+ |\xi_{i}| $$
where:\\
$y_{i}$ is the continuous values that we are trying to predict\\
$\vec{w}$ is the hyperplane\\
$\vec{x_{i}}$ is the $i^{th}$ data\\
$\epsilon$ is the size of the margin\\
$\xi_{i}$ is the distance outside the margin of the ith data\\
$C$ is the weighting of the distances of the data falling outside the margin