1. Implement a linear model

y^(w,x)=w0+w1x1+...+wpxp

2. Use of real data

3. Shrinkage

4. Subset selection

5. ElasticNet penalty surface

1.

y^(w,x)=w0+w1x1+...+wpxp

2.

The dataset itself is loaded from the URL provided but the descriptors are still needed. Here I load the descriptors via the same method and search for the @attribute <att-name> <att-type> substring and name the columns based on the results. You obviously could have done this part manually as well.

There are basically columns with loads of missing values. It seems reasonable to drop those columns. I set a threshold on the .dropna method for half the length of the dataset and checked all the remainin rows with still remaining NaN values.

Only one NaN row remained so I filled its missing value with the dataset mean, moving on to the data types all columns except one are numeric. That one contains community names and almost all rows contain different names (~1.8k out of ~1.9k rows) so I'll drop that column as well since one-hot encoding would not provide any additional information.

Also dropping the fold columns since it contains no predictive value, and could be used for automatic cross validation.

With cross-validation for linear regression we can check how varying scores are for training and testing based on splits.

3.

Here I chose alpha=0.00628 where 8 coefficients are none zero. Their names are visualized on the plot. Per capity white people, per capita illegal activity, etc. seem to be good descriptors of the predicted value.

Doing the same for ElasticNet [ EXTRA ]

4.

To me it seems that the Lasso model is the best and not by sheer scores but based on the fact that the Ridge model got a worse score at the same time ElasticNet scores high, therefore probably the norm penalty for Lasso is more useful for this dataset than that of the Ridge model, basically you would get this result in the next excersice.

5.

We can see here that our ElasticNet is the best when there are no weight penalties. :)