代做ECE 219 Large-Scale Data Mining: Models and Algorithms Winter 2025 Project 4代写数据结构语言

2025.02.13 - 首页 >> OS编程

ECE 219

Large-Scale Data Mining: Models and Algorithms

Winter 2025

Project 4: Regression Analysis and Define

Your Own Task!

Due on Mar 14, 2025, 11:59 pm

1 Introduction

Regression analysis is a statistical procedure for estimating the relationship between a target variable and a set of features that jointly inform about the target. In this project, we explore specific-to-regression feature engineering methods and model selection that jointly improve the performance of regression. You will conduct different experiments and identify the relative significance of the different options.

2 Datasets

You should take steps in section 3 on either one of the following datasets.

2.1 Dataset 1: Diamond Characteristics

Valentine’s day might be over, but we are still interested in building a bot to predict the price and characteristics of diamonds. A synthetic diamonds dataset can be downloaded from this link. This dataset contains information about 150000 round-cut diamonds. There are 14 variables (features) and for each sample, these features specify the various properties of the sample. Below we describe some of these features:

• carat: weight of the diamond

• cut: quality of the cut

• clarity: measured diamond clarity

• length: measured length in mm

• width: measured width in mm

• depth: measured depth in mm

• depth percent: diamond’s total height divided by it’stotal width

• table percent: width of top of diamond relative to widest point

• gridle min: refers to the thinnest part of the girdle

• gridle max: refers to the thickest part of the girdle

In addition to these features, there is the target variable: i.e what we would like to predict:

• price: price in US dollars

2.2 Dataset 2: Wine Quality Dataset

Perhaps your interests lean more towards the nuances of wine tasting rather than the than the fascination with diamonds. You can access the dataset through this link. The two datasets are related to red and white variants of the Portuguese ”Vinho Verde” wine. This dataset comprises 4, 898 instances for white wine and 1599 instances for red wine, with 13 features for each instance.

3 Required Steps

In this section, we describe the setup you need to follow. Follow these steps to process either of the datasets in Section 2.

3.1 Before Training

Before training an algorithm, it’s always essential to inspect the data. This provides intuition about the quality and quantity of the data and suggests ideas to extract features for downstream ML applications. In this following section we will address these steps.

3.1.1 Handling Categorical Features

A categorical feature is a feature that can take on one of a limited number of possible values. If one dataset contains categorical features, a preprocessing step needs to be carried to convert categorical variables into numbers and thus prepared for training.

One method for numerical encoding of categorical features is to assign a scalar. For instance, if we have a “Quality” feature with values {Poor, Fair, Typical, Good, Excellent} we might replace them with numbers 1 through 5. If there is no numerical meaning behind categorical features (e.g. {Cat, Dog}) one has to perform. “one-hot encoding” instead.

3.1.2 Data Inspection

The first step for data analysis is to take a close look at the dataset.

• Plot a heatmap of the Pearson correlation matrix of the dataset columns. Report which features have the highest absolute correlation with the target variable. In the context of either dataset, describe what the correlation patterns sug- gest.Question 1.1

• Plot the histogram of numerical features. What preprocessing can be done if the distribution of a feature has high skewness? Question 1.2

• Construct and inspect the box plot of categorical features vs target variable. What do you find? Question 1.3

• For the Diamonds dataset, plot the counts by color, cut and clarity. or

• For the wine quality dataset, plot histogram for quality scores. Question 1.4

3.1.3 Standardization

Standardization of datasets is a common requirement for many machine learning esti-

mators; they might behave badly if the individual features do not more-or-less look like standard normally distributed data: Gaussian with zero mean and unit variance. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features cor- rectly as expected.

Standardize feature columns and prepare them for training. Question 2.1

3.1.4 Feature Selection

• sklearn. feature selection. mutual info regression function returns estimated

mutual information between each feature and the label. Mutual information (MI) between two random variables is a non-negative value which measures the depen- dency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

• sklearn. feature selection . f regression function provides F scores, which is a way of comparing the significance of the improvement of a model, with respect to the addition of new variables.

You **may** use these functions to select features that yield better regression re- sults (especially in the classical models). Describe how this step qualitatively affects the performance of your models in terms of test RMSE. Is it true for all model types? Also list two features for either dataset that has the lowest MI w.r.t to the target. Question 2.2

From this point on, you are free to use any combination of features, as long as the performance on the regression model is on par (or slightly worse) than the Neural Network model.

3.2 Training

Once the data is prepared, we would like to train multiple algorithms and compare their performance using average RMSE from 10-fold cross-validation (please refer to part 3.3).

3.3 Evaluation

Perform 10-fold cross-validation and measure average RMSE errors for training and val- idation sets.

For random forest model, measure “Out-of-Bag Error” (OOB) as well.

3.3.1 Linear Regression

What is the objective function? Train three models: (a) ordinary least squares (linear regression without regularization), (b) Lasso and (c) Ridge regression, and answer the following questions.

• Explain how each regularization scheme affects the learned parameter set. Ques- tion 4.1

• Report your choice of the best regularization scheme along with the optimal penalty parameter and explain how you computed it. Question 4.2

• Does feature standardization play a role in improving the model performance (in the cases with ridge regularization)? Justify your answer. Question 4.3

• Some linear regression packages return p-values for different features. What is the meaning of these p-values and how can you infer the most significant features? A qualitative reasoning is sufficient. Question 4.4

3.3.2 Polynomial Regression

Perform polynomial regression by crafting products of features you selected in part 3.1.4 up to a certain degree (max degree 6) and applying ridge regression on the compound features. You can use scikit-learn library to build such features. Avoid overfitting by proper regularization. Answer the following:

• What are the most salient features? Why? Question 5.1

• What degree of polynomial is best? How did you find the optimal degree? What does a very high-order polynomial imply about the fit on the training data? What about its performance on testing data? Question 5.2

3.3.3 Neural Network

You will train a multi-layer perceptron (fully connected neural network). You can simply

use the sklearn implementation:

• Adjust your network size (number of hidden neurons and depth), and weight decay as regularization. Find a good hyper-parameter set systematically (no more than

20 experiments in total). Question 6.1

• How does the performance generally compare with linear regression? Why? Ques- tion 6.2

• What activation function did you use for the output and why? You may use none. Question 6.3

• What is the risk of increasing the depth of the network too far? Question 6.4

3.3.4 Random Forest

We will train a random forest regression model on datasets, and answer the following:

• Random forests have the following hyper-parameters:

– Maximum number of features;

– Number of trees;

– Depth of each tree;

Explain how these hyper-parameters affect the overall performance. Describe if and how each hyper-parameter results in a regularization effect during training.

Question 7.1

• How do random forests create a highly non-linear decision boundary despite the fact that all we do at each layer is apply a threshold on a feature? Question 7.2

• Randomly pick a tree in your random forest model (with maximum depth of 4) and plot its structure. Which feature is selected for branching at the root node? What can you infer about the importance of this feature as opposed to others? Do the important features correspond to what you got in part 3.3.1? Question 7.3

• Measure “Out-of-Bag Error” (OOB). Explain what OOB error and R2 score means.

Question 7.4

3.3.5 LightGBM, CatBoost and Bayesian Optimization

Boosted tree methods have shown advantages when dealing with tabular data, and recent advances make these algorithms scalable to large scale data and enable natural treatment of (high-cardinality) categorical features. Two of the most successful examples are Light- GBM and CatBoost.

Both algorithms have many hyperparameters that influence their performance. This results in large search space of hyperparameters, making the tuning of the hyperparame- ters hard with naive random search and grid search. Therefore, one may want to utilize “smarter” hyperparameter search schemes. We specifically explore one of them: Bayesian optimization.

In this part, pick either one of the datasets and apply LightGBM OR CatBoost. If you do both, we will only look at the first one.

• Read the documentation of LightGBM OR CatBoost and determine the important hyperparameters along with a search space for the tuning of these parameters (keep the search space small). Question 8.1

• Apply Bayesian optimization using skopt.BayesSearchCVfrom scikit-optmizeto find the ideal hyperparameter combination in your search space. Keep your search space small enough to finish running on a single Google Colab instance within 60 minutes. Report the best hyperparameter set found and the corresponding RMSE.

Question 8.2

• Qualitatively interpret the effect of the hyperparameters using the Bayesian opti- mization results: Which of them helps with performance? Which helps with reg- ularization (shrinks the generalization gap)? Which affects the fitting efficiency?

Question 8.3