data编程讲解、Python,Java程序语言辅导、辅导c++编程 讲解R语言编程|讲解留学生Processing
- 首页 >> Java编程 HW1 (Due date: 11/12).
Upload your answers in a Word document to Canvas with your name in the filename. For code
questions, paste your code in Word and the corresponding results (print out needed outputs and
figures), and add some explanations. Alternatively, you can also submit a jupyter notebook file with
some comments. For by hand calculation problems, you may write the answers in word or on a piece of
paper and take a picture and paste the picture into Word.
1. Gradient accumulation for a 2-layer neural network with 2-dimensional input (5 points).
x = [11; 5]; W1=[1.1, 2.1, 4; 0.8, 3.2, 3.3] (a 2x3 matrix); b1=-1.3; W2=[1/8; 1/6; 1/7] (a 3x1
matrix); b2=0.1. 𝑦 = 𝑊ଶ
்
(𝑊ଵ
்𝑥 + 𝑏ଵ
) + 𝑏ଶ. The observation y*=20. Use squared loss. Calculate:
(a) what is dL/dW1 (hint: you should have a matrix of 6 values here)
(b) with a learning rate of 0.01, if you update the weights, what are the new weights?
(c) after the update, what is the new loss?
2. Clustering and dimensional reduction. From the CAMELS dataset (attributes.csv), we can
extract the following attributes: (i) aridity: annual potential evapotranspiration (PET) divided by
precipitation; (ii) precipitation seasonality index; (iii) fraction of precipitation falling as snow.
(a) Define the distance as Euclidean distance of the above three indices. Run a k-means
clustering of the CAMELS basins. How many clusters should you set? Show the
total_sum_of_squared_distance vs k plot to justify your choice.
(b) There are 17 attributes in attributes.csv, use principal component analysis to find the first
principal components. Do scatter plot of basins on the 2D plot with PC-1 and PC-2 as the axes.
Better yet, use colors to indicate which cluster they belong to.
3. Boosting and feature importance. Still working with the CAMELS dataset, extract annual
average runoff from runoff_mm.csv (we did this in hw1). Together with the 17 attributes, you
have 18 attributes. Normalize the attributes first.
(a) Write a loop, in each iteration, predict one of the attributes using xgboost with the rest 17
attributes as inputs. You can predict all 18 attributes with this loop. Which attribute has the
highest predictability?
(b) For the most predictable attribute you found in (a), use permutation_importance to rank the
feature importance.
4. Neural network training. Use 80% of the basins as train and 20% as test. Report both the train
and the test metrics.
(a) For the problem of predicting annual average runoff_mm using the other 17 attributes, write
a PyTorch code to train a 2-layer neural network.
(b) write an two-layer MLP as an autoencoder for the 17 catchment attributes (not including
runoff) with a hidden size of 4 or 6. What kind of reconstruction error do you get for these two
setups?
Upload your answers in a Word document to Canvas with your name in the filename. For code
questions, paste your code in Word and the corresponding results (print out needed outputs and
figures), and add some explanations. Alternatively, you can also submit a jupyter notebook file with
some comments. For by hand calculation problems, you may write the answers in word or on a piece of
paper and take a picture and paste the picture into Word.
1. Gradient accumulation for a 2-layer neural network with 2-dimensional input (5 points).
x = [11; 5]; W1=[1.1, 2.1, 4; 0.8, 3.2, 3.3] (a 2x3 matrix); b1=-1.3; W2=[1/8; 1/6; 1/7] (a 3x1
matrix); b2=0.1. 𝑦 = 𝑊ଶ
்
(𝑊ଵ
்𝑥 + 𝑏ଵ
) + 𝑏ଶ. The observation y*=20. Use squared loss. Calculate:
(a) what is dL/dW1 (hint: you should have a matrix of 6 values here)
(b) with a learning rate of 0.01, if you update the weights, what are the new weights?
(c) after the update, what is the new loss?
2. Clustering and dimensional reduction. From the CAMELS dataset (attributes.csv), we can
extract the following attributes: (i) aridity: annual potential evapotranspiration (PET) divided by
precipitation; (ii) precipitation seasonality index; (iii) fraction of precipitation falling as snow.
(a) Define the distance as Euclidean distance of the above three indices. Run a k-means
clustering of the CAMELS basins. How many clusters should you set? Show the
total_sum_of_squared_distance vs k plot to justify your choice.
(b) There are 17 attributes in attributes.csv, use principal component analysis to find the first
principal components. Do scatter plot of basins on the 2D plot with PC-1 and PC-2 as the axes.
Better yet, use colors to indicate which cluster they belong to.
3. Boosting and feature importance. Still working with the CAMELS dataset, extract annual
average runoff from runoff_mm.csv (we did this in hw1). Together with the 17 attributes, you
have 18 attributes. Normalize the attributes first.
(a) Write a loop, in each iteration, predict one of the attributes using xgboost with the rest 17
attributes as inputs. You can predict all 18 attributes with this loop. Which attribute has the
highest predictability?
(b) For the most predictable attribute you found in (a), use permutation_importance to rank the
feature importance.
4. Neural network training. Use 80% of the basins as train and 20% as test. Report both the train
and the test metrics.
(a) For the problem of predicting annual average runoff_mm using the other 17 attributes, write
a PyTorch code to train a 2-layer neural network.
(b) write an two-layer MLP as an autoencoder for the 17 catchment attributes (not including
runoff) with a hidden size of 4 or 6. What kind of reconstruction error do you get for these two
setups?