讲解ECON 128、辅导Machine Learning、Java,Python程序语言辅导、讲解c/c++

- 首页 >> 其他
ECON 128 Machine Learning Final Project
Background

In order to eliminate poverty, it is imperative to be able to identify households suffering from poverty and target them with assistance. However, the identification of households in poverty relies on data from consumption surveys that is difficult, expensive, and time-consuming to collect.

Therefore, recent efforts have been focused on the use of “rapid surveys” that rely a limited number of poverty identifiers that serve as effective proxies for the calculation of a household’s poverty status.

Objective

The World Bank has asked you to identify the most important variables that determine a household’s poverty status to help them reduce the cost associated with compiling data to predict poverty.

Data

The data provided for analysis is household responses to a World bank consumption survey. Each observation has a unique household id to reflect the survey responses of that distinct household. Further, each household is labeled in or out of poverty through the Poor indicator variable. A sample of the data is as follows

Notice that all of the variables are encoded as random character strings but reflect actual survey questions. For categorical variables, these variables may reflect questions such as does your household have items such as Bar soap, Cooking oil, Matches, and Salt. Numeric questions often ask things like How many working cell phones in total does your household own or How many separate rooms do the members of your household occupy? The project is not meant for you to determine the real meaning of the variables you select, rather just identify the best variables in their encoded state to best predict poverty.

Two datasets in the format pictured above are supplied, one for model training and one for model testing. No external data beyond what is provided should be used for modeling.
Error Metric

When evaluating your model’s performance in its ability to predict a household’s poverty status, you should use the logloss error metric. We define the logloss metric through the following formula:

The logloss metric any value from 0 to positive infinity in which a model scoring a 0 is a perfect classifier. Also, notice how the logloss error function operates. The metric rewards a model that confidently classifies a household correctly and punishes a model that is overconfident for wrong classifications. For example, a model that predicts a high probability of a household being poor and the household is actually poor will receive a lower logloss score than a model that predicts a high probability of poverty for a household that is not poor.

Deliverables

Assuming that World Bank has contracted you for this project, compile a report that communicates your problem-solving approach and works through the aspects of a data science project. Essentially, your report should contain the following elements:

1)Problem Description
Demonstrate your understanding of what World bank wants you to accomplish and an overview of your solution plan.
2)Description of the data used for analysis.
Review some summary statistics of your data. For example, what is the distribution of poor households in your data set?
Upon reviewing the data, are there any problems that may present itself for modeling. For example, if the distribution of poor households is heavily skewed, how may this affect the model? Similarly, are there certain variables with missing data? Do we need to impute data?
Describe any data cleaning performed. For example, were any new features or transformations of the data created. If missing data was found, how was it imputed?
3)Methods
Describe any models you are using with a brief explanation of how they work. If chosen models involve hyperparameters, explain what the hyperparameters are and how you plan to select the hyperparameters.
Focus on why you selected your models. Are there any advantages that you feel your model/approach has over other models/approaches? Any disadvantages?
4)Code
The execution of the proposed methods in code. Ensure that there are sufficient comments such that someone unfamiliar with your code can understand what you are doing.
5)Results and Conclusion
How well did your model perform on the test data?
What variables should the World Bank focus on to effectively predict poverty?
Justify any performance versus error metric tradeoff for a model with a sub-selection of variables. Is the sacrifice in model performance according to your chosen error metric for the reduced variable model worth it?

This project has been adapted from a competition on the data science website: DrivenData.