讲解CRISP-DM、R编程语言辅导、讲解R设计、辅导Data Analysis 解析C/C++编程|讲解数据库SQL
- 首页 >> 其他 Preliminary Information
In your report do not just replicate the process followed during the workshops! The objective of the
workshops is to introduce you to the different techniques discussed during the lectures, and not to give
you a roadmap on how to answer the coursework.
Assessment: In the coursework you will be assessed based on:
1. Your ability to use correctly the tools that we covered in the course
2. Your ability to draw the correct conclusions from these tools
3. Your ability to address the questions posed in the coursework based on an intelligent interpretation
of the evidence provided in the previous two steps. (Consult the CRISP-DM process described in
Chapter 1 of Guide to Intelligent Data Analysis)
You will not be assessed on your capability to use R or any other software. For this reason don’t
include screenshots from any software or any other information about commands you used, or options
you set, or how to draw a figure etc. You will be simply wasting valuable space.
You are free to use any software you like to do the coursework. However, you can’t use as an excuse
the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular
capability which we covered in the workshops.
Page limits Your report must be submitted as a PDF file that does not exceed 12 pages, with at
least 11 point typeface. This limit is strict and it includes appendices (which I strongly recommend
that you don’t use). If your report exceeds the page limit I will simply stop reading at the end of page
12 and not take into account anything from the remaining pages in the assessment.
Plagiarism: This is an individual piece of assessment, and you should ensure that your report reflects
your own work exclusively.
All reports go through automated software to detect plagiarism from a variety of sources (including
past and current students’ reports as well as online resources, conference and journal publications etc.)
The consequences of plagiarism are very serious.
Problem Description/ Project Objectives
A bank wants you to develop a credit scoring model to classify applications for unsecured loans. You have
been provided with a sample of observations which contain information about past bank customers. (The
dataset provided to each student is unique.) The description of the variables in this dataset is provided in
the next section.
The bank is primarily interested in understanding what are the main factors that influence repayment
behaviour, so that it can exploit this knowledge to improve future decisions. The bank faces a trade-off
between accepting applicants for loans, so as to retain its share in the market and increase its profit through
interest payments, and on the other hand incurring losses due to giving loans to customers that default on
their debt. The bank managers are interested in the following questions:
What is the best way for the bank to use a statistical model to achieve the following goals:
– Accept the maximum number of good customers if at least 85% of bad customers are correctly
identified
– Accept at least 70% of good customers while rejecting as many bad customers as possible.
If the previous two goals were not specified which statistical model would you recommend, and why?
Compare this model to the ones recommended in the previous question, and discuss similarities and
differences.
1 How many and which are the most important variables that determine the repayment behaviour of
mortgage customers. (Do these differ depending on the objective, and/ or the classification method
used?)
Data Description
You are provided with a sample of observations which contain information about past bank customers. The
dataset provided to each student is unique. The main variables in this dataset are described in Table 1.
The class variable (i.e. the variable we want to predict) is called BAD. There are 9 more variables in the dataset
you were provided with in addition to these described in the table. Each of these variables is encoded as M
and the name of one of the main variables: for example, M MORTDUE, or M DEBTINC. All the M variables are
binary (i.e. take values in {0, 1}). They were created because the original dataset contained a large number
of missing values. For each variable that had missing values in the original data (e.g. MORTDUE) the missing
values were replaced, and a binary variable (M MORTDUE) was created what indicates whether the value of
the variable was missing in the original dataset (M MORTDUE=1) or not (M MORTDUE=0). In other words, the
value of a variable like DEBTINC is the actual, observed, value when M DEBTINC=0. When M DEBTINC=1 the
value of DEBTINC has been predicted (and therefore does not correspond to the actual value of this variable
for that customer). You don’t know which method was used to replace these missing values.
Name Type Description
BAD Binary 1=applicant defaulted on loan or seriously delinquent, 0=applicant paid loan
LOAN Continuous Amount of the loan request
MORTDUE Continuous Amount due on existing mortgage
VALUE Continuous Value of current property
REASON Nominal Not Provided; DebtCon=debt consolidation; HomeImp=home improvement
JOB Nominal Occupational categories
YOJ Continuous Years at present job
DEROG Continuous Number of major derogatory reports
DEBTINC Continuous Debt-to-income ratio
CLAGE Continuous Age of oldest credit line in months
NINQ Continuous Number of recent credit inquiries
CLNO Continuous Number of credit lines
DELINQ Continuous Number of delinquent credit lines
Table 1: Description of main variables in training dataset
Tasks
Exploratory Data Analysis (40 marks).
In particular, consider each variable and answer the following questions:
– Does this variable appear to be important for the task at hand? (After discussing each variable
separately provide a ranking of the importance of all explanatory variables.) Support your claims
with appropriate visualisations that document whether and how important each variable is.
– Are different variables related, and which variables convey information similar to that provided
in other variable(s)?
– Do you find evidence of “outliers” or other issues with data quality (e.g. incorrect observations)?
– For which variables is the fact that specific values were missing in the original dataset informative,
and what are the implications of this?
2 Statistical Modelling (60 marks)
– What is the appropriate performance measure for this application and why? Relate this to the
project objectives.
– For the two types of classifiers: logistic regression, and decision trees discuss different settings you
used and why you considered these important. (Consider the choice of variable selection method
as part of this question also.)
– For each classification method develop one or a few candidate models that you think are promising
before providing a final recommendation of the most appropriate model (for each question in the
project objectives section). You do not need to discuss every model you tried in detail, but
you must include the results for the important steps in the process that led you to the final
recommendations. I am particularly interested in understanding the steps you followed and the
justification for these. (Refer to the CRISP data mining process discussed during the lectures and
in Chapter 1 of the Guide to Intelligent Data Analysis).
– Comment on the generalisation performance of the model(s) you recommend for each type of
classifier.
The coursework requires you to write a report explaining your findings. This means that you need to
explain each figure, table or number you include in the report. In other words including a relevant figure
but not explaining what are the conclusions from it will get you no marks.
You do not need to write an executive summary, or include a cover page, and a page of contents.
You do need to include at the end of your coursework a Conclusions section which will summarise your
findings and will clearly answer the questions posed in the project objectives section. In this section I
would also recommend to discuss the relative advantages and limitations of the two types of classifiers
for the problem at hand.
Report Assessment
Your coursework will not be evaluated by the quality of the final model alone, or by whether you got a
particular answer right. You will be primarily assessed by whether you are able to correctly justify the steps
you took to complete the assignment. In other words, your report needs to document that you are able to
intelligently analyse the provided data, that you draw correct conclusions from what you observe, and that
these conclusions lead you either to the next logical step of the data mining process, or to the revision of
decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered in
the first lectures and in particular to the feedback loops)
Therefore, don’t simply present the conclusions/ results of your analysis and expect to get a high mark.
Reports that don’t document the steps followed and the reasons why these were chosen will receive minimal
marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Don’t
provide a list of bullet points, or unstructured sentences etc. Similarly, don’t include figures or any
other output from R that you don’t comment/ explain in the text. I will not assume that you know
how to interpret these correctly.
In your report do not just replicate the process followed during the workshops! The objective of the
workshops is to introduce you to the different techniques discussed during the lectures, and not to give
you a roadmap on how to answer the coursework.
Assessment: In the coursework you will be assessed based on:
1. Your ability to use correctly the tools that we covered in the course
2. Your ability to draw the correct conclusions from these tools
3. Your ability to address the questions posed in the coursework based on an intelligent interpretation
of the evidence provided in the previous two steps. (Consult the CRISP-DM process described in
Chapter 1 of Guide to Intelligent Data Analysis)
You will not be assessed on your capability to use R or any other software. For this reason don’t
include screenshots from any software or any other information about commands you used, or options
you set, or how to draw a figure etc. You will be simply wasting valuable space.
You are free to use any software you like to do the coursework. However, you can’t use as an excuse
the fact that you couldn’t do a particular task because the software you chose doesn’t offer a particular
capability which we covered in the workshops.
Page limits Your report must be submitted as a PDF file that does not exceed 12 pages, with at
least 11 point typeface. This limit is strict and it includes appendices (which I strongly recommend
that you don’t use). If your report exceeds the page limit I will simply stop reading at the end of page
12 and not take into account anything from the remaining pages in the assessment.
Plagiarism: This is an individual piece of assessment, and you should ensure that your report reflects
your own work exclusively.
All reports go through automated software to detect plagiarism from a variety of sources (including
past and current students’ reports as well as online resources, conference and journal publications etc.)
The consequences of plagiarism are very serious.
Problem Description/ Project Objectives
A bank wants you to develop a credit scoring model to classify applications for unsecured loans. You have
been provided with a sample of observations which contain information about past bank customers. (The
dataset provided to each student is unique.) The description of the variables in this dataset is provided in
the next section.
The bank is primarily interested in understanding what are the main factors that influence repayment
behaviour, so that it can exploit this knowledge to improve future decisions. The bank faces a trade-off
between accepting applicants for loans, so as to retain its share in the market and increase its profit through
interest payments, and on the other hand incurring losses due to giving loans to customers that default on
their debt. The bank managers are interested in the following questions:
What is the best way for the bank to use a statistical model to achieve the following goals:
– Accept the maximum number of good customers if at least 85% of bad customers are correctly
identified
– Accept at least 70% of good customers while rejecting as many bad customers as possible.
If the previous two goals were not specified which statistical model would you recommend, and why?
Compare this model to the ones recommended in the previous question, and discuss similarities and
differences.
1 How many and which are the most important variables that determine the repayment behaviour of
mortgage customers. (Do these differ depending on the objective, and/ or the classification method
used?)
Data Description
You are provided with a sample of observations which contain information about past bank customers. The
dataset provided to each student is unique. The main variables in this dataset are described in Table 1.
The class variable (i.e. the variable we want to predict) is called BAD. There are 9 more variables in the dataset
you were provided with in addition to these described in the table. Each of these variables is encoded as M
and the name of one of the main variables: for example, M MORTDUE, or M DEBTINC. All the M variables are
binary (i.e. take values in {0, 1}). They were created because the original dataset contained a large number
of missing values. For each variable that had missing values in the original data (e.g. MORTDUE) the missing
values were replaced, and a binary variable (M MORTDUE) was created what indicates whether the value of
the variable was missing in the original dataset (M MORTDUE=1) or not (M MORTDUE=0). In other words, the
value of a variable like DEBTINC is the actual, observed, value when M DEBTINC=0. When M DEBTINC=1 the
value of DEBTINC has been predicted (and therefore does not correspond to the actual value of this variable
for that customer). You don’t know which method was used to replace these missing values.
Name Type Description
BAD Binary 1=applicant defaulted on loan or seriously delinquent, 0=applicant paid loan
LOAN Continuous Amount of the loan request
MORTDUE Continuous Amount due on existing mortgage
VALUE Continuous Value of current property
REASON Nominal Not Provided; DebtCon=debt consolidation; HomeImp=home improvement
JOB Nominal Occupational categories
YOJ Continuous Years at present job
DEROG Continuous Number of major derogatory reports
DEBTINC Continuous Debt-to-income ratio
CLAGE Continuous Age of oldest credit line in months
NINQ Continuous Number of recent credit inquiries
CLNO Continuous Number of credit lines
DELINQ Continuous Number of delinquent credit lines
Table 1: Description of main variables in training dataset
Tasks
Exploratory Data Analysis (40 marks).
In particular, consider each variable and answer the following questions:
– Does this variable appear to be important for the task at hand? (After discussing each variable
separately provide a ranking of the importance of all explanatory variables.) Support your claims
with appropriate visualisations that document whether and how important each variable is.
– Are different variables related, and which variables convey information similar to that provided
in other variable(s)?
– Do you find evidence of “outliers” or other issues with data quality (e.g. incorrect observations)?
– For which variables is the fact that specific values were missing in the original dataset informative,
and what are the implications of this?
2 Statistical Modelling (60 marks)
– What is the appropriate performance measure for this application and why? Relate this to the
project objectives.
– For the two types of classifiers: logistic regression, and decision trees discuss different settings you
used and why you considered these important. (Consider the choice of variable selection method
as part of this question also.)
– For each classification method develop one or a few candidate models that you think are promising
before providing a final recommendation of the most appropriate model (for each question in the
project objectives section). You do not need to discuss every model you tried in detail, but
you must include the results for the important steps in the process that led you to the final
recommendations. I am particularly interested in understanding the steps you followed and the
justification for these. (Refer to the CRISP data mining process discussed during the lectures and
in Chapter 1 of the Guide to Intelligent Data Analysis).
– Comment on the generalisation performance of the model(s) you recommend for each type of
classifier.
The coursework requires you to write a report explaining your findings. This means that you need to
explain each figure, table or number you include in the report. In other words including a relevant figure
but not explaining what are the conclusions from it will get you no marks.
You do not need to write an executive summary, or include a cover page, and a page of contents.
You do need to include at the end of your coursework a Conclusions section which will summarise your
findings and will clearly answer the questions posed in the project objectives section. In this section I
would also recommend to discuss the relative advantages and limitations of the two types of classifiers
for the problem at hand.
Report Assessment
Your coursework will not be evaluated by the quality of the final model alone, or by whether you got a
particular answer right. You will be primarily assessed by whether you are able to correctly justify the steps
you took to complete the assignment. In other words, your report needs to document that you are able to
intelligently analyse the provided data, that you draw correct conclusions from what you observe, and that
these conclusions lead you either to the next logical step of the data mining process, or to the revision of
decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered in
the first lectures and in particular to the feedback loops)
Therefore, don’t simply present the conclusions/ results of your analysis and expect to get a high mark.
Reports that don’t document the steps followed and the reasons why these were chosen will receive minimal
marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Don’t
provide a list of bullet points, or unstructured sentences etc. Similarly, don’t include figures or any
other output from R that you don’t comment/ explain in the text. I will not assume that you know
how to interpret these correctly.