代写AD699: Data Mining for Business Analytics Fall 2018 Quiz 2代做留学生SQL语言程序
- 首页 >> OS编程AD699: Data Mining for Business Analytics
Fall 2018
Quiz 2
30OCT 2018
QUIZ #2: Question Bank
1. Which of the following best describes the purpose of using the naive Bayes approach, rather than the exact Bayesian classifier, to determine the probability that a particular record will belong to a particular class?
a. Unlike other types of classification algorithms, naive Bayes allows us to make classification cutoffs on a sliding scale; that is, we can use naive Bayes to determine whether a record is likely to belong to a particular class, even if we don’t use .50 as the cutoff point for the decision.
b. With full conditional probability, we must rely on the use of each predictor independently; with naive Bayes, however, we can achieve much more complexity.
c. A naive Bayesian algorithm can perform. dimension reduction on large datasets, whereas the exact Bayesian classifier leaves the data with its original number of dimensions, regardless of the format.
d. The exact Bayesian approach requires that we have records in our training set that contain the same predictors as the record we’re trying to classify; with naive Bayes, we can assess probability of class membership for a new record, even if our model has never seen a record like it in the past.
2. There seem to be some strange words at the bottom of this e-mail. Assume that the sender placed them there with the intention of committing “Bayesian poisoning.” What was the sender’s intent?
a. The sender knows that Yahoo mail does not use a smoothing constant, so a misspelled word such as “kyng” or “pharmaceuticcal” will outsmart the naive Bayesian filter.
b. The sender is taking advantage of the fact that certain proper nouns, such as “Buckingham” and “Harwich”, cannot be considered spam.
c. The extra words placed at the bottom of the e-mail are designed to distract the reader -- by focusing on the precise meaning of those phrases, the reader may not notice the other content of the message, and will therefore be more likely to click on the link.
d. If the sender placed several spammy words in the body and subject line of the e-mail, he was hoping to dilute their impact on a naive Bayesian model by including these extra words that would either be unknown to the model, or even considered less likely to be spammy.
3. Your friend is building a k-nearest neighbors model. She has 150 records in her training set. Not sure where to start with a k value, she decides to start with 5, and then to keep incrementing by 10 more (k=5, then k=15, k=25, k=35, and so on). What would you expect to happen to her model as the k values keep increasing?
a. The model will use linear optimization to identify the 150 most likely outcomes (because there are k-2 degrees of freedom).
b. The model will lose its statistical significance unless an exhaustive search method is used.
c. The model will use the classification outcome of the record that is statistically closest to any new record that we test in order to classify the new record.
d. The higher the k value she uses, the more likely that her model will predict that a new record belongs to the same class as the dominant class in the training set.
4. You are working on a multiple linear regression model, when a mysterious voice shouts from out of nowhere: “Remember to use the principle of parsimony, to avoid multicollinearity, and to maintain accuracy as best you can.”
Which of the following would this mysterious speaker be most likely to recommend?
a. Consider using fewer inputs in your model. When considering which inputs to use, keep ones that are highly correlated with the outcome variable, and highly correlated with other inputs.
b. Consider using more inputs in your model. When considering which inputs to drop, get rid of any that are uncorrelated with the outcome variable and also uncorrelated with other inputs.
c. Consider using fewer inputs in your model. When considering which inputs to use, keep ones that are highly correlated with the outcome variable, but not correlated with other inputs.
d. Consider using more inputs in your model. When considering which inputs to drop, get rid of any that are correlated with the outcome variable, but not correlated with other inputs.
5. A Naive Bayes model is assessing the following sentence: “Get meds from Canadian pharmacies discreetly shipped to your door no prescription needed!”
Which of the following statements is true?
a. The model will look for suspicious multiple-word strings, such as “discreetly shipped” and “no prescription” .
b. The model will assess each of the words in the sentence independently (i.e. regardless of what other words are in the sentence).
c. The model will be rendered ineffective by the word “Canadian.”
d. The model will make a classification estimate for this sentence as either “spam” or “ham” without factoring in the overall percentage of e-mails in the training set that are either spam or ham.
6. The Boston Crime Movies Club is hard at work once again, trying to capture data about its members. Based on the information shown below, what is the hamming distance between Hall and Oates?
|
The Departed |
Black Mass |
Boondock Saints |
The Town |
Mystic River |
Hall |
7 |
8 |
9 |
5 |
10 |
Oates |
5 |
6 |
9 |
7 |
8 |
7. A local sports memorabilia company needs our help. They have some data regarding local sports fans’ buying habits, but they want to see if we can help them to understand the conditional probabilities based on their data. They surveyed 800 local fans, and found that 712 buy Red Sox gear (hats, shirts, bumper sticks, etc.). They found that 456 people surveyed sometimes buy Bruins gear, and that just 20 fans out of the whole group never buy Red Sox gear or Bruins gear. Given that a fan buys Bruins gear, what is the chance that he never buys Red Sox gear?
8. Using the same information provided in the question above, what is the probability that a local sports fan never buys Bruins gear?
9. On your way home from a grueling day of classes, you walk to the T platform at Kenmore, and then start to zone out a little bit and daydream about home. Suddenly, you notice two young men arguing, with voices escalating in volume, so you take your earbuds off and begin to listen in:
Person A: “I’m tired of it, man. I’m tired of you, and I’m tired of you, and I’m tired of your ways!!”
Person B: “Dude, you’re redundant. Did you realize you just said ‘I’m tired of you’ twice in a row? But whatever -- just listen to me for a second -- the students in AD699 took a survey that asked about their tastes, habits, and opinions. They ranked their level of agreement with a series of 10 statements on a continuous scale from 1 to 5, with 5 being “strongly agree” and 1 being “strongly disagree.” The results shown below, indicating Marco’s closest neighbors, as measured by Euclidean distance, indicate that Beth, Cindy, and Dennis all answered each of the questions the same way.
Person A: “Yeah, I heard you say that. You just keep saying that over and over again.
But the similarity among Beth, Cindy, and Dennis doesn’t necessarily mean their answers were all the same.”
Student |
Neighbor |
Distance |
Marco |
Alex |
2.125 |
Marco |
Beth |
3.177 |
Marco |
Cindy |
3.177 |
Marco |
Dennis |
3.177 |
Marco |
Elaine |
4.695 |
Marco |
Fredro |
7.812 |
Who is right here -- Person A or Person B?
a. Person B is right. The way Euclidean distance is calculated, it
guarantees that any two records that have the exact same Euclidean distance to another record must have the same values as one another (in other words, Beth, Cindy, and Dennis must have all answered each question the same way).
b. Person A is right. Even when the Euclidean distance between two
records is the same, there is no guarantee that the individual variable values for each record’s attributes are the same.
c. Person A is right, because the data was reported on a continuous
scale (if it had been reported on a discrete scale, Person B would be right).
d. Person B is right, because he’s taking the score correlations into account, which Person A is neglecting to do.
10. Imagine that AD699 has five new teams of students, as shown below. Each team was asked to rank lectures 1 through 5 on a scale from 1 to 10. A correlation table based on those rankings is shown below. Which pair of teams shown below has the greatest correlation distance?
11. You want to build a model that aims to predict whether someone might become a purchaser of rare diamonds. Why would it be appropriate for you use oversampling to build your training set?
a. If oversampling is not performed, then we will not be able to identify any diamond purchasers, thus labeling the entire process ineffective.
b. By oversampling, we can be sure that the model includes exactly the same proportion of probable diamond purchasers as exists in the overall population.
c. If we oversample, we know that our model will be well-protected against the possible influence of outliers.
d. Large diamond purchasers are rare. If we want to identify the things that make them unique, it will be harder to do so if we have to study them in a way that’s proportional to their representation across the population.
12. Imagine that the proportion of all e-mails in our training data that are spam is .56, and the proportion of e-mails that are ham is .44. An e-mail comes through our Bayesian filter and it contains only four words: “Get Russian bride fast.” These four words were all contained in our training data. The probabilities that each of these words would appear in either type of e-mail are shown below:
|
SPAM |
HAM |
Get |
.003 |
.001 |
Russian |
.09 |
.04 |
bride |
.095 |
.042 |
fast |
.001 |
.002 |
What is the naive Bayesian probability that this e-mail is spam?
13. The chart below shows the summary information for a simple linear regression model with iris petal width as the outcome variable and iris petal length as the input variable, with both measured in centimeters. For every 100 cm increase in petal length, how much should we expect the petal width to change?
14. While taking the T home from class, you overhear someone speaking very loudly. He yells out, to no one in particular, “Sure, this naive Bayes calculation doesn’t give us true probabilities. But really, who cares? It still gets the job done.” Your stop is next, and you get off the green line before hearing the rest of this guy’s ramblings. But assume he is right. Why would he say this?
a. Probability is not at all relevant in a naive bayes model, so he’s right in a very general sense.
b. A naive worldview can still lead to good results -- for instance, it could lead to a safer, risk-averse mentality when building a predictive model.
c. By working in a comprehensive way, a naive Bayes model cuts out the noise factor normally associated with dimension reduction.
d. For a classification task, knowing the true probabilities isn’t as important as knowing the relative probabilities -- after all, we just need to place records in our test data set into the correct groups.
15. You are building a model that predicts whether someone prefers to eat steak one of five ways -- Well, Medium Well, Medium, Medium Rare, or Rare. Can you use a lift chart to analyze your results?
a. You can keep the data exactly as you have it, with all five classification groups, but you first have to place the likelihood of someone’s membership in each group in rank order.
b. No, because lift charts cannot be used for classification tasks -- only for regression tasks and unsupervised learning tasks.
c. Not right away. You would have to first reduce the categories to just two (for instance, you could have one ‘important’ category and label the rest as ‘other’ and then build the lift chart).
d. It depends on the cutoff value used for your classifiers. Assuming you have a cutoff value of at least .50, then you will have no problems with this.