QnA - Performance Measure of Models

Collection of questions and answers on performance measure of models

Which is more important to you– model accuracy, or model performance?

Lets answer this with respect to classification problems. Model Performance is more important. Model accuracy cannot be considered in cases where we have imbalanced dataset (where there are more positives then negatives). Accuracy also assign equal weight to labels which is a disadvantage in cases of imbalanced dataset.

Classification model performance can be evaluated from metrics such as Log-Loss, Accuracy, AUC(Area under Curve) and precision, recall (generally used by search engines)

Can you cite some examples where a false positive is important than a false negative?

Consider a model where, 1 (positive) means that a mail is Spam, 0 (negative) means that the mail is not Spam. If False Positives are high then important mails will go to the Spam folder and it may become difficult to retrieve that mail from the huge chunk of mails in Spam folder. Low False Negative would mean, that a spam mail lands up in the Primary mailbox.

Now it is not difficult, to mark a mail as Spam from the mail at Primary mailbox. But, as mentioned earlier, it is very difficult, to retrieve a mail from Spam folder. hence, in cases like this, False Positive is more important then False Negative

Can you cite some examples where a false negative important than a false positive?

In Cancer diagnosis, let 1 (positive) denote positive for Cancer, 0 (negative) denote negative for Cancer. A False Negative would mean a patient who has cancer has been diagnosed as negative for Cancer. This situation is very dangerous as a patient who has Cancer was detected as negative by ML model and as a result, the patient will not be subjected to follow up investigation.

On the other hand, False Positive is not as dangerous Flase Negative. Even if the patient does not have Cancer, the ML model will show positive and the patient will be subjected to further follow-up investigation.

Can you cite some examples where both false positive and false negatives are equally important?

Consider, posting articles in a blog. If this article is read by more then average number of readers in my blog then it is positive. Else, negative.
A false positive would mean that more readers then the average number of readers in my blog have read this article, but the truth is that less then average readers have read this article. Here, false positive gives me a wrong motivation but the same motivation ensures that I keep writing. Writing helps me stay in practice.
A flase negative would means that the article did not do any better than all the other article but the truth being that it garnered more readers than the average readership of my blog. Here, false negative gives me a sense of introspection on the quality of my writing and ultimately helps me improve myself.

What is the most frequent metric to assess model accuracy for classification problems?

The answer to this question is very domain specefic. For a overall idea we can say that confusion matrix is better then simple accuracy because of more output parameters in confusion matrix. RO curve could prove to be more helpful becuase it includes integration over the whole range of precision/recall tradeoffs. Log-loss is another metic to measure accuracy and it is the only one that considers probabilistic score directly.

Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric?

A ROC curve plots the true positives (sensitivity) vs. false positives (1 − specificity), for a binary classifier system as its discrimination threshold is varied. An AUROC has many interpretations compared to raw accuracy.

A ROC curve plots the true positives (sensitivity) vs. false positives (1 − specificity), for a binary classifier system as its discrimination threshold is varied. An AUROC has many interpretations compared to raw accuracy. A beautiful explanation on Confusion Matrix.

The area equals the probability that a randomly chosen positive example ranks above (is deemed to have a higher probability of being positive than) a randomly chosen negative example.

What is Accuracy ?

Accuracy can be defined as:
(Number of correctly classified points)/(Total number of points)

1) Imbalanced Data:A dumb model could get a very high accuracy. So never use accuracy as measure in imbalanced dataset.
2) Accuracy cannot use probabilistic score.

Explain about Confusion matrix, TPR, FPR, FNR, TNR?

Confusion matrix is a square matrix comprising of predicted/actual class label values. Dimension of the square is equal to the number of class labels. Confusion matrix does not consider probabilistic scores.

A good model, will have high TNR and TPR. Elements in principal diagonal matrix will be high for a good model
Important parameters related to Confusion Matrix
TPR: True Positive Rate
FPR: False Positive Rate
FNR: False Negative Rate
TNR: True Negative Rate
TP: Number of true positive points
FP: Number of false positive points
TN: Number of true negative points
FN: Number of false negative points
P:Total actual positive points
N:Total actual negative points
TPR = TP/P; TNR = TN/N; FPR = FP/N; FNR = FN/P
[caption id="attachment_1566" align="aligncenter" width="300"] Source: blog.revolutionanalytics.com/[/caption]
Therefore, with TPR, TNR, FPR, FNR, we get a better insight of data rather then only accuracy. It is upto the domain to decide as to which among TPR, TNR, FPR, FNR is more important.

What do you understand about Precision & recall, F1-score?

Precision and recall are often used in information retrieval problems.They are related to the positive class/label of a dateset. Precision is: TP/(TP+FP). It means that of all the points predicted to be positives, what percentage of them are actually positive

Recall is nothing but True Positive Rate(TPR). It means, out of all the positive labels, how many are correctly predicted to be positive.

We want precesion to be high which means that there are less points which are wrongly implicated to be positive. We also want, recall to be high, out of all the actual positive points, more points were rightly detected to be positive

Precision(Pr) and Recall(R) are combined in F1-Score.
$$F1Score = 2*\frac{Pr*R}{Pr+R}$$

What is the ROC Curve and what is AUC (a.k.a. AUROC)

Receiver Operating Characteristic Curve (ROC) and Area Under RO Curve (AUC) are binary classification metric. It is a plot between TPR and FPR. An AUC score includes integration over the whole range of precision/recall tradeoffs, while the F1 score takes one specific precision and recall pair, which could be viewed as a sample or average. Area under RO curve can lie between 0 and 1. 1 signifies very good model. 0 means terrible.

1. If we have imbalanced data, AUC can be high even for a dumb model.
2. AUC does not care about the actual score assigned to a data point label.
3. AUC of a random model is 0.5.

What is Log-loss and how it helps to improve performance?.

Given a test set, log-loss is defined as:
$$-\frac{1}{n}\sum_{i=1}^{n}\{(log(P_i)*y_i)+(1-y_i)*log(1-P_i)\} $$
y_i is the label of dataset and P_i is the probabilistic score of the label.

Log-loss value is small where P_i value is large for positive class/label. Also Log-loss value is small where P_i value is small for negative class/label. Loss loss value can lie between 0 to Infinity. 0 is the best case. Loss loss takes into consideration the actual probabilistic values.

Log-Loss is average of negative log of probability of correct class label. Log-loss can be extended to multi class labels.

Explain about R-Squared/ Coefficient of determination

Coefficient of determination is a performance measure for models where predicted label values can belong to any real number (regression). Let the actual value be y_i and predicted value be y'_i, then we can calculate error as e_i = y_i - y'_i

Now, we define a term Total Sum of Square, SS as
$$SS = \sum_{i=1}^{n}(y_i - \bar{y_i})^2$$
where,
$$\bar{y_i} = \frac{1}{n}\sum_{i=1}^{n}y_i$$ = average value of actual y_i in test data.

In a simplest regression model, given a query point we can return its output as the mean of all the other outputs. For example, to predict height of a person among 10 persons, we can calculate the mean of height of all the other 9 person and assign it as the height of the person under consideration.

Total sum of square is the sum of square errors using a simple mean model. Now we define Sum of Square Residual
$$SS = \sum_{i=1}^{n}(y_i - y'_i)^2$$
where,
y'_i is the predicted class value.

SS_total is for a simple mean model whereas, SS_residual is for the model that is under operation. Now, we can define R² as:
$$R^2 = 1-\frac{SS_{res}}{SS_{total}}$$.

Case 1: When SS_res = 0. This will happen when predicted value is exactly same as actual value, that means error, e_i = 0. In this case R² = 1, which means that our model is phenomenal.

Case 2: When SS_res < SS_total. In this case, R² will be between 0 and 1.

Case 3:: SS_res = SS_total, then R² is 0, which means our model is same as simple mean model.

Case 3:: SS_res > SS_total, then R² becomes negative, which means our model is worse then a simple mean model

Explain about Median absolute deviation (MAD) ?Importance of MAD?

Errors, e_i and SS can suffer from outlier points, i.e. if one point is very large, our entire R² can go for a toss. R² is not very robust to outliers.

Now, error, e_i is a random variable. We can choose to select the mean of e_i, i.e. median(e_i) = central value of errors.
Median Absolute Deviation, MAD(e_i) = Median(e_i - median(e_i))
Median is a robust measure of mean, and MAD is a robust measure of standard-deviation.

[mathjax]