QnA - Classification Algorithms In Various Situations

Here are a few questions and answers related to Classification Algorithms In Various Situations

What is Imbalanced and balanced dataset

If a dataset has unequal positive and negative data points than the dataset is imbalanced. Balanced dataset has equal positive and negative labels.
For Example: In data sets of patient having cancer or not, there will be a very high negative data points (for patients without cancer) and less positive data points (patients with cancer).

K-NN results could be biased if the dataset is heavily imbalanced.

Define Multi-class classification?

In MNIST dataset, we have 10 class/labels. So, MNIST dataset is a multi-class dataset.
In a c-class classifier, for a query point Xq, Xq may belong to any of the c-class. Now, for 7-NN, Xq will belong to that class, where majority of 7-NN belong to.

Explain Impact of Outliers?

In KNN, when K=1, than an outlier can easily impact our model. This happens because our decision surface has changed with an outpier. As a comparison, K=5 will be less prone to error as compared to k=1. So, if we get the same accuracy in Dtest for K=5 and K=1, then we must prefer K=5.

What is Local Outlier Factor?

The objective of local outlier factor is to detect outliers in data. It is inspired by KNN. For every query point (also for the outlier point), find the mean distance of all its K-Nearest Neighbor. Now sort all mean distances. If any of the mean distance is exceptionally high then the corresponding point is definitely an outlier.

What is k-distance (A), N(A)

k-distance of a point A is the distance to the k-th nearest neighbor of A from A. N(A) denotes neighborhood of A. It is set of all points that belong to the KNN of A.

Define reachability-distance(A, B)?

Mathematically, it is defined as:
reachability-distance(A, B) = max(k-distance(B), dist(A,B))
dist(A,B) is the actual distance between A and B
Note that, if A is in the neighborhood of B then:
reachability-distance(A, B) = k-distance(B)

What is Local-reachability-density(A) or LRD(A)?

Local-reachability-density(A) is the inverse of average reachability distance of A from its neighbor.
Mathematically:
$$LRD(A) = \frac{1}{\sum_{B\in N(A)}^{}{\frac{reachability-distance(A,B)}{\|N(B)\|}}}$$

Define LOF(A)

Local Outlier Factor of A, or LOF(A) is a quantity that is large when LRD(A) is small but LRD of neighborhood points of A is large. In other words, LOF(A) is large when density of points around A is small but density of points around the neighborhood of A is large.

When LOF(A) is large then we can conclude that A is an outlier.
Mathematically:
$$LOF(A) = \frac{\sum_{B\in N(A)}^{}LRD(B)}{\|N(B)\|} * \frac{1}{LRD(A)}$$

Impact of Scale & Column standardization?

If the scale of feature column are different then it impacts the euclidean distance measure. Euclidean distance measure is important in algorithms like KNN. Thus column standardization is done to bring all features under same scale

What is Interpretability?

Suppose a ML model outputs weather a patient will survive cancer or not. Now, a professional Doctor cannot trust this model blindly. Nor is the Doctor trained to understand a ML model.

So, in addition to YES/NO output from a ML model, the ML model should also give a reasonable justification as to why a particular output occurred. The reasoning will be useful for the Doctor to analyze the result. Such models are called interpretable model. KNN is interpretable if K is small.

How To Handling categorical and numerical features?

Suppose we have a categorical feature called Hair Color. Now, all type of hair color needs to be converted into a number so that a machine learning model can compute over it.
One-hot encoding: Creates a binary vector of the size of number of distinct elements. f the number of distinct values for a categorical feature is large then One-Hot-Encoding can create sparse and large vectors.
Ordinal Features
In ordinal features, you can assign numbers (logical ordering) to each category of the feature and they just work fine. For example, 'very-good' could have a numeral value 5, 'very-bad' could have a numeral value 1.

How To Handle Handling missing values by imputation?

There are various ways to handle missing values and it depends on the domain as to which method to choose to apply. Some of the ways are

1. Take all the non-missing values and put its mean/median/mode in the missing value position.
2. Imputation based on class label. Suppose we have the class label and a known feature f1 for a data point. We can compute the unknown feature f2, from class-label and f1.
3. Create missing value feature. Missing values are sometimes source of information. Impute missing values using technique 1 or 2. Then create additional binary features, where missing values are represented as 1 and non-missing values are represented as 0.

What is Bias-Variance tradeoff?

In theory of ML, we calculate a generalization error. It is the error on future unseen data. It is calculated as a sum of bias²+variance+irreducible-error.

Bias error occurs due to simplifying assumptions about a model. High-bias error imply underfitting.

Variance gives a measure of how much a model changes as training data changes. Large changes in model causes high variance resulting into overfitting. Small changes in training dataset causes decision surface to change a lot thus changing the mode.

As K increases, variance reduces. Our target should be to reduce generalization error. This can be cone so by reducing bias and variance. We can reduce bias by preventing underfit. Reduce variance by not overfitting. There is always a trade-off between underfitting and overfitting and thus the bias-variance tradeoff.

[mathjax]