Jekyll2020-02-24T20:15:07+00:00/feed.xmlRishiraj AdhikaryRishiraj is a Ph.D. student at the Sustainability Lab, Computer Science Engineering, IIT Gandhinagar, Gujarat. He is working under Prof. Nipun BatraHow to download and process climatic data from climate data store (CDS)?2020-01-01T16:45:00+00:002020-01-01T16:45:00+00:00/2020/01/01/download-ERA5-CDS-data<p>In this notebook, I have processed the temperature data for Gandhinagar Region in Gujarat. The Copernicus website gives a straightforward option to either download the data from their website or use their API to download the data. Download any data from <a href="https://cds.climate.copernicus.eu/#!/home">Climate Data Source</a>. Both of them work and takes approximately the same time. I have used the following requirement while using the API. These sets of data are also termed as the ERA5 dataset.</p>
<p>Requirement (Windows with Anaconda VIrtual Environment)</p>
<ul>
<li>Create an account <a href="https://cds.climate.copernicus.eu">here</a></li>
<li>Install cdsapi via conda forge</li>
<li>Install netCDF4 via conda forge</li>
<li>Go to the directory: C://Users//Username and create a file name <i>.cdsapirc</i></li>
<li>Copy the URL and Key <a href="https://cds.climate.copernicus.eu/api-how-to">form here</a> and paste to the file created above:</li>
</ul>
<p>Open the file consiting the data. I downloaded the file from ERA5 website in NetCDF format. I choose <u>2m temerature</u> for the month of June 2019 to November 2019. Use xarray and pandas to open and convert the data to pandas dataframe and save it as CSV</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">xarray</span> <span class="k">as</span> <span class="n">xr</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ds</span> <span class="o">=</span> <span class="n">xr</span><span class="o">.</span><span class="n">open_dataset</span><span class="p">(</span><span class="s">'D:/temperature.nc'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ds</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><xarray.Dataset>
Dimensions: (latitude: 1801, longitude: 3600, time: 2929)
Coordinates:
* longitude (longitude) float32 0.0 0.1 0.2 0.3 ... 359.6 359.7 359.8 359.9
* latitude (latitude) float32 90.0 89.9 89.8 89.7 ... -89.8 -89.9 -90.0
* time (time) datetime64[ns] 2019-06-01 ... 2019-10-01
Data variables:
t2m (time, latitude, longitude) float32 ...
Attributes:
Conventions: CF-1.6
history: 2019-12-30 07:17:58 GMT by grib_to_netcdf-2.15.0: /opt/ecmw...
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ds</span><span class="o">.</span><span class="n">t2m</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.collections.QuadMesh at 0x19fce396d68>
</code></pre></div></div>
<p><img src="/images/output_5_1.png" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ds</span><span class="o">.</span><span class="n">t2m</span><span class="p">[</span><span class="mi">1000</span><span class="p">]</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><matplotlib.collections.QuadMesh at 0x19fd03fc630>
</code></pre></div></div>
<p><img src="/images/output_6_1.png" /></p>
<p>The index in t2m is the 3rd dimention which is time. Below is choice of lat-long for Gandhinagar, Gujarat. This will filter the huge xrray in these coordinate values</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">newds</span><span class="o">=</span><span class="n">ds</span><span class="o">.</span><span class="n">sel</span><span class="p">(</span><span class="n">longitude</span><span class="o">=</span><span class="mf">72.63</span><span class="p">,</span> <span class="n">latitude</span><span class="o">=</span><span class="mf">23.21</span><span class="p">,</span> <span class="n">method</span><span class="o">=</span><span class="s">'nearest'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#Convert to pandas dataframe and save it
</span><span class="n">df</span> <span class="o">=</span> <span class="n">newds</span><span class="o">.</span><span class="n">to_dataframe</span><span class="p">()</span>
<span class="c1">#convert kelvin to celcius
</span><span class="n">df</span><span class="p">[</span><span class="s">'t2m'</span><span class="p">]</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'t2m'</span><span class="p">]</span><span class="o">-</span><span class="mf">273.15</span>
<span class="c1">#save the dataframe for future use
</span><span class="n">x</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'temperature-gandhinagar.csv'</span><span class="p">)</span>
</code></pre></div></div>
<p>Below is the code to retreieve data from copernicus website via API. This code is automtically generated from the same website</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">cdsapi</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">cdsapi</span><span class="o">.</span><span class="n">Client</span><span class="p">()</span>
<span class="n">c</span><span class="o">.</span><span class="n">retrieve</span><span class="p">(</span>
<span class="s">'reanalysis-era5-single-levels'</span><span class="p">,</span>
<span class="p">{</span>
<span class="s">'product_type'</span><span class="p">:</span> <span class="s">'reanalysis'</span><span class="p">,</span>
<span class="s">'variable'</span><span class="p">:</span> <span class="p">[</span>
<span class="s">'2m_dewpoint_temperature'</span><span class="p">,</span> <span class="s">'2m_temperature'</span><span class="p">,</span> <span class="s">'total_precipitation'</span><span class="p">,</span>
<span class="p">],</span>
<span class="s">'year'</span><span class="p">:</span> <span class="p">[</span>
<span class="s">'2018'</span><span class="p">,</span> <span class="s">'2019'</span><span class="p">,</span>
<span class="p">],</span>
<span class="s">'month'</span><span class="p">:</span> <span class="p">[</span>
<span class="s">'01'</span><span class="p">,</span> <span class="s">'02'</span><span class="p">,</span> <span class="s">'03'</span><span class="p">,</span>
<span class="s">'04'</span><span class="p">,</span> <span class="s">'05'</span><span class="p">,</span> <span class="s">'06'</span><span class="p">,</span>
<span class="s">'07'</span><span class="p">,</span> <span class="s">'08'</span><span class="p">,</span> <span class="s">'09'</span><span class="p">,</span>
<span class="s">'10'</span><span class="p">,</span> <span class="s">'11'</span><span class="p">,</span> <span class="s">'12'</span><span class="p">,</span>
<span class="p">],</span>
<span class="s">'day'</span><span class="p">:</span> <span class="p">[</span>
<span class="s">'01'</span><span class="p">,</span> <span class="s">'02'</span><span class="p">,</span> <span class="s">'03'</span><span class="p">,</span>
<span class="s">'04'</span><span class="p">,</span> <span class="s">'05'</span><span class="p">,</span> <span class="s">'06'</span><span class="p">,</span>
<span class="s">'07'</span><span class="p">,</span> <span class="s">'08'</span><span class="p">,</span> <span class="s">'09'</span><span class="p">,</span>
<span class="s">'10'</span><span class="p">,</span> <span class="s">'11'</span><span class="p">,</span> <span class="s">'12'</span><span class="p">,</span>
<span class="s">'13'</span><span class="p">,</span> <span class="s">'14'</span><span class="p">,</span> <span class="s">'15'</span><span class="p">,</span>
<span class="s">'16'</span><span class="p">,</span> <span class="s">'17'</span><span class="p">,</span> <span class="s">'18'</span><span class="p">,</span>
<span class="s">'19'</span><span class="p">,</span> <span class="s">'20'</span><span class="p">,</span> <span class="s">'21'</span><span class="p">,</span>
<span class="s">'22'</span><span class="p">,</span> <span class="s">'23'</span><span class="p">,</span> <span class="s">'24'</span><span class="p">,</span>
<span class="s">'25'</span><span class="p">,</span> <span class="s">'26'</span><span class="p">,</span> <span class="s">'27'</span><span class="p">,</span>
<span class="s">'28'</span><span class="p">,</span> <span class="s">'29'</span><span class="p">,</span> <span class="s">'30'</span><span class="p">,</span>
<span class="s">'31'</span><span class="p">,</span>
<span class="p">],</span>
<span class="s">'time'</span><span class="p">:</span> <span class="p">[</span>
<span class="s">'00:00'</span><span class="p">,</span> <span class="s">'01:00'</span><span class="p">,</span> <span class="s">'02:00'</span><span class="p">,</span>
<span class="s">'03:00'</span><span class="p">,</span> <span class="s">'04:00'</span><span class="p">,</span> <span class="s">'05:00'</span><span class="p">,</span>
<span class="s">'06:00'</span><span class="p">,</span> <span class="s">'07:00'</span><span class="p">,</span> <span class="s">'08:00'</span><span class="p">,</span>
<span class="s">'09:00'</span><span class="p">,</span> <span class="s">'10:00'</span><span class="p">,</span> <span class="s">'11:00'</span><span class="p">,</span>
<span class="s">'12:00'</span><span class="p">,</span> <span class="s">'13:00'</span><span class="p">,</span> <span class="s">'14:00'</span><span class="p">,</span>
<span class="s">'15:00'</span><span class="p">,</span> <span class="s">'16:00'</span><span class="p">,</span> <span class="s">'17:00'</span><span class="p">,</span>
<span class="s">'18:00'</span><span class="p">,</span> <span class="s">'19:00'</span><span class="p">,</span> <span class="s">'20:00'</span><span class="p">,</span>
<span class="s">'21:00'</span><span class="p">,</span> <span class="s">'22:00'</span><span class="p">,</span> <span class="s">'23:00'</span><span class="p">,</span>
<span class="p">],</span>
<span class="s">'format'</span><span class="p">:</span> <span class="s">'netcdf'</span><span class="p">,</span>
<span class="p">},</span>
<span class="s">'temperature.nc'</span><span class="p">)</span>
</code></pre></div></div>In this notebook, I have processed the temperature data for Gandhinagar Region in Gujarat. The Copernicus website gives a straightforward option to either download the data from their website or use their API to download the data. Download any data from Climate Data Source. Both of them work and takes approximately the same time. I have used the following requirement while using the API. These sets of data are also termed as the ERA5 dataset.Why did I decide to do a Ph.D. and not continue with a job?2019-05-23T11:47:39+00:002019-05-23T11:47:39+00:00/2019/05/23/why-phd-and-not-a-job<style>
p{text-align: justify;}
li{text-align: justify;}
</style>
<p><strong>TLDR</strong>:
I love the fusion of <strong>Electronics+Computer Science</strong>. Also, just like Chris Olah says, <em>“I want to understand things clearly”</em></p>
<p>I was recently shortlisted for Ph.D. admission at IIT Gandhinagar after clearing the written test and interview. Before this, I was working as a JRF in the same institute under <a href="https://nipunbatra.github.io/" target="_blank">Prof. Nipun Batra</a> since March 2019. From August 2018 to January 2019, I was a Research and Teaching Assistant at IIIT Sricity. Meeting and working under <a href="http://vvtesh.co.in" target="_blank">Prof. Venkatesh Vinayakarao</a> was the most fabulous thing that happened there.</p>
<p>After my BTech in ECE at Gauhati University, I primarily worked for about 3.5 years in multiple organizations in the Web Design and Development domain (<em>I used to eat JavaScript</em>). I was even into freelancing. 3.5 years and I was done with those Jobs. I loved the 9-5 (or 9 to 6.30) culture but eventually thought that I did not want to continue that way.</p>
<p>Government job was an excellent option to try. But the State Govt. is something I feel fishy about and the Central Government seems too good for a person like me. ;)</p>
<p>So, Ph.D. seems to be what I wanted to do. I enjoy the essence of understanding complicated looking concepts. That mindful satisfaction is hard to explain. I think the following are useful points in favor of doing a Ph.D.</p>
<ol>
<li>
<p>You never retire in the job you do after doing a Ph.D. Why?</p>
</li>
<li>
<p>You do a Ph.D. to specialize in a domain where work is fun, and that is what you aspire to do in your entire life.</p>
</li>
<li>
<p>Ph.D. is an excellent way to <em>satisfy the innate hunger</em> of understanding things clearly and explain those <em>things</em> in the most straightforward way to someone else.</p>
</li>
</ol>
<p>So, that is it. I hope my journey will be fulfilling. I know it will be full of challenges and pressure, but anything worth having is difficult to have, and it is climbing that mountain that is important.</p>TLDR: I love the fusion of Electronics+Computer Science. Also, just like Chris Olah says, “I want to understand things clearly”Python Programming Bootcamp at ADTU, Guwahati2019-02-02T11:47:39+00:002019-02-02T11:47:39+00:00/events/2019/02/02/python-programming-bootcamp-adtu-guwahati<p style="text-align:justify">A python programming Bootcamp was conducted at Assam Downtown University on 31st January and 1st February 2019. The topics ranged from the basic syntax of Python to data structures. On the concluding session, various python projects were showcased. The projects were based on Information Retrieval and Data Science.</p>
<p style="text-align:justify">A python programming Bootcamp was conducted at Assam Downtown University on 31st January and 1st February 2019. The topics ranged from the basic syntax of Python to data structures. On the concluding session, various python projects were showcased. The projects were based on Information Retrieval and Data Science.</p>
<ul class="wp-block-gallery columns-3 is-cropped">
<li class=""><figure><img src="/images/IMG-20190202-WA0010.jpg" alt="" data-id="" /></figure></li>
<li class=""><figure><img src="/images/IMG-20190202-WA0009.jpg" /></figure></li>
<li class=""><figure><img src="/images/IMG-20190202-WA0008.jpg" alt="" data-id="" /></figure></li>
<li class=""><figure><img src="/images/IMG-20190202-WA0007.jpg" alt="" data-id="" class="" /></figure></li>
<li class=""><figure><img src="/images/IMG-20190202-WA0001.jpg" alt="" data-id="" class="" /></figure></li>
</ul>A python programming Bootcamp was conducted at Assam Downtown University on 31st January and 1st February 2019. The topics ranged from the basic syntax of Python to data structures. On the concluding session, various python projects were showcased. The projects were based on Information Retrieval and Data Science.QnA - Performance Measure of Models2018-06-28T11:47:39+00:002018-06-28T11:47:39+00:00/2018/06/28/qna-performence-measure-models<p>Collection of questions and answers on performance measure of models<!--more--></p>
<p><strong>Which is more important to you– model accuracy, or model performance?</strong></p>
<blockquote>
<p>Lets answer this with respect to classification problems. Model Performance is more important. Model accuracy cannot be considered in cases where we have imbalanced dataset (where there are more positives then negatives). Accuracy also assign equal weight to labels which is a disadvantage in cases of imbalanced dataset.<br /><br />Classification model performance can be evaluated from metrics such as Log-Loss, Accuracy, AUC(Area under Curve) and precision, recall (generally used by search engines)</p>
</blockquote>
<p> </p>
<p><strong>Can you cite some examples where a false positive is important than a false negative?</strong></p>
<blockquote>
<p>Consider a model where, 1 (positive) means that a mail is Spam, 0 (negative) means that the mail is not Spam. If False Positives are high then important mails will go to the Spam folder and it may become difficult to retrieve that mail from the huge chunk of mails in Spam folder. Low False Negative would mean, that a spam mail lands up in the Primary mailbox.</p>
<p>Now it is not difficult, to mark a mail as Spam from the mail at Primary mailbox. But, as mentioned earlier, it is very difficult, to retrieve a mail from Spam folder. hence, in cases like this, False Positive is more important then False Negative</p>
</blockquote>
<p> </p>
<p><strong>Can you cite some examples where a false negative important than a false positive?</strong></p>
<blockquote>
<p>In Cancer diagnosis, let 1 (positive) denote positive for Cancer, 0 (negative) denote negative for Cancer. A False Negative would mean a patient who has cancer has been diagnosed as negative for Cancer. This situation is very dangerous as a patient who has Cancer was detected as negative by ML model and as a result, the patient will not be subjected to follow up investigation.<br /><br />On the other hand, False Positive is not as dangerous Flase Negative. Even if the patient does not have Cancer, the ML model will show positive and the patient will be subjected to further follow-up investigation.</p>
</blockquote>
<p> </p>
<p><strong>Can you cite some examples where both false positive and false negatives are equally important?</strong></p>
<blockquote>
<p>Consider, posting articles in a blog. If this article is read by more then average number of readers in my blog then it is positive. Else, negative.<br />A false positive would mean that more readers then the average number of readers in my blog have read this article, but the truth is that less then average readers have read this article. Here, false positive gives me a wrong motivation but the same motivation ensures that I keep writing. Writing helps me stay in practice.<br />A flase negative would means that the article did not do any better than all the other article but the truth being that it garnered more readers than the average readership of my blog. Here, false negative gives me a sense of introspection on the quality of my writing and ultimately helps me improve myself.</p>
</blockquote>
<p> </p>
<p><strong>What is the most frequent metric to assess model accuracy for classification problems?</strong></p>
<blockquote>
<p>The answer to this question is very domain specefic. For a overall idea we can say that confusion matrix is better then simple accuracy because of more output parameters in confusion matrix. RO curve could prove to be more helpful becuase it includes integration over the whole range of precision/recall tradeoffs. Log-loss is another metic to measure accuracy and it is the only one that considers probabilistic score directly.</p>
</blockquote>
<p> </p>
<p><strong>Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric?</strong></p>
<blockquote>
<p>A ROC curve plots the true positives (sensitivity) vs. false positives (1 − specificity), for a binary classifier system as its discrimination threshold is varied. An AUROC has many interpretations compared to raw accuracy.<br /><br />A ROC curve plots the true positives (sensitivity) vs. false positives (1 − specificity), for a binary classifier system as its discrimination threshold is varied. An AUROC has many interpretations compared to raw accuracy. A <a href="https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it" target="_blank" rel="noopener">beautiful explanation</a> on Confusion Matrix.<br /><br />The area equals the probability that a randomly chosen positive example ranks above (is deemed to have a higher probability of being positive than) a randomly chosen negative example.</p>
</blockquote>
<p> </p>
<p><strong>What is Accuracy ?</strong></p>
<blockquote>
<p>Accuracy can be defined as:<br /><code>(Number of correctly classified points)/(Total number of points)</code></p>
<p>1) Imbalanced Data:A dumb model could get a very high accuracy. So never use accuracy as measure in imbalanced dataset.<br />2) Accuracy cannot use probabilistic score.</p>
</blockquote>
<p> </p>
<p><strong>Explain about Confusion matrix, TPR, FPR, FNR, TNR?</strong></p>
<blockquote>
<p>Confusion matrix is a square matrix comprising of predicted/actual class label values. Dimension of the square is equal to the number of class labels. Confusion matrix does not consider probabilistic scores.</p>
<p>A good model, will have high TNR and TPR. Elements in principal diagonal matrix will be high for a good model<br />Important parameters related to Confusion Matrix<br />TPR: True Positive Rate<br />FPR: False Positive Rate<br />FNR: False Negative Rate<br />TNR: True Negative Rate<br />TP: Number of true positive points<br />FP: Number of false positive points<br />TN: Number of true negative points<br />FN: Number of false negative points<br />P:Total actual positive points<br />N:Total actual negative points<br />TPR = TP/P; TNR = TN/N; FPR = FP/N; FNR = FN/P</p>
[caption id="attachment_1566" align="aligncenter" width="300"]<a href="#"><img class="" src="/images/revolution-analytics-300x279.png" alt="" width="300" height="279" /></a> Source: blog.revolutionanalytics.com/[/caption]
<p>Therefore, with TPR, TNR, FPR, FNR, we get a better insight of data rather then only accuracy. It is upto the domain to decide as to which among TPR, TNR, FPR, FNR is more important.</p>
</blockquote>
<p> </p>
<p><strong>What do you understand about Precision & recall, F1-score?</strong></p>
<blockquote>
<p>Precision and recall are often used in information retrieval problems.They are related to the positive class/label of a dateset. <strong>Precision is</strong>: <code>TP/(TP+FP)</code>. It means that of all the points predicted to be positives, what percentage of them are actually positive</p>
<p>Recall is nothing but True Positive Rate(TPR). It means, out of all the positive labels, how many are correctly predicted to be positive.</p>
<p>We want precesion to be high which means that there are less points which are wrongly implicated to be positive. We also want, recall to be high, out of all the actual positive points, more points were rightly detected to be positive</p>
<p>Precision(Pr) and Recall(R) are combined in F1-Score.<br />$$F1Score = 2*\frac{Pr*R}{Pr+R}$$</p>
</blockquote>
<p> </p>
<p><strong>What is the ROC Curve and what is AUC (a.k.a. AUROC)</strong></p>
<blockquote>
<p>Receiver Operating Characteristic Curve (ROC) and Area Under RO Curve (AUC) are <strong>binary classification</strong> metric. It is a plot between TPR and FPR. An AUC score includes integration over the whole range of precision/recall tradeoffs, while the F1 score takes one specific precision and recall pair, which could be viewed as a sample or average. Area under RO curve can lie between 0 and 1. 1 signifies very good model. 0 means terrible.</p>
<p>1. If we have imbalanced data, AUC can be high even for a dumb model.<br />2. AUC does not care about the actual score assigned to a data point label.<br />3. AUC of a random model is 0.5.</p>
</blockquote>
<p> </p>
<p><strong>What is Log-loss and how it helps to improve performance?.</strong></p>
<blockquote>
<p>Given a test set, log-loss is defined as:<br />$$-\frac{1}{n}\sum_{i=1}^{n}\{(log(P_i)*y_i)+(1-y_i)*log(1-P_i)\} $$<br /><code>y<sub>i</sub></code> is the label of dataset and <code>P<sub>i</sub> is the probabilistic score of the label</code>.</p>
<p>Log-loss value is small where P<sub>i</sub> value is large for positive class/label. Also Log-loss value is small where P<sub>i</sub> value is small for negative class/label. Loss loss value can lie between 0 to Infinity. 0 is the best case. Loss loss takes into consideration the actual probabilistic values.</p>
<p>Log-Loss is average of negative log of probability of correct class label. Log-loss can be extended to multi class labels.</p>
</blockquote>
<p><strong>Explain about R-Squared/ Coefficient of determination</strong></p>
<blockquote>
<p>Coefficient of determination is a performance measure for models where predicted label values can belong to any real number (regression). Let the actual value be <code>y<sub>i</sub></code> and predicted value be <code>y'<sub>i</sub></code>, then we can calculate <strong>error</strong> as <code>e<sub>i</sub> = y<sub>i</sub> - y'<sub>i</sub></code></p>
<p>Now, we define a term <strong>Total Sum of Square</strong>, SS as<br />$$SS = \sum_{i=1}^{n}(y_i - \bar{y_i})^2$$<br />where,<br />$$\bar{y_i} = \frac{1}{n}\sum_{i=1}^{n}y_i$$ = average value of actual <code>y<sub>i</sub></code> in test data.</p>
<p>In a simplest regression model, given a query point we can return its output as the mean of all the other outputs. For example, to predict height of a person among 10 persons, we can calculate the mean of height of all the other 9 person and assign it as the height of the person under consideration.</p>
<p>Total sum of square is the sum of square errors using a simple mean model. Now we define <strong>Sum of Square Residual</strong><br />$$SS = \sum_{i=1}^{n}(y_i - y'_i)^2$$<br />where,<br /><code>y'<sub>i</sub></code> is the predicted class value.</p>
<p><code>SS<sub>total</sub></code> is for a simple mean model whereas, <code>SS<sub>residual</sub></code> is for the model that is under operation. Now, we can define <code>R<sup>2</sup></code> as:<br />$$R^2 = 1-\frac{SS_{res}}{SS_{total}}$$.</p>
<p> </p>
<p><strong>Case 1</strong>: When <code>SS<sub>res</sub> = 0</code>. This will happen when predicted value is exactly same as actual value, that means error, <code>e<sub>i</sub> = 0</code>. In this case <code>R<sup>2</sup> = 1</code>, which means that our model is phenomenal.</p>
<p> </p>
<p><strong>Case 2:</strong> When <code>SS<sub>res</sub> < SS<sub>total</sub></code>. In this case, <code>R<sup>2</sup></code> will be between 0 and 1.</p>
<p> </p>
<p><strong>Case 3:</strong>: <code>SS<sub>res</sub> = SS<sub>total</sub></code>, then <code>R<sup>2</sup></code> is 0, which means our model is same as simple mean model.</p>
<p> </p>
<p><strong>Case 3:</strong>: <code>SS<sub>res</sub> > SS<sub>total</sub></code>, then <code>R<sup>2</sup></code> becomes negative, which means our model is worse then a simple mean model</p>
</blockquote>
<p> </p>
<p><strong>Explain about Median absolute deviation (MAD) ?Importance of MAD?</strong></p>
<blockquote>
<p>Errors, <code>e<sub>i</sub></code> and <code>SS</code> can suffer from outlier points, i.e. if one point is very large, our entire <code>R<sup>2</sup></code> can go for a toss. <code>R<sup>2</sup></code> is not very robust to outliers.</p>
<p>Now, error, <code>e<sub>i</sub></code> is a random variable. We can choose to select the mean of <code>e<sub>i</sub></code>, i.e. <code>median(e<sub>i</sub>) = central value of errors</code>.<br />Median Absolute Deviation, <code>MAD(e<sub>i</sub>) = Median(e<sub>i</sub> - median(e<sub>i</sub>))</code><br />Median is a robust measure of mean, and MAD is a robust measure of standard-deviation.</p>
</blockquote>
<p> </p>
<p>[mathjax]</p>Collection of questions and answers on performance measure of models Which is more important to you– model accuracy, or model performance? Lets answer this with respect to classification problems. Model Performance is more important. Model accuracy cannot be considered in cases where we have imbalanced dataset (where there are more positives then negatives). Accuracy also assign equal weight to labels which is a disadvantage in cases of imbalanced dataset.Classification model performance can be evaluated from metrics such as Log-Loss, Accuracy, AUC(Area under Curve) and precision, recall (generally used by search engines) Can you cite some examples where a false positive is important than a false negative? Consider a model where, 1 (positive) means that a mail is Spam, 0 (negative) means that the mail is not Spam. If False Positives are high then important mails will go to the Spam folder and it may become difficult to retrieve that mail from the huge chunk of mails in Spam folder. Low False Negative would mean, that a spam mail lands up in the Primary mailbox. Now it is not difficult, to mark a mail as Spam from the mail at Primary mailbox. But, as mentioned earlier, it is very difficult, to retrieve a mail from Spam folder. hence, in cases like this, False Positive is more important then False Negative Can you cite some examples where a false negative important than a false positive? In Cancer diagnosis, let 1 (positive) denote positive for Cancer, 0 (negative) denote negative for Cancer. A False Negative would mean a patient who has cancer has been diagnosed as negative for Cancer. This situation is very dangerous as a patient who has Cancer was detected as negative by ML model and as a result, the patient will not be subjected to follow up investigation.On the other hand, False Positive is not as dangerous Flase Negative. Even if the patient does not have Cancer, the ML model will show positive and the patient will be subjected to further follow-up investigation. Can you cite some examples where both false positive and false negatives are equally important? Consider, posting articles in a blog. If this article is read by more then average number of readers in my blog then it is positive. Else, negative.A false positive would mean that more readers then the average number of readers in my blog have read this article, but the truth is that less then average readers have read this article. Here, false positive gives me a wrong motivation but the same motivation ensures that I keep writing. Writing helps me stay in practice.A flase negative would means that the article did not do any better than all the other article but the truth being that it garnered more readers than the average readership of my blog. Here, false negative gives me a sense of introspection on the quality of my writing and ultimately helps me improve myself. What is the most frequent metric to assess model accuracy for classification problems? The answer to this question is very domain specefic. For a overall idea we can say that confusion matrix is better then simple accuracy because of more output parameters in confusion matrix. RO curve could prove to be more helpful becuase it includes integration over the whole range of precision/recall tradeoffs. Log-loss is another metic to measure accuracy and it is the only one that considers probabilistic score directly. Why is Area Under ROC Curve (AUROC) better than raw accuracy as an out-of- sample evaluation metric? A ROC curve plots the true positives (sensitivity) vs. false positives (1 − specificity), for a binary classifier system as its discrimination threshold is varied. An AUROC has many interpretations compared to raw accuracy.A ROC curve plots the true positives (sensitivity) vs. false positives (1 − specificity), for a binary classifier system as its discrimination threshold is varied. An AUROC has many interpretations compared to raw accuracy. A beautiful explanation on Confusion Matrix.The area equals the probability that a randomly chosen positive example ranks above (is deemed to have a higher probability of being positive than) a randomly chosen negative example. What is Accuracy ? Accuracy can be defined as:(Number of correctly classified points)/(Total number of points) 1) Imbalanced Data:A dumb model could get a very high accuracy. So never use accuracy as measure in imbalanced dataset.2) Accuracy cannot use probabilistic score. Explain about Confusion matrix, TPR, FPR, FNR, TNR? Confusion matrix is a square matrix comprising of predicted/actual class label values. Dimension of the square is equal to the number of class labels. Confusion matrix does not consider probabilistic scores. A good model, will have high TNR and TPR. Elements in principal diagonal matrix will be high for a good modelImportant parameters related to Confusion MatrixTPR: True Positive RateFPR: False Positive RateFNR: False Negative RateTNR: True Negative RateTP: Number of true positive pointsFP: Number of false positive pointsTN: Number of true negative pointsFN: Number of false negative pointsP:Total actual positive pointsN:Total actual negative pointsTPR = TP/P; TNR = TN/N; FPR = FP/N; FNR = FN/P [caption id="attachment_1566" align="aligncenter" width="300"] Source: blog.revolutionanalytics.com/[/caption] Therefore, with TPR, TNR, FPR, FNR, we get a better insight of data rather then only accuracy. It is upto the domain to decide as to which among TPR, TNR, FPR, FNR is more important. What do you understand about Precision & recall, F1-score? Precision and recall are often used in information retrieval problems.They are related to the positive class/label of a dateset. Precision is: TP/(TP+FP). It means that of all the points predicted to be positives, what percentage of them are actually positive Recall is nothing but True Positive Rate(TPR). It means, out of all the positive labels, how many are correctly predicted to be positive. We want precesion to be high which means that there are less points which are wrongly implicated to be positive. We also want, recall to be high, out of all the actual positive points, more points were rightly detected to be positive Precision(Pr) and Recall(R) are combined in F1-Score.$$F1Score = 2*\frac{Pr*R}{Pr+R}$$ What is the ROC Curve and what is AUC (a.k.a. AUROC) Receiver Operating Characteristic Curve (ROC) and Area Under RO Curve (AUC) are binary classification metric. It is a plot between TPR and FPR. An AUC score includes integration over the whole range of precision/recall tradeoffs, while the F1 score takes one specific precision and recall pair, which could be viewed as a sample or average. Area under RO curve can lie between 0 and 1. 1 signifies very good model. 0 means terrible. 1. If we have imbalanced data, AUC can be high even for a dumb model.2. AUC does not care about the actual score assigned to a data point label.3. AUC of a random model is 0.5. What is Log-loss and how it helps to improve performance?. Given a test set, log-loss is defined as:$$-\frac{1}{n}\sum_{i=1}^{n}\{(log(P_i)*y_i)+(1-y_i)*log(1-P_i)\} $$yi is the label of dataset and Pi is the probabilistic score of the label. Log-loss value is small where Pi value is large for positive class/label. Also Log-loss value is small where Pi value is small for negative class/label. Loss loss value can lie between 0 to Infinity. 0 is the best case. Loss loss takes into consideration the actual probabilistic values. Log-Loss is average of negative log of probability of correct class label. Log-loss can be extended to multi class labels. Explain about R-Squared/ Coefficient of determination Coefficient of determination is a performance measure for models where predicted label values can belong to any real number (regression). Let the actual value be yi and predicted value be y'i, then we can calculate error as ei = yi - y'i Now, we define a term Total Sum of Square, SS as$$SS = \sum_{i=1}^{n}(y_i - \bar{y_i})^2$$where,$$\bar{y_i} = \frac{1}{n}\sum_{i=1}^{n}y_i$$ = average value of actual yi in test data. In a simplest regression model, given a query point we can return its output as the mean of all the other outputs. For example, to predict height of a person among 10 persons, we can calculate the mean of height of all the other 9 person and assign it as the height of the person under consideration. Total sum of square is the sum of square errors using a simple mean model. Now we define Sum of Square Residual$$SS = \sum_{i=1}^{n}(y_i - y'_i)^2$$where,y'i is the predicted class value. SStotal is for a simple mean model whereas, SSresidual is for the model that is under operation. Now, we can define R2 as:$$R^2 = 1-\frac{SS_{res}}{SS_{total}}$$. Case 1: When SSres = 0. This will happen when predicted value is exactly same as actual value, that means error, ei = 0. In this case R2 = 1, which means that our model is phenomenal. Case 2: When SSres < SStotal. In this case, R2 will be between 0 and 1. Case 3:: SSres = SStotal, then R2 is 0, which means our model is same as simple mean model. Case 3:: SSres > SStotal, then R2 becomes negative, which means our model is worse then a simple mean model Explain about Median absolute deviation (MAD) ?Importance of MAD? Errors, ei and SS can suffer from outlier points, i.e. if one point is very large, our entire R2 can go for a toss. R2 is not very robust to outliers. Now, error, ei is a random variable. We can choose to select the mean of ei, i.e. median(ei) = central value of errors.Median Absolute Deviation, MAD(ei) = Median(ei - median(ei))Median is a robust measure of mean, and MAD is a robust measure of standard-deviation. [mathjax]QnA - Classification Algorithms In Various Situations2018-06-27T11:47:39+00:002018-06-27T11:47:39+00:00/2018/06/27/qna-classification-algo-in-various-situations<p>Here are a few questions and answers related to Classification Algorithms In Various Situations<!--more--></p>
<p><strong>What is Imbalanced and balanced dataset</strong></p>
<blockquote>
<p>If a dataset has unequal positive and negative data points than the dataset is imbalanced. Balanced dataset has equal positive and negative labels.<br />For Example: In data sets of patient having cancer or not, there will be a very high negative data points (for patients without cancer) and less positive data points (patients with cancer).<br /><br />K-NN results could be biased if the dataset is heavily imbalanced.</p>
</blockquote>
<p> </p>
<p><strong>Define Multi-class classification?</strong></p>
<blockquote>
<p>In MNIST dataset, we have 10 class/labels. So, MNIST dataset is a multi-class dataset.<br />In a c-class classifier, for a query point Xq, Xq may belong to any of the c-class. Now, for 7-NN, Xq will belong to that class, where majority of 7-NN belong to.</p>
</blockquote>
<p> </p>
<p><strong>Explain Impact of Outliers?</strong></p>
<blockquote>
<p>In KNN, when K=1, than an outlier can easily impact our model. This happens because our decision surface has changed with an outpier. As a comparison, K=5 will be less prone to error as compared to k=1. So, if we get the same accuracy in Dtest for K=5 and K=1, then we must prefer K=5.</p>
</blockquote>
<p> </p>
<p><strong>What is Local Outlier Factor?</strong></p>
<blockquote>
<p>The objective of local outlier factor is to detect outliers in data. It is inspired by KNN. For every query point (also for the outlier point), find the mean distance of all its K-Nearest Neighbor. Now sort all mean distances. If any of the mean distance is exceptionally high then the corresponding point is definitely an outlier.</p>
</blockquote>
<p> </p>
<p><strong>What is k-distance (A), N(A)</strong></p>
<blockquote>
<p>k-distance of a point A is the distance to the k-th nearest neighbor of A from A. <strong>N(A) denotes neighborhood of A</strong>. It is set of all points that belong to the KNN of A.</p>
</blockquote>
<p> </p>
<p><strong>Define reachability-distance(A, B)?</strong></p>
<blockquote>
<p>Mathematically, it is defined as:<br /><code>reachability-distance(A, B) = max(k-distance(B), dist(A,B))</code><br /><code>dist(A,B)</code> is the actual distance between A and B<br />Note that, if A is in the neighborhood of B then:<br /><code>reachability-distance(A, B) = k-distance(B)</code></p>
</blockquote>
<p> </p>
<p><strong>What is Local-reachability-density(A) or LRD(A)?</strong></p>
<blockquote>
<p>Local-reachability-density(A) is the inverse of average reachability distance of A from its neighbor.<br />Mathematically:<br />$$LRD(A) = \frac{1}{\sum_{B\in N(A)}^{}{\frac{reachability-distance(A,B)}{\|N(B)\|}}}$$</p>
</blockquote>
<p> </p>
<p><strong>Define LOF(A)</strong></p>
<blockquote>
<p>Local Outlier Factor of A, or LOF(A) is a quantity that is large when LRD(A) is small but LRD of neighborhood points of A is large. In other words, LOF(A) is large when density of points around A is small but density of points around the neighborhood of A is large.</p>
<p>When LOF(A) is large then we can conclude that A is an outlier.<br />Mathematically:<br />$$LOF(A) = \frac{\sum_{B\in N(A)}^{}LRD(B)}{\|N(B)\|} * \frac{1}{LRD(A)}$$</p>
</blockquote>
<p> </p>
<p><strong>Impact of Scale & Column standardization?</strong></p>
<blockquote>
<p>If the scale of feature column are different then it impacts the euclidean distance measure. Euclidean distance measure is important in algorithms like KNN. Thus column standardization is done to bring all features under same scale</p>
</blockquote>
<p> </p>
<p><strong>What is Interpretability?</strong></p>
<blockquote>
<p>Suppose a ML model outputs weather a patient will survive cancer or not. Now, a professional Doctor cannot trust this model blindly. Nor is the Doctor trained to understand a ML model.<br /><br />So, in addition to YES/NO output from a ML model, the ML model should also give a reasonable justification as to why a particular output occurred. The reasoning will be useful for the Doctor to analyze the result. Such models are called interpretable model. KNN is interpretable if K is small.</p>
</blockquote>
<p> </p>
<p><strong>How To Handling categorical and numerical features?</strong></p>
<blockquote>
<p>Suppose we have a categorical feature called <strong>Hair Color</strong>. Now, all type of hair color needs to be converted into a number so that a machine learning model can compute over it. <br /><strong>One-hot encoding</strong>: Creates a binary vector of the size of number of distinct elements. f the number of distinct values for a categorical feature is large then One-Hot-Encoding can create sparse and large vectors.<br />Ordinal Features<br /><strong>In ordinal features</strong>, you can assign numbers (logical ordering) to each category of the feature and they just work fine. For example, 'very-good' could have a numeral value 5, 'very-bad' could have a numeral value 1.</p>
</blockquote>
<p> </p>
<p><strong>How To Handle Handling missing values by imputation?</strong></p>
<blockquote>
<p>There are various ways to handle missing values and it depends on the domain as to which method to choose to apply. Some of the ways are <br /><br />1. Take all the non-missing values and put its <strong>mean/median/mode</strong> in the missing value position.<br />2. Imputation based on class label. Suppose we have the class label and a known feature f1 for a data point. We can compute the unknown feature f2, from class-label and f1.<br />3. Create missing value feature. Missing values are sometimes source of information. Impute missing values using technique 1 or 2. Then create additional binary features, where missing values are represented as 1 and non-missing values are represented as 0.</p>
</blockquote>
<p> </p>
<p><strong>What is Bias-Variance tradeoff?</strong></p>
<blockquote>
<p>In theory of ML, we calculate a generalization error. It is the error on future unseen data. It is calculated as a sum of bias<sup>2</sup>+variance+irreducible-error.</p>
<p>Bias error occurs due to simplifying assumptions about a model. High-bias error imply <strong>underfitting</strong>.</p>
<p>Variance gives a measure of how much a model changes as training data changes. Large changes in model causes high variance resulting into <strong>overfitting</strong>. Small changes in training dataset causes decision surface to change a lot thus changing the mode.</p>
<p>As K increases, variance reduces. Our target should be to reduce generalization error. This can be cone so by reducing bias and variance. We can reduce bias by preventing underfit. Reduce variance by not overfitting. There is always a trade-off between underfitting and overfitting and thus the bias-variance tradeoff.</p>
</blockquote>
<p>[mathjax]</p>Here are a few questions and answers related to Classification Algorithms In Various Situations What is Imbalanced and balanced dataset If a dataset has unequal positive and negative data points than the dataset is imbalanced. Balanced dataset has equal positive and negative labels.For Example: In data sets of patient having cancer or not, there will be a very high negative data points (for patients without cancer) and less positive data points (patients with cancer).K-NN results could be biased if the dataset is heavily imbalanced. Define Multi-class classification? In MNIST dataset, we have 10 class/labels. So, MNIST dataset is a multi-class dataset.In a c-class classifier, for a query point Xq, Xq may belong to any of the c-class. Now, for 7-NN, Xq will belong to that class, where majority of 7-NN belong to. Explain Impact of Outliers? In KNN, when K=1, than an outlier can easily impact our model. This happens because our decision surface has changed with an outpier. As a comparison, K=5 will be less prone to error as compared to k=1. So, if we get the same accuracy in Dtest for K=5 and K=1, then we must prefer K=5. What is Local Outlier Factor? The objective of local outlier factor is to detect outliers in data. It is inspired by KNN. For every query point (also for the outlier point), find the mean distance of all its K-Nearest Neighbor. Now sort all mean distances. If any of the mean distance is exceptionally high then the corresponding point is definitely an outlier. What is k-distance (A), N(A) k-distance of a point A is the distance to the k-th nearest neighbor of A from A. N(A) denotes neighborhood of A. It is set of all points that belong to the KNN of A. Define reachability-distance(A, B)? Mathematically, it is defined as:reachability-distance(A, B) = max(k-distance(B), dist(A,B))dist(A,B) is the actual distance between A and BNote that, if A is in the neighborhood of B then:reachability-distance(A, B) = k-distance(B) What is Local-reachability-density(A) or LRD(A)? Local-reachability-density(A) is the inverse of average reachability distance of A from its neighbor.Mathematically:$$LRD(A) = \frac{1}{\sum_{B\in N(A)}^{}{\frac{reachability-distance(A,B)}{\|N(B)\|}}}$$ Define LOF(A) Local Outlier Factor of A, or LOF(A) is a quantity that is large when LRD(A) is small but LRD of neighborhood points of A is large. In other words, LOF(A) is large when density of points around A is small but density of points around the neighborhood of A is large. When LOF(A) is large then we can conclude that A is an outlier.Mathematically:$$LOF(A) = \frac{\sum_{B\in N(A)}^{}LRD(B)}{\|N(B)\|} * \frac{1}{LRD(A)}$$ Impact of Scale & Column standardization? If the scale of feature column are different then it impacts the euclidean distance measure. Euclidean distance measure is important in algorithms like KNN. Thus column standardization is done to bring all features under same scale What is Interpretability? Suppose a ML model outputs weather a patient will survive cancer or not. Now, a professional Doctor cannot trust this model blindly. Nor is the Doctor trained to understand a ML model.So, in addition to YES/NO output from a ML model, the ML model should also give a reasonable justification as to why a particular output occurred. The reasoning will be useful for the Doctor to analyze the result. Such models are called interpretable model. KNN is interpretable if K is small. How To Handling categorical and numerical features? Suppose we have a categorical feature called Hair Color. Now, all type of hair color needs to be converted into a number so that a machine learning model can compute over it. One-hot encoding: Creates a binary vector of the size of number of distinct elements. f the number of distinct values for a categorical feature is large then One-Hot-Encoding can create sparse and large vectors.Ordinal FeaturesIn ordinal features, you can assign numbers (logical ordering) to each category of the feature and they just work fine. For example, 'very-good' could have a numeral value 5, 'very-bad' could have a numeral value 1. How To Handle Handling missing values by imputation? There are various ways to handle missing values and it depends on the domain as to which method to choose to apply. Some of the ways are 1. Take all the non-missing values and put its mean/median/mode in the missing value position.2. Imputation based on class label. Suppose we have the class label and a known feature f1 for a data point. We can compute the unknown feature f2, from class-label and f1.3. Create missing value feature. Missing values are sometimes source of information. Impute missing values using technique 1 or 2. Then create additional binary features, where missing values are represented as 1 and non-missing values are represented as 0. What is Bias-Variance tradeoff? In theory of ML, we calculate a generalization error. It is the error on future unseen data. It is calculated as a sum of bias2+variance+irreducible-error. Bias error occurs due to simplifying assumptions about a model. High-bias error imply underfitting. Variance gives a measure of how much a model changes as training data changes. Large changes in model causes high variance resulting into overfitting. Small changes in training dataset causes decision surface to change a lot thus changing the mode. As K increases, variance reduces. Our target should be to reduce generalization error. This can be cone so by reducing bias and variance. We can reduce bias by preventing underfit. Reduce variance by not overfitting. There is always a trade-off between underfitting and overfitting and thus the bias-variance tradeoff. [mathjax]QnA - K-Nearest Neighbor2018-06-20T11:47:39+00:002018-06-20T11:47:39+00:00/2018/06/20/qna-knn<p>This post is a quick revision guide on k-NN. The answers are neither of advanced level nor of the layman level. Some questions require proper answer and will be updated as I better my understanding<!--more--></p>
<p><strong>What is k-nearest neighbors?</strong></p>
<blockquote>
<p>k-NN is a classification technique. It tells as to which class a query point belongs to.</p>
</blockquote>
<p> </p>
<p><strong>When might k-nearest neighbors fail?</strong></p>
<blockquote>
<p>It fails when training data is mixed, that is, there is no way to distinguish between the classes. See the figure below, there is no clear way to distinguish between positive and negative class, hence we cannot say as to which class a query point will belong.<br /><a href="#"><img class="" src="/images/Screenshot-from-2018-06-14-20-29-49-300x251.png" alt="" width="300" height="251" /></a><br /><br /><br />K-NN also fails when the distance of query point is equal from both positive and negative class.</p>
</blockquote>
<p> </p>
<p><strong>Define Distance measures: Euclidean(L2) , Manhattan(L1), Minkowski, Hamming?</strong></p>
<blockquote>
<p>The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.<br /><a href="https://en.wikipedia.org/wiki/Minkowski_distance" target="_blank" rel="noopener">Find the equation here in wiki</a>:<br />In the equation, p=2 would imply Euclidean distance (also known as l2 norm). p=1 would imply Minkowski distance.<br /><br /><br />Hamming distance is equivalent to the number of locations/dimensions where two binary vectors differ.</p>
</blockquote>
<p> </p>
<p><strong>What is Cosine Distance & Cosine Similarity?</strong></p>
<blockquote>
<p>The cosine of the angle between two vectors is the measure of cosine similarity. More the similarity, lesser is the distance. Mathematically, Cosine Similarity = 1 - Cosine Distance</p>
</blockquote>
<p> </p>
<p><strong>How to measure the effectiveness of k-NN?</strong></p>
<blockquote>
<p>Given a dataset with class labels, we divide the dataset into D<sub>train</sub> and D<sub>test</sub>. We calculate the value of <strong>K</strong> on D<sub>train</sub>. Now with the calculated <strong>K</strong> value we predict the class label of D<sub>test</sub>. Now to measure effectiveness, we will calculate <strong>accuracy</strong> on D<sub>test</sub> as-<br /><br />Accuracy = (Number of points for which Dtrain and KNN gave correct class label)/(Number of points in Dtest)</p>
</blockquote>
<p> </p>
<p><strong>What is the Limitations of KNN?</strong></p>
<blockquote>
<p>Biggest limitation of KNN is its high time and space complexity. If a data point is of <strong>d </strong> dimensions then for <strong>n</strong> points in Dtrain, the time complexity of finding the nearest neighbor for each of the <strong>n</strong> data point is O(nd).<br /><br />The value of d could be very big depending on the data type. For example, in BOW of Amazon Review Dataset, d = 100K, for tfidf, it is 300K.</p>
</blockquote>
<p> </p>
<p><strong>How to handle Overfitting and Underfitting in KNN</strong></p>
<blockquote>
<p>For K=1, we will have overfitting and for K=n, we will have underfitting. Infact, for K=n, every data point will belong to majority class. Overfitting decision surfaces are non-smooth and less robust (prone to noise). A smooth decision surface neither under-fits nor over-fits. Underfitting occurs in lazy systems. A well balanced system has a smooth decision surface.</p>
</blockquote>
<p> </p>
<p><strong>Why there is a Need for Cross validation?</strong></p>
<blockquote>
<p>The available dataset, D is divided into D-train and D-test. While D-train is used to train the model, D-test is used to evaluate the right value of K. The right K corresponds to the highest accuracy on D-test. <br />Accuracy on D-test = (Points for which D-train and K-NN gives correct class label)/(Total number of points in Dtest)<br /><br />This accuracy can be calculated only on D-test .Hence, we calculate accuracy for different values of K, then decide on the right value of K. But, we cannot say that for future data also, this K value would give the highest accuracy. For this problem, we have a concept called as <strong>Cross validation</strong>. In Cross Validation , the concept is to use D-cv instead of D-test to get the right value of K. So now, the dataset is divided into three disjoint sets which are, D-train, D-test and D-cv.<br /><br />While D-train is used for training the model, D-cv is used to get the right value of K. D-test is now the unseen (or future) data.</p>
</blockquote>
<p> </p>
<p><strong>What is K-fold cross validation?</strong></p>
<blockquote>
<p>In K-fold cross validation, D-train itself will be used to get the value of K instead of D-cv. The K in K-fold is not the same K as the K in KNN.</p>
<p>In K-fold cross validation, the D-train is divided in K sets of training data and 1 set of cross validation data. (Dtrain split into K+1 set). We calculate the accuracy on different permutations of K sets of training data and 1 set of cross validation data. This is repeated for different values of K (of KNN). We choose the K (of KNN) depending on the highest accuracy. The highest accuracy can be found out, either using majority vote or average of all the permutations.<br /><br />Advantage: Here we have used the entire Dtrain to calculate K. We have not used Dcv or Dtest. Dtest remains an unseen data.<strong> Rule of thumb:</strong> Use 10 fold cross validation.</p>
</blockquote>
<p> </p>
<p><strong>What is Time based splitting?</strong></p>
<blockquote>
<p>For time based splitting, data should have a time feature (say timestamp). Whenever time is available and if data changes over time then time based splitting is preferable to random splitting. Dividing the dataset into D-train, D-test and D-cv on basis of time. We have to make sure that the available dataset is sorted on basis of time. After that, we can use the first 60% of data as D-train. Next 15% as D-cv and next 15% as D-test.</p>
</blockquote>
<p> </p>
<p><strong>Explain k-NN for regression??</strong></p>
<blockquote>
<p>In classification, we use majority vote to get the right value of label/class. In regression, we end up getting numerous values for label/class. In such case, we use either mean or median to get the right value of the label/class.</p>
</blockquote>
<p> </p>
<p><strong>What is weighted k-NN ?</strong></p>
<blockquote>
<p>After we evaluate distance between nearest neighbor and the query point, we will give higher weight to closer distances (lower weight to farther distances) and then take the class label on basis of higher weight. Evaluation of weight from the distances can be done so using formula. The simplest formula which can be used is: <em>weight = 1/distance</em> (there could be different formula to calculate the weight).<br /><br />Sometimes, the class label of a query point under consideration could have been +ve in some cases had we would have taken only majority vote. But, with weighted k-NN, we have a -ve label.</p>
</blockquote>
<p> </p>
<p><strong>What is Locality sensitive Hashing (LSH)?</strong></p>
<blockquote>
<p>Similar to hashing in data structures. In Locality Sensitive Hashing, given a data point x, it computes the hash function on x. The result of hash function on x and its neighbor is such that they all end up in the same bucket in hash table. <br />Given a query point, we can find its hash function and see as to which bucket it ends up in. By looking at the values of the bucket, we know as to which class/label that query point belongs to.</p>
</blockquote>
<p> </p>
<p><strong>LSH for cosine similarity?</strong></p>
<blockquote>
<p>In LSH for cosine similarity, vectors which are cosine similar will go inside the same bucket in the hash table. [help me in writing this in detail]</p>
</blockquote>
<p> </p>
<p><strong>LSH for euclidean distance?</strong></p>
<blockquote>
<p>In LSH for euclidean distance, vectors which are closer in distance will go inside the same bucket in the hash table. [help me in writing this in detail]</p>
</blockquote>
<hr />This post is a quick revision guide on k-NN. The answers are neither of advanced level nor of the layman level. Some questions require proper answer and will be updated as I better my understanding What is k-nearest neighbors? k-NN is a classification technique. It tells as to which class a query point belongs to. When might k-nearest neighbors fail? It fails when training data is mixed, that is, there is no way to distinguish between the classes. See the figure below, there is no clear way to distinguish between positive and negative class, hence we cannot say as to which class a query point will belong.K-NN also fails when the distance of query point is equal from both positive and negative class. Define Distance measures: Euclidean(L2) , Manhattan(L1), Minkowski, Hamming? The Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.Find the equation here in wiki:In the equation, p=2 would imply Euclidean distance (also known as l2 norm). p=1 would imply Minkowski distance.Hamming distance is equivalent to the number of locations/dimensions where two binary vectors differ. What is Cosine Distance & Cosine Similarity? The cosine of the angle between two vectors is the measure of cosine similarity. More the similarity, lesser is the distance. Mathematically, Cosine Similarity = 1 - Cosine Distance How to measure the effectiveness of k-NN? Given a dataset with class labels, we divide the dataset into Dtrain and Dtest. We calculate the value of K on Dtrain. Now with the calculated K value we predict the class label of Dtest. Now to measure effectiveness, we will calculate accuracy on Dtest as-Accuracy = (Number of points for which Dtrain and KNN gave correct class label)/(Number of points in Dtest) What is the Limitations of KNN? Biggest limitation of KNN is its high time and space complexity. If a data point is of d dimensions then for n points in Dtrain, the time complexity of finding the nearest neighbor for each of the n data point is O(nd).The value of d could be very big depending on the data type. For example, in BOW of Amazon Review Dataset, d = 100K, for tfidf, it is 300K. How to handle Overfitting and Underfitting in KNN For K=1, we will have overfitting and for K=n, we will have underfitting. Infact, for K=n, every data point will belong to majority class. Overfitting decision surfaces are non-smooth and less robust (prone to noise). A smooth decision surface neither under-fits nor over-fits. Underfitting occurs in lazy systems. A well balanced system has a smooth decision surface. Why there is a Need for Cross validation? The available dataset, D is divided into D-train and D-test. While D-train is used to train the model, D-test is used to evaluate the right value of K. The right K corresponds to the highest accuracy on D-test. Accuracy on D-test = (Points for which D-train and K-NN gives correct class label)/(Total number of points in Dtest)This accuracy can be calculated only on D-test .Hence, we calculate accuracy for different values of K, then decide on the right value of K. But, we cannot say that for future data also, this K value would give the highest accuracy. For this problem, we have a concept called as Cross validation. In Cross Validation , the concept is to use D-cv instead of D-test to get the right value of K. So now, the dataset is divided into three disjoint sets which are, D-train, D-test and D-cv.While D-train is used for training the model, D-cv is used to get the right value of K. D-test is now the unseen (or future) data. What is K-fold cross validation? In K-fold cross validation, D-train itself will be used to get the value of K instead of D-cv. The K in K-fold is not the same K as the K in KNN. In K-fold cross validation, the D-train is divided in K sets of training data and 1 set of cross validation data. (Dtrain split into K+1 set). We calculate the accuracy on different permutations of K sets of training data and 1 set of cross validation data. This is repeated for different values of K (of KNN). We choose the K (of KNN) depending on the highest accuracy. The highest accuracy can be found out, either using majority vote or average of all the permutations.Advantage: Here we have used the entire Dtrain to calculate K. We have not used Dcv or Dtest. Dtest remains an unseen data. Rule of thumb: Use 10 fold cross validation. What is Time based splitting? For time based splitting, data should have a time feature (say timestamp). Whenever time is available and if data changes over time then time based splitting is preferable to random splitting. Dividing the dataset into D-train, D-test and D-cv on basis of time. We have to make sure that the available dataset is sorted on basis of time. After that, we can use the first 60% of data as D-train. Next 15% as D-cv and next 15% as D-test. Explain k-NN for regression?? In classification, we use majority vote to get the right value of label/class. In regression, we end up getting numerous values for label/class. In such case, we use either mean or median to get the right value of the label/class. What is weighted k-NN ? After we evaluate distance between nearest neighbor and the query point, we will give higher weight to closer distances (lower weight to farther distances) and then take the class label on basis of higher weight. Evaluation of weight from the distances can be done so using formula. The simplest formula which can be used is: weight = 1/distance (there could be different formula to calculate the weight).Sometimes, the class label of a query point under consideration could have been +ve in some cases had we would have taken only majority vote. But, with weighted k-NN, we have a -ve label. What is Locality sensitive Hashing (LSH)? Similar to hashing in data structures. In Locality Sensitive Hashing, given a data point x, it computes the hash function on x. The result of hash function on x and its neighbor is such that they all end up in the same bucket in hash table. Given a query point, we can find its hash function and see as to which bucket it ends up in. By looking at the values of the bucket, we know as to which class/label that query point belongs to. LSH for cosine similarity? In LSH for cosine similarity, vectors which are cosine similar will go inside the same bucket in the hash table. [help me in writing this in detail] LSH for euclidean distance? In LSH for euclidean distance, vectors which are closer in distance will go inside the same bucket in the hash table. [help me in writing this in detail]Prove that in a heap, the leaves starts from index ⌊n/2⌋+1,⌊n/2⌋+2,…,n2018-04-03T11:47:39+00:002018-04-03T11:47:39+00:00/2018/04/03/starting-index-of-heap<p>We need to prove that in an array representation of heap, the leaves starts at index <strong>⌊n/2⌋+1</strong>, <strong>⌊n/2⌋+2</strong> . . .and goes till <strong>n</strong>, n being the last leaf of the heap.<br /><!--more--></p>
<p> </p>
<p>So, to prove it, first we recall that, the LEFT child and the RIGHT child of a node in a heap is given by:<br /> </p>
<pre>[code language="C"]
function LEFT(i){
return 2i;
}
[/code]</pre>
<p> </p>
<pre>[code language="C"]
function RIGHT(i){
return 2i+1;
}
[/code]</pre>
<p> </p>
<hr />
<p> </p>
<p> </p>
<p>Secondly, we also recall that, heaps are almost complete binary tree, meaning, we do not fill right child of any node without filling the left child first.</p>
<p> </p>
<hr />
<p> </p>
<p> </p>
<p>Third, realize the mathematical truth that, <strong>⌊n/2⌋ > ( n/2 - 1 )</strong></p>
<hr />
<p>Now, we are good to go: Lets us take the index of the first leaf node, i.e <strong>⌊n/2⌋+1</strong>. For this node, we will attempt to find its left child:</p>
<p> <br /><code><br />
LEFT(⌊n/2⌋+1) = <span style="color: red;">2</span>(⌊n/2⌋+1)<br />
</code><br /> </p>
<p>Now:<br /><code><br />
<span style="color: red;">2</span>(⌊n/2⌋+1) > [ 2( n/2-1 + 1) = 2n/2 - 2 + 2 = n ]<br />
</code></p>
<p> </p>
<p>Therefore:<br /><code><br />
LEFT(⌊n/2⌋+1) > n<br />
</code><br />which is greater then the elements of the heap. Therefore ⌊n/2⌋+1 is a leaf and so is the next element, next to next element till n.</p>
<p> </p>We need to prove that in an array representation of heap, the leaves starts at index ⌊n/2⌋+1, ⌊n/2⌋+2 . . .and goes till n, n being the last leaf of the heap. So, to prove it, first we recall that, the LEFT child and the RIGHT child of a node in a heap is given by: [code language="C"] function LEFT(i){ return 2i; } [/code] [code language="C"] function RIGHT(i){ return 2i+1; } [/code] Secondly, we also recall that, heaps are almost complete binary tree, meaning, we do not fill right child of any node without filling the left child first. Third, realize the mathematical truth that, ⌊n/2⌋ > ( n/2 - 1 ) Now, we are good to go: Lets us take the index of the first leaf node, i.e ⌊n/2⌋+1. For this node, we will attempt to find its left child: LEFT(⌊n/2⌋+1) = 2(⌊n/2⌋+1) Now: 2(⌊n/2⌋+1) > [ 2( n/2-1 + 1) = 2n/2 - 2 + 2 = n ] Therefore: LEFT(⌊n/2⌋+1) > n which is greater then the elements of the heap. Therefore ⌊n/2⌋+1 is a leaf and so is the next element, next to next element till n.Program Code For Recursive Insertion Sort2018-02-25T11:47:39+00:002018-02-25T11:47:39+00:00/2018/02/25/code-recursive-insertion-sort<p>Iterative insertion sort is very common. In this post, the recursive insertion sort is given. <!--more--></p>
<pre>[code language="C"]
void recursiveInsertionSort(int *a, int k, int key){
if(k&lt;0){ //if index is &lt;0 which means indexes are over
a[0] = key;
return;
}
if(a[k] &gt; key){
a[k+1] = a[k];
recursiveInsertionSort(a, k-1, key);
return;
}
else{
a[k+1] = key;
return;
}
}
[/code]</pre>
<p> </p>
<p>I would encourage you to write the driver program yourself by understanding the recursiveInsertionSort() function implementation. A hint is given below.</p>
<p> </p>
<pre>[code language="C"]
...
for(i=1; i&lt;n; i++){ //i is the index of array | a[i] is the Key
k = i - 1;
recursiveInsertionSort(a, k, a[i]); //args: array | index | key
}
...
[/code]</pre>Iterative insertion sort is very common. In this post, the recursive insertion sort is given. [code language="C"] void recursiveInsertionSort(int *a, int k, int key){Swap all pairwise nodes in a linked list2018-02-20T11:47:39+00:002018-02-20T11:47:39+00:00/2018/02/20/code-swap-pairwise-nodes-in-linkedlist<p>In this lesson, I will show you an iterative program to swap all pairwise nodes of a linked list.<!--more--></p>
<p><strong>Are we swapping the pairwise data of nodes or the complete node?</strong><br /> </p>
<p>We'll we swapping the nodes, as a result the data will get swapped. Swapping only the element does not make any sense.</p>
<p> </p>
<p>For example, if the linked list is 1-2-3-4-5-6 then pairwise swapping all nodes will give us the linked list 2-1-4-3-6-5.</p>
<p> </p>
<p>If the linked list is 1-2-3-4-5 then pairwise swapping all nodes will give us the linked list 2-1-4-3-5.</p>
<p> </p>
<p><img class="" src="/images/pairwise-swap-LL.png" alt="" width="463" height="261" /></p>
<p> </p>
<p>Observe the representation above, the address and data of the nodes.</p>
<p> </p>
<p>The following program code require the understanding of <strong>insertAfter()</strong> function and <strong>printList()</strong> function.</p>
<p> </p>
<pre>[code language="C"]
/Pairwise swap all elements of linked list.
#include&amp;lt;stdio.h&amp;gt;
#include&amp;lt;stdlib.h&amp;gt;
struct Node{
int data;
struct Node *next;
};
void printList(struct Node*);
struct Node* insertAfter(struct Node**, int); //returns adress of last node
void pairWiseSwap(struct Node**);
int main()
{
struct Node *head;
struct Node *endMarker;
head = (struct Node*)malloc(sizeof(struct Node));
head-&amp;gt;data = 1;
head-&amp;gt;next = NULL;
endMarker = insertAfter(&amp;amp;head, 2);
endMarker = insertAfter(&amp;amp;endMarker, 3);
endMarker = insertAfter(&amp;amp;endMarker, 4);
endMarker = insertAfter(&amp;amp;endMarker, 5);
endMarker = insertAfter(&amp;amp;endMarker, 6);
pairWiseSwap(&amp;amp;head);
printList(head);
return 0;
}
void pairWiseSwap(struct Node **head){
struct Node *t1, *t2, *beforeT1 = NULL;
int count = 0;
//init
t1 = *head;
t2 = (*head)-&amp;gt;next;
while(t1 &amp;amp;&amp;amp; t2){
t1-&amp;gt;next = t2-&amp;gt;next;
t2-&amp;gt;next = t1;
if(count == 0)
(*head) = t2;
else{
beforeT1-&amp;gt;next = t2;
}
beforeT1 = t2;
if(t2-&amp;gt;next == NULL)
t1 = NULL;
else
t1 = t1-&amp;gt;next; //t1 = t1-&amp;gt;next;
if(t1 == NULL)
t2 = NULL;
else
t2 = t1-&amp;gt;next;
count++;
beforeT1 = beforeT1-&amp;gt;next;
}
}
void printList(struct Node* node){
int count=0;
while(node!=NULL){
printf("\t%d", node-&amp;gt;data);
node = node-&amp;gt;next;
count++;
}
printf("\nTotal nodes printed=\t%d\n", count);
}
struct Node* insertAfter(struct Node **node, int data){
if( (*node)==NULL) {
printf("Node Does Not Exists\n");
return NULL;
}
struct Node *new = (struct Node*)malloc(sizeof(struct Node));
new-&amp;gt;data = data;
new-&amp;gt;next = (*node)-&amp;gt;next;
(*node)-&amp;gt;next = new;
return new;
}
[/code]</pre>In this lesson, I will show you an iterative program to swap all pairwise nodes of a linked list. Are we swapping the pairwise data of nodes or the complete node? We'll we swapping the nodes, as a result the data will get swapped. Swapping only the element does not make any sense. For example, if the linked list is 1-2-3-4-5-6 then pairwise swapping all nodes will give us the linked list 2-1-4-3-6-5. If the linked list is 1-2-3-4-5 then pairwise swapping all nodes will give us the linked list 2-1-4-3-5. Observe the representation above, the address and data of the nodes. The following program code require the understanding of insertAfter() function and printList() function. [code language="C"] /Pairwise swap all elements of linked list.Evaluating highest value of int in C language2018-02-17T11:47:39+00:002018-02-17T11:47:39+00:00/2018/02/17/evaluating-highest-value-in-c<p>In this lesson, we will see how we can evaluate the highest/lowest value of an <strong>int</strong> in C language. Unlike C++ and Java, this is not as straight forward as it may seem. To understand the value of the discussion below one should have a preliminary knowledge of 2's compliment arithmetic. 2's compliment arithmetic will not be discussed here. The general idea is this. <!--more--></p>
<p> </p>
<p>Any numbers in a modern arithmetic is represented in 2's compliment form. if a computer system is n bit wide then the highest number is $$2^{n-1}-1$$ and the lowest number is $$2^{n-1}$$</p>
<p> </p>
<p>Generally (not always), modern computers are 32 bit wide. So, highest positive number that <strong>int </strong>can represent or hold is $$2^{32-1}-1$$ Similarly the lowest negative number representable in <strong>int32</strong> is $$2^{32-1}$$</p>
<p> </p>
<p>Now, lets focus on representing this <em>highest positive number </em>in C. That should be simple as given below:</p>
<pre>[code language="C"]
int INT_MAX = (1&lt;&lt;31)-1;
[/code]</pre>
<p>But this code, results into an <em>-Woverflow </em>error. So what went wrong?</p>
<p>To evaluate (1<<31)-1, this is what the computer does: It takes the 1 and left shifts it by 31 bits.</p>
<pre>[code language="C"]
00000000 00000000 00000000 00000001
[/code]</pre>
<p> </p>
<p>on left shifting by 31 bits, we get</p>
<pre>[code language="C"]
10000000 00000000 00000000 00000000
[/code]</pre>
<p>and now subtract -1 from it. -1 in 2s compliment form is</p>
<pre>[code language="C"]
11111111 11111111 11111111 11111111
[/code]</pre>
<p>and now the following addition takes places</p>
<pre>[code language="C"]
10000000 00000000 00000000 00000000
11111111 11111111 11111111 11111111
[/code]</pre>
<p> </p>
<p>Now, clearly the addition above will result in a 33 bit number, hence a <strong>overflow</strong>.</p>
<p> </p>
<p>Now to evaluate the expression (1<<31)-1 correctly in C, we have to store the result in a data type bigger then 32 bits. For this, we use <strong>long</strong> data type.</p>
<p> </p>
<pre>[code language="C"]
long x = 1;
long INT_MAX = (x&lt;&lt;(sizeof(int)*8)-1)-1;
printf("%lu\n", INT_MAX); //outputs 2147483647 for 32 bit machine
[/code]</pre>
<p> </p>
<p>[latex]</p>In this lesson, we will see how we can evaluate the highest/lowest value of an int in C language. Unlike C++ and Java, this is not as straight forward as it may seem. To understand the value of the discussion below one should have a preliminary knowledge of 2's compliment arithmetic. 2's compliment arithmetic will not be discussed here. The general idea is this. Any numbers in a modern arithmetic is represented in 2's compliment form. if a computer system is n bit wide then the highest number is $$2^{n-1}-1$$ and the lowest number is $$2^{n-1}$$ Generally (not always), modern computers are 32 bit wide. So, highest positive number that int can represent or hold is $$2^{32-1}-1$$ Similarly the lowest negative number representable in int32 is $$2^{32-1}$$ Now, lets focus on representing this highest positive number in C. That should be simple as given below: [code language="C"]