Create Presentation
Download Presentation

Download Presentation
## Learning Algorithm Evaluation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Algorithm evaluation: Outline**• Why? • Overfitting • How? • Train/Test vs Cross-validation • What? • Evaluation measures • Who wins? • Statistical significance**Introduction**• A model should perform well on unseen data drawn from the same distribution**Classification accuracy**• performance measure • Success: instance’s class is predicted correctly • Error: instance’s class is predicted incorrectly • Error rate: #errors/#instances • Accuracy: #successes/#instances • Quiz • 50 examples, 10 classified incorrectly • Accuracy? Error rate?**Evaluation**Rule #1 Never evaluate on training data!**Train and Test**Step 1: Randomly split data into training and test set (e.g. 2/3-1/3) a.k.a. holdout set**Train and Test**Step 2: Train model on training data**Train and Test**Step 3: Evaluate model on test data**Train and Test**Quiz: Can I retry with other parameter settings?**Evaluation**Rule #1 Never evaluate on training data! Rule #2 Never train on test data! (that includes parameter setting or feature selection)**Train and Test**Step 4: Optimize parameters on separate validation set validation testing**Test data leakage**• Never use test data to create the classifier • Can be tricky: e.g. social network • Proper procedure uses three sets • training set: train models • validation set: optimize algorithm parameters • test set: evaluate final model**Making the most of the data**• Once evaluation is complete, all the data can be used to build the final classifier • Trade-off: performance evaluation accuracy • More training data, better model (but returns diminish) • More test data, more accurate error estimate**Train and Test**Step 5: Build final model on ALL data (more data, better model)**k-fold Cross-validation**• Split data (stratified) in k-folds • Use (k-1) for training, 1 for testing • Repeat k times • Average results Original Fold 1 Fold 2 Fold 3 train test**Cross-validation**• Standard method: • Stratified ten-fold cross-validation • 10? Enough to reduce sampling bias • Experimentally determined**Leave-One-Out Cross-validation**• A particular form of cross-validation: • #folds = #instances • n instances, build classifier n times • Makes best use of the data, no sampling bias • Computationally expensive Original Fold 1 Fold 100 100 ………**ROC Analysis**• Stands for “Receiver Operating Characteristic” • From signal processing: tradeoff between hit rate and false alarm rate over noisy channel • Compute FPR, TPR and plot them in ROC space • Every classifier is a point in ROC space • For probabilistic algorithms • Collect many points by varying prediction threshold • Or, make cost sensitive and vary costs (see below)**Confusion Matrix**actual + - TP FP + true positive false positive predicted TN FN - false negative true negative FP+TN TP+FN TPrate (sensitivity): FPrate (fall-out):**ROC space**J48 parameters fitted J48 OneR classifiers**ROC curves**Change prediction threshold: Threshold t: (P(+) > t) Area Under Curve (AUC) =0.75**ROC curves**• Alternative method (easier, but less intuitive) • Rank probabilities • Start curve in (0,0), move down probability list • If positive, move up. If negative, move right • Jagged curve—one set of test data • Smooth curve—use cross-validation**ROC curvesMethod selection**• Overall: use method with largest Area Under ROC curve (AUROC) • If you aim to cover just 40% of true positives in a sample: use method A • Large sample: use method B • In between: choose between A and B with appropriate probabilities**equal**costs skewed costs ROC Space and Costs**Different Costs**• In practice, TP and FN errors incur different costs • Examples: • Medical diagnostic tests: does X have leukemia? • Loan decisions: approve mortgage for X? • Promotional mailing: will X buy the product? • Add cost matrix to evaluation that weighs TP,FP,...**Comparing data mining schemes**• Which of two learning algorithms performs better? • Note: this is domain dependent! • Obvious way: compare 10-fold CV estimates • Problem: variance in estimate • Variance can be reduced using repeated CV • However, we still don’t know whether results are reliable**Significance tests**• Significance tests tell us how confident we can be that there really is a difference • Null hypothesis: there is no “real” difference • Alternative hypothesis: there is a difference • A significance test measures how much evidence there is in favor of rejecting the null hypothesis • E.g. 10 cross-validation scores: B better than A? mean A mean B P(perf) Algorithm A Algorithm B perf x x x xxxxx x x x x x xxxx x x x**32**Paired t-test P(perf) Algorithm A Algorithm B • Student’s t-test tells whether the means of two samples (e.g., 10 cross-validation scores) are significantly different • Use a pairedt-test when individual samples are paired • i.e., they use the same randomization • Same CV folds are used for both algorithms perf William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Worked as chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".**Performing the test**P(perf) Algoritme A Algoritme B • Fix a significance level • Significant difference at % level implies (100-)% chance that there really is a difference • Scientific work: 5% or smaller (>95% certainty) • Divide by two (two-tailed test) • Look up the z-value corresponding to /2: • If t –zor t z: difference is significant • null hypothesis can be rejected perf