## PHY 403: Modern Statistics and Exploration of Large Datasets

Spring 2015

Bausch and Lomb 315: MW 9:30 - 10:50

PHY 403 is a graduate-level lecture course on probability and statistics. The class also includes a strong data analysis and numerical methods component centered around bi-weekly homework sets.

### Location and Office Hours

Instructor |
TA |

Segev BenZvi | Brian Coopersmith |

B&L 405 | B&L 373 |

Tu 11-12, Th 2-3 | W 12:45-1:45 |

### Textbooks

There are two concise (and inexpensive) textbooks used in this course:

Data Analysis: A Bayesian Tutorial |
Statistial Data Analysis |

D.S. Sivia and John Skilling | Glen Cowan |

ISBN-10: 0198568320 | ISBN-10: 0198501552 |

The following books are not required but are great references. You may find them on reserve at POA or available as an online electronic reference on the River Campus Library website:

*Statistics for Nuclear and Particle Physicists*by Louis Lyons. This is a very good statistics book written for non-statisticians specializing in particle physics.*Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences*by Roger Barlow. Another book for high-energy physicists with excellent discussions of systematic uncertainties and the "frequentist" approach to inference.*Numerical Recipes: The Art of Scientific Computing*by William Press et al. Detailed descriptions of many numerical techniques, and a pretty decent math text book in its own right.*Probability Theory: The Logic of Science*by Ed T. Jaynes. An excellent (if beastly long) resource on the philosophy of inference.

### Syllabus

**Probablity**

Interpretations of Probability. Frequentist statistics, Bayes' Theorem, Principle of Maximum Entropy.**Basic Statistics**

Random variables, discrete and continuous probability distributions, cumulative distributions. Mean, variance, and covariance. Central Limit Theorem. Method of moments.**Common Probablity Distributions**

Gaussian, Binomial, Poisson, Exponential, Lognormal, Chi-square, Power Law, Cauchy (Breit-Wigner)**Monte Carlo Methods**

Random number generation, transformation of PDFs, acceptance-rejection technique.**Bayesian Statistics**

Likelihoods, priors, and posteriors. Nuisance parameters, systematic uncertainties, and marginalization. Numerical methods: Markov Chain Monte Carlo.**Random and Systematic Uncertainties**

Error bars and error propagation. Correlations and the "error matrix." Non-Gaussian uncertainties. Techniques for managing systematic uncertainties.**Parameter Estimation**

Maximum likelihood technique. Least squares regression. Minimization techniques. "Robust" alternatives to the least squares method.**Hypothesis Testing**

- Frequentist approach: significance and power, Neyman-Pearson tests, statistical trials, likelihood ratio tests.
- Bayesian approach: posterior odds, the Bayes Factor, the Ockham Factor.

**Interval Estimation**

Confidence intervals (frequentist) and credible intervals (Bayesian). Lower and upper limits. The Feldman-Cousins ranking method.**Classification**

Multivariate techniques. Data classifiers and machine learning. Decision trees and boosting.**Nonparametric Methods**

Rank-order statistics. KS tests. Sign test and k-sample test. Contingency tables. Gaussian processes.**Time Series Analysis and Correlations**

Power spectra, periodograms, autocorrelation, cross-correlation. Detection of clusters in data.

### Grading

Homework | 45% |

Class Participation | 10% |

Midterm | 15% |

Final Project | 30% |

Homeworks are assigned bi-weekly and will have a significant
programming component. Assignments are due **Friday at 5
pm** two weeks after it is assigned. You can use any programming
language you like (including Mathematica and R), but support from the TA
and instructor is limited to Python and ROOT.

You may discuss the problems informally with your classmates but you must complete the homework on your own. Printouts of source code and plots are required to receive full credit.

The final project can be on a data analysis project of your choice, either reflecting your current work or your analysis of a previous result. You will present your results during a 20 minute presentation at the end of the semester (April 20-29).

### Lecture Notes

1 | Jan. 14 | Course Intro |

Different interpretations of probability. Sum rule, product rule, Bayes' Theorem, Law of Total Probability | ||

Reading: Sivia Ch. 1; Cowan 1.1, 1.2 | ||

2 | Jan. 21 | Programming Primer |

Basics of programming in Python. NumPy and Matplotlib extensions. | ||

3 | Jan. 26 | Basic Statistics |

PDFs and summary statistics: mean, mode median; variance, covariance, correlation; histograms. | ||

Reading: Cowan 1.3-1.5 | ||

4 | Jan. 28 | Common Probability Distributions |

Binomial, negative binomial, multinomial, Gaussian, Poisson, Gamma, Exponential, Chi-square, Cauchy, Landau. | ||

Reading: Cowan Ch. 2 | ||

5 | Feb. 2 | Monte Carlo Methods |

Pseudo-random number generators. Simulating data from arbitrary PDFs. | ||

Reading: Cowan Ch. 3 | ||

6 | Feb. 4 | Model Selection, Parameter Estimation |

Odds ratio. Statistical trials. Marginalization and systematic uncertainties. | ||

Reading: Sivia Ch. 2, 3 | ||

7 | Feb. 9 | Choosing Priors and Maximum Entropy |

Principle of Indifference, uniform and Jeffreys priors, Principle of Maximum Entropy | ||

Reading: Sivia Ch. 5 | ||

8 | Feb. 11 | PDF Estimators |

Best estimators of a PDF: Bayesian and frequentist approaches. Quadratic approximations. Efficiency, consistency, and bias. | ||

Reading: Sivia Ch. 2; Cowan Ch. 5 | ||

9 | Feb. 16 | Estimators with Correlations |

Correlations between parameters: quadratic approximation in 2D, the Hessian matrix, the covariance matrix. | ||

Reading: Sivia Ch. 3 | ||

10 | Feb. 18 | Minimization Techniques: Maximum Likelihood, Least Squares |

Numerical methods and intro to maximum likelihood. | ||

Reading: Sivia Ch. 3; Cowan Ch. 6 | ||

11 | Feb. 23 | Maximum Likelihood and Least Squares II |

Properties of ML estimators. Variances and the Minimum Variance Bound. Goodness of fit. | ||

Reading: Sivia Ch. 3; Cowan Ch. 6; NR in C Ch. 15 | ||

12 | Feb. 25 | Propagation of Uncertainties |

Error propagation formula. Covariance matrix and correlations. Asymmetric error bars. Bayesian approach with the complete PDF. | ||

Reading: Sivia Ch. 3; Cowan Ch. 1, 7 | ||

13 | Mar. 2 | Systematic Uncertainties |

Systematic uncertainties and experimental design. How and when to assign systematic uncertainties. | ||

14 | Mar. 4 | Methods for Propagating Systematics |

Producing an error budget. The shift method. The covariance method. The pull method. Using Monte Carlo. | ||

Reading: Barlow Ch. 4.4 | ||

15 | Mar. 16 | Model Selection and Hypothesis Testing |

Posterior odds. Classical hypothesis testing: Type I and Type II errors. Using p-values. | ||

Reading: Sivia 4.1-4.2; Cowan 4.1-4.4 | ||

16 | Mar. 18 | Likelihood Ratio Testing |

Statistical significance and power. The Neyman-Pearson Lemma. Wilks' Theorem. | ||

Reading: Cowan Ch. 4 | ||

17 | Mar. 23 | Sampling from PDFs: Markov Chain Monte Carlo |

Sampling from high-dimensional PDFs with MCMC. The Metropolis-Hastings algorithm. The Principle of Detailed Balance. Practical details (burn-in, efficiency). Parallel tempering. | ||

18 | Mar. 25 | Sampling from PDFs: Nested Sampling |

Evaluating full posterior distributions. Likelihood ordering and Lebesgue integration. Multimodal PDFs. | ||

Reading: Sivia Ch. 9 | ||

19 | Mar. 30 | Confidence Intervals |

Credible intervals and confidence intervals. Upper and lower limits. Confidence belts, coverage, and "flip-flopping" between central intervals and limits. | ||

Reading: Cowan Ch. 9 | ||

20 | Apr. 6 | Unfolding |

Removing an instrumental response from data. Forward folding vs. unfolding. The variance problem. Regularization. | ||

Reading: Cowan Ch. 11 | ||

21 | Apr. 8 | Spectral Analysis |

Nyquist-Shannon Sampling Theorem. Fourier analysis and power spectral density. Schuster and Lomb-Scargle periodograms (with Bayesian derivations). | ||

22 | Apr. 13 | Measurement and Bias |

Bandwagon effects in experimental results. Confirmation bias: data selection and stopping criteria. Blind analyses. |

The homework assignments are available at my.rochester.edu.

### Additional Bibliography

In addition to the course texts and books on reserve I also used online materials as resources for these lectures, including lecture notes from similar courses. In the interest of giving credit where it's due, here are some of the best resources out there:

*Theory of Measurement*, Scott Oser, University of British Columbia:

Excellent lecture notes from a course similar to PHY 403. Oser's examples are very clear and many of the case studies presented in my notes are taken verbatim from this course.- PSU Summer Schools
in Statistics for Astronomers:

An annual multi-day summer school on statistics and data analysis aimed at students beginning their PhD research. The guest lectures are given by leaders in the field and are worth reading. - Proceedings of the PHYSTAT
Workshops:

A series of workshops conducted by researchers in particle physics, astrophysics, cosmology, and statistics about common problems of inference in these fields. Highly informative (though technical) discussions of common pitfalls in Bayesian and frequentist methods.

### Usage

Anyone who comes across this material and wishes to use it for their own courses is free to do so without requesting my permission. However, please cite S. BenZvi, Dept. of Physics and Astronomy, University of Rochester, 2015.