PHY 403: Modern Statistics and the Exploration of Large Datasets
This is my second time teaching this course on probability and statistics for graduate students and advanced undergraduates. The class includes a significant data analysis and numerical methods component centered around the Python programming language. (Mathematica, R, and ROOT are also allowed.)
Time and Location
MW 10:25 - 11:40 am
Bausch and Lomb 208
Data Analysis: A Bayesian Tutorial
D.S. Sivia and John Skilling
Statistical Data Analysis
The following books are not required but are great references. You may find them on reserve at POA or available as an online electronic reference on the River Campus Library website:
- Statistics for Nuclear and Particle Physicists by Louis Lyons. This is a very good statistics book written for non-statisticians specializing in particle physics.
- Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences by Roger Barlow. Another book for high-energy physicists with excellent discussions of systematic uncertainties and the "frequentist" approach to inference.
- Numerical Recipes: The Art of Scientific Computing by William Press et al. Detailed descriptions of many numerical techniques, and a pretty decent math text book in its own right.
- Probability Theory: The Logic of Science by Ed T. Jaynes. An excellent (if beastly long) resource on the philosophy of inference.
Interpretations of Probability. Frequentist statistics, Bayes' Theorem, Principle of Maximum Entropy.
- Basic Statistics
Random variables, discrete and continuous probability distributions, cumulative distributions. Mean, variance, and covariance. Central Limit Theorem. Method of moments.
- Common Probablity Distributions
Gaussian, Binomial, Poisson, Exponential, Lognormal, Chi-square, Power Law, Cauchy (Breit-Wigner)
- Monte Carlo Methods
Random number generation, transformation of PDFs, acceptance-rejection technique.
- Bayesian Statistics
Likelihoods, priors, and posteriors. Nuisance parameters, systematic uncertainties, and marginalization. Numerical methods: Markov Chain Monte Carlo.
- Random and Systematic Uncertainties
Error bars and error propagation. Correlations and the "error matrix." Non-Gaussian uncertainties. Techniques for managing systematic uncertainties.
- Parameter Estimation
Maximum likelihood technique. Least squares regression. Minimization techniques. "Robust" alternatives to the least squares method.
- Hypothesis Testing
- Frequentist approach: significance and power, Neyman-Pearson tests, statistical trials, likelihood ratio tests.
- Bayesian approach: posterior odds, the Bayes Factor, the Ockham Factor.
- Interval Estimation
Confidence intervals (frequentist) and credible intervals (Bayesian). Lower and upper limits. The Feldman-Cousins ranking method.
Multivariate techniques. Data classifiers and machine learning. Decision trees and boosting.
- Nonparametric Methods
Rank-order statistics. KS tests. Sign test and k-sample test. Contingency tables. Gaussian processes.
- Time Series Analysis and Correlations
Power spectra, periodograms, autocorrelation, cross-correlation. Detection of clusters in data.
Homeworks are assigned bi-weekly and will have a significant programming component. Assignments are due Friday at 5 pm two weeks after it is assigned. You can use any programming language you like (including Mathematica and R), but support from the TA and instructor is limited to Python and ROOT.
You may discuss the problems informally with your classmates but you must complete the homework on your own. Printouts of source code and plots are required to receive full credit.
The final project can be on a data analysis project of your choice, either reflecting your current work or your analysis of a previous result. You will present your results during a 20 minute presentation at the end of the semester (April 20-29).
|Different interpretations of probability: propensity, frequency, and degree of belief.|
|Reading: Sivia Ch. 1; Cowan Ch. 1.1, 1.2|
|Basics of programming in Python: Arithmetic operators, variables, conditionals and loops, functions, and importing modules. Intro to NumPy and Matplotlib.|
|3||Basics of Probability and Summary Statistics|
|Rules of probability: Sum Rule, Product Rule, Bayes'
Theorem, Law of Total Probability.|
PDFs and summary statistics: mean, mode median; variance, covariance, correlation; histograms.
|Reading: Sivia Ch. 1, Cowan Ch. 1.1-1.5|
|4||Common Probability Distributions I|
|Binomial, negative binomial, multinomial, Gaussian, Poisson, Gamma, Exponential, Chi-square, Cauchy, Landau.|
|Reading: Cowan Ch. 2, Numerical Recipes Ch. 7|
|5||Common Probability Distributions II|
|PDFs and probability mass functions (PMFs); more about the chi-square test and p-values; transformation rules for PDFs; probability generating functions for discrete random variables.|
|Reading: Cowan Ch. 2, Sivia Ch. 3.6|
|6||Monte Carlo Methods|
|Simulation and random number generation. Pseudo-random uniform number generators. Sampling from arbitrary PDFs using transformation/inversion and acceptance/rejection methods.|
|Reading: Cowan Ch. 3|
|7||Model Selection and Parameter Estimation|
|Parameter estimation in the Bayesian framework: posteriod distributions; the role and effect of priors; marginalization of "nuisance" parameters; model comparison using posterior odds ratios; a quantitative version of Ockham's Razor.|
|Reading: Sivia Ch. 2, 3|
|Choosing priors: the Principle of Indifference; uniform and Jeffreys priors. Estimators of parameters: maximizing the posterior; reliability of estimators; bias and mean squared error; consistency and efficiency of estimators. Case studies: Gaussian and binomial estimators.|
|Reading: Sivia Ch. 2; Cowan Ch. 5|
|9||Parameter Estimation, Correlation, and Error Bars|
|Correlations between parameters: quadratic approximation in 2D, the Hessian matrix, the covariance matrix. The student-t distribution and the chi-square distribution.|
|Reading: Sivia Ch. 3.2, 3.3|
|10||Minimization Techniques: Maximum Likelihood and Least Squares I|
|Function minimization: common issues; grid search and steepest descent; Newton's Method; simplex method; simulated annealing. The method of maximum likelihood and its connection to least squares regression.|
|Reading: Numerical Recipes Ch. 10|
|11||Maximum Likelihood and Least Squares II|
|Properties of ML estimators. Variances and the Minimum Variance Bound. The chi-square statistic and goodness of fit.|
|Reading: Sivia Ch. 3; Cowan Ch. 6; Numerical Recipes Ch. 15|
|12||Propagation of Uncertainties|
|The classic error propagation formula and its limitations. Using the covariance matrix. Asymmetric error bars. A fully Bayesian approach with the complete PDF.|
|Reading: Sivia Ch. 3.6; Cowan Ch. 7.6|
|Systematic uncertainties vs. mistakes ("errors") in data taking. Systematics and experimental design. How and when to assign systematic uncertainties.|
|Reading: see papers by Roger Barlow referenced in the slides.|
|14||Bayesian Model Selection and Hypothesis Testing|
|Hypothesis testing; posterior odds; the Ockham Factor revisited. Comparing several models with free parameters. Hypothesis testing vs. parameter estimation.|
|Reading: Sivia Ch. 4.1-4.2; Cowan Ch. 4.1-4.4|
|15||Classical Hypothesis Testing: The Likelihood Ratio Test|
|Type I and Type II errors. Statistical significance and power in model selection. The Neyman-Pearson lemma. Using p-values and the Neyman-Pearson test in model selection. The likelihood ratio test and Wilks' Theorem.|
|Reading: Cowan Ch. 4|
|16||Credible Intervals and Confidence Intervals|
|Summarizing the range of values of a parameter. Bayesian credible intervals. Classical confidence intervals: Neyman intervals and confidence belts; central intervals and lower/upper limits; frequentist coverage and the "flip-flopping" problem. Feldman-Cousins frequentist intervals. Confusing confidence intervals with posterior probabilities.|
|Reading: Cowan Ch. 9|
|17||Instrument Response and Unfolding|
|Accounting for instrumental efficiency and resolution. Forward folding and unfolding. Regularization techniques for unfolding. Balancing of variance and bias: figures of merit based on MSE, log-likelihood, and chi-square statistics.|
|Reading: Cowan Ch. 11|
|18||Sampling from PDFs: Markov Chain Monte Carlo|
|The Metropolis-Hastings algorithm. Sampling from multi-dimensional PDFs with MCMC. The Principle of Detailed Balance. Practical details: burn-in and efficiency. Parallel tempering.|
|Reading: Information Theory, Inference, and Learning Algorithms, Ch. 29|
|19||Sampling from PDFs: Nested Sampling|
|Evaluating full posterior distributions. Likelihood ordering and Lebesgue integration. Sampling from strongly multimodal PDFs.|
|Reading: Sivia Ch. 9|
|Analysis of signals in the time domain: signal sampling and the Nyquist-Shannon Sampling Theorem. Analysis of signals in the frequency domain: Fourier analysis and power spectral density. Windowing and apodization. Bayesian insight into the power spectrum: Schuster and Lomb-Scargle periodograms.|
|Reading: Numerical Recipes in C Ch. 13|
|21||The Principle of Maximum Entropy|
|Revisiting the Principle of Indifference. Choosing maximally non-committal PDFs in the presence of missing information. The Shannon-Jaynes Entropy and the derivation of common statistical distributions using the Principle of Maximum Entropy.|
|Reading: Sivia Ch. 5, 6.2|
|22||Measurement and Bias|
|Bandwagon effects in experimental results. Confirmation bias: data selection and stopping criteria. Blind analyses.|
The homework assignments are available at my.rochester.edu.
In addition to the course texts and books on reserve I also used online materials as resources for these lectures, including lecture notes from similar courses. In the interest of giving credit where it's due, here are some of the best resources out there:
- Theory of
Measurement, Scott Oser, University of British Columbia:
Excellent lecture notes from a course similar to PHY 403. Oser's examples are very clear and many of the case studies presented in my notes are taken verbatim from this course.
- PSU Summer Schools
in Statistics for Astronomers:
An annual multi-day summer school on statistics and data analysis aimed at students beginning their PhD research. The guest lectures are given by leaders in the field and are worth reading.
- Proceedings of the PHYSTAT Workshops: A series of workshops conducted by researchers in particle physics, astrophysics, cosmology, and statistics about common problems of inference in these fields. Highly informative (though technical) discussions of common pitfalls in Bayesian and frequentist methods.
Anyone who comes across this material and wishes to use it for their own courses is free to do so without requesting my permission. However, please cite S. BenZvi, Dept. of Physics and Astronomy, University of Rochester, 2016.