
Power analysis

Power = $1-\beta$. 通过power analysis来限制样本量: \(n = \frac{(z_{1-\beta} + z_{1-\alpha})^2\sigma^2}{(\mu_0 - \mu_1)^2}\) 一般研究中 $\alpha = 0.05,\ \beta = 0.80$


Estimation: 利用样本统计量 ($\overline{M},\ s$) 推断总体的参数 ($\mu,\ \sigma$)。Estimation 和 假设检验是互补的。


在estimation中, $\dfrac{\overline{X} - \mu}{s/\sqrt{n}}$ 不是正态分布,而是student-t分布,所以这里应该用 $t_{\alpha/2}$, 而非 $Z_{\alpha/2}$. Student-t分布针对总体标准差未知的时候。t分布实际上是对样本std自由度的一个改正。





调查的两种方法:实验和观察。观察到的信息无法做casual的判断,只能给出correlation。但是实验室做实验可以做出因果关系来。天文观测没法给出casaul link,这需要注意。天文观测没法控制变量。我们只能知道结果,但是可以通过贝叶斯方法来推断是什么原因产生了这些结果(通过一定的概率模型)。




一个个体被调查到的概率是\(p\), 则其代表性是\(1/p\).




数据挖掘里的EM算法(Expectation & Maximization):通过迭代,把未响应数据填入列联表。这个算法成立要求“随机不相应”。



随机化实验方法:随机选取病人,从而各种因素在两组内分布都是相近的。随机化对照是金标准。随机对照和历史对照可以完全不一样,衡量药品的药效一定要做随机化对照。对人做实验是有ethical committee来判断是否合乎伦理。

Median absolute deviation: $$\text{MAD} = \text{median}( X_i - \text{median}(X) )$$.

Gaussian Process

Gaussian Process for Machine Learning

David Hogg’s paper on Statistics

Data analysis recipes: Fitting a model to data

Data analysis recipes: Probability calculus for inference

Plot with Python

Using Python to do Data Analysis

Bayesian Statistics

Frequentism v.s. Bayesianism, by Jake VanderPlas

Akaike Information Criterion

emcee: Seriously Kick-Ass MCMC tool

  • emcee is a python module that implements a very cool MCMC sampling algorithm cample an ensemble sampler. In order to more efficiently sample the parameter space, many samplers (called walkers) run in parallel and periodically exchange states. emcee is available from this website: http://dan.iel.fm/emcee/current/. And some examples for EMCEE: http://dfm.io/emcee/current/user/line/
  • The most up-to-date documents of emcee is https://emcee.readthedocs.io/en/latest/tutorials/line/. It’s much prettier and understandable than before. You can install newest version by clone its GitHub, then pip uninstall emcee, and using python setup.py install to install new version.
  • If you google emcee example, you can already find a lot of good tutorials and examples. But if you need an example for complex astrophysical application, my personal recommendation is the prospector SED fitting code by Ben Johnson: https://github.com/bd-j/prospector
  • emcee employs Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler. But Metropolis-Hastings sampler and The Parallel-Tempered Ensemble Sampler (PTMCMC) can also be found in emcee. The PTMCMC is useful if you expect your distribution to be multi-modal.

Related to today’s discussion on MCMC:

  1. About autocorrelation time in emcee: https://emcee.readthedocs.io/en/latest/tutorials/autocorr/ In emcee the autocorr.py deals with this, and more here on how to use it to check convergence in real application: https://emcee.readthedocs.io/en/latest/tutorials/monitor/
  2. The dynesty Dynamic Nested Sampling code is here: https://dynesty.readthedocs.io/en/latest/ ; prospector has an example of its application here: https://github.com/bd-j/prospector/blob/master/prospect/fitting/nested.py
  3. About the pickle issue, the general description of “pickleable” objects can be found here: https://docs.python.org/3/library/pickle.html#pickle-picklable Although in real application, this can get tricky.
  4. About the time spent on each realization of the likelihood, if you think there might be room for improvement, I always use cProfile to profile the time spent on each functional call. It is very easy to use.