Ideas page for Google Summer of Code
Information for Students
RooStats is a set of statistical tools primarily used for high-energy physics. We are pushing the limits of big data and statistical modeling, so this experience is almost guaranteed to give you a unique perspective -- and potentially the gratification of contributing to the discovery of new fundamental particles and forces.
Our code development is almost entirely in C++, though there are some projects that can be done in Python. The C++ constructs are not particularly advanced, instead the emphasis is on a clean mapping between the code and statistical concepts and the implementation of statistical algorithms that can work on statistical models of arbitrary complexity.
Below are projects that range from hard-core computational statistics to unit testing and code profiling. We also have some projects related to GUIs and more human interfaces to statistical tools.
Many of these project ideas are rather vague. If one seems interesting, you are encouraged to contact
<roostats-development@cern.ch> or kyle dot cranmer at nyu edu.
Projects
Graphical Models, Bayesian Belief Networks, Gibbs sampler
Brief explanation: Graphical models are a representation of the conditional dependence of several random variables. There are efficient algorithms for sampling probabilities when one has a graphical model. Unlike Markov Chain Monte Carlo and several other techniques that only see an N-dim function, graphical models are able to take advantage of the relational structure of the variables. Our statistical models are growing in complexity and hitting computational wall -- the hope is that these algorithms may be a break through.
Note, the mentor is in contact with a few statisticians that can help navigate the literature associated with graphical models etc.
Expected results: A C++ class class structure for finding and storing the graphical model based on our current representation of a probability density function. A C++ class to implement one or more of the sampling algorithms based on the graphical model, analogous to the Metropolis-Hastings class:
http://root.cern.ch/root/html532/RooStats__MetropolisHastings.html
Knowledge Prerequisite: C++, basic statistics, basic graph theory
A unit testing suite & benchmarking statistical algorithms
Brief explanation: Fairly self-explanatory. Unit testing and benchmarking are currently neglected.
Expected results: A unit testing suite that can run in a nightly build system. Some stress-tests that can be used to benchmark the various statistical algorithms.
Knowledge Prerequisite: C++, unit testing methodology
Importance sampling and the Neyman Construction
Brief explanation:
Expected results:
Knowledge Prerequisite:
Profiling CPU, Memory, etc.
Brief explanation: Our code has been optimized in some contexts, but a more dedicated profiling is much needed.
Expected results: Profiling of various statistical algorithms on problems of different complexity. The complexity can be characterized in terms of number of variables, number of entries in the data, number of iterations of particular operations, etc. ideally, the leading inefficiencies are tracked down and specific code optimizations are suggested and/or implemented.
Knowledge Prerequisite: valgrind, cachgrind, and/or equivalent debugging and profiling tools.
Interactive interfaces
Brief explanation: Most of the user's interaction with
RooStats is writing small programs that produce figures. There are a few opportunities for GUIs and such that would actually be useful. One is in terms of interacting with a statistical model. A simple first example can be seen here:
http://www.youtube.com/watch?v=2AkfPq2c9II&context=C37a75aeADOEgsToPDskKdAzDZxWkvEvtHtKZVAC1v
There are more possibilities for using a GUI to construct, edit, and combine statistical models.
Expected results: A GUI to do something useful.
Knowledge Prerequisite: GUIs, C++
Draw your PDF
Brief explanation: The idea here is that instead of referring to standard probability density functions (eg. Gaussian, exponential, gamma, beta, log-normal, etc.) one simply makes a sketch of the PDF. This sketch is converted into a PDF in the
RooFit/RooStats framework. This could be used for quick iPad-like studies, utilizing the Monte Carlo sampling and fitting functionality common to
RooFit PDFs.
Expected results: A GUI that can convert a sketch into a PDF
Knowledge Prerequisite: GUIs, C++
Parallelizing Markov Chains & Implementation of convergence measures
Brief explanation: We have a Markov Chain implantation, but we do not have the ability to run several chains in parallel and check the convergence of the chains.
Expected results: Modifications to existing code to support running different chains in parallel. Implementation of standard MCMC convergence measures.
Knowledge Prerequisite: C++, basic statistics
--
KyleCranmer - 05-Mar-2012