Statistical Software Resources on the Web

Collected by Jim Linnemann
Michigan State University Physics

Last updated September 2005
Completeness or Authoritativeness isn't even a goal--just useful pointers!
If I'm missing a good link (your collection, for example), or a link died, please email it to me!

A word on selection

I sampled links on most top-level pages, and included pages I thought practicing High Energy Physicsits and Astrophysicists would find useful; that didn't trap me on their web page; and had a reasonable proportion of live links. The web is wonderful, but ephemeral; I'm sure you'll find links that weren't broken when I tried. The opinions are my own; a bit on statistics also slipped in among the software.

High Energy Physics

There an evolving plan for a HEP statistical software repository; your comments (and your collaboration's) would be welcome.
Particle Data Group Statistics Summary describes statistical methods (theory) on which there is consensus in HEP
Glen Cowan's statistical resources page (Royal Holloway physics); go up a link for some software associated with his book.
There are some statistical routines in Root (an interactive data analysis framerwork); and in cernlib, clhep and Fermilab’s Zoom
FreeHep points to other HEP analysis software (including JAS, Java Analysis Studio), but does not have a specific statistics section
CDF statistics committee a Tevatron experiment's statistics page: mostly methods discussion
A simple version of the D0 experiment's Bayesian limit calculator
Babar statistics working group a SLAC experiment's statistics page: methods and a few applets
Geant statistical packages, Maria Grazia Pia, HEP, INFN Genova, C++ library
Fermilab Advanced Analysis Group
TerraFerMA, Sherry Towers: a root-compatible package combining several classifiers and helping select candidate variables for multidimensional analysis.
gnu gsl (gnu scientific library) contains random number generators, as well as some histogramming, ntuples, moments for weighted events, and autocorrelation calculations.
sourceforge.net a broad repository of open source software. Basic browsing or search by name without subscribing. You could troll about in the scientific/engineering section and find, for example, roofit.
The Computer Physics Communications program library contains a few items of interest; it requires a subscription to the journal.
Cedar is beginning a HEP archive.
 
A glossary to help translate from statistics-speak to physics-speak is on the site of the Durham conference on statistical techniques in particle physics; see also the PHYSTAT 2003 conference; both have links to other useful resources and earlier workshops in the series.

Astrophysics

Statcodes Eric Feigelson et. al., Penn State: big collection, with commentary; see also his Astrostatistics book. Look here--much broader than astsrophysics! Includes link to web-basedVOSTAT (Virtual Observatory Statistics) project, largely implemented in R (see below).
Statistical Resources Eric Hooper, Harvard
StatPy: Python interfaces to statistical software, Tom Loredo, Cornell; see also his
Bayesian Inference in the Physical Sciences (Software Section) see especially the ominously-named BUGS (heavily used by statisticians), and BAYESPACK
Astrostatistics, Barry Madore, Cal Tech
Mutual translation glossaries for astronomers and statisticians
 

Statistics

http://lib.stat.cmu.edu/   Carnegie Mellon’s StatLib: a key resource
http://www.galaxy.gmu.edu/papers/astr1.html    George Mason Statistics
http://members.aol.com/johnp71/javasta2.html  free software and interactive pages from John Pezzullo (retired, Georgetown Statistics)
http://my.execpc.com/~helberg/statframes.html Clay Hellberg of SPSS
http://www.stat.ufl.edu/vlib/statistics.html Use your browser to search for Resources to get to the good stuff
Journal of Statistical Software; in many programming languages.
http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml bugs Markov Chain MC package
http://www.isds.duke.edu/sites/   Duke Statistics
http://www.mathworks.com/matlabcentral/fileexchange/loadCategory.do Matlab contributions.
From national labs:
http://my.execpc.com/~helberg/statframes.html Clay Hellberg of SPSS
http://gams.nist.gov/ see Class L for a mixture of commercial and academic software
NIST/SEMATECH e-Handbook of Statistical Methods (Engineering Statistics reference, but not much on multidimensional data, and little software under Tools and Aids)

Statistical Computations in Java on the Web

Some nice things, some trivia, and many broken links. Gives a feel for the strengths and limitations of web-interfaced statistics. See Pezzulo's page above, and one derived from it which contains much useful reference material. Many of the working links are powered by http://home.clara.net/sisa/ and http://graphpad.com/

Multivariate Analysis and Statistical Learning

Useful buzzwords to search on in bold; "statistics" will get you more data than methods.

R The R project for Statistical Computing: gnu implementation of the S language
Graphics, statistical algorithms, and a huge repository (CRAN) of R packages. Extensive online documentation. Published books include Introductory Statistics with R by Dalgaard; and Programming with Data: A Guide to the S Language by Chambers; Modern Applied Statistics with S-PLUS, by Venables & Ripley, and others; there's a very good R tutorial ; here's another tutorial, but without graphics
http://www.ggobi.org/ GGobi visualization package for multidimensional data.
Includes dynamic graphics such as arrays of scatterplots, brushing techniques (highlighting groups of objectes in one dimension and having their coordinates highlighted in other coordinates); parallel coordinate plots, and grand tours. Interfaces exist to R and Python front ends, and database back ends. I've skimped on Perl here and elsewhere but it often where you find Python interfaces, you'll also find Perl-though Ruby not as often.
http://www.omegahat.org/ The Omega project for Statistical Computing.
Interfaces between R, Python, XML, Java, databases, and other goodies. At this point, aimed more at developers than users.
Jerry Friedman (High Energy Physicist turned Statistican) has software for a number of multivariate techniques on the web; don't miss his book below.
http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ Elements of Statistical Learning Theory, by Hastie, Tibshirani, and Friedman. Site includes R/S+ Code
The best multivariate analysis and Statistical Learning textbook I know of; web site includes software. From a modern and sophisticated computational statistics viewpoint, but quite readable. Compares methods from trees to neural nets, kernel methods, and support vector machines, though nothing on genetic algorithms. You can even learn the meaning of useful things like bootstrapping and boosting and other post-1960's statistical jargon!
http://ai.fri.uni-lj.si/~aleks/orng/ Data mining in Python: support vector machines, logistic regression, clustering, by Aleks Jakulin
http://magix.fri.uni-lj.si/orange/ Orange
a massive toolkit, including visualization, feature selection, many evaluation tools, including calibration curves and ROC (Receiver Operating Characteristics = efficiency for signal vs fracdtion of background: true positives vs. false positives). Practically all major algorithms from machine learning. Python is a popular interface to this library.
http://www.pitt.edu/~csna/software.html Multivariate Analysis Software
http://www.ll.mit.edu/IST/lnknet/ Classification Software collection--easy to compare methods
http://www.kdnuggets.com/software/classification.html mixture of commercial and academic software links
http://www.ph.tn.tudelft.nl/PRInfo/software.html Machine Learning Resources online
http://www.ncrg.aston.ac.uk/NN/software.html Neural Network Software list, including SNNS popular in Babar
http://home.comcast.net/~tom.fawcett/public_html/ROCCH/ and http://gim.unmc.edu/dxtests/ROC1.htm
ROC curves and a critique of using "best accuracy" on test data sets as a comparison criterion across algorithms (and implicitly, perhaps, as a training objective?). Makes the point that external criteria define the best efficiency point to select, and that often no single algorithm dominates at all efficiencies. Obvious here that there is a considerable gap between the machine learning and statistics communities, which Elements of Statistical Learning by Hastie et al tries to bridge.
Data mining is a wide, but rather commercial, field; mostly software you'd rather sell than buy.
http://www.cs.waikato.ac.nz/ml/weka/ is one Java toolkit
http://www.togaware.com/datamining/ contains a mix of free and commercial resources

Acknowledgements

Google rankings, and Glen Cowan's and Eric Feiglson's pages got me started. The following people (and some others I've forgotten) have provided me with several useful links as well as some excellent suggestions which I have unaccountably ignored.

Tom Loredo (Cornell Astronomy); Rene Brun (CERN); Paul Padley (Rice); Jim Kowalkowski (Fermilab) John Rice (Berkeley Statistics); Louis Lyons (Oxford); Ilya Narsky (Caltech)