Statistical Software Resources on the
Web
Collected by Jim Linnemann
Michigan State University Physics
Last updated October 2010
Completeness or Authoritativeness isn't even a goal--just useful
pointers!
If I'm missing a good link (your collection, for example), or a link died, please
email it to me!
A word on selection
I sampled links on most top-level pages, and included pages I
thought practicing High Energy Physicsits and Astrophysicists would find useful;
that didn't trap me on their web page; and had a reasonable proportion of live
links. The web is wonderful, but ephemeral; I'm sure you'll find links that
weren't broken when I tried. The opinions are my own; a bit on statistics also
slipped in among the software.
High Energy Physics
- The phystat.org site is a repository for HEP statistical
software, with pointers to the phystat conference series; see StatPatternRecognition there for well-tuned multivariate algorithms.
- Particle
Data Group Statistics Summary describes statistical methods (theory) on
which there is consensus in HEP
- Glen Cowan's
statistical resources page (Royal Holloway physics); go up a link for some
software associated with his book.
- There are some statistical
routines in Root (an interactive data
analysis framerwork). See also roostats and tmva for more useful software in the Root framework. There's also some in cernlib,
clhep and
Fermilab’s Zoom.
- FreeHep
points to other HEP analysis software (including JAS, Java Analysis Studio),
but does not have a specific statistics section
- CDF
statistics committee a Tevatron experiment's statistics page: mostly methods
discussion
- A
simple version of the D0 experiment's Bayesian limit calculator
- Babar
statistics working group a SLAC experiment's statistics page: methods
and a few applets
- Geant statistical
packages, Maria Grazia Pia, HEP, INFN Genova, C++ library
- Fermilab Advanced Analysis
Group
- gnu gsl (gnu scientific library)
contains random number generators, as well as some histogramming, ntuples,
moments for weighted events, and autocorrelation calculations.
-
- sourceforge.net a broad repository
of open source software. Basic browsing or search by name without subscribing.
You could troll about in the scientific/engineering section and find, for
example, roofit.
- The Computer Physics Communications program
library contains a few items of interest; it requires a subscription to
the journal.
- Cedar is beginning a HEP archive.
-
- A glossary
to help translate from statistics-speak to physics-speak (from one of the phystat conferences).
Astrophysics
- Statcodes Eric Feigelson et.
al., Penn State: big collection, with commentary; see also his Astrostatistics
book. Look here--much broader than astsrophysics! Includes
link to web-based VOSTAT (Virtual Observatory Statistics) project, largely
implemented in R (see below).
- StatPy: Python
interfaces to statistical software, Tom Loredo, Cornell; see also his
- Bayesian
Inference in the Physical Sciences (Software Section) see especially the
ominously-named BUGS
(heavily used by statisticians), and BAYESPACK
- Astrostatistics,
Barry Madore, Cal Tech
- Mutual translation glossaries
for astronomers and statisticians, and software, and other goodies from statistician David van Dyk.
-
Statistics
- http://lib.stat.cmu.edu/
Carnegie Mellon’s StatLib: a key resource
- Free software and interactive pages from John Pezzullo (retired, Georgetown
Statistics)
- Statistics on the Web
from Clay Hellberg of SPSS
- http://www.stat.ufl.edu/vlib/statistics.html
Use your browser to search for Resources to get to the good stuff
- Journal of Statistical Software;
in many programming languages.
- http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml
bugs Markov Chain MC package
- http://www.isds.duke.edu/sites/
Duke Statistics
- http://www.mathworks.com/matlabcentral/fileexchange/loadCategory.do
Matlab contributions.
- From national labs:
- http://gams.nist.gov/ see Class L for
a mixture of commercial and academic software
- NIST/SEMATECH
e-Handbook of Statistical Methods (Engineering Statistics reference, but
not much on multidimensional data, and little software under Tools and Aids)
- There is a wiki list of statistical software
- And finally, a handy statistics glossary or two.
-
Statistical Computations in Java on
the Web
Some nice things, some trivia, and many broken links. Gives a feel for
the strengths and limitations of web-interfaced statistics. Pezzulo's page above is the best starting place. Many java
links are powered by sisa
and graphpad
Multivariate Analysis
and Statistical Learning
Useful buzzwords to search on in bold; "statistics"
will get you more data than methods. Try wiki as well as search engines.
- R The R project for Statistical
Computing: gnu implementation of the S language
- Graphics, statistical algorithms, and a huge repository (CRAN)
of R packages. Extensive online documentation. Published books include Introductory
Statistics with R by Dalgaard; and Programming with Data: A Guide to the S
Language by Chambers; Modern Applied Statistics with S-PLUS, by Venables &
Ripley, and others; here's a very good R
tutorial ; for more, search for "tutorial using R"
- http://www.ggobi.org/ GGobi visualization
package for multidimensional data.
- Includes dynamic graphics such as arrays of scatterplots, brushing techniques
(highlighting groups of objectes in one dimension and having their coordinates
highlighted in other coordinates); parallel coordinate plots, and grand tours.
Interfaces exist to R and Python front ends, and database back ends. I've
skimped on Perl here and elsewhere, but often where you find Python
interfaces, you'll also find Perl-and sometimes Ruby.
- http://www.omegahat.org/ The Omega
project for Statistical Computing.
- Interfaces between R, Python, XML, Java, databases, and other goodies.
At this point, aimed more at developers than users.
- Jerry Friedman (High
Energy Physicist turned Statistican) has software for a number of multivariate
techniques on the web; don't miss his book below.
- http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
Elements of Statistical Learning Theory, by Hastie, Tibshirani, and Friedman.
Site includes R/S+ Code
- The best multivariate analysis and Statistical
Learning textbook I know of; web site includes software. From a modern
and sophisticated computational statistics viewpoint, but quite readable.
Compares methods from trees to neural nets,
kernel methods, and support vector machines,
though nothing on genetic algorithms. You can even learn the meaning of useful
things like bootstrapping and boosting and other post-1960's statistical jargon!
-
- http://magix.fri.uni-lj.si/orange/
Orange
- a massive toolkit, including visualization, feature selection,
many evaluation tools, including calibration curves and ROC
(Receiver Operating Characteristics = efficiency for signal vs fraction of
background: true positives vs. false positives). Practically all major algorithms
from machine learning. Python is a popular interface to this library.
- http://www.pitt.edu/~csna/software.html
Multivariate Analysis Software (some older items, but useful)
- libsvm, SVMlight, and PRTools are popular Pattern Recognition software (thanks: MSU Computer Science)
- support-vectormachines.org has more svm software and information
- lnknet
MIT Classification Software collection--easy to compare methods
- http://www.kdnuggets.com/software/classification.html
mixture of commercial and academic software links
- Machine Learning Resources online ; see also Google Directory | Computers | Artificial Intelligence
- Neural Network Software list, see also SNNS
popular in Babar
- http://home.comcast.net/~tom.fawcett/public_html/ROCCH/
and http://gim.unmc.edu/dxtests/ROC1.htm
- ROC curves and a critique of using "best accuracy"
on test data sets as a comparison criterion across algorithms (and implicitly,
perhaps, as a training objective?). Makes the point that external criteria
define the best efficiency point to select, and that often no single algorithm
dominates at all efficiencies. Obvious here that there is a considerable gap
between the machine learning and statistics communities, which Elements
of Statistical Learning by Hastie et al tries to bridge.
Data mining is a wide, but rather commercial, field; lots of
software you'd rather sell than buy.
- weka is one Java toolkit
- mloss is a sortable machine learning repository of free resources; also try wiki machine learning
- http://www.togaware.com/datamining/
contains a mix of free and commercial resources
Acknowledgements
Google rankings, and Glen Cowan's and Eric Feiglson's pages got me started.
The following people (and some others I've forgotten) have provided me with
several useful links as well as some excellent suggestions which I have unaccountably
ignored.
Tom Loredo (Cornell Astronomy); Rene Brun (CERN); Paul Padley (Rice); Jim Kowalkowski
(Fermilab) John Rice (Berkeley Statistics); Louis Lyons (Oxford); Ilya Narsky
(Matlab), Deb Davis (statistics teacher)