Software and databases

Unless stated otherwise, the files shown below are © Miguel Á. Carreira-Perpiñán. They are free for non-commercial use. I make them available in the hope that they will be useful, but without any warranty.

Github repository

Some of the software produced by my students and I is in https://github.com/UCMerced-ML.

Matlab code

For implementations of my own algorithms, see my publications page.

Machine learning algorithms and models

Algorithms based on the method of auxiliary coordinates (MAC):
- (reference: Carreira-Perpiñán and Wang, AISTATS 2014): original reference with some illustrative examples.
- Low-dimensional SVM (reference: Wang and Carreira-Perpiñán, AAAI 2014): learning nonlinear, low-dimensional features for a linear SVM.
- Binary autoencoder (reference: Carreira-Perpiñán and Raziperchikolaei, CVPR 2015): learning a binary autoencoder (for fast approximate information retrieval using binary hashing).
- Affinity-based binary embedding (reference: Carreira-Perpiñán and Raziperchikolaei, arXiv 2015): learning a binary embedding with affinity-based loss functions (for fast approximate information retrieval using binary hashing).
Computation of affinity or similarity functions:
- Entropic affinities (references: Hinton and Roweis, NIPS 2002; Vladymyrov and Carreira-Perpiñán, ICML 2013): constructs Gaussian affinities with an adaptive bandwidth for each data point, so that each point has a fixed effective number of neighbours K. This should give improved results in problems such as nonlinear embeddings, spectral clustering, etc. Our code is practically as efficient as using a global bandwidth.
Dimensionality reduction algorithms:
- Elastic embedding (EE) (reference: Carreira-Perpiñán, ICML 2010). We have three codes:
  - A fast code using the spectral direction (reference: Vladymyrov and Carreira-Perpiñán, ICML 2012). This computes exact gradients, use it for small datasets (up to thousands of points).
  - A fast code using fast multipole methods (reference: Vladymyrov and Carreira-Perpiñán, AISTATS 2014). This computes approximate gradients, use it for large datasets (millions of points). It includes both fast multipole methods, which are O(N), and the Barnes-Hut algorithm, which is O(N.logN).
  - A slower code using a fixed-point iteration (reference: Carreira-Perpiñán, ICML 2010). This is provided for reference, you probably should use the faster code (spectral direction).
- Dimensionality reduction by unsupervised regression (DRUR) (reference: Carreira-Perpiñán and Lu, CVPR 2008, 2010).
- lapeig.m: Laplacian eigenmaps (reference: Belkin and Niyogi, Neural Comp. 2003).
- Locally Linear Landmarks (LLL): fast, approximate Laplacian eigenmaps for large-scale manifold learning (reference: Vladymyrov and Carreira-Perpiñán, ECML/PKDD 2013).
- Variational Nyström (VN): fast, approximate Laplacian eigenmaps for large-scale manifold learning (reference: Vladymyrov and Carreira-Perpiñán, ICML 2016).
- Stochastic neighbor embedding (SNE) and t-SNE (references: Hinton and Roweis, NIPS 2002; van der Maaten and Hinton, JMLR 2008).
  For small datasets (thousands of points), this code is the fastest one available for SNE and t-SNE as far as I know. It uses the spectral direction (reference: Vladymyrov and Carreira-Perpiñán, ICML 2012). For large datasets, use the fast multipole implementation above.
Clustering algorithms:
- Laplacian K-modes (reference: Wang and Carreira-Perpiñán, arXiv 2014).
- K-modes (reference: Carreira-Perpiñán and Wang, arXiv 2013).
- Gaussian blurring mean shift (GBMS) (reference: Carreira-Perpiñán, ICML 2006).
- Constrained spectral clustering through affinity propagation (reference: Lu and Carreira-Perpiñán, CVPR 2008).
Semi-supervised learning algorithms:
- Laplacian assignments (LASS) (reference: Carreira-Perpiñán and Wang, AAAI 2014): learns soft assignments (or probability distributions) of items into categories, given item-item and item-category sparse (dis)similarity matrices.
Probabilistic models:
- Factor analysis: maximum likelihood estimate with an EM algorithm, projection to factor space (scores), goodness-of-fit test (number of factors to use), varimax rotation, etc.
- Principal component analysis: probabilistic and non-probabilistic PCA.
- Mixture of multivariate Bernoulli distributions: maximum likelihood estimate with an EM algorithm, sampling, etc.
Gaussian mixtures:
- Gaussian mixture Matlab tools contains a set of Matlab functions to perform some common operations with Gaussian mixtures (GMs): computing the density, gradient and Hessian at given points; computing the moments (mean, covariance); finding all the modes (with a fixed-point, mean-shift iteration); sampling from the GM; learning the GM parameters with an EM algorithm; learning a Gaussian classifier; reconstructing missing values in a dataset using a GM; and finding the parameters of a conditional or marginal GM, e.g. for p(x3|x2,x4) from a GM with pdf p(x1,x2,x3,x4,x5). In particular, this allows finding the modes, gradient, etc. of conditional or marginal GMs, with isotropic, diagonal or full covariance matrices.
  Here is an application of these tools to reconstructing missing data in a database of articulatory speech (Wisconsin X-ray Microbeam Database): code (reference: Qin and Carreira-Perpiñán, Interspeech 2010).
- Mode-finding in Gaussian mixtures (including the calculation of error bars, gradient and Hessian). See my page on Gaussian mixtures. This code assumes isotropic covariance matrices, for more general covariance matrices and many other useful functions, see the Gaussian mixture Matlab tools in this page.
Generalised elastic nets (GEN): these extend the original elastic net of Durbin and Willshaw to arbitrary differential operators. This generalised elastic nets web page contains figures and animations for travelling salesman problems (TSP) and visual cortical map simulations of V1, all created with the Generalised Elastic Net Matlab Toolbox (reference: Carreira-Perpiñán and Goodhill, 2003).
Other functions:
- sqdist.m: matrix of squared Euclidean distances between two datasets (needed by many of the functions below).
- nnsqdist.m: k nearest neighbours (indices and distances) of a dataset.
- knn.m: k-nearest neighbour classifier.
- lagdist.m: lagged distances of a vector time series.
- roc.m: ROC curve for a binary classifier.
- confumat.m: confusion matrix for a K-class classifier.
- recall1.m: recall@R measure in information retrieval: the average rate of queries for which the 1-nearest neighbour is ranked in the top R positions.
- kmeans.m: k-means algorithm, with various choices for the initialisation (including kmeans++).
- kmeans1.m: a generalisation of the k-means algorithm to use weighted centroids and a (partially) fixed codebook.
- procrustes.m: Procrustes alignment of two datasets. Useful to compare the results of different dimensionality reduction algorithms.
- rrr.m and rrrtrain.m: reduced-rank regression (linear regression where the coefficient matrix is constrained to be low-rank).
- gaussaff.m: Gaussian affinity matrix of a dataset based on various types of graphs: full, k-nearest-neighbour (symmetric, nonsymmetric, mutual), epsilon-ball. Returns a sparse matrix if possible.
- seglsq.m: segmented least squares problem (using dynamic programming), i.e., fit several lines to a 2D dataset.
- econncomp.m: connected components of a point set using the Euclidean distance.
- conncomp.m: connected components of a graph.
- sample_dirichlet.m: sampling from a Dirichlet distribution.
- sw.m: Swendsen-Wang Monte Carlo sampling.
- varimax.m: varimax rotation.
- imgsqd.m and imgsqd2.m: transform an image file (greyscale or colour) into a data set with one feature vector per pixel and computes the sparse matrix of squared Euclidean distances between pixel feature vectors (requires rgb2lab.m, rgb2luv.m, rgb2xyz.m). The difference between imgsqd.m and imgsqd2.m is in how boundary pixels are treated: in imgsqd.m their square neighbourhood is clipped, so they have fewer neighbours than interior pixels, while in imgsqd2.m the neighbourhood is shifted inside, so each pixel has the same number of neighbours.
- rgb2lab.m, rgb2luv.m, rgb2xyz.m: color space transformations.
- map50.m: create a 50% saturation colormap, with unobtrusive colours as in political maps, useful to plot the regions created by clustering or classification algorithms with 2D datasets.

Numerical optimisation

The following functions are part of my course materials for EECS260 Optimization:

numgradhess.m: evaluates numerically (with a finite-difference approximation) the gradient and Hessian of a function at a list of points, and optionally compares these gradients and Hessians with user-provided ones (useful to confirm the latter are correct). See the example function Fquad.m.
AC.m: compare pairs of arrays numerically using the infinity-norm (useful to verify that two different computations are essentially the same numerically).
convseq.m: estimates numerically the order of convergence given a sequence of vectors.
fcontours.m: contour plot of a 2D function with equality and inequality constraints.
plotseq.m: plots a sequence of points in 2D (useful in combination with fcontours.m).
Osteepdesc.m: steepest descent minimisation.
linesearch.m: backtracking line search.
Example functions: Fquad.m (quadratic), Frosenbrock.m (Rosenbrock function in n dimensions), Fsines.m (sines in 2D).

Some more functions:

proxbqp.m: solves a collection of proximal bound-constrained quadratic programs each of the form min(x'.A.x - 2b'.x + μ.|x-v|²) s.t. l ≤ x ≤ u. It uses an efficient ADMM algorithm described here.
SimplexProj.m: computes the projection of a vector onto the simplex (the standard, regular or probability simplex), fully vectorised and efficient (O(n.logn) where n is the dimension of the vector). The algorithm is briefly described here.

(Articulatory) speech processing

Articulatory databases: the following tools allow you to read data files from the MOCHA and XRMB databases into Matlab, plot animations of the vocal tract (pellet locations and outlines of the tongue and palate) and speech (waveform, spectrogram, energy, pitch, and phoneme labels if available), to save the visualisation as a movie file, to produce scatterplots and temporal plots of the articulator traces, and various other things:
- MOCHAtools: for the MOCHA-TIMIT database. Example displays for utterance fsew0_001: animation and scatterplot.
- XRMBtools: for the Wisconsin X-ray Microbeam (XRMB) database. Example displays for utterance jw11_tp105: animation and scatterplot.
Both of them require you to install our speech analysis functions. This code was mostly written by my PhD student Chao Qin. We also have a Java interface for this code written by Jimmy Yih and Mark Crompton.
Electropalatography: tools for imaging EPG frames.

Databases

A database of ear images I created for my 1995 MSc thesis on ear biometrics and ear identification.
Two databases (Corel subset and Sowerby, in gzipped Matlab format) of labelled images used in our CVPR 2004 paper "Multiscale conditional random fields for image labeling". The Corel subset was labelled by us; the Sowerby database is © BAE Systems.
The COIL-20 dataset and COIL-100 dataset, both in Matlab format, provided here with permission for the convenience of Matlab users. These datasets contain images of objects from a varying viewpoint. You can view all the images in a single file here: coil20.png and coil100.png.
The original datasets are Copyright © 1996 Computer Vision Laboratory, Columbia University, and are described here:
- COIL-20: Columbia Object Image Library. S. A. Nene, S. K. Nayar and H. Murase, "Columbia Object Image Library (COIL-20)", Technical Report CUCS-005-96, February 1996.
- COIL-100: Columbia Object Image Library. S. A. Nene, S. K. Nayar and H. Murase, "Columbia Object Image Library (COIL-100)", Technical Report CUCS-006-96, February 1996.
MNISTrotated7: a database created by Weiran Wang and myself consisting of noisy, rotated digit-7 images from the MNIST dataset, as well as their corresponding skeleton shape. It was used in our AISTATS 2012 paper. You can view some sample images here: MNISTrotated7.gif.

Java applets

An illustration of several mean-shift algorithms for image segmentation: Gaussian blurring mean shift (GBMS) and accelerated mean-shift (MS1) (references: Carreira-Perpiñán, ICML 2006 and CVPR 2006).
A standalone Java application to display animations of the vocal tract and speech written by Jimmy Yih and Mark Crompton. This is an interface to our articulatory speech processing Matlab code. At present it requires having Matlab installed in your machine and is mostly intended for illustration purposes.

implementations

Weka is a suite of machine learning software for data mining tasks written in Java. We have implemented the following algorithms in Weka:

Binary autoencoders: useful for fast information retrieval using binary hashing (reference: Carreira-Perpiñán and Raziperchikolaei, CVPR 2015).
Low-dimensional support vector machines: useful to learn nonlinear features (not necessarily low-dimensional) for a linear SVM and construct a fast nonlinear classifier (reference: Wang and Carreira-Perpiñán, AAAI 2014).

LaTeX

MACPcv: a LaTeX2e class file to typeset a personalised curriculum vitae.
MACPremark: a LaTeX2e package to add remarks to a draft LaTeX file, such as I need to complete this section and add a bibliographic reference or Extend this proof to the complex numbers.
MACPpreprint: a LaTeX2e package to add remarks to a preprint LaTeX file without modifying its layout, as if the text of the remarks had been typed over the paper once printed. This is useful to put online copies of papers in your web page which are exactly like the camera-ready ones published in proceedings or journals, but over which some text has been typed, such as a copyright notice, bibliographical information or page numbers.
MACParrow: a simple LaTeX2e package to add horizontal curved stretchable arrows under and over a formula, as are common in commutative diagrams, like . There are other packages that do this and much more, like xypic, but if you just need over- and underarrows, this one is very compact and simple to use.

Emacs Lisp

www-comand.el package: allows automatic sending of commands to certain WWW sites from inside XEmacs. It currently includes:
- Lookup of words in the Anaya dictionaries for Spanish and English (as well as the Merriam-Webster dictionary and thesaurus)
- Lookup of terms in Eric Weisstein's World of Mathematics and Britannica.com
- Search for keywords in the Google and Altavista searchers
- Validation of HTML files in the W3C HTML Validation Service
and it is straightforward to include others. This package is a minimal modification of Tomasz J. Cholewo's webster-www package that comes in the standard distribution of XEmacs.20.x.

Configuration files for various programs and useful scripts

If you want to use them, be sure to read through them and adapt them to your local configuration; for example, you will need to change some directory names, the email address, check for the existence of certain programs in your system, etc.

.emacs, for the Emacs or XEmacs editors (requires the www-command.el file)
init.el and custom.el, internally loaded by my .emacs for the XEmacs editor
.bibtex, internally loaded by my .emacs for use with XEmacs' BibTeX mode
.vm, for VM, the XEmacs mail reader
.ctwmrc, for the CTWM window manager
llpdf, a shellscript to pass options to ps2pdf so it uses lossless image compression and embeds all fonts, including fonts in Matlab figures
bitmap-eps, a shellscript to convert an EPS file with lots of objects into a bitmapped EPS file, which will generate smaller PDF files when included in a Latex document

You can get all the files as MACP-config-files.tar.gz.

Miguel A. Carreira-Perpinan

Last modified: Mon May 18 20:20:23 PDT 2020

UC Merced | EECS | MACP's Home Page