Publications

Believe it or not, this is still the definitive list of my publications!

General Machine Learning

T. Krüger, D. Panknin, M. Braun, Fast Cross-Validation via Sequential Analysis, NIPS BigLearning Workshop, 2011 (abstract)

With the increasing size of today’s data sets, ﬁnding the right parameter conﬁguration via cross-validation can be an extremely time-consuming task. In this paperwe propose an improved cross-validation procedure which uses non-parametrictesting coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidatesquickly and keeping promising candidates as long as possible the method speedsup the computation while preserving the capability of the full cross-validation.The experimental evaluation shows that our method reduces the computation timeby a factor of up to 70 compared to a full cross-validation with a negligible impacton the accuracy.

Social Media Analysis

Analysis of Twitter data; Impact Analysis; Big Data, Real-Time; Stream Mining.

M. Braun, M. Jugel, K.-R. Müller, Real-time social media analysis with TWIMPACT, Demonstration at NIPS 2011. (abstract)

Social media analysis has attracted quite some interest recently. Typical questions are identifying current trends, rating the impact or influence of users, summarizing what people are saying on some topic or brand, and so on. One challenge is the enormous amount of data you need to process. Current approaches often need to filter the data first and then do the actual analysis after the event. However, ideally you would want to be able to look into the stream of social media updates in real-time as the event unfolds, and not a few days or weeks later when you've processed your data. We will showcase the real-time social media analysis engine developed by TWIMPACT in cooperation with the TU Berlin. You will be able to look at the current trends on Twitter in real-time, in addition to being able to have a look at the last year which we will bring in a pre-analyzed form, such that you can zoom in on the historic events of this year like the revolutions in Egypt, or Libya, and so on.

(Deep) Neural Networks

Studying how representations are built in (deep) neural networks using the techniques developed for kernel methods.

G. Montavon, M. Braun, K.-R. Müller, Deep Boltzmann Machines as Feed-Forward Hierarchies, International Conference on Artificial Intelligence and Statistics (AISTATS), 2012 (abstract)

The deep Boltzmann machine is a powerful model that extracts the hierarchical structure of observed data. While inference is typically slow due to its undirected nature, we argue that the emerging feature hierarchy is still explicit enough to be traversed in a feed-forward fashion. The claim is corroborated by training a set of deep neural networks on real data and measuring the evolution of the representation layer after layer. The analysis reveals that the deep Boltzmann machine produces a feed-forward hierarchy of increasingly invariant representations that clearly surpasses the layer-wise approach.
G. Montavon, M. Braun, K.-R. Müller, Importance of Cross-Layer Cooperation for Learning Deep Feature Hierarchies, NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011 (abstract)

A common property of hierarchical models of the brain is their capacity to integrate bottom-up and top-down information in order to distill the task-relevant information from the sensory noise. In this paper, we argue that such cooperation between upper and lower layers is not only useful at prediction time but also at learning time in order to build a successful feature hierarchy. The claim is corroborated by training a set of deep networks on real data and measuring the evolution of the representation layer after layer. The analysis reveals that crosslayer cooperation enables the emergence of a sequence of increasingly invariant representations.
G. Montavon, M. Braun, K.-R. Müller, Kernel Analysis of Deep Networks, Journal of Machine Learning Research (JMLR), 12, pp. 2563-2581, 2011 (abstract)

When training deep networks it is common knowledge that an efficient and well generalizing representation of the problem is formed. In this paper we aim to elucidate what makes the emerging representation successful. We analyze the layer-wise evolution of the representation in a deep network by building a sequence of deeper and deeper kernels that subsume the mapping performed by more and more layers of the deep network and measuring how these increasingly complex kernels fit the learning problem. We observe that deep networks create increasingly better representations of the learning problem and that the structure of the deep network controls how fast the representation of the task is formed layer after layer.
G. Montavon, M. Braun, K.-R. Müller, Layer-wise analysis of deep networks with Gaussian kernels, Advances in Neural Information Processing Systems 23 (NIPS 2010), pp. 1678-1686, 2010 (abstract)

Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively bet- ter representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers.

Machine Learning Open Source Software

S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou, G. Holmes, Y. LeCun, K.-R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston, R. Williamson , The Need for Open Source Software in Machine Learning, Journal of Machine Learning Research, 8(Oct):2443-2466, 2007 (abstract)

Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.

Kernel Methods

Eigenstructure of the Kernel Matrix; Kernel PLS; Kernel Methods

N. Krämer, M. Sugiyama, M.L. Braun, Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression , JMLR Workshop and Conference Proceedings, Volume 5: AISTATS 2009 (abstract)

The runtime for Kernel Partial Least Squares (KPLS) to compute the fit is quadratic in the number of examples. However, the necessity of obtaining sensitivity measures as degrees of freedom for model selection or confidence intervals for more detailed analysis requires cubic runtime, and thus constitutes a computational bottleneck in real-world data analysis. We propose a novel algorithm for KPLS which not only computes (a) the fit, but also (b) its approximate degrees of freedom and (c) error bars in quadratic runtime. The algorithm exploits a close connection between Kernel PLS and the Lanczos algorithm for approximating the eigenvalues of symmetric matrices, and uses this approximation to compute the trace of powers of the kernel matrix in quadratic runtime.
M.L. Braun, J. Buhmann, K.-R. Müller, On Relevant Dimensions in Kernel Feature Spaces, JMLR 9(Aug):1875--1908, 2008. (software) (more information) (abstract)

We show that the relevant information of a supervised learning problem is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem in the sense that it can asymptotically represent the function to be learned and is sufficiently smooth. Thus, kernels do not only transform data sets such that good generalization can be achieved using only linear discriminant functions, but this transformation is also performed in a manner which makes economical use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data for supervised learning problems. Practically, we propose an algorithm which enables us to recover the number of leading kernel PCA components relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to aid in model selection, and (3) to denoise in feature space in order to yield better classification results.
N. Krämer, M.L. Braun, Kernelizing PLS, Degress of Freedom, and Efficient Model Selection, ICML 2007 (abstract)

Kernelizing partial least squares (PLS), an algorithm which has been particularly popular in chemometrics, leads to kernel PLS which has several interesting properties, including a sub-cubic runtime for learning, and an iterative construction of directions which are relevant for predicting the outputs. We show that the kernelization of PLS introduces interesting properties not found in ordinary PLS, giving novel insights into the workings of kernel PLS and the connections to kernel ridge regression and conjugate gradient descent methods. Furthermore, we show how to correctly define the degrees of freedom for kernel PLS and how to efficiently compute an unbiased estimate. Finally, we address the practical problem of model selection. We demonstrate how to use the degrees of freedom estimate to perform effective model selection, and discuss how to implement cross-validation schemes efficiently.
M. L. Braun, Accurate bounds for the eigenvalues of the kernel matrix, Journal of Machine Learning Research, Vol. 7(Nov) 2303-2328, 2006 (abstract)

The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds.
M.L. Braun, J. Buhmann, K.-R. Müller, Denoising and Dimension Reduction in Feature Space, Advances in Neural Information Processing Systems 19 (NIPS 2006), B. Schölkopf, J. Platt, and T. Hoffman (eds.), MIT Press, Cambridge, MA, 185-192, 2007 (abstract)

We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, ker- nels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geo- metric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results.
M.L. Braun, T. Lange, J. Buhmann, Model Selection in Kernel Methods based on a Spectral Analysis of Label Information , DAGM 2006, LNCS 4174, pp. 344-353 (abstract)

We propose a novel method for addressing the model selection problem in the context of kernel methods. In contrast to existing methods which rely on hold-out testing or try to compensate for the optimism of the generalization error, our method is based on a structural analysis of the label information using the eigenstructure of the kernel matrix. In this setting, the label vector can be transformed into a representation in which the smooth information is easily discernible from the noise. This permits to estimate a cut-off dimension such that the leading coefficients in that representation contains the learnable information, discarding the noise. Based on this cut-off dimension, the regularization parameter is estimated for kernel ridge regression.
M. L. Braun, Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning , Ph.D. thesis, University of Bonn, 2005. Published electronically. (abstract)

Machine learning is an area of research concerned with the construction of algorithms which are able to learn from examples. Among such algorithms, so-called kernel methods form an important family of algorithms which have proven to be powerful and versatile for a large number of problem areas. Central to these approaches is the kernel matrix which summarizes the information contained in the training examples. The goal of this thesis is to analyze machine learning kernel methods based on properties of the kernel matrix. The algorithms considered are kernel principal component analysis and kernel ridge regression. This thesis is divided into two parts: a theoretical part devoted to studying the spectral properties of the kernel matrix, and an application part which analyzes the kernel principal component analysis method and kernel based regression based on these theoretical results.

In the theoretical part, convergence properties of the eigenvalues and eigenvectors of the kernel matrix are studied. We derive accurate bounds on the approximation error which have the important property that the error bounds scale with the magnitude of the eigenvalue, predicting correctly that the approximation error of small eigenvalues is much smaller than that of large eigenvalues. In this respect, the results improve significantly on existing results. A similar result is proven for scalar products with eigenvectors of the kernel matrix. It is shown that the scalar products with eigenvectors corresponding to small eigenvalues are small a priori independently of the degree of approximation.

In the application part, we discuss the following topics. For kernel principal component analysis, we show that the estimated eigenvalues approximate the true principal values with high precision. Next, we discuss the general setting of kernel based regression and show that the relevant information of the labels is contained in the first few coefficients of the label vector in the basis of eigenvectors of the kernel matrix, such that the information and the noise can be divided much more easily in this representation. Finally, we show that kernel ridge regression works by suppressing all but the leading coefficients, thereby extracting the relevant information of the label vectors. This interpretation suggests an estimate of the number of relevant coefficients in order to perform model selection. In an experimental evaluation, this approach proves to perform competitively to state-of-the-art methods.

Brain Computer Interface

Detecting conditions of high mental and auditory workload based on EEG recordings under real-world conditions.

C. Sannelli, M. Braun, and K.-R. Müller, Improving BCI performance by task-related trial pruning, Neural Networks, Volume 22, Issue 9, November 2009, Pages 1295-1304, Special Issue on Brain-Machine Interface (abstract)

Noise in electroencephalography data (EEG) is an ubiquitous problem that limits the performance of brain computer interfaces (BCI). While typical EEG artifacts are usually removed by trial rejection or by filtering, noise induced in the data by the subject's failure to produce the required mental state is very harmful. Such "noise" effects are rather common, especially for naive subjects in their training phase and, thus, standard artifact removal methods would inevitably fail. In this paper, we present a novel method which aims to detect such defected trials taking into account the intended task by use of Relevant Dimensionality Estimation (RDE), a new machine learning method for denoising in feature space. In this manner, our method effectively "cleans" the training data and thus allows better BCI classification. Preliminary results conducted on a data set of 43 naive subjects show a significant improvement for 74% of the subjects.
C. Sannelli, M. Braun, M. Tangermann, and K.-R. Müller, Estimating Noise and Dimensionality in BCI Data Sets: Towards BCI Illiteracy Comprehension , In Proceedings of the 4th International Brain-Computer Interface Workshop and Training Course 2008. Verlag der Technischen Universität Graz, 2008 (won the young scientists best talk award)
J. Kohlmorgen, G. Dornhege, M.L. Braun, B. Blankertz, K.-R. Müller, G. Curio, K. Hagemann, A. Bruns, M. Schrauf, W.E. Kincses , Improving Human Performance in a Real Operating Environment through Real-Time Mental Workload Detection , in Towards Brain-Computer Interfacing, G. Dornhege, J. del R. Millián, T. Hinterberger, D. McFarland, K.-R. Müller (Eds), MIT Press, 2007

Stability Based Model Selection

A general framework for estimating the right number of clusters based on the stability of the clustering solutions.

T. Lange, V. Roth, M. Braun and J. Buhmann, Stability-Based Validation of Clustering Solutions, Neural Computation 16:6, June 2004. (abstract)

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract "natural" group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.
T. Lange, M. Braun, V. Roth, and J. Buhmann, Stability-Based Model Selection, NIPS 2002
V. Roth, M. Braun, T. Lange, and J. Buhmann, Stability-Based Model Order Selection in Clustering with Applications to Gene Expression Data , ICANN 2002, pp. 607-612, Lecture Notes in Computer Science 2415, José R. Dorronsoro (Ed.)
V. Roth, T. Lange, M. Braun, J. Buhmann, A Resampling Approach to Cluster Validation, COMPSTAT 2002

Noisy TSP

Applying methods from statistical physics to robustly sample good solutions for combinatorial optimization problems like the travelling salesman problem.

M. Braun, J. Buhmann, The Noisy Euclidian Traveling Salesman Problem and Learning, in: T. G. Diettrich, S. Becker, Z. Ghahramani (Eds.), Advances in Information Processing Systems 14, pp. 251-258, MIT Press (abstract)

We consider noisy Euclidean traveling salesman problems in the plane, whcih are random combinatorial problems with underlying structure. Gibbs sampling is sused to compute average trajectories, which estimate the underlying structure common to all instances. This procedure requires identifying the exact relationship between permutations and tours. In a learning setting, the average trajectory is used as a model to construct solutions to new instances sampled from the same source. Experimental results show that the average trajectory can in fact estimate the underlying structure and that overfitting effects occur if the trajectory adapts too closely to a single instance.

Retina Implant

From Mar. 1997 to Sept. 1999, I worked as a lab assitant in the Retina Implant project (which was a registered project of the EXPO2000) with Michael Becker. The goal was to construct a retinal prosthesis for people with macula degeneration.

M. Becker, M. Braun, R. Eckmiller, Retina Implant Adjustment with Reinforcement Learning, Proc. ICASSP98, IEEE Press, 1998, Vol. II, pp. 1181-1184
M. Becker, M. Braun, R. Eckmiller, Retina Encoder Inversion for Retina Implant Simulation, Niklasson L. et al (eds), Proc. ICANN98, Springer 1998, pp. 791-796

Mikio L. Braun

Fascinated with AI since 1993

Areas

Publications

General Machine Learning

Social Media Analysis

(Deep) Neural Networks

Machine Learning Open Source Software

Kernel Methods

Brain Computer Interface

Stability Based Model Selection

Noisy TSP

Retina Implant

Mikio L. Braun Fascinated with AI since 1993

Areas

Publications

General Machine Learning

Social Media Analysis

(Deep) Neural Networks

Machine Learning Open Source Software

Kernel Methods

Brain Computer Interface

Stability Based Model Selection

Noisy TSP

Retina Implant

Mikio L. Braun

Fascinated with AI since 1993