Publications
General Machine Learning

T. Krüger, D. Panknin, M. Braun,
Fast CrossValidation via Sequential Analysis,
NIPS BigLearning Workshop, 2011
(abstract)
With the increasing size of today’s data sets, ﬁnding the right
parameter conﬁguration via crossvalidation can be an extremely
timeconsuming task. In this paperwe propose an improved
crossvalidation procedure which uses nonparametrictesting
coupled with sequential analysis to determine the best parameter
set on linearly increasing subsets of the data. By eliminating
underperforming candidatesquickly and keeping promising candidates
as long as possible the method speedsup the computation while
preserving the capability of the full crossvalidation.The
experimental evaluation shows that our method reduces the
computation timeby a factor of up to 70 compared to a full
crossvalidation with a negligible impacton the accuracy.
Social Media Analysis
Analysis of Twitter data; Impact Analysis; Big Data, RealTime;
Stream Mining.

M. Braun, M. Jugel, K.R. Müller,
Realtime social media analysis with TWIMPACT,
Demonstration at NIPS 2011.
(abstract)
Social media analysis has attracted quite some interest
recently. Typical questions are identifying current trends, rating
the impact or influence of users, summarizing what people are
saying on some topic or brand, and so on. One challenge is the
enormous amount of data you need to process. Current approaches
often need to filter the data first and then do the actual
analysis after the event. However, ideally you would want to be
able to look into the stream of social media updates in realtime
as the event unfolds, and not a few days or weeks later when
you've processed your data. We will showcase the realtime social
media analysis engine developed by TWIMPACT in cooperation with
the TU Berlin. You will be able to look at the current trends on
Twitter in realtime, in addition to being able to have a look at
the last year which we will bring in a preanalyzed form, such
that you can zoom in on the historic events of this year like the
revolutions in Egypt, or Libya, and so on.
(Deep) Neural Networks
Studying how representations are built in (deep) neural networks
using the techniques developed for kernel methods.

G. Montavon, M. Braun, K.R. Müller,
Deep Boltzmann Machines as FeedForward Hierarchies,
International Conference on Artificial Intelligence and Statistics
(AISTATS), 2012
(abstract)
The deep Boltzmann machine is a powerful model that extracts the
hierarchical structure of observed data. While inference is
typically slow due to its undirected nature, we argue that the
emerging feature hierarchy is still explicit enough to be
traversed in a feedforward fashion. The claim is corroborated by
training a set of deep neural networks on real data and measuring
the evolution of the representation layer after layer. The
analysis reveals that the deep Boltzmann machine produces a
feedforward hierarchy of increasingly invariant representations
that clearly surpasses the layerwise approach.

G. Montavon, M. Braun, K.R. Müller,
Importance of CrossLayer Cooperation for Learning Deep Feature Hierarchies,
NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011
(abstract)
A common property of hierarchical models of the brain is their
capacity to integrate bottomup and topdown information in order
to distill the taskrelevant information from the sensory
noise. In this paper, we argue that such cooperation between upper
and lower layers is not only useful at prediction time but also at
learning time in order to build a successful feature
hierarchy. The claim is corroborated by training a set of deep
networks on real data and measuring the evolution of the
representation layer after layer. The analysis reveals that
crosslayer cooperation enables the emergence of a sequence of
increasingly invariant representations.

G. Montavon, M. Braun, K.R. Müller,
Kernel Analysis of Deep Networks,
Journal of Machine Learning Research (JMLR), 12, pp. 25632581, 2011
(abstract)
When training deep networks it is common knowledge that an
efficient and well generalizing representation of the problem is
formed. In this paper we aim to elucidate what makes the emerging
representation successful. We analyze the layerwise evolution of
the representation in a deep network by building a sequence of
deeper and deeper kernels that subsume the mapping performed by
more and more layers of the deep network and measuring how these
increasingly complex kernels fit the learning problem. We observe
that deep networks create increasingly better representations of
the learning problem and that the structure of the deep network
controls how fast the representation of the task is formed layer
after layer.

G. Montavon, M. Braun, K.R. Müller,
Layerwise analysis of deep networks with Gaussian kernels,
Advances in Neural Information Processing Systems 23 (NIPS
2010), pp. 16781686, 2010
(abstract)
Deep networks can potentially express a learning problem more
efficiently than local learning machines. While deep networks
outperform local learning machines on some problems, it is still
unclear how their nice representation emerges from their complex
structure. We present an analysis based on Gaussian kernels that
measures how the representation of the learning problem evolves
layer after layer as the deep network builds higherlevel abstract
representations of the input. We use this analysis to show
empirically that deep networks build progressively bet ter
representations of the learning problem and that the best
representations are obtained when the deep network discriminates
only in the last layers.
Machine Learning Open Source Software

S. Sonnenburg, M. L. Braun, C. S. Ong, S. Bengio, L. Bottou,
G. Holmes, Y. LeCun, K.R. Müller, F. Pereira, C. E. Rasmussen,
G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston,
R. Williamson
,
The Need for Open Source Software in Machine Learning,
Journal of Machine Learning Research, 8(Oct):24432466, 2007
(abstract)
Open source tools have recently reached a level of maturity
which makes them suitable for building largescale realworld
systems. At the same time, the field of machine learning has
developed a large body of powerful learning algorithms for diverse
applications. However, the true potential of these methods is not
used, since existing implementations are not openly shared,
resulting in software with low usability, and weak
interoperability. We argue that this situation can be
significantly improved by increasing incentives for researchers to
publish their software under an open source model. Additionally,
we outline the problems authors are faced with when trying to
publish algorithmic implementations of machine learning
methods. We believe that a resource of peer reviewed software
accompanied by short articles would be highly valuable to both the
machine learning and the general scientific community.
Kernel Methods
Eigenstructure of the Kernel Matrix; Kernel PLS; Kernel Methods

N. Krämer, M. Sugiyama, M.L. Braun,
Lanczos Approximations for the Speedup of Kernel Partial
Least Squares Regression
,
JMLR Workshop and Conference Proceedings, Volume 5: AISTATS 2009
(abstract)
The runtime for Kernel Partial Least Squares (KPLS) to compute the
fit is quadratic in the number of examples. However, the necessity
of obtaining sensitivity measures as degrees of freedom for model
selection or confidence intervals for more detailed analysis
requires cubic runtime, and thus constitutes a computational
bottleneck in realworld data analysis. We propose a novel
algorithm for KPLS which not only computes (a) the fit, but also
(b) its approximate degrees of freedom and (c) error bars in
quadratic runtime. The algorithm exploits a close connection
between Kernel PLS and the Lanczos algorithm for approximating the
eigenvalues of symmetric matrices, and uses this approximation to
compute the trace of powers of the kernel matrix in quadratic runtime.

M.L. Braun, J. Buhmann, K.R. Müller,
On Relevant Dimensions in Kernel Feature Spaces,
JMLR 9(Aug):18751908, 2008. (software)
(more information)
(abstract)
We show that the relevant information of a supervised learning
problem is contained up to negligible error in a finite number of
leading kernel PCA components if the kernel matches the underlying
learning problem in the sense that it can asymptotically represent
the function to be learned and is sufficiently smooth. Thus,
kernels do not only transform data sets such that good
generalization can be achieved using only linear discriminant
functions, but this transformation is also performed in a manner
which makes economical use of feature space dimensions. In the
best case, kernels provide efficient implicit representations of
the data for supervised learning problems. Practically, we propose
an algorithm which enables us to recover the number of leading
kernel PCA components relevant for good classification. Our
algorithm can therefore be applied (1) to analyze the interplay of
data set and kernel in a geometric fashion, (2) to aid in model
selection, and (3) to denoise in feature space in order to yield
better classification results.

N. Krämer, M.L. Braun,
Kernelizing PLS, Degress of Freedom, and Efficient Model Selection,
ICML 2007
(abstract)
Kernelizing partial least squares (PLS), an algorithm which has
been particularly popular in chemometrics, leads to kernel PLS
which has several interesting properties, including a subcubic
runtime for learning, and an iterative construction of directions
which are relevant for predicting the outputs. We show that the
kernelization of PLS introduces interesting properties not found
in ordinary PLS, giving novel insights into the workings of kernel
PLS and the connections to kernel ridge regression and conjugate
gradient descent methods. Furthermore, we show how to correctly
define the degrees of freedom for kernel PLS and how to
efficiently compute an unbiased estimate. Finally, we address the
practical problem of model selection. We demonstrate how to use
the degrees of freedom estimate to perform effective model
selection, and discuss how to implement crossvalidation schemes
efficiently.

M. L. Braun,
Accurate bounds for the eigenvalues of the kernel matrix,
Journal of Machine Learning Research, Vol. 7(Nov) 23032328, 2006
(abstract)
The eigenvalues of the kernel matrix play an important role in a
number of kernel methods, in particular, in kernel principal
component analysis. It is well known that the eigenvalues of the
kernel matrix converge as the number of samples tends to
infinity. We derive probabilistic finite sample size bounds on the
approximation error of individual eigenvalues which have the
important property that the bounds scale with the eigenvalue under
consideration, reflecting the actual behavior of the approximation
errors as predicted by asymptotic results and observed in
numerical simulations. Such scaling bounds have so far only been
known for tail sums of eigenvalues. Asymptotically, the bounds
presented here have a slower than stochastic rate, but the number
of sample points necessary to make this disadvantage noticeable is
often unrealistically large. Therefore, under practical
conditions, and for all but the largest few eigenvalues, the
bounds presented here form a significant improvement over existing
nonscaling bounds.

M.L. Braun, J. Buhmann, K.R. Müller,
Denoising and Dimension Reduction in Feature Space,
Advances in Neural Information Processing Systems 19 (NIPS 2006),
B. Schölkopf, J. Platt, and T. Hoffman (eds.), MIT Press,
Cambridge, MA, 185192, 2007
(abstract)
We show that the relevant information about a classification
problem in feature space is contained up to negligible error in a
finite number of leading kernel PCA components if the kernel
matches the underlying learning problem. Thus, ker nels not only
transform data sets such that good generalization can be achieved
even by linear discriminant functions, but this transformation is
also performed in a manner which makes economic use of feature
space dimensions. In the best case, kernels provide efficient
implicit representations of the data to perform
classification. Practically, we propose an algorithm which enables
us to recover the subspace and dimensionality relevant for good
classification. Our algorithm can therefore be applied (1) to
analyze the interplay of data set and kernel in a geo metric
fashion, (2) to help in model selection, and to (3) denoise in
feature space in order to yield better classification
results.

M.L. Braun, T. Lange, J. Buhmann,
Model Selection in Kernel Methods based on a Spectral Analysis of
Label Information
,
DAGM 2006, LNCS 4174, pp. 344353
(abstract)
We propose a novel method for addressing the model selection
problem in the context of kernel methods. In contrast to existing
methods which rely on holdout testing or try to compensate for
the optimism of the generalization error, our method is based on a
structural analysis of the label information using the
eigenstructure of the kernel matrix. In this setting, the label
vector can be transformed into a representation in which the
smooth information is easily discernible from the noise. This
permits to estimate a cutoff dimension such that the leading
coefficients in that representation contains the learnable
information, discarding the noise. Based on this cutoff
dimension, the regularization parameter is estimated for kernel
ridge regression.

M. L. Braun,
Spectral Properties of the Kernel Matrix and their Relation
to Kernel Methods in Machine Learning
,
Ph.D. thesis, University of Bonn, 2005. Published electronically.
(abstract)
Machine learning is an area of research concerned with the
construction of algorithms which are able to learn from
examples. Among such algorithms, socalled kernel methods form an
important family of algorithms which have proven to be powerful
and versatile for a large number of problem areas. Central to
these approaches is the kernel matrix which summarizes the
information contained in the training examples. The goal of this
thesis is to analyze machine learning kernel methods based on
properties of the kernel matrix. The algorithms considered are
kernel principal component analysis and kernel ridge
regression. This thesis is divided into two parts: a theoretical
part devoted to studying the spectral properties of the kernel
matrix, and an application part which analyzes the kernel
principal component analysis method and kernel based regression
based on these theoretical results.
In the theoretical part, convergence properties of the
eigenvalues and eigenvectors of the kernel matrix are studied. We
derive accurate bounds on the approximation error which have the
important property that the error bounds scale with the magnitude
of the eigenvalue, predicting correctly that the approximation
error of small eigenvalues is much smaller than that of large
eigenvalues. In this respect, the results improve significantly on
existing results. A similar result is proven for scalar products
with eigenvectors of the kernel matrix. It is shown that the
scalar products with eigenvectors corresponding to small
eigenvalues are small a priori independently of the degree of
approximation.
In the application part, we discuss the following topics. For
kernel principal component analysis, we show that the estimated
eigenvalues approximate the true principal values with high
precision. Next, we discuss the general setting of kernel based
regression and show that the relevant information of the labels is
contained in the first few coefficients of the label vector in the
basis of eigenvectors of the kernel matrix, such that the
information and the noise can be divided much more easily in this
representation. Finally, we show that kernel ridge regression
works by suppressing all but the leading coefficients, thereby
extracting the relevant information of the label vectors. This
interpretation suggests an estimate of the number of relevant
coefficients in order to perform model selection. In an
experimental evaluation, this approach proves to perform
competitively to stateoftheart methods.
Brain Computer Interface
Detecting conditions of high mental and auditory workload based on
EEG recordings under realworld conditions.

C. Sannelli, M. Braun, and K.R. Müller,
Improving BCI performance by taskrelated trial pruning,
Neural Networks, Volume 22, Issue 9, November 2009, Pages
12951304, Special Issue on BrainMachine Interface
(abstract)
Noise in electroencephalography data (EEG) is an ubiquitous
problem that limits the performance of brain computer interfaces
(BCI). While typical EEG artifacts are usually removed by trial
rejection or by filtering, noise induced in the data by the subject's
failure to produce the required mental state is very harmful. Such
"noise" effects are rather common, especially for naive subjects
in their training phase and, thus, standard artifact removal methods
would inevitably fail. In this paper, we present a novel method
which aims to detect such defected trials taking into account the
intended task by use of Relevant Dimensionality Estimation (RDE), a
new machine learning method for denoising in feature space. In this
manner, our method effectively "cleans" the training data and thus
allows better BCI classification. Preliminary results conducted on a
data set of 43 naive subjects show a significant improvement for 74%
of the subjects.

C. Sannelli, M. Braun, M. Tangermann, and K.R. Müller,
Estimating Noise and Dimensionality in BCI Data Sets: Towards BCI
Illiteracy Comprehension
,
In Proceedings of the 4th International BrainComputer Interface
Workshop and Training Course 2008. Verlag der Technischen
Universität Graz, 2008 (won the young scientists best talk award)

J. Kohlmorgen, G. Dornhege, M.L. Braun, B. Blankertz,
K.R. Müller, G. Curio, K. Hagemann, A. Bruns, M. Schrauf,
W.E. Kincses
,
Improving Human Performance in a Real Operating Environment
through RealTime Mental Workload Detection
,
in Towards BrainComputer Interfacing, G. Dornhege,
J. del R. Millián, T. Hinterberger, D. McFarland,
K.R. Müller (Eds), MIT Press, 2007
Stability Based Model Selection
A general framework for estimating the right number of clusters
based on the stability of the clustering solutions.

T. Lange, V. Roth, M. Braun and J. Buhmann,
StabilityBased Validation of Clustering Solutions,
Neural Computation 16:6, June 2004.
(abstract)
Data clustering describes a set of frequently employed techniques
in exploratory data analysis to extract "natural" group structure
in data. Such groupings need to be validated to separate the
signal in the data from spurious structure. In this context,
finding an appropriate number of clusters is a particularly
important model selection question. We introduce a measure of
cluster stability to assess the validity of a cluster model. This
stability measure quantifies the reproducibility of clustering
solutions on a second sample, and it can be interpreted as a
classification risk with regard to class labels produced by a
clustering algorithm. The preferred number of clusters is
determined by minimizing this classification risk as a function of
the number of clusters. Convincing results are achieved on
simulated as well as gene expression data sets. Comparisons to
other methods demonstrate the competitive performance of our
method and its suitability as a general validation tool for
clustering solutions in realworld problems.

T. Lange, M. Braun, V. Roth, and J. Buhmann,
StabilityBased Model Selection,
NIPS 2002

V. Roth, M. Braun, T. Lange, and J. Buhmann,
StabilityBased Model Order Selection in Clustering with
Applications to Gene Expression Data
,
ICANN 2002, pp. 607612, Lecture Notes in Computer Science
2415, José R. Dorronsoro (Ed.)

V. Roth, T. Lange, M. Braun, J. Buhmann,
A Resampling Approach to Cluster Validation,
COMPSTAT 2002
Noisy TSP
Applying methods from statistical physics to robustly sample good
solutions for combinatorial optimization problems like the
travelling salesman problem.

M. Braun, J. Buhmann,
The Noisy Euclidian Traveling Salesman Problem and Learning,
in: T. G. Diettrich, S. Becker, Z. Ghahramani (Eds.), Advances in
Information Processing Systems 14, pp. 251258, MIT Press
(abstract)
We consider noisy Euclidean traveling salesman problems in the
plane, whcih are random combinatorial problems with underlying
structure. Gibbs sampling is sused to compute average
trajectories, which estimate the underlying structure common to
all instances. This procedure requires identifying the exact
relationship between permutations and tours. In a learning
setting, the average trajectory is used as a model to construct
solutions to new instances sampled from the same
source. Experimental results show that the average trajectory can
in fact estimate the underlying structure and that overfitting
effects occur if the trajectory adapts too closely to a single
instance.
Retina Implant
From Mar. 1997 to Sept. 1999, I worked as a lab assitant in the Retina
Implant project (which was a registered project of the
EXPO2000) with Michael Becker. The goal was to construct a retinal
prosthesis for people with macula
degeneration.

M. Becker, M. Braun, R. Eckmiller,
Retina Implant Adjustment with Reinforcement Learning,
Proc. ICASSP98, IEEE Press, 1998, Vol. II, pp. 11811184

M. Becker, M. Braun, R. Eckmiller,
Retina Encoder Inversion for Retina Implant Simulation,
Niklasson L. et al (eds), Proc. ICANN98, Springer 1998, pp. 791796