Scott Pesme


Who am I

I am second year PhD student at EPFL in the Theory of Machine Learning group under the supervision of Nicolas Flammarion.

I graduated from École Polytechnique in 2019 and have a master degree from ENS Paris-Saclay in mathematics, vision and machine learning (MVA). I did my master thesis in convex optimisation at EPFL in the MLO group under the supervision of Aymeric Dieuleveut and Martin Jaggi.


Research interests

My main research interests are at the intersection between optimisation and statistics. Here is a selection of research topics I am interested in:

  • Optimisation methods in machine learning (convex and more recently non-convex)

  • Stochastic Differential Equations and how they can help understand optimisation algorithms


  • S. Pesme, L. Pillaud-Vivien, N. Flammarion. Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity.
    [arxiv, pdf], Preprint, 2021. [Show Abstract]

    Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.

  • S. Pesme, N. Flammarion. Online Robust Regression via SGD on the l1 loss.
    [poster, arxiv, pdf], NeurIPS, 2020. [Show Abstract]

    Abstract: We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner, one data point after the other. More specifically, for a true parameter $\theta^*$, we consider the corrupted Gaussian linear model $y = \langle x, \theta^* \rangle + \varepsilon + b$ where the adversarial noise $b$ can take any value with probability $\eta$ and equals zero otherwise. We consider this adversary to be oblivious (i.e., $b$ independent of the data) since this is the only contamination model under which consistency is possible. Current algorithms rely on having the whole data at hand in order to identify and remove the outliers. In contrast, we show in this work that stochastic gradient descent on the $\ell_1$ loss converges to the true parameter vector at a $\tilde{O}(1/(1− \eta )^2 n)$ rate which is independent of the values of the contaminated measurements. Our proof relies on the elegant smoothing of the non-smooth $\ell_1$ loss by the Gaussian data and a classical non-asymptotic analysis of Polyak-Ruppert averaged SGD. In addition, we provide experimental evidence of the efficiency of this simple and highly scalable algorithm.

  • S. Pesme, A. Dieuleveut, N. Flammarion.
    On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent.
    [arxiv, pdf], ICML, 2020. [Show Abstract]

    Abstract: Constant step-size Stochastic Gradient Descent exhibits two phases: a transient phase during which iterates make fast progress towards the optimum, followed by a stationary phase during which iterates oscillate around the optimal point. In this paper, we show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates. We analyse the classical statistical test proposed by Pflug (1983), based on the inner product between consecutive stochastic gradients. Even in the simple case where the objective function is quadratic we show that this test cannot lead to an adequate convergence diagnostic. We then propose a novel and simple statistical procedure that accurately detects stationarity and we provide experimental results showing state-of-the-art performance on synthetic and real-world datasets.

You can have a look at my Google scholar webpage, though you will not find any additional paper.