Ruben Ohana

 

Biography

I am currently a Research Fellow at the Center for Computational Mathematics of the Flatiron Institute (Simons Foundation) in New York City. My current research interests are optimization of large deep learning models (i.e. transformers and diffusion models), optimal transport and machine learning for solving scientific problems. I am part of the Polymathic AI initiative, aiming at building foundation models for science.

In 2022, I obtained my PhD in machine learning from Ecole Normale Supérieure supervised by Florent Krzakala (EPFL, formerly at ENS), Alessandro Rudi (INRIA - DIENS, Sierra Team) and Laurent Daudet (LightOn). In parallel to my PhD, I was a research scientist at the startup LightOn.

My PhD research interests focused on random features for kernel approximation, high-dimensional probabilities, random matrices and alternative training methods for deep learning, with application in differential privacy and adversarial robustness. I was also looking for new machine learning applications of the Optical Processing Units developed by LightOn.

During my PhD, I did a 4 months PhD internship at the Criteo AI Lab under the supervision of Liva Ralaivola and Alain Rakotomamonjy where we worked on building a PAC-Bayesian framework for Sliced-Wasserstein distances.

Before starting my Ph.D., I graduated with an engineering degree in Physics from ESPCI Paris, an MSc in Condensed Matter from the Master ICFP at the Ecole Normale Supérieure and an MSc in Statistics/Machine Learning from the Master in Mathematics at Sorbonne University.

During my studies, I did an internship at NTT Basic Research Laboratory in Japan, where I worked on the Quantum spin Hall Effect in InAS/GaSb double quantum wells, experimentally and theoretically under the supervision of Irie Hiroshi. I also did my first master thesis at the LIGO (the gravitational wave observatory), at the Massachussetts Institute of Technology, where I studied the generation of lasers that will be integrated in the next update of the interferometer, under the supervision of Peter Fritschel. I did my second master thesis at the Quantum Information Group at the LIP6 at Sorbonne University where I studied Contextuality in quantum information networks, under the supervision of Damian Markham.

Contact

  • E-mail: rohana [at] flatironinstitute.org

  • 162 5th Avenue, New York City, NY, United States

Publications and Preprints

  • Removing Dust from CMB Observations with Diffusion Models. D. Heurtel-Depeiges, B. Burkhart, R. Ohana*, B. Régaldo-Saint Blancard*. Oral @ NeurIPS, ML and the Physical Sciences Workshop 2023. [arXiv] [Show Abstract]

    Abstract: In cosmology, the quest for primordial B-modes in cosmic microwave background (CMB) observations has highlighted the critical need for a refined model of the Galactic dust foreground. We investigate diffusion-based modeling of the dust foreground and its interest for component separation. Under the assumption of a Gaussian CMB with known cosmology (or covariance matrix), we show that diffusion models can be trained on examples of dust emission maps such that their sampling process directly coincides with posterior sampling in the context of component separation. We illustrate this on simulated mixtures of dust emission and CMB. We show that common summary statistics (power spectrum, Minkowski functionals) of the components are well recovered by this process. We also introduce a model conditioned by the CMB cosmology that outperforms models trained using a single cosmology on component separation. Such a model will be used in future work for diffusion-based cosmological inference.

  • Multiple Physics Pretraining for Physical Surrogate Models. Polymathic AI. Best Paper Award & Oral @ NeurIPS, AI4Science Workshop 2023. [arXiv] [Show Abstract]

    Abstract: We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling. MPP involves training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks. In order to learn effectively in this setting, we introduce a shared embedding and normalization strategy that projects the fields of multiple systems into a single shared embedding space. We validate the efficacy of our approach on both pretraining and downstream tasks over a broad fluid mechanics-oriented benchmark. We show that a single MPP-pretrained transformer is able to match or outperform task-specific baselines on all pretraining sub-tasks without the need for finetuning. For downstream tasks, we demonstrate that finetuning MPP-trained models results in more accurate predictions across multiple time-steps on new physics compared to training from scratch or finetuning pretrained video foundation models. We open-source our code and model weights trained at multiple scales for reproducibility and community experimentation.

  • xVal: A Continuous Number Encoding for Large Language Models. Polymathic AI. NeurIPS, AI4Science Workshop 2023. [arXiv] [Show Abstract]

    Abstract: Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization.

  • AstroCLIP: Cross-Modal Pre-Training for Astronomical Foundation Models. Polymathic AI. NeurIPS, AI4Science Workshop 2023. [arXiv] [Show Abstract]

    Abstract: We present AstroCLIP, a strategy to facilitate the construction of astronomical foundation models that bridge the gap between diverse observational modalities. We demonstrate that a cross-modal contrastive learning approach between images and optical spectra of galaxies yields highly informative embeddings of both modalities. In particular, we apply our method on multi-band images and optical spectra from the Dark Energy Spectroscopic Instrument (DESI), and show that: (1) these embeddings are well-aligned between modalities and can be used for accurate cross-modal searches, and (2) these embeddings encode valuable physical information about the galaxies -- in particular redshift and stellar mass -- that can be used to achieve competitive zero- and few- shot predictions without further finetuning. Additionally, in the process of developing our approach, we also construct a novel, transformer-based model and pretraining approach for processing galaxy spectra.

  • MoMo: Momentum Models for Adaptive Learning Rates. F. Schaipp, R. Ohana, M. Eickenberg, A. Defazio, R. M. Gower. [arXiv] [Show Abstract]

    Abstract: Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new adaptive learning rates that can be used with any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (Stochastic gradient descent with momentum). MoMo uses momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. We then approximately minimize this model at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam - which is Adam with our new model-based adaptive learning rate. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. Through extensive numerical experiments, we demonstrate that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet, recommender systems on the Criteo dataset, and a transformer model on the translation task IWSLT14.

  • Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. R. Ohana*, K. Nadjahi*, A. Rakotomamonjy, L. Ralaivola. ICML 2023. [arXiv] [Show Abstract]

    Abstract: The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.

  • Linear Optical Random Projections Without Holography. R. Ohana, D. Hesslow, D. Brunner, S. Gigan, K. Müller. Optics Express. [arXiv] [Show Abstract]

    Abstract: We introduce a novel method to perform linear optical random projections without the need for holography. Our method consists of a computationally trivial combination of multiple intensity measurements to mitigate the information loss usually associated with the absolute-square non-linearity imposed by optical intensity measurements. Both experimental and numerical findings demonstrate that the resulting matrix consists of real-valued, independent, and identically distributed (i.i.d.) Gaussian random entries. Our optical setup is simple and robust, as it does not require interference between two beams. We demonstrate the practical applicability of our method by performing dimensionality reduction on high-dimensional data, a common task in randomized numerical linear algebra with relevant applications in machine learning.

  • Complex-to-Real Random Features for Polynomial Kernels. J. Wacker, R. Ohana, M. Filippone. AISTATS 2023. [AISTATS] [Show Abstract]

    Abstract: Kernel methods are ubiquitous in statistical modeling due to their theoretical guarantees as well as their competitive empirical performance. Polynomial kernels are of particular importance as their feature maps model the interactions between the dimensions of the input data. However, the construction time of explicit feature maps scales exponentially with the polynomial degree and a naive application of the kernel trick does not scale to large datasets. In this work, we propose Complex-to-Real (CtR) random features for polynomial kernels that leverage intermediate complex random projections and can yield kernel estimates with much lower variances than their real-valued analogs. The resulting features are real-valued, simple to construct and have the following advantages over the state-of-the-art: 1) shorter construction times, 2) lower kernel approximation errors for commonly used degrees, 3) they enable us to obtain a closed-form expression for their variance.
  • Photonic Differential Privacy with Direct Feedback Alignment. R. Ohana*, H.J. Ruiz*, J. Launay*, A. Cappelli, I. Poli, L. Ralaivola, A. Rakotomamonjy. 2021, NeurIPS 2021. [arXiv] [Show Abstract]

    Abstract: Optical Processing Units (OPUs) -- low-power photonic chips dedicated to large scale random projections -- have been used in previous work to train deep neural networks using Direct Feedback Alignment (DFA), an effective alternative to backpropagation. Here, we demonstrate how to leverage the intrinsic noise of optical random projections to build a differentially private DFA mechanism, making OPUs a solution of choice to provide a private-by-design training. We provide a theoretical analysis of our adaptive privacy mechanism, carefully measuring how the noise of optical random projections propagates in the process and gives rise to provable Differential Privacy. Finally, we conduct experiments demonstrating the ability of our learning procedure to achieve solid end-task performance.

  • ROPUST: Improving Robustness through Fine-tuning with Photonic Processors and Synthetic Gradients. A. Cappelli, J. Launay, L. Meunier, R. Ohana, I. Poli. 2021, preprint. [arXiv] [Show Abstract]

    Abstract: Robustness to adversarial attacks is typically obtained through expensive adversarial training with Projected Gradient Descent. Here we introduce ROPUST, a remarkably simple and efficient method to leverage robust pre-trained models and further increase their robustness, at no cost in natural accuracy. Our technique relies on the use of an Optical Processing Unit (OPU), a photonic co-processor, and a fine-tuning step performed with Direct Feedback Alignment, a synthetic gradient training scheme. We test our method on nine different models against four attacks in RobustBench, consistently improving over state-of-the-art performance. We perform an ablation study on the single components of our defense, showing that robustness arises from parameter obfuscation and the alternative training method. We also introduce phase retrieval attacks, specifically designed to increase the threat level of attackers against our own defense. We show that even with state-of-the-art phase retrieval techniques, ROPUST remains an effective defense.

  • Adversarial Robustness by Design through Analog Computing and Synthetic Gradients. R. Ohana*, A. Cappelli*, J. Launay, L. Meunier, I. Poli, F. Krzakala. 2021, ICASSP 2022. [arXiv, Github] [Show Abstract]

    Abstract: We propose a new defense mechanism against adversarial attacks inspired by an optical co-processor, providing robustness without compromising natural accuracy in both white-box and black-box settings. This hardware co-processor performs a nonlinear fixed random transformation, where the parameters are unknown and impossible to retrieve with sufficient precision for large enough dimensions. In the white-box setting, our defense works by obfuscating the parameters of the random projection. Unlike other defenses relying on obfuscated gradients, we find we are unable to build a reliable backward differentiable approximation for obfuscated parameters. Moreover, while our model reaches a good natural accuracy with a hybrid backpropagation - synthetic gradient method, the same approach is suboptimal if employed to generate adversarial examples. We find the combination of a random projection and binarization in the optical system also improves robustness against various types of black-box attacks. Finally, our hybrid training method builds robust features against transfer attacks. We demonstrate our approach on a VGG-like architecture, placing the defense on top of the convolutional features, on CIFAR-10 and CIFAR-100.

  • Photonic co-processors in HPC: using LightOn OPUs for Randomized Numerical Linear Algebra. D. Hesslow, A. Cappelli, I. Carron, L. Daudet, R. Lafargue, K. Muller, R. Ohana, G. Pariente, I. Poli. 2021, Hot Chips 2021. [arXiv] [Show Abstract]

    Abstract: Randomized Numerical Linear Algebra (RandNLA) is a powerful class of methods, widely used in High Performance Computing (HPC). RandNLA provides approximate solutions to linear algebra functions applied to large signals, at reduced computational costs. However, the randomization step for dimensionality reduction may itself become the computational bottleneck on traditional hardware. Leveraging near constant-time linear random projections delivered by LightOn Optical Processing Units we show that randomization can be significantly accelerated, at negligible precision loss, in a wide range of important RandNLA algorithms, such as RandSVD or trace estimators.

  • The dynamics of learning with feedback alignment. M. Refinetti*, S. d'Ascoli*, R. Ohana, S. Goldt. 2021, ICML 2021. [arXiv, Github, Twitter thread] [Show Abstract]

    Abstract: Direct Feedback Alignment (DFA) is emerging as an efficient and biologically plausible alternative to the ubiquitous backpropagation algorithm for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory for the success of DFA. We first show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on fitting the data. This two-step process has a degeneracy breaking effect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment, and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorize process occurs sequentially from the bottom layers of the network to the top.

  • Experimental Approach to Demonstrating Contextuality for Qudits. A. Sohbi, R. Ohana, I. Zaquine, E. Diamanti, D. Markham. 2020. Physical Review A. [arXiv] [Show Abstract]

    Abstract: We propose a method to experimentally demonstrate contextuality with a family of tests for qudits. The experiment we propose uses a qudit encoded in the path of a single photon and its temporal degrees of freedom. We consider the impact of noise on the effectiveness of these tests, taking the approach of ontologically faithful non-contextuality. In this approach, imperfections in the experimental set up must be taken into account in any faithful ontological (classical) model, which limits how much the statistics can deviate within different contexts. In this way we bound the precision of the experimental setup under which ontologically faithful non-contextual models can be refuted. We further consider the noise tolerance through different types of decoherence models on different types of encodings of qudits. We quantify the effect of the decoherence on the required precision for the experimental setup in order to demonstrate contextuality in this broader sense.

  • Reservoir Computing meets Recurrent Kernels and Structured Transforms. R. Ohana*, J. Dong*, M. Rafayelyan, F. Krzakala. Oral @ NeurIPS 2020. [NeurIPS, Oral (starts at 46:30), arXiv, Github, Twitter thread] [Show Abstract]

    Abstract: Reservoir Computing is a class of simple yet efficient Recurrent Neural Networks where internal weights are fixed at random and only a linear output layer is trained. In the large size limit, such random neural networks have a deep connection with kernel methods. Our contributions are threefold: a) We rigorously establish the recurrent kernel limit of Reservoir Computing and prove its convergence. b) We test our models on chaotic time series prediction, a classic but challenging benchmark in Reservoir Computing, and show how the Recurrent Kernel is competitive and computationally efficient when the number of data points remains moderate. c) When the number of samples is too large, we leverage the success of structured Random Features for kernel approximation by introducing Structured Reservoir Computing. The two proposed methods, Recurrent Kernel and Structured Reservoir Computing, turn out to be much faster and more memory-efficient than conventional Reservoir Computing.

  • Kernel Computations from large-scale random features obtained by Optical Processing Units. R. Ohana, J. Wacker, J. Dong, S. Marmin, F. Krzakala, M. Filippone, L. Daudet. ICASSP 2020. [ICASSP, arXiv, Github] [Show Abstract]

    Abstract: Approximating kernel functions with random features (RFs) has been a successful application of random projections for nonparametric estimation. However, performing random projections presents computational challenges for large-scale problems. Recently, a new optical hardware called Optical Processing Unit (OPU) has been developed for fast and energy-efficient computation of large-scale RFs in the analog domain. More specifically, the OPU performs the multiplication of input vectors by a large random matrix with complex-valued i.i.d. Gaussian entries, followed by the application of an element-wise squared absolute value operation - this last nonlinearity being intrinsic to the sensing process. In this paper, we show that this operation results in a dot-product kernel that has connections to the polynomial kernel, and we extend this computation to arbitrary powers of the feature map. Experiments demonstrate that the OPU kernel and its RF approximation achieve competitive performance in applications using kernel ridge regression and transfer learning for image classification. Crucially, thanks to the use of the OPU, these results are obtained with time and energy savings.

  • Impact of epitaxial strain on the topological-nontopological phase diagram and semimetallic behavior of InAs/GaSb composite quantum wells. H. Irie, T. Akiho, F. Couedo, R. Ohana, S. Suzuki, H. Onomitsu, K. Muraki. 2020, Physical Review B. [Phys. Rev. B, arXiv] [Show Abstract]

    Abstract: We study the influence of epitaxial strain on the electronic properties of InAs/GaSb composite quantum wells (CQWs), host structures for quantum spin Hall insulators, by transport measurements and eight-band k⋅p calculations. Using different substrates and buffer layer structures for crystal growth, we prepare two types of samples with vastly different strain conditions. CQWs with a nearly strain-free GaSb layer exhibit a resistance peak at the charge neutrality point that reflects the opening of a topological gap in the band-inverted regime. In contrast, for CQWs with 0.50% biaxial tensile strain in the GaSb layer, semimetallic behavior indicating a gap closure is found for the same degree of band inversion. Additionally, with the tensile strain, the boundary between the topological and nontopological regimes is located at a larger InAs thickness. Eight-band k⋅p calculations reveal that tensile strain in GaSb not only shifts the phase boundary but also significantly modifies the band structure, which can result in the closure of an indirect gap and make the system semimetallic even in the topological regime. Our results thus provide a global picture of the topological-nontopological phase diagram as a function of layer thicknesses and strain.

Patent

  • Method and System for machine learning using optical data. I. Poli, J. Launay, K. Müller, G. Pariente, I. Carron, L. Daudet, R. Ohana, D. Hesslow. 2021, US Patent. [Patent] [Show Abstract]

    Abstract: A system may include an optical source and an adjustable spatial light modulator coupled to the optical source. The system may further include a medium coupled to the adjustable spatial light modulator, and an optical detector coupled to the medium. The optical detector may obtain various optical signals that are transmitted through the medium at various predetermined spatial light modulations using the adjustable spatial light modulator. The system may further include a controller coupled to the optical detector and the adjustable spatial light modulator. The controller may train an electronic model using various synthetic gradients based on the optical signals.

PhD Manuscript

  • Leveraging (physical) randomness in machine learning algorithms. R. Ohana, PhD Thesis. [Manuscript] [Show Abstract]

    Abstract: In this thesis, we will leverage the use of randomness in multiple aspects of machine learning. We will start by showing the link between reservoir computing and recurrent kernels through the lens of random features, and introduce structured transforms to improve the computational complexity of reservoir computing. We will then show how optical computing can help scaling-up random features for kernel approximation, at a low energy cost. We will continue by showing how to combine the Optical Processing Unit with training methods alternative to backpropagation such as Direct Feedback Alignment, to make adversarially robust networks trained from the beginning, or improve the robustness of state-of-the-art defenses. We will also train optically a neural network and show how the experimental noise yields differential privacy. We will finish by using PAC-Bayesian bounds to optimize the distribution of random projections of Sliced-Wasserstein distances while being theoretically grounded.

Miscellaneous

  • Reviewer for ALT 2020, NeurIPS 2021-2022, ICML 2022-2023, WACV 2022, Nature Communications, Journal of Machine Learning Research (JMLR).
  • Oral at NeurIPS 2020, Paris Machine learning meetup 2020, Roscoff meeting 2019, invited speaker at Dauphine university, Golosino meetings at ENS 2019-2020, invited speaker at Criteo, organised an internal workshop on building GPT models from scratch at the Flatiron Institute. Talk at Supercomputing 23.