(Cette journée est organisée grâce au soutien de la région Auvergne-Rhône-Alpes/ARC6)
|9h00 - 9h30||Accueil|
|9h30 – 10h||Deep Learning and Reinforcement Learning : introduction and challenges
Jilles Dibangoye (INRIA, CITI, INSA - Lyon), |
Christian Wolf (INRIA, CITI, LIRIS, INSA - Lyon)
|10h – 10h45||Invited Speaker|
FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis Jakob Verbeek (INRIA, Grenoble)
|10h45 – 11h30||Invited Speaker|
Webly supervised learning of deep CNNs Hervé Le Borgne (CEA – LIST, Saclay)
|11h30 – 12h45||Espace doctorants (posters)
Three Dimensional Deep Learning Approach for Remote Sensing Image Classification Amina Ben Hamida (LISTIC, Annecy)
Learning to recognize touch gestures: recurrent vs. convolutional features and dynamic sampling Quentin Debard (LIRIS, Lyon)
Coupled Ensemble of Neural Networks Anuvabh Dutt (LIG, Grenoble)
Deep Learning for Spatio-Temporal Modeling of Dynamic Spontaneous Emotions Dawood Al Chanti (GIPSA Lab, Grenoble)
Towards integrating spatial localization in convolutional neural networks for brain image segmentation Pierre-Antoine Ganaye (CREATIS, Lyon)
The Use of a Compositional Pattern Producing Network to Incorporate Spatial Localization during Medical Volume Segmentation Matthieu Martin (CREATIS, Lyon)
Unsupervised state representation learning with robotic priors: a robustness benchmark Antonin Raffin (Inria FLOWERS team, Paris)
|12h45 – 14h15||Buffet|
|14h15 – 15h00||Invited Speaker|
Recent theoretical insights on hierarchical reinforcement learning Alessandro Lazaric (Facebook AI Research)
|15h00 – 15h45||Recent advances on deep learning for medical image analysis Carole Lartizien (CREATIS, Lyon)|
|15h45 – 17h00||Espace doctorants (présentations orales)
Autonomous learning of sensori-motor mappings for manipulation robotics François De La Bourdonnaye (Institut Pascal, Clermont-Ferrand)
A deep reinforcement learning approach for early classification of time series Coralie Martinez (GIPSA Lab, Bio Mérieux, FEMTO)
Deep Reinforcement Learning for Audio-Visual Gaze Control Stéphane Lathuilière (INRIA Perception, Grenoble)
The keep-growing content of Web images is probably the next important data source to scale up deep neural networks which surpass human in image classification tasks. The fact that deep networks are hungry for labeled data limits themselves from extracting valuable information of Web images which are abundant and cheap. There have been efforts to train neural networks such as autoencoders with respect to either unsupervised or semi-supervised settings. Nonetheless their performances are less than those of supervised methods partly because the loss function used in unsupervised methods fails to guide the network to learn discriminative features as well as to ignore unnecessary details. We present some methods of the state of the art that learn convolutional neural networks (CNN) to classify images, that are learned with a limited number of manually annotated images, if any. The approach is named "webly supervised learning" since the annotated image results from queries on web search engine, leading to reduce dramatically the annotation cost. The annotations are nonetheless noisy, in a sense that some annotation do not actually correspond to the assumed concept. We propose our own method that consist to learn convolutional networks in a supervised setting using webly labeled data. These last result from the collection of a large amounts of Web images downloaded from Flickr and Bing, further cleaned by different approaches. Our experiments are conducted at several data scales, with different choices of network architecture, and alternating between different data preprocessing techniques. The effectiveness of our approach is shown by the good generalization of the learned representations on six publicly available datasets, both for image classification and fine-grained classification.
Convolutional neural networks (CNNs) have massively impacted visual recognition in 2D images, and are now ubiquitous in state-of-the-art approaches. I will present recent work on two issues: (i) CNNs do not easily extend to data that are not represented by regular grids, such as 3D shape meshes or other graph-structured data, to which traditional local convolution operators do not directly apply. (ii) It is not clear how to design CNN architectures, or how to search over the space of architectures. To address the first problem, we propose a novel graph-convolution operator to establish correspondences between filter weights and graph neighborhoods with arbitrary connectivity. The key novelty of our approach is that these correspondences are dynamically computed from features learned by the network, rather than relying on predefined static coordinates over the graph as in previous work. We obtain excellent experimental results that significantly improve over previous state-of-the-art shape correspondence results. This shows that our approach can learn effective shape representations from raw input coordinates, without relying on shape descriptors. Regarding the second problem, instead of aiming to select a single optimal architecture, we propose a ``fabric'' that embeds an exponentially large number of architectures. The fabric consists of a 3D trellis that connects response maps at different layers, scales, and channels with a sparse homogeneous local connectivity pattern. The only hyper-parameters of a fabric are the number of channels and layers. While individual architectures can be recovered as paths, the fabric in addition ensembles all embedded architectures together, sharing their weights where their paths overlap. Parameters can be learned using standard methods based on back-propagation, at a cost that scales linearly in the fabric size. We present benchmark results competitive with the state of the art for image classification on MNIST and CIFAR10, and for semantic segmentation on the Part Labels dataset.
Recent theoretical insights on hierarchical reinforcement learning Hierarchical reinforcement learning (HRL) aims at reducing the learning complexity by decomposing a given task into simpler subtasks. One of the most popular models of HRL is the option framework, which integrates options (i.e., macro-actions) into the standard Markov decision process (MDP) model. Despite good empirical results and a growing number of techniques to "discover" options, the understanding on how options may improve (or worsen) the learning complexity is still limited. In this talk, I will review the option framework, a few techniques for option discovery, and discuss recent theoretical findings that may provide a more solid guidance in the construction of options that provably improve the learning performance.
Deep learning-based methods have been pervading the medical image analysis community over the last three years. This results from the pivotal developments achieved in computer vision, especially regarding the ability of deep convolutional architectures to combine feature extraction and model learning into a unified framework, process very large training sets, transfer learned features between different databases, and analyse multimodal data. In this talk, we will shortly review the recent advances of deep learning methods for medical image analysis and present ongoing research in our lab in the domain of medical image segmentation, image and signal reconstruction (inverse problem) as well as unsupervised feature learning for anomaly detection. Emphasis will be put on the challenges that each of our specific applications are raising.
Recently, a variety of approaches has been enriching the field of Remote Sensing (RS) image processing and analysis. Unfortunately, existing methods remain limited faced to the rich spatio-spectral content of today’s large datasets. It would seem intriguing to resort to Deep Learning (DL) based approaches at this stage with regards to their ability to offer accurate semantic interpretation of the data. However, the specificity introduced by the coexistence of spectral and spatial content in the RS datasets widens the scope of the challenges presented to adapt DL methods to these contexts. Therefore, the aim of this paper is firstly to explore the performance of DL architectures for the RS hyperspectral dataset classification and secondly to introduce a new three-dimensional DL approach that enables a joint spectral and spatial information process. A set of three-dimensional schemes is proposed and evaluated. Experimental results based on well-known hyperspectral datasets demonstrate that the proposed method is able to achieve a better classification rate than state of the art methods with lower computational costs.
We propose a fully automatic method for learning gestures on big touch devices in a potentially multi-user context. The goal is to learn general models capable of adapting to different gestures, user styles and hardware variations (e.g. device sizes, sampling frequencies and regularities). Based on deep neural networks, our method features a novel dynamic sampling and temporal normalization component, transforming variable length gestures into fixed length representations while preserving finger/surface contact transitions, that is, the topology of the signal. This sequential representation is then processed with a convolutional model capable, unlike recurrent networks, of learning hierarchical representations with different levels of abstraction. To demonstrate the interest of the proposed method, we introduce a new touch gestures dataset with 6591 gestures performed by 27 people, which is, up to our knowledge, the first of its kind: a publicly available multi-touch gesture dataset for interaction. We also tested our method on a standard dataset of symbolic touch gesture recognition, the MMG dataset, outperforming the state of the art and reporting close to perfect performance https://arxiv.org/abs/1802.09901
We present coupled ensembles of neural networks, which is a reconfiguration of existing neural network models into parallel branches. We empirically show that this modification leads to results on CIFAR and SVHN that are competitive to state of the art, with a greatly reduced parameter count. Additionally, for a fixed parameter, or a training time budget coupled ensembles are significantly better than single branch models. Preliminary results on ImageNet are also promising.
Facial expressions involve dynamic morphological changes in a face, conveying information about the expresser’s feelings. The discrimination of an expression depends to a greater extent on the perception of these morphological patterns which significantly contribute to expression recognition. In this paper, we aim at modeling the human dynamic emotional behavior by taking into consideration the visual content of the face and its evolution through time. A 3D-Convolutional Neural Networks (3D-CNN) is used to learn and extract early local spatiotemporal features. The 3D-CNN is designed to capture subtle and fast spatiotemporal changes that may occur on the face. Then a Convolutional-Long-Short-Term-Memory (ConvLSTM) network is designed to learn to interpret semantic information by taking into account longer spatiotemporal dependencies. The ConvLSTM network helps considering the global visual saliency of the expression. That is locating and learning features in space and time that stand out from their local neighbors in order to signify distinctive facial expression features along the entire sequence. Non-invariant representations based on aggregating global spatiotemporal features at increasingly fine resolutions are then done using a weighted Spatial Pyramid Pooling layer. The proposed model is validated and its performance is estimated over four distinctive databases of acted and spontaneous facial expressions.
We focus on the segmentation of brain Magnetic Resonance Images (MRI) into cerebral structures using convolutional neural networks (CNN). Different ways to introduce spatial constraints into the network are proposed, to further reduce prediction inconsistencies. A patch based CNN architecture was trained, making use of multiple scales to gather contextual information. Spatial constraints were introduced within the CNN through a distance to landmarks feature or through the integration of a probability atlas.
The past few years, deep learning methods greatly improved the state of the art in several biomedical imaging tasks. In particular, in medical volume segmentation problems, which can be seen as classification problems, convolutional neural networks (CNN) achieved very good performances in term of accuracy and segmentation time. Nevertheless, the semantic segmentation of subvolumes does not able the use of absolute spatial localization within volumes. This is a huge drawback in medical imaging field because the structures we are interested in segmenting usually have a precise localization, thus, it often results in spatially obvious misclassification. To reduce these misclassifications, we propose a neural network architecture able to provide absolute spatial information to the network. To do so, we combined a CNN with a Compositional pattern producing network (CPPN) that generates spatial feature maps. To evaluate the performances of our network, we segmented two preterm neonate cerebral structures in 3D ultrasonographic reconstructed volumes. Our database was composed of 15 manually segmented volumes of size 256×256×256 voxels. The network achieved good segmentation results (DICE > 0.8) and the use of a CPPN reduced mean absolute distance and Hausdorff distance between the segmentations and the references.
Our understanding of the world depends highly on our capacity to produce intuitive and simplified representations which can be easily used to solve problems. We reproduce this simplification process using a neural network to build a low dimensional state representation of the world from images acquired by a robot. We learn in an unsupervised way using prior knowledge about the world as loss functions called robotic priors and extend this approach to high dimension richer images to learn a 3D representation of the hand position of a robot from RGB images. We propose a quantitative evaluation of the learned representation using nearest neighbors in the state space that allows to assess its quality and show both the potential and limitations of robotic priors in realistic environments. We augment image size, add distractors and domain randomization, all crucial components to achieve transfer learning to real robots. Finally, we also contribute a new prior to improve the robustness of the representation. The applications of such low dimensional state representation range from easing reinforcement learning (RL) and knowledge transfer across tasks, to facilitating learning from raw data with more efficient and compact high level representations. The results show that the robotic prior approach is able to extract high level representation as the 3D position of an arm and organize it into a compact and coherent space of states in a challenging dataset. https://arxiv.org/abs/1802.04181
The objective is to make a robot learn an object reaching task with weak supervision. The rationale is that using little human supervision helps our system to be more autonomous. The task is learned using deep reinforcement learning and a stage-wise procedure of simpler tasks that require weak supervision. In particular, the robot learns to fixate objects and its own end-effector with a binocular robot camera system. Using these skills allows to estimate the distance between the end-effector and an object. With it, an informative reward signal can be computed, allowing the complete learning of the reaching task.
In many real-world applications, ranging from predictive maintenance to personalized medicine, early classification of time series data is of paramount importance for supporting decision makers. We address this challenging task with a novel approach based on reinforcement learning. We introduce an early classifier agent, an end-to-end reinforcement learning agent (deep Q-network, DQN) able to perform early classification in an efficient way. We formulate the early classification problem in a reinforcement learning framework: we introduce a suitable set of states and actions but we also define a specific reward function which aims at finding a compromise between earliness and classification accuracy. While most of the existing solutions do not explicitly take time into account in the final decision, this solution allows the user to set this trade-off in a more flexible way.
We address the problem of audio-visual gaze control in the specific context of human-robot interaction, namely how controlled robot motions are combined with visual and acoustic observations in order to direct the robot head towards targets of interest. The paper has the following contributions: (i) a novel audio-visual fusion framework that is well suited for controlling the gaze of a robotic head; (ii) a reinforcement learning (RL) formulation where gaze control is mapped, using a reward function based on the available temporal sequence of camera and microphone observations; and (iii) several deep architectures that allow to experiment with early and late fusion of audio and visual data. We introduce a simulated environment that enables us to learn the proposed deep RL model without the need of spending hours of tedious interaction. By thoroughly experimenting on a publicly available dataset and on a real robot, we provide empirical evidence that our method achieves state-of-the-art performance.
|Dawood||AL CHANTI||GIPSA-LAB, Grenoble-INP|
|Mohan Kumar||Arunachalam||IIIT Hyderabad(Student)|
|Youakim||Badr||INSA Lyon / LIRIS|
|Kiran||BANGALORE RAVI||AKKA technologies|
|Amina||Ben Hamida||LISTIC, Université Savoie Mont Blanc|
|Ivan||Castillo Camacho||GIPSA LAB - Grenoble INP|
|Sicheng||DAI||ENS de Lyon|
|Luc||Damas||LISTIC - USMB|
|François||de La Bourdonnaye||Institut Pascal|
|Lucien||Del Bosque||Laboratoire Hubert Curien|
|Christophe||Ducottet||Laboratoire Hubert Curien Saint-Etienne|
|Rémy||Emonet||Laboratoire Hubert Curien Saint-Etienne|
|Longkai||GUO||INSA DE LYON|
|Mikaël||Jacquemont||LAPP / LISTIC|
|Patrick||LAMBERT||LISTIC Université Savoie Mont Blanc|
|Tien-Nam||LE||ENS de Lyon|
|Sébastien||Lerique||ENS Lyon / Inria|
|Navneet||Madhu Kumar||TU Munich|
|Serge||Miguet||LIRIS - Lyon 2|
|Samba Ndojh||Ndiaye||LIRIS - Université Lyon 1|
|Juan Felipe||Perez Juste||CREATIS|
|Antonin||Raffin||Inria FLOWERS team, Paris|
|David||Urban||GIPSA-lab / IKOS RA|
|Christian||WOLF||INRIA, CITI, LIRIS, INSA-Lyon|
|Chen||Wu||Laboratoire Hubert Curien Saint-Etienne|
|Yongzhe||YAN||Institut pascal - LIRIS|
|Heng||Zhang||Laboratoire Hubert Curien Saint-Etienne|
Des contraintes de logistique imposent un nombre limité d'inscriptions.
Nous avons atteint cette limite.
Nous sommes au regret de vous informer que les inscriptions sont closes.