ROBOVIS 2025 Abstracts


Area 1 - Computer Vision

Full Papers
Paper Nr: 18
Title:

Cut-and-Splat: Leveraging Gaussian Splatting for Synthetic Data Generation

Authors:

Bram Vanherle, Brent Zoomers, Jeroen Put, Frank Van Reeth and Nick Michiels

Abstract: Generating synthetic images is a useful method for cheaply obtaining labeled data for training computer vision models. However, obtaining accurate 3D models of relevant objects is necessary, and the resulting images often have a gap in realism due to challenges in simulating lighting effects and camera artifacts. We propose using the novel view synthesis method called Gaussian Splatting to address these challenges. We have developed a synthetic data pipeline for generating high-quality context-aware instance segmentation training data for specific objects. This process is fully automated, requiring only a video of the target object. We train a Gaussian Splatting model of the target object and automatically extract the object from the video. Leveraging Gaussian Splatting, we then render the object on a random background image, and monocular depth estimation is employed to place the object in a believable pose. We introduce a novel dataset to validate our approach and show superior performance over other data generation approaches, such as Cut-and-Paste and Diffusion model-based generation.

Paper Nr: 33
Title:

Silent Speech Interface Based on Surface Electromyography Using Arduino and Ensemble Learning

Authors:

William Kyle Deveza, Joseph Bryan Ibarra and Marloun Sejera

Abstract: A silent speech interface (SSI) allows people to interact with machines. It solves issues regarding the use of machines or devices in noisy environments or use by people incapable of producing audible speech. In this study, an SSI was made by classifying spectrogram images of the surface electromyography signals on the corner of the mouths of seven participants. The signals were digitized by an Arduino board. The Arduino board was connected to a PC via Bluetooth Low Energy. The signals were filtered and amplified as part of preprocessing. Then, short-time Fourier transformed was applied to change their domains to time and frequency, from which the spectrogram images were created. Afterward, the images were preprocessed using ResNetv2. The classification was done by four ensembles of four deep learning, base models, ResNet50V2, VGG16, CNN-LSTM, and MLP. The accuracy scores of the base models were 80.00 %, 70.00 %, 81.67 %, and 71.67 %. The soft-voting ensemble, neural network stacked ensemble, linear SVC stacked ensemble, and SVC stacked ensemble achieved accuracy scores of 90.00 %, 81.67 %, 91.67 %, and 93.33 %. The results indicated that the proposed silent speech interface was viable and could be improved through certain recommended actions.

Paper Nr: 36
Title:

Evaluating Pose Awareness and 3D Consistency in Semantic Matching

Authors:

Paolo Sebeto, Jean-Baptiste Weibel, Christian Hartl-Nesic and Markus Vincze

Abstract: Semantic matching is increasingly adopted in robotics as a flexible approach for generalizing point transfer across different manipulation tasks , creating a need for suitable benchmark datasets and a reassessment of existing evaluation metrics. Point correspondences used for manipulation require not only pixel-level precision but also accuracy in 3D space. Current methods and evaluation metrics suffer from a bias toward the pixel position relative to the image at the expense of accurate 3D localization. Objects’ orientation differences between source and target images severely affects matching results. In this paper, we introduce a novel evaluation procedure for assessing pose-aware and 3D-consistent semantic matching, supported by a synthetic dataset, SemanticHouseCat3D, that includes rich annotations of household objects. Our evaluation features a new orientation-based assessment that bins point matching metrics according to changes in object orientation and precisely measures the 3D accuracy of estimated image points after their re-projection. Using SemanticHouseCat3D, we conduct an exhaustive evaluation of state-of-the-art methods and investigate the influence of full-image context on performance compared to techniques like object cropping and segmentation. Our results indicate that while foundation models are reliable and adaptable across different domains, there is a critical need for improving semantic matching methods in terms of pose awareness and features localization. The dataset is available at https://sites.google.com/view/semantichousecat3d/.

Paper Nr: 42
Title:

Learn Where I Can Walk: Auto-Labeling of Walked Areas Using Monocular Camera Trajectory

Authors:

Helmut Engelhardt, Matthias Kalenberg, Jörg Franke and Sina Martin

Abstract: This paper presents Learn Where I Can Walk (LWICW), a novel auto-labeling approach for segmenting walked areas using a trajectory estimated from a monocular camera image sequences, aimed at training supervised segmentation models for the navigation of visually impaired people. The proposed method uses images sourced from Mapillary, which is a collaborative platform for sharing street-level images. The approach involves extracting the walked path of the camera operator through the camera poses, the filtering of occluded walking path poses using Depth Anything V2, and the application of Segment Anything Model 2 (SAM 2) for segmentation. The LWICW auto-labels are validated against a manually labeled dataset from Mapillary and compared to the state-of-the-art zero-shot segmentation model Grounded SAM 2. The LWICW method achieves an overall mean Intersection over Union (mIoU) of 93.9 % and a mean F1 score (mF1) of 96.6 %, which represents a performance improvement of +1.6 percentage points on mIoU and +1.0 percentage points on mF1 compared to the Grounded SAM 2 approach.

Paper Nr: 49
Title:

Entropy-Guided Self-Regulated Learning Without Forgetting for Distribution-Shift Continual Learning with Blurred Task Boundaries

Authors:

Rui Yang, Liming Chen, Matthieu Grard and Emmanuel Dellandréa

Abstract: Continual Learning (CL) aims to endow machines with the human-like ability to continuously acquire novel knowledge while retaining previously learned experiences. Recent research on CL has focused on Domain-Incremental Learning (DIL) or Class-Incremental Learning (CIL) with well-defined task boundaries. However, for real-life applications, e.g., waste sorting, robotic grasping, etc., the model needs to be constantly updated to fit new data. Additionally, there is usually an overlap between new and old data. Thus, task boundaries may not be well-defined, and a more smooth scenario is needed. In this paper, we propose a more general scenario, namely Distribution-Shift Incremental Learning (DS-IL), which enables soft task boundaries with possible mixtures of data distributions over tasks and thereby subsumes the two previous CL scenarios: DIL and CIL are simply DS-IL. Moreover, given the increasingly greater importance of data privacy in real-life applications and, incidentally, data storage efficiency, we further introduce an entropy-guided self-regulated distillation process without memory, which leverages data similarities between tasks with soft-boundaries. Experimented on a variety of datasets, our proposed method outperforms or matches state-of-the-art continual learning methods.

Short Papers
Paper Nr: 13
Title:

Predicting Sources of High Vertical Acceleration on the Road by Preceding Vehicle Observation

Authors:

Petr Jahoda, Jan Cech, Jan Svancar and Tomas Hanis

Abstract: We propose a novel method that detects sources of high vertical acceleration on the road by visual tracking of a preceding vehicle. The method is general, predicting any kind of road anomalies, such as potholes, bumps, debris, etc., unlike direct observation methods that rely on training visual detectors of those cases. The method works in poor visibility or when the preceding vehicle occludes the road anomaly. The approach is validated on our dataset that includes anomalies collected in both controlled settings and real-world scenarios captured in normal traffic conditions. The experiment confirms a strong correlation between a signal measured by a visual tracker, which estimates a vertical displacement of the preceding vehicle, and the IMU signal from the ego vehicle. We demonstrate that our system detects road surface anomalies with high accuracy, achieving AUC 0.969. The method is computationally cheap and runs in real-time on consumer hardware.

Paper Nr: 14
Title:

Deep Learning-Based Object Recognition for Automated Dissolution Monitoring

Authors:

Simon-Johannes Burgdorf, Md Rezwanul Karim and Kerstin Thurow

Abstract: Chemical and biological processes often involve the dissolution of substances in a solvent. In this work, we describe a possibility for automation to monitor such dissolution processes to increase the efficiency of the overall process. A Faster-RCNN model with a ResNet50 backbone was trained to monitor and automate the dissolution process of particles in various solutions. A dataset was created, containing four different solutions that show the dissolution process at various stages. The results demonstrate that the model is capable of detecting medium to large particles, especially those with no or minimal overlap with other particles. Additionally, the model was able to monitor colored solutions, such as a potassi-um permanganate solution. The model demonstrates strengths in detecting medi-um to large particles. By expanding the dataset, the detection performance for smaller particles can be improved, leading to an overall enhancement of the mod-el’s capabilities. Automating this process is a crucial step toward fully autono-mous laboratories.

Paper Nr: 17
Title:

Automating 3D Dataset Generation with Neural Radiance Fields

Authors:

Paul Schulz, Thorsten Hempel and Ayoub Al-Hamadi

Abstract: 3D detection is a critical task to understand spatial characteristics of the environment and is used in a variety of applications including robotics, augmented reality, and image retrieval. Training performant detection models require diverse, precisely annotated, and large scale datasets that involve complex and expensive creation processes. Hence, there are only few public 3D datasets that are additionally limited in their range of classes. In this work, we propose a pipeline for automatic generation of 3D datasets for arbitrary objects. By utilizing the universal 3D representation and rendering capabilities of Radiance Fields, our pipeline generates high quality 3D models for arbitrary objects. These 3D models serve as input for a synthetic dataset generator. Our pipeline is fast, easy to use and has a high degree of automation. Our experiments demonstrate, that 3D pose estimation networks, trained with our generated datasets, archive strong performance in typical application scenarios.

Paper Nr: 32
Title:

Statistically Consistent Total Least-Squares Estimation of Object Scales

Authors:

Arne Hasselbring and Udo Frese

Abstract: Estimating object poses is a fundamental problem in computer vision in general as well as for robotic manipulation in particular. Most approaches require a known 3D model of the object. One step towards a more general formulation is to estimate the object’s width, height and depth with the pose, e. g. consider a generic box, cylinder or plate instead of one with known dimensions. This paper investigates the last stage of such a pipeline, namely least-squares estimating pose and scales from point correspondences aggregated into a fixed size matrix. Therefore it encapsulates the scaled SO(3) manifold in a so-called ⊞-operator and derives a Gauss-Newton based optimizer with initial guess on that. We find that the resulting estimator is strongly biased towards small scales. This is due to the structure of the least-squares loss: Noise in recognized object points is multiplied with the to be estimated transformation matrix, violating the additive noise assumption. It has no effect in the prevalent use of this loss for pose estimation but affects the scale. We propose a solution to this bias based on an approximation of total least-squares that preserves the advantage of a fixed size representation and show that it provides relatively consistent uncertainty estimates.

Paper Nr: 34
Title:

PENet: A Joint Panoptic Edge Detection Network

Authors:

Yang Zhou and Giuseppe Loianno

Abstract: In recent years, compact and efficient scene understanding representations have gained popularity in increasing situational awareness and autonomy of robotic systems. In this work, we illustrate the concept of a panoptic edge segmentation and propose PENet, a novel detection network called that combines semantic edge detection and instance-level perception into a compact panoptic edge representation. This is obtained through a joint network by multi-task learning that concurrently predicts semantic edges, instance centers and offset flow map without bounding box predictions exploiting the cross-task correlations among the tasks. The proposed approach allows extending semantic edge detection to panoptic edge detection which encapsulates both category-aware and instance-aware segmentation. We validate the proposed panoptic edge segmentation method and demonstrate its effectiveness on the real-world Cityscapes dataset.

Paper Nr: 39
Title:

Scene-Aware Prediction of Diverse Human Movement Goals

Authors:

Qiaoyue Yang, Amadeus Weber, Magnus Jung, Ayoub Al-Hamadi and Sven Wachsmuth

Abstract: Anticipation of human behaviours facilitates autonomous systems in proactive planning. Human behaviour could be stochastic due to varying goals. Human goals typically guide their own movement and could therefore help to predict the human trajectory and human motion in the long-term. To infer the human movement intentions, the environmental context plays a significant role, in addition to the social cues expressed by the individual. Previous works on human goals prediction either require semantic knowledge of the scene, or only tackle interactions with objects. In this paper, we propose a novel multi-goal prediction method using the generative model to address the stochasticity of human movement. It leverages the current RGB scene and the human pose to predict diverse potential future goals of human movement based on the Conditional Variational Autoencoder (CVAE). Our results demonstrate that our approach is capable of generating multiple movement goals in the scene via samplings in latent space of the CVAE and exhibits generalization capability across scenarios in GTA-IM dataset and PROX dataset. Code is publicly available at https://github.com/Q-Y-Yang/DiverseGoalsPrediction.

Paper Nr: 40
Title:

Explainable Detection of Logical and Structural Anomalies Based on Multimodal Large Language Models

Authors:

Noeko Fujii and Tetsuya Sakai

Abstract: Large Language Models are rapidly evolving in the field of Natural Language Processing as well as Vision and Language. In particular, GPT-4o integrates images and text, and is capable of effective Visual Question Answering. However, GPT-4o is limited in its ability to accurately detect the quantity (count or volume), position, and size of objects in a given image, which hinders its practical application to industrial Anomaly Detection (AD). In order to improve the accuracy and interpretability of AD, this study proposes a new pipeline that utilizes GPT-4o to improve both logical and structural AD accuracies and to output natural language explanations of the detected anomalies. Our system first performs a first-pass logical AD by leveraging the results of the MM-Grounding-DINO object detection model and the SAM2 object segmentation model. It then constructs a multimodal prompt in order to make GPT-4o perform AD with accompanied by natural language explanations of the nature of the anomalies. Experimental results with the MVTec LOCO AD dataset show that our system outperforms existing models in a logical AD task, although it performs less well in a structural AD task. Moreover, to the best of our knowledge, our system is the first to achieve explainable AD that can handle both structural and logical anomalies.

Paper Nr: 43
Title:

Material Classification Using Visio-Tactile Sensor for Haptic Feedback Generation

Authors:

Md Golam Rabby Shuvo, Sonya Coleman, Dermot Kerr and Justin Quinn

Abstract: The flexibility, versatility and enhanced perception of visio-tactile sensors could be beneficial for advanced robotic systems and other applications requiring precise haptic feedback. In this paper, we present a comprehensive framework that combines material classification and haptic feedback through the use of GelSight sensors. The study includes the creation of a diverse material dataset, consisting of 13 material classes of 42 distinct indoor and outdoor items, each item with multiple video samples captured over different regions and pressing conditions by human-held GelSight mini sensors. We introduce a method for detecting pressing events from recorded video samples and extracting key frames that capture important material features. We employ both traditional and deep learning-based feature extraction techniques to model material characteristics. These features are then used to classify materials with high accuracy through supervised learning methods using different image resolutions. For the traditional approach, Histogram of Oriented Gradients (HOG) feature descriptor combined with SVM gives 95.41% accuracy using the new image dataset. After fine-tuning five pre-trained models for transfer learning on our dataset, Dense-Net121 (96.42%), InceptionV3 (96.32%), and VGG16 (95.77%) models show promising results, and the ensemble of these 3 fine-tuned models provides highest accuracy of 97.63%. The results demonstrate the feasibility of using GelSight-based material classification for haptic feedback, with potential implications for virtual reality, robotic manipulation, and human-computer interaction applications.

Paper Nr: 37
Title:

Detection, Characterization, and Localization of Oranges and Their Stems Using RGB-D Camera and Image Processing

Authors:

Alaeddin Bani Milhim, Colton Cunningham and The Leo Nguyen

Abstract: This paper presents an advanced image processing technique for detecting, characterization, and localization of orange fruits and their stems, using an RGB-D camera, addressing a critical challenge in agricultural automation. Additionally, this work investigates the performance of RGB-D camera for orange fruit and its stem characterizations, alongside with the Super-Resolution Convolutional Neural Network (SRCNN) techniques to enhance image quality for improved detection accuracy. The proposed method captures the images of orange fruits, converts them into HSV color space, iso-lates the orange fruit using binary masking, identifies the largest visible fruit, detects its stem, characterizes both the fruits and their stems, and precisely localizes both the fruit and stem features using pinhole camera modeling. The effectiveness of the super-resolution approach was evaluated by comparing the results of a higher resolution camera, RGB-D camera, and RGB-D camera with super-resolution enhancement. The super resolution approach was concluded to significantly improve stem detection and localization, with the proposed image processing method achieving localization results within a maximum absolute error of 4.5º and 1.7% deviation from measured values. These findings demonstrate promising potential for achieving high-accuracy localization of orange fruit and stems, facilitating precise gripping and cutting action in automated harvesting systems.

Paper Nr: 55
Title:

Improving Stability and Precision of Bird Tracking in Stereo Vision Systems

Authors:

Grzegorz Madejski, Aleksy Stocki, Dawid Gradolewski, Włodzimierz Kaoka and Wlodek J. Kulesza

Abstract: Wind energy offers a sustainable solution to reduce carbon emissions but presents risks to bird populations, particularly through potential collisions with turbine blades. The Bird Protection System, BPS, employing stereo vision, aims to mitigate these risks by detecting and tracking birds and estimating their distance from turbines. The precision of distance estimation remains challenging due to quantization uncertainty and environmental factors such as lighting, background complexity, and asynchronous camera frames, which can lead to unstable and imprecise measurements. This paper explores methods to enhance the accuracy and reliability of bird tracking systems by addressing one of the key technical challenges: precise disparity estimation. We investigate three methods: based on the object's bounding box center, the object's center of gravity, and the image alignment technique that uses cross-correlation. The methods use an object's image extracted from the background and resized to ensure a subpixel refinement. Our findings show that the center of gravity and cross-correlation methods with resizing significantly enhance tracking stability and precision, and the former is also computationally efficient, rendering it useful for real-time applications, such as BPS.

Area 2 - Intelligent Systems

Full Papers
Paper Nr: 29
Title:

Towards a Reliable Multimodal AI Monitoring System for Pain Detection and Quantification

Authors:

Huibin Wang, Sören Nienaber, Laslo Dinges, Magnus Jung and Ayoub Al-Hamadi

Abstract: Accurate and robust pain intensity detection has significant implications for patient monitoring and rehabilitation, especially in personalized treatment and management. Benefiting from the complementarity of multiple modalities, multimodal fusion-based methods for pain intensity classification have garnered widespread attention. In this study, we propose an novel Bi-Modal Fusion framework based on Electrodermal Activity (EDA) and Electromyography (EMG) for pain classification. This framework combines LSTM with an attention module in a unified block to learn complex dynamic features from biosignals, effectively capturing both global and local patterns. Meanwhile, focal loss is used as the loss function to mitigate the impact of class imbalance during model training. Through extensive experiments, our method achieves an average accuracy improvement of 3.31% compared to SOTA across 11 sub-datasets of a novel multimodal pain dataset, with a notable improvement of 7.20% on the Reduced Electrical Tonic sub-dataset (RETD). Our research not only validates the effectiveness of the proposed method but also highlights its robustness across different modalities and sub-datasets. These findings lay a solid foundation for our long-term goal of developing an accurate and robust clinical multimodal AI monitoring system for pain detection and quantification.

Paper Nr: 51
Title:

A Flexible Output Distribution for Regression-Based Probabilistic Long-Term Human Trajectory Prediction

Authors:

Ronny Hug, Stefan Becker, Wolfgang Hübner, Michael Arens and Jürgen Beyerer

Abstract: Probabilistic models for sequential data are the basis for a variety of applications concerned with processing timely ordered information. The predominant approach in this domain is given by recurrent neural networks, implementing either a transformative approach (e.g. Variational Autoencoders or Generative Adversarial Networks) or a regression-based approach, i.e. variations of Mixture Density networks (MDN). While these approaches effectively approximate complex probability distributions over full trajectories, their respective output distributions fall short in terms of post-hoc inference capabilities. To overcome this limitation, we extend on an MDN variant, which parameterizes (mixtures of) probabilistic Bézier curves (N-Curves}), allowing us to establish a connection to the framework of Gaussian processes. For this, we show that N-Curves are a special case of non-stationary Gaussian processes (denoted as N-GP) and then derive corresponding mean and kernel functions for different modalities. Then, we propose the use of this MDN variant as a data-dependent generator for N-GP prior distributions, resulting in a probabilistic trajectory prediction model that inherently supports post-hoc Bayesian inference. We show the advantages granted by this combined prediction model in the context of long-term human trajectory prediction.

Short Papers
Paper Nr: 20
Title:

Toward Truly Intelligent Autonomous Systems: A Taxonomy of LLM Integration for Everyday Automation

Authors:

Magnus Jung, Thorsten Hempel, Basheer Al-Tawil, Qiaoyue Yang, Sven Wachsmuth and Ayoub Al-Hamadi

Abstract: With the rapid development of large language models (LLMs), their integration into autonomous systems has become essential. This integration significantly increases the flexibility and adaptability of the system. In this paper, we propose a categorisation of LLM integration into three levels: open-loop, closed-loop and fully autonomous systems driven by robotic LLMs. They are analysed through existing literature, real experiments with the humanoid robot TIAGo, and simulations with models such as ChatGPT-4 and Vicuna 13b-v1.5-16k. We demonstrate the potential of LLMs to enhance the flexibility and adaptability of autonomous systems, particularly in dynamic environments where conventional finite state machines may prove inadequate. Closed-loop systems, in particular, show a strong potential to respond to unexpected situations with human-like problem solving capabilities. Integrating LLMs with autonomous systems enables new real-world applications by enhancing their ability to adapt, reason and respond intelligently in dynamic environments.

Paper Nr: 46
Title:

Active Closure: Symbolic Active Contour for Spatial Automated Reasoning

Authors:

J. I. Olszewska

Abstract: Reducing the gap between quantitative visual data and qualitative spatial information such as qualitative spatial relations (QSR) is crucial for many intelligent and autonomous systems (AS) requiring the automated analysis of complex, visual scenes containing multiple objects of interest. Hence, our paper proposes to directly relate symbolic spatial knowledge to computer-vision concepts, in particular, to a new active contour concept. Indeed, active contours are deformable curves that evolve under forces computed from geometric and photometric properties of visual objects in order to delineate these target objects’ shapes. The active contours could not only be applied to generate the quantitative visual data related to the extracted objects of interest, but the computation of these active contours’ centroids could also be used to define centers of reference which are necessary for the determination of our spatial directional relations and projective relations among the objects of interest. Furthermore, this paper introduces the use of active contours to intrinsically define objects-of-interest’s closure useful for our spatial topological relations, leading to the active closure concept. The presented approach for qualitative spatial reasoning based on active contours has been successfully validated on geographical-related imagery, while being reliable, explainable, and sustainable.

Area 3 - Robotics

Full Papers
Paper Nr: 26
Title:

Error Modification of Robot Motion Generation by LLM based on Parts Function and Physical Features of Robot

Authors:

Takahiro Suzuki and Manabu Hashimoto

Abstract: We propose a method for generating robot motions based on simple commands given by humans. Large language models (LLMs) are generic models that can be used to generate robot motion procedures for various tasks. However, they often output errors, such as specifying inappropriate procedures or tools, or they select tools that are difficult for robots to grasp. For example, the LLM suggests using a spoon or a whisk when scooping hot water. In this study, we address these problems by setting the function of tools, such as "scoop" or "stir," and by utilizing the robot's physical features. We also generate a robot motion trajectory based on a motion template. A comparison between the proposed method and a method that does not take into account the physical features of the robot confirmed that our method had a higher task success rate and was able to select tools that were easier for the robot to operate.

Paper Nr: 28
Title:

Hybrid Neural Network-Based Indoor Localisation System for Mobile Robots Using CSI Data in a Robotics Simulator

Authors:

Javier Ballesteros-Jerez, Jesus Martínez Gómez, Ismael García-Varea, Luis Orozco-Barbosa and Manuel Castillo-Cara

Abstract: We present a hybrid neural network model for inferring the position of mobile robots using Channel State Information (CSI) data from a Massive MIMO system. By leveraging an existing CSI dataset, our approach integrates a Convolutional Neural Network (CNN) with a Multilayer Perceptron (MLP) to form a Hybrid Neural Network (HyNN) that estimates 2D robot positions. CSI readings are converted into synthetic images using the TINTO tool. The localisation solution is integrated with a robotics simulator, and the Robot Operating System (ROS), which facilitates its evaluation through heterogeneous test cases, and the adoption of state estimators like Kalman filters. Our contributions illustrate the potential of our HyNN model in achieving precise indoor localisation and navigation for mobile robots in complex environments. The study follows, and proposes, a generalisable procedure applicable beyond the specific use case studied, making it adaptable to different scenarios and datasets.

Short Papers
Paper Nr: 23
Title:

Robot Vision System for Retail Shelf Monitoring

Authors:

Abhishek V. Latha, Mohammad Rahimipour and Adel Merabet

Abstract: In this paper, automated retail shelf monitoring, that includes tag verification and on-shelf availability checking, is proposed using a robot vision system. The mobile robot platform with camera vision extracts images of the products and tags on the shelves. The tags images are processed using computer vision tools such as YOLOv5 and EasyOCR to detect the tags and extract all information to be compared with a database for compliance. The extracted images of the products are used to reconstruct the entire shelving unit, the YOLOv5 is used for object detection, and the depth estimation MiDas model is used to build the depth map image of the shelving unit. This combined method, based on object detection and depth estimation, is used to check on-shelf availability of the products. Experimentation is conducted using a small shelving unit. It has been found that the proposed procedure can verify products, tags and availability and insure on time shelf monitoring by the robot vision system.

Paper Nr: 24
Title:

Integrating Vision-Based AI and Large Language Models for Real-Time Aquaculture Net Pens Inspection

Authors:

Waseem Akram, Muhayy Ud Din, Lakmal Seneviratne and Irfan Hussain

Abstract: This paper presents a novel approach for real-time aquaculture net pen monitoring by integrating vision-based AI with large language models. Traditional monitoring methods, which rely heavily on manual inspection and semi-autonomous systems, are often labor-intensive and inefficient. This work uses advanced AI techniques, combining YOLO-based deep learning model for detecting net defects such as biofouling, vegetation, and net holes, with large language model ChatGPT-4 to interpret and summarize inspection results in real-time. The proposed approach provides a real-time aqua-net inspection that enhances the accuracy and speed of net inspections while minimizing human intervention. Experimental results demonstrate significant improvements in detection precision resulting in mAP score of 0.9701 on our custom aqua-net dataset, operational efficiency, and automated report generation, highlighting the potential of this integrated approach to transform aquaculture management and promote sustainability.

Paper Nr: 35
Title:

Norm-Based Stability Conditions for Neutral Systems with Discrete Delays

Authors:

Ozlem Faydasicok and Sabri Arik

Abstract: This research article deals with stability problems for neutral systems possessing discrete time delay terms in state variables and discrete neutral delay terms in time derivative of state variables. By analysing some suitable Lyapunov functional candidates, we derive a set of new norm-based conditions that determine global stability of neutral systems involving discrete delay terms. The proposed norm-based global asymptotic stability conditions impose some constraints on the norms of the constant system matrices independently of the delay parameters. We will give a numerical example of neutral systems to demonstrate the feasibility of proposed stability criteria. Since, the deriving sufficient stability criteria for linear neutral system including discrete delay terms in the system equations is a difficult task to achieve, the stability conditions proposed in this research article can be considered as beneficial contributions to the topic of stability of neutral systems.

Paper Nr: 45
Title:

Pipeline Inspection A Case Study for Human Cognition Inspired Condition Management System

Authors:

Hariom Dhungana

Abstract: Subsea pipelines play a critical role in the modern oil and gas industry by transporting production. However, over time, factors like corrosion or deformations can cause degradation, potentially leading to significant economic and environmental harm if not promptly addressed. As a result, regular inspection of subsea pipelines is essential to prevent catastrophic events. In an ongoing project, a human cognition-inspired condition management framework has been proposed. The system is designed to leverage various data sources and integrate them with analytical models and knowledge-based systems to assist in equipment diagnosis and recommend optimized operation and maintenance strategies. To demonstrate the experimental verification of the different stages of condition monitoring and the overlapping components of the framework, we sought an integrated dataset containing various observations, such as equipment, environmental, and operational data. For a comprehensive and optimal case study, we selected the Subpipe inspection dataset, showcasing pipeline localization, fault detection, and fault diagnostic tests performed on the dataset.

Paper Nr: 47
Title:

Synth It like KITTI: Synthetic Data Generation for Object Detection in Driving Scenarios

Authors:

Richard Marcus, Christian Vogel, Inga Jatzkowski, Niklas Knoop and Marc Stamminger

Abstract: Training systems for use in the real world virtually is an important factor in advancing autonomous driving. Yet, there is rather small progress for transferability between domains in practice. We specifically focus on 3D object detection on LiDAR point clouds and propose a dataset generation pipeline based on the CARLA simulation. Utilizing domain randomization strategies and careful modeling, we are able to train an object detector on the processed synthetic data and demonstrate strong generalization capabilities to the KITTI dataset. We show that similar data distributions compared to real world datasets can be achieved. Furthermore, we compare different virtual sensor variants to gather insights, which sensor attributes can be responsible for the prevalent domain gap.

Paper Nr: 53
Title:

Is an Object-Centric Representation Beneficial for Robotic Manipulation?

Authors:

Alexandre Chapin, Emmanuel Dellandrea and Liming Chen

Abstract: Object-centric representation (OCR) has recently became a subject of interest in the computer vision community for learning a structured representation of images and videos. It has been several times presented as a potential way to improve data-efficiency and generalization capabilities to learn an agent on downstream tasks. However, most of existing work only evaluate such models on scene decomposition, without any notion of reasoning over the learnt representation. Robotic manipulation tasks generally involve multi-object environments with potential inter-object interaction. We thus argue that they are a very interesting playground to really evaluate the potential of existing object-centric work. We present a new framework built on top of existing OCR models to evaluate them on robotic tasks. To do so, we create several robotic manipulation tasks in simulated environments involving multiple object (several distractors, robot, ...) and a high-level of randomization (objects positions, colors, shapes, background, init position, ...). We then evaluate one classical object-centric method over several generalization scenarios and compare its results against several state-of-the-art hollistic representations. Our results exhibits that existing methods are prone to failure when facing difficult scenarios implying complex scene structures and object-centric methods helps overcoming these problems.

Paper Nr: 54
Title:

Outdoor Robot Geo-Localization Using Vision Technics

Authors:

Hasinarivo Ramanana, Jean-Pierre Jessel, Tahiry Filamatra Andriamarozakaniaina and Hagamalala Santatra Bernardin

Abstract: Outdoor geolocation generally relies on the smartphone's GPS (Global Positioning System) for positioning. However, GPS encounters difficulties when the sky is cloudy or in dense urban areas between tall buildings. This article proposes an alternative method of geolocation in the event of GPS unavailability or failure, using vision to locate oneself. The method is based on a neural network model based on MobileNet, followed by regression and logo detection using YOLO to help the user locate. We will compare our method with that used by Nilwong et al, AlexNet and ResNet50. The datasets used are images captured on a campus with a smartphone and used to train the model. The results are obtained using an Android application, which compares the predicted position from the input images with the actual position measured by GPS in good conditions (clear sky). The results show that our model can replace GPS for locating a pedestrian in an urban environment.

Paper Nr: 19
Title:

LiCAR: Pseudo-RGB LiDAR Image for CAR Segmentation

Authors:

Ignacio de Loyola Páez Ubieta, Edison P. Velasco-Sánchez and Santiago T. Puente

Abstract: With the advancement of computing resources, an increasing number of Neural Networks (NNs) are appearing for image detection and segmentation appear. However, these methods usually accept as input a RGB 2D image. On the other side, Light Detection And Ranging (LiDAR) sensors with many layers provide images that are similar to those obtained from a traditional low resolution RGB camera. Following this principle, a new dataset for segmenting cars in pseudo-RGB images has been generated. This dataset combines the information given by the LiDAR sensor into a Spherical Range Image (SRI), concretely the reflectivity, near infrared and signal intensity 2D images. These images are then fed into instance segmentation NNs. These NNs segment the cars that appear in these images, having as result a Bounding Box (BB) and mask precision of 88% and 81.5% respectively with You Only Look Once (YOLO)-v8 large. By using this segmentation NN, some trackers have been applied so as to follow each car segmented instance along a video feed, having great performance in real world experiments.

Paper Nr: 25
Title:

Shadow-Robust Autonomous Navigation with Traversability Judgement Using Stereo Camera

Authors:

Motonobu Omori, Kota Hayashi, Hiroshi Yoshitake and Motoki Shino

Abstract: There are high expectations for autonomous personal mobility vehicles (PMVs) in outdoor environments to support the daily lives of older people. These autonomous PMVs are required to adhere to traffic rules and exhibit safe obstacle avoidance. Existing methods extract regions recommended for travel based on stereo camera images with semantic segmentation and navigate through them to meet the requirements. Currently, there is a problem where the recommended area is incorrectly extracted due to changes in surface brightness caused by shadows, leading to failures in autonomous navigation. Therefore, this study focused on 3D point clouds obtained with a stereo camera and proposed a robust autonomous navigation method that addresses shadows. Finally, the effectiveness of the proposed method was evaluated through autonomous navigation experiments conducted in environments that include shadows.

Paper Nr: 30
Title:

Automatic Comparison of Ecuadorian Facial Features with Marquardt Beauty Mask

Authors:

Jesús Ormaza, Paulina Morillo and Diego Vallejo-Huanga

Abstract: Surgical specialists use facial analysis tools to evaluate human beauty, particularly in the context of reconstructive procedures. Among the tools used to perform facial analysis is the Marquardt Beauty Mask, a mask developed by Stephen Marquardt that represents the perfect proportions a face should have. This article shows the implementation of a web tool that uses computer vision to extract the differences between the proportions of the mask, considered an ideal model of beauty, and a facial photograph of a person. In addition, we performed a statistical analysis of the differences found between the features of 432 facial photographs of Ecuadorians and the mask. According to the results, the measures calculated on the faces differ considerably from those in the Marquardt Beauty mask. These differences show that Ecuadorians have faces that are wider and longer than the mask. Therefore, the widespread use of tools built for a single archetype of people could be inaccurate in evaluating the harmony and symmetry of a face.

Paper Nr: 31
Title:

Online Collaborative UAV Path Planning for Mapping and Spraying Missionsnline Collaborative UAV Path Planning for Mapping and Spraying Missions

Authors:

Ali Moltajaei Farid and Malek Mouhoub

Abstract: Mapping and spraying play a crucial role in precision agriculture. In some technologies, companies employ an offline approach to plan paths for a swarm of UAVs. However, in real-world scenarios, unforeseen challenges often arise during missions, necessitating adaptive path adjustments. Currently, human intervention remains common to address these issues as they occur. In this regard, greater autonomy will likely be essential. To address these dynamic and uncertain situations, this paper introduces a novel online planner designed to enable UAVs to make slight adjustments to their pre-planned paths in real time until issues are resolved. This proposed approach facilitates autonomous online path planning that seamlessly integrates both pre-planned and dynamically adjusted routes. Primary planning is carried out offline, with the online planner activated to handle unexpected challenges as they arise. Depending on the circumstances, UAVs can collaborate to resolve issues while adhering to specified constraints to achieve their objectives.

Paper Nr: 48
Title:

NAUTICAL: Navigation Aid using U-Net and $Theta^*$ with Integrated Collision Avoidance and Landmarking

Authors:

Yashwardhan Deshmukh, Martin J.-D. Otis and Salick Diagne

Abstract: Technologies related to decision support systems in intelligent vessels have reached a high level of maturity in recent years. Meanwhile, autonomous and unmanned vessels have been extensively studied alongside autonomous vehicles using path planning, trajectory generation, localization, and logistic optimization. Numerous technologies have been developed for path planning and obstacle avoidance strategies on the sea and rivers, but the challenge remains when it comes to avoiding marine mammals, especially under the constraint of optimal trajectory to reduce energy consumption and time. This paper proposes a planner based on neural network driven as well as deterministic approaches. An Attention U-Net is used for semantic segmentation of terrestrial and aquatic areas, followed by the implementation of an artificial potential field to represent the map. Skeletonization and weighted dilation are then carried out to generate an optimal cost map for Theta* path planning. The proposed Navigation Aid System includes parameters related to the minimum distance from the shore, a maximization of cetacean distance under the constraint of minimizing the total distance traveled by the vessel. The pilot has the option to add waypoints and force the algorithm to reach these locations using a minimal traveled distance. The results show an optimal vessel path for the segmented map, given the constraints.

Paper Nr: 50
Title:

Objects Detection on the Water Surface Using Satellite Imagery, Drones and Vessel-Based Imaging Applied for Logistics

Authors:

Iheb Ben Salah, Martin J.-D. Otis, Remy Rahem and Mahassen Ardhaoui

Abstract: Automation in the logistic process in vessel transport could be improved using satellite and drone imagery. As an example, finding the location of a point of interest, such as a human, a buoy, a small ship without AIS, a cetacean, on the surface of the water could be used for collision avoidance and search-and-rescue applications. The location could then be used for path planning and logistics. There are a wide variety of objects that can be found on the surface of water. However, there are no studies on the location and differentiation of these objects. This study suggests integrating six data sets, from different sources such as drone and satellite imagery, to validate the concept of mixing different data sources while maintaining adequate detection performance for different obejts at the surface of water. YOLOv8s, YOLOv8m, YOLOv10m, and Detectron2 Faster R-CNN (R50-FPN) models were used to validate the feasibility of detecting four different objects under different conditions, which are cetacean, small boat, buoy, human, and water (background). This project will be integrated into an overall system that improves maritime safety and operations.