Publications

My work has appeared in the following venues:

Full Papers:

Conferences:

FSE: '25
ICSE: '22 '23 '24 '25
IROS: '24
ICRA: '21 '24
SEA^2: '19

Journals:

Science of Computer Programming: 2024

Short Papers:

FSE-DEMO: '25
ICSE-DEMO: '25
SE4ADS: '25
ICSE-SEET: '22

You can also find my articles on my Google Scholar profile.

Scene Flow Specifications: Encoding and Monitoring Rich Temporal Safety Properties of Autonomous Systems

Published in 2025 ACM International Conference on the Foundations of Software Engineering, 2025

To ensure the safety of autonomous systems, it is imperative for them to abide by their safety properties. The specification of such safety properties is challenging because of the gap between the input sensor space (e.g., pixels, point clouds) and the semantic space over which safety properties are specified (e.g. people, vehicles, road). Recent work utilized scene graphs to overcome portions of that gap, enabling the specification and synthesis of monitors targeting many safe driving properties for autonomous vehicles. However, scene graphs are not rich enough to express the many driving properties that include temporal elements (i.e., when two vehicles enter an intersection at the same time, the vehicle on the left shall yield…), fundamentally limiting the types of specifications that can be monitored. In this work, we characterize the expressiveness required to specify a large body of driving properties, identify property types that cannot be specified with current approaches, which we name scene flow properties, and construct an enhanced domain-specific language that utilizes symbolic entities across time to enable the encoding of the rich temporal properties required for autonomous system safety. In analyzing a set of 114 specifications, we find that our approach can successfully encode 110 (96%) specifications as compared to 87 (76%) under prior approaches, an improvement of 20 percentage points. We implement the specifications in the form of a runtime monitoring framework to check the compliance of 3 state-of-the-art autonomous vehicles finding that they violated scene flow properties over 40 times in 30 test executions, including 34 violations for failing to yield properly at intersections. Empirical results demonstrate the implementation is suitably efficient for runtime monitoring applications.

Recommended citation:
Trey Woodlief, Felipe Toledo, Matthew Dwyer, and Sebastian Elbaum. 2025. Scene Flow Specifications: Encoding and Monitoring Rich Temporal Safety Properties of Autonomous Systems. Proc. ACM Softw. Eng. 2, FSE, Article FSE112 (July 2025), 24 pages. https://doi.org/10.1145/3729382

Download here or Github

Steering the Future: A Catalog of Failures in Deep Learning-Enabled Robotic Navigation Systems

Published in 2025 ACM International Conference on the Foundations of Software Engineering: Demonstrations, 2025

Failure catalogs have proven to be key instruments driving the evolution and assessment of program analysis techniques. However, such infrastructure does not support the development of techniques for the large number of emerging robotic systems. Developing such a catalog is costly and challenging because it requires access to the full physical system and the presentation of a diverse set of failures. We have started to tackle this challenge, building Defects4DeepNav, a growing catalog of over 100 failures from a commercial open source robot operating in the real world navigated by a learned component, with a diverse set of failures arising from each of 15 navigation components. This paper introduces Defects4DeepNav, including a diverse set of failures, full sensor data for failing and non-failing behavior, tools to analyze the data, and illustrations of its potential use cases and extensions.

Recommended citation:
Coming Soon

Github

Closing the Gap between Sensor Inputs and Driving Properties: A Scene Graph Generator for CARLA

Published in 47th International Conference on Software Engineering: Demonstrations (ICSE-DEMO'25), 2025

The software engineering community has increasingly taken up the task of assuring safety in autonomous driving systems by applying software engineering principles to create techniques to develop, validate, and verify these systems. However, developing and analyzing these techniques requires extensive sensor data sets and execution infrastructure with the relevant features and known semantics for the task at hand. While the community has invested substantial effort in gathering and cultivating large-scale data sets and developing simulation infrastructure with varying features, semantic understanding of this data has remained out of reach, relying on limited, manually-crafted data sets or bespoke simulation environments to ensure the desired semantics are met. To address this, we developed a plugin for the widely-used ADS simulator CARLA called CarlaSGG, that extracts relevant ground-truth spatial and semantic information from the simulator state at runtime in the form of scene graphs, enabling online and post-hoc automatic reasoning about the semantics of the scenario and associated sensor data. The tool has been successfully deployed in multiple previous software engineering approach evaluations which we describe to demonstrate the utility of the tool. The precision of the semantic information captured in the scene graph can be adjusted by the client application to suit the needs of the implementation. We provide a detailed description of the tool’s design, capabilities, and configurations, with additional documentation available accompanying the tool’s online source: https://github.com/less-lab-uva/carla_scene_graphs.

Recommended citation:
Coming Soon

Github

Realism Constructs for ADS Simulation Testing

Published in 1st International Workshop on Software Engineering for Autonomous Driving Systems (SE4ADS 2025), 2025

As autonomous driving systems (ADSs) continue to expand into the public sphere, so too must our efforts to sufficiently validate their safety. Given the wide array of scenarios over which ADSs must operate and the inherent dangers in these scenarios, developers often rely on simulation testing to exercise the system. However, the well-documented simulation-reality gap limits the transfer of results from simulation testing to real world operation, hindering the ability to build sufficient assurance cases based on validation in simulation alone. This is a fundamental issue in the construct validity of simulation-based methods for validation of ADS systems. Recent efforts have sought to decrease the simulation-reality gap through improved simulation fidelity and developing methods for generating synthetic data from real data. However, these efforts do not come with a method to reason about the construct validity achieved by these improvements. Current methods to measure the distance between simulation and reality for ADS validation are insufficient for the task as they provide no basis on which to judge the validity of the simulated tests. For simulation testing to provide utility, we require methods to reason about this construct validity; i.e., whether and how much a given test or technique will yield failures that transfer to real-world deployment, or miss failures because of the lack of fidelity. We describe the continuing challenges in this domain, provide outlines of what is required of a solution, and set directions for future work in the community to this end.

Recommended citation:
Coming Soon

A Differential Testing Framework to Identify Critical AV Failures Leveraging Arbitrary Inputs

Published in 47th International Conference on Software Engineering (ICSE 2025), 2025

The proliferation of autonomous vehicles (AVs) has made their failures increasingly evident. Testing efforts aimed at identifying the inputs leading to those failures are challenged by the input’s long-tail distribution, whose area under the curve is dominated by rare scenarios. We hypothesize that leveraging emerging open-access datasets can accelerate the exploration of long-tail inputs. Having access to diverse inputs, however, is not sufficient to expose failures; an effective test also requires an oracle to distinguish between correct and incorrect behaviors. Current datasets lack such oracles and developing them is notoriously difficult. In response, we propose DIFFTEST4AV, a differential testing framework designed to address the unique challenges of testing AV systems: 1) for any given input, many outputs may be considered acceptable, 2) the long-tail contains an insurmountable number of inputs to explore, and 3) the AV’s continuous execution loop requires for failures to persist in order to affect the system. DIFFTEST4AV integrates statistical analysis to identify meaningful behavioral variations, judges their importance in terms of the severity of these differences, and incorporates sequential analysis to detect persistent errors indicative of potential system-level failures. Our study on 5 versions of the commercially-available, road-deployed comma.ai OpenPilot system, using 3 available image datasets, demonstrates the capabilities of the framework to detect high-severity, high-confidence, long-running test failures.

Recommended citation:
Coming Soon

Download here or Github

The SGSM framework: Enabling the specification and monitor synthesis of safe driving properties through scene graphs

Published in Science of Computer Programming, 2024

As autonomous vehicles (AVs) become mainstream, assuring that they operate in accordance with safe driving properties becomes paramount. The ability to specify and monitor driving properties is at the center of such assurance. Yet, the mismatch between the semantic space over which typical driving properties are asserted (e.g., vehicles, pedestrians) and the sensed inputs of AVs (e.g., images, point clouds) poses a significant assurance gap. Related efforts bypass this gap by either assuming that data at the right semantic level is available, or they develop bespoke methods for capturing such data. Our recent Scene Graph Safety Monitoring (SGSM) framework addresses this challenge by extracting scene graphs (SGs) from sensor inputs to capture the entities related to the AV, specifying driving properties using a domain-specific language that enables building propositions over those graphs and composing them through temporal logic, and synthesizing monitors to detect property violations. Through this paper we further explain, formalize, analyze, and extend the SGSM framework, producing SGSM++. This extension is significant in that it incorporates the ability for the framework to encode the semantics of resetting a property violation, enabling the framework to count the quantity and duration of violations. We implemented SGSM++ to monitor for violations of 9 properties of 3 AVs from the CARLA Autonomous Driving Leaderboard, confirming the viability of the framework, which found that the AVs violated 71% of properties during at least one test including almost 1400 unique violations over 30 total test executions, with violations lasting up to 9.25 minutes. Artifact available at https://github.com/less-lab-uva/ExtendingSGSM.

Recommended citation: Copy BibTeX
Trey Woodlief, Felipe Toledo, Sebastian Elbaum, Matthew B. Dwyer, The SGSM framework: Enabling the specification and monitor synthesis of safe driving properties through scene graphs, Science of Computer Programming, Volume 242, 2025, 103252, ISSN 0167-6423, https://doi.org/10.1016/j.scico.2024.103252.

Download here or Github

ODD-diLLMma: Driving Automation System ODD Compliance Checking using LLMs

Published in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'24), 2024

Although Driving Automation Systems (DASs) are rapidly becoming more advanced and ubiquitous, they are still confined to specific Operational Design Domains (ODDs) over which the system must be trained and validated. Yet, each DAS has a bespoke and often informally defined ODD, which makes it intractable to manually judge whether a dataset satisfies a DAS’s ODD. This results in inadequate data leaking into the training and testing processes, weakening them, and causes large amounts of collected data to go unused given the inability to check their ODD compliance. This presents a dilemma: How do we cost-effectively determine if existing sensor data complies with a DAS’s ODD? To address this challenge, we start by reviewing the ODD specifications of 10 commercial DASs to understand current practices in ODD documentation. Next, we present ODD-diLLMma, an automated method that leverages Large Language Models (LLMs) to analyze existing datasets with respect to the natural language specifications of ODDs. Our evaluation of ODD-diLLMma examines its utility in analyzing inputs from 3 real-world datasets. Our empirical findings show that ODD-diLLMma significantly enhances the efficiency of detecting ODD compliance, showing improvements of up to 147% over a human baseline. Further, our analysis highlights the strengths and limitations of employing LLMs to support ODD-diLLMma, underscoring their potential to effectively address the challenges of ODD compliance detection.

Recommended citation: Copy BibTeX
Hildebrandt, Carl, Trey Woodlief, and Sebastian Elbaum. "ODD-diLLMma: Driving Automation System ODD Compliance Checking using LLMs." 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024.

Download here or Github

Specifying and Monitoring Safe Driving Properties with Scene Graphs

Published in 2024 IEEE International Conference on Robotics and Automation, 2024

With the proliferation of autonomous vehicles (AVs) comes the need to ensure they abide to safe driving properties. Specifying and monitoring such properties, however, is challenging because of the mismatch between the semantic space over which typical driving properties are asserted (e.g., vehicles, pedestrians, intersections) and the sensed inputs of AVs. Existing efforts either assume for such sematic data to be available or develop bespoke methods for capturing it. Instead, this work introduces a framework that can extract scene graphs from sensor inputs to capture the entities related to the AV, and a domain-specific language that enables building propositions over those graphs and composing them through temporal logic. We implemented the framework to monitor for specification violations of 3 top AVs from the CARLA Autonomous Driving Leaderboard, and found that on average the AVs violated 71% of properties during at least one test.

Recommended citation: Copy BibTeX
Toledo, Felipe, Trey Woodlief, Sebastian Elbaum, and Matthew B. Dwyer. "Specifying and Monitoring Safe Driving Properties with Scene Graphs." In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 15577-15584. IEEE, 2024.

Download here or Github

S³C: Spatial Semantic Scene Coverage for Autonomous Vehicles

Published in 46th International Conference on Software Engineering (ICSE 2024), 2024

Autonomous vehicles (AVs) must be able to operate in a wide range of scenarios including those in the long tail distribution that include rare but safety-critical events. The collection of sensor input and expected output datasets from such scenarios is crucial for the development and testing of such systems. Yet, approaches to quantify the extent to which a dataset covers test specifications that capture critical scenarios remain limited in their ability to discriminate between inputs that lead to distinct behaviors, and to render interpretations that are relevant to AV domain experts. To address this challenge, we introduce S³C, a framework that abstracts sensor inputs to coverage domains that account for the spatial semantics of a scene. The approach leverages scene graphs to produce a sensor-independent abstraction of the AV environment that is interpretable and discriminating. We provide an implementation of the approach and a study for camera-based autonomous vehicles operating in simulation. The findings show that S³C outperforms existing techniques in discriminating among classes of inputs that cause failures, and offers spatial interpretations that can explain to what extent a dataset covers a test specification. Further exploration of S³C with open datasets complements the study findings, revealing the potential and shortcomings of deploying the approach in the wild.

Recommended citation: Copy BibTeX
Trey Woodlief, Felipe Toledo, Sebastian Elbaum, and Matthew B. Dwyer. 2024. S3C: Spatial Semantic Scene Coverage for Autonomous Vehicles. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24), April 14–20, 2024, Lisbon, Portugal. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3597503.3639178

Download here or Github

Generating Realistic and Diverse Tests for LiDAR-Based Perception Systems

Published in 45th International Conference on Software Engineering (ICSE 2023), 2023

Autonomous systems rely on a perception component to interpret their surroundings, and when misinterpretations occur, they can and have led to serious and fatal system-level failures. Yet, existing methods for testing perception software remain limited in both their capacity to efficiently generate test data that translates to real-world performance and in their diversity to capture the long tail of rare but safety-critical scenarios. These limitations are particularly evident for perception systems based on LiDAR sensors, which have emerged as a crucial component in modern autonomous systems due to their ability to provide a 3D scan of the world and operate in all lighting conditions. To address these limitations, we introduce a novel approach for testing LiDAR-based perception systems by leveraging existing real-world data as a basis to generate realistic and diverse test cases through mutations that preserve realism invariants while generating inputs rarely found in existing data sets, and automatically crafting oracles that identify potentially safety-critical issues in perception performance. We implemented our approach to assess its ability to identify perception failures, generating over 50,000 test inputs for five state-of-the-art LiDAR-based perception systems. We found that it efficiently generated test cases that yield errors in perception that could result in real consequences if these systems were deployed and does so at a low rate of false positives.

Recommended citation: Copy BibTeX
Garrett Christian, Trey Woodlief, and Sebastian Elbaum. 2023. Generating Realistic and Diverse Tests for LiDAR-Based Perception Systems. In 45th International Conference on Software Engineering (ICSE ’23), May 17–19, 2023, Melbourne, VIC, AU. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1109/ICSE48619.2023.00217

Download here or Github

Semantic Image Fuzzing of AI Perception Systems

Published in 44th International Conference on Software Engineering (ICSE 2022), 2022

Perception systems enable autonomous systems to interpret raw sensor readings of the physical world. Testing of perception systems aims to reveal misinterpretations that could cause system failures. Current testing methods, however, are inadequate. The cost of human interpretation and annotation of real-world input data is high, so manual test suites tend to be small. The simulation-reality gap reduces the validity of test results based on simulated worlds. And methods for synthesizing test inputs do not provide corresponding expected interpretations. To address these limitations, we developed 𝑠𝑒𝑚𝑆𝑒𝑛𝑠𝐹𝑢𝑧𝑧, a new approach to fuzz testing of perception systems based on semantic mutation of test cases that pair real-world sensor readings with their ground-truth interpretations. We implemented our approach to assess its feasibility and potential to improve software testing for perception systems. We used it to generate 150,000 semantically mutated image inputs for five state-of-the-art perception systems. We found that it synthesized tests with novel and subjectively realistic image inputs, and that it discovered inputs that revealed significant inconsistencies between the specified and computed interpretations. We also found that it produced such test cases at a cost that was very low compared to that of manual semantic annotation of real-world images.

Recommended citation: Copy BibTeX
Trey Woodlief, Sebastian Elbaum, and Kevin Sullivan. 2022. Semantic Image Fuzzing of AI Perception Systems. In 44th International Conference on Software Engineering (ICSE ’22), May 21–29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510003.3510212

Download here or Github

Preparing Software Engineers to Develop Robot Systems

Published in 44th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET ’22), 2022

Robotics is a rapidly expanding field that needs software engineers. Most of our undergraduates, however, are not equipped to manage the unique challenges associated with the development of software for modern robots. In this work we introduce a course we have designed and delivered to better prepare students to develop software for robot systems. The course is unique in that: it emphasizes the distinctive challenges of software development for robots paired with the software engineering techniques that may help manage those challenges, it provides many opportunities for experiential learning across the robotics and software engineering interface, and it lowers the barriers for learning how to build such systems. In this work we describe the principles and innovations of the course, its content and delivery, and finish with the lessons we have learned"

Recommended citation: Copy BibTeX
Carl Hildebrandt, Meriel von Stein, Trey Woodlief, and Sebastian Elbaum. 2022. Preparing Software Engineers to Develop Robot Systems. In 44th International Conference on Software Engineering: Software Engineering Education and Training (ICSE-SEET ’22), May 21–29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510456.3514161

Download here

Fuzzing Mobile Robot Environments for Fast Automated Crash Detection

Published in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021

Testing mobile robots is difficult and expensive, and many faults go undetected. In this work we explore whether fuzzing, an automated test input generation technique, can more quickly find failure inducing inputs in mobile robots. We developed a simple fuzzing adaptation, BASE-FUZZ, and one specialized for fuzzing mobile robots, PHYS-FUZZ. PHYS-FUZZ is unique in that it accounts for physical attributes such as the robot dimensions, estimated trajectories, and time to impact measures to guide the test input generation process. The results of evaluating PHYS-FUZZ suggest that it has the potential to speed up the discovery of input scenarios that reveal failures, finding 56.5% more than uniform random input selection and 7.0% more than BASE-FUZZ during 7 days of testing.

Recommended citation: Copy BibTeX
T. Woodlief, S. Elbaum and K. Sullivan, "Fuzzing Mobile Robot Environments for Fast Automated Crash Detection," 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 5417-5423, doi: 10.1109/ICRA48506.2021.9561627.

Download here

Faster Biclique Mining in Near-Bipartite Graphs

Published in International Symposium on Experimental Algorithms, 2019

Identifying dense bipartite subgraphs is a common graph data mining task. Many applications focus on the enumeration of all maximal bicliques (MBs), though sometimes the stricter variant of maximal induced bicliques (MIBs) is of interest. Recent work of Kloster et al. introduced a MIB-enumeration approach designed for “near-bipartite” graphs, where the runtime is parameterized by the size k of an odd cycle transversal (OCT), a vertex set whose deletion results in a bipartite graph. Their algorithm was shown to outperform the previously best known algorithm even when k was logarithmic in |V|. In this paper, we introduce two new algorithms optimized for near-bipartite graphs - one which enumerates MIBs in time O(M_I|V||E|k), and another based on the approach of Alexe et al. which enumerates MBs in time O(M_B|V||E|k), where M_I and M_B denote the number of MIBs and MBs in the graph, respectively. We implement all of our algorithms in open-source C++ code and experimentally verify that the OCT-based approaches are faster in practice than the previously existing algorithms on graphs with a wide variety of sizes, densities, and OCT decompositions.

Recommended citation: Copy BibTeX
Sullivan, Blair D., Andrew van der Poel, and Trey Woodlief. "Faster Biclique Mining in Near-Bipartite Graphs." International Symposium on Experimental Algorithms. Springer, Cham, 2019.

Download here