Optimizing Vision-Language Model for Road Crossing Intention Estimation

  • Roy Uziel
  • , Oded Bialer

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    Identifying a pedestrian's intention to cross the road is crucial for autonomous driving, as it alerts the system to stop or slow down. However, determining crossing intention from video is challenging due to the need for extracting complex high-level semantics. This paper introduces ClipCross, a novel classification framework optimized to ex-tract high-level semantic features using the vision-language model CLIP for determining crossing intention. Existing CLIP-based methods perform poorly in this task, as CLIP's image and text encoders fail to capture the nuanced se-mantic distinctions between crossing and non-crossing in-tention images. Clip Cross addresses this by optimizing a set of CLIP text embeddings to extract high-level semantic features, which a multi-layer perceptron uses to distinguish between crossing and non-crossing intentions. Clip Cross achieves state-of-the-art performance on crossing intention estimation benchmark datasets: PIE, PSI, and lAAD.

    Original languageEnglish
    Title of host publicationProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
    PublisherInstitute of Electrical and Electronics Engineers
    Pages1702-1712
    Number of pages11
    ISBN (Electronic)9798331510831
    DOIs
    StatePublished - 1 Jan 2025
    Event2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 - Tucson, United States
    Duration: 28 Feb 20254 Mar 2025

    Publication series

    NameProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025

    Conference

    Conference2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
    Country/TerritoryUnited States
    CityTucson
    Period28/02/254/03/25

    Keywords

    • autonomous driving
    • crossing intention
    • crossing prediction
    • scene understanding

    ASJC Scopus subject areas

    • Artificial Intelligence
    • Computer Science Applications
    • Computer Vision and Pattern Recognition
    • Human-Computer Interaction
    • Modeling and Simulation
    • Radiology Nuclear Medicine and imaging

    Fingerprint

    Dive into the research topics of 'Optimizing Vision-Language Model for Road Crossing Intention Estimation'. Together they form a unique fingerprint.

    Cite this