ihsumlee
7 July 2023
Papers we should take account into: CVPR
What is intra frame and inter frame: Intraframe coding不會參考其它的影像資訊,所以intraframe 也可以說是(Still Image)靜止影像的編碼。Interframe Coding則必須參考其它的影像資料。
From [1]:
- This paper tell that early approaches modeled the RVOS task as a sequence prediction problem and pay little attention to the temporal relationships between different frames.
- While it may be acceptable for such conversion to handle the descriptions of static properties such as the appearance and color of the objects, this approach may lose the perception of target objects for language descriptions expressing temporal variations of objects due to the lack of video-level multi-modal understanding.
- What they proposed: we design a Semantic Integration Module (SIM) to efficiently aggregate intra-frame and inter-frame information. With a global view of the video content, SIM can facilitate the understanding of temporal variations as well as alignment across different modalities and granularity. Furthermore, we introduce visual-linguistic contrastive learning to provide semantic supervision and guide the establishment of video-level multi-modal joint space.
VO challenges
- Although indoor robot localization has been implemented successfully, robot localization in outdoor environments remains a challenging problem. Many factors, (e.g., terrains are usually not flat, direct sunlight, shadows, and dynamic changes in the environment caused by wind and sunlight) make localization difficult in outdoor environments (Takahashi 2007). The main challenges in VO systems are mainly related to computational cost and light and imaging conditions (Gonzalez et al. 2013; Nagatani et al. 2010; Nourani-Vatani and Borges 2011; Yu et al. 2011).
- For VO to work efficiently, sufficient illumination and a static scene with enough texture should be present in the environment to allow apparent motion to be extracted (Scaramuzza and Fraundorfer 2011). In areas that have a smooth and low-textured surface floor, directional sunlight and lighting conditions are highly considered, leading to non-uniform scene lighting. Moreover, shadows from static or dynamic objects or from the vehicle itself can disturb the calculation of pixel displacement and thus result in erroneous displacement estimation (Gonzalez et al. 2012; Nourani-Vatani and Borges 2011).
- Monocular vision systems suffer from scale uncertainty (Kitt et al. 2011; Cumani 2011; Zhang et al. 2014). If the surface is uneven, the image scale will fluctuate, and the image scaling factor will be difficult to estimate. According to Kitt et al. (2011), estimation of the scaling factor may become erroneous when a large change in the road slope occurs, which may lead to incorrect estimation of the resulting trajectory.
[1] SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
[2] Language as queries for referring video object segmentation
- Log in to post comments