Publications
For the complete list, please see my Google Scholar Profile.
2024
- ArXivEigen-Cluster VIS: Improving Weakly-supervised Video Instance Segmentation by Leveraging Spatio-temporal ConsistencyFarnoosh Arefi, Amir M Mansourian, and Shohreh KasaeiarXiv preprint arXiv:2408.16661, 2024
The performance of Video Instance Segmentation (VIS) methods has improved significantly with the advent of transformer networks. However, these networks often face challenges in training due to the high annotation cost. To address this, unsupervised and weakly-supervised methods have been developed to reduce the dependency on annotations. This work introduces a novel weakly-supervised method called Eigen-cluster VIS that, without requiring any mask annotations, achieves competitive accuracy compared to other VIS approaches. This method is based on two key innovations: a Temporal Eigenvalue Loss (TEL) and a clip-level Quality Cluster Coefficient (QCC). The TEL ensures temporal coherence by leveraging the eigenvalues of the Laplacian matrix derived from graph adjacency matrices. By minimizing the mean absolute error (MAE) between the eigenvalues of adjacent frames, this loss function promotes smooth transitions and stable segmentation boundaries over time, reducing temporal discontinuities and improving overall segmentation quality. The QCC employs the K-means method to ensure the quality of spatio-temporal clusters without relying on ground truth masks. Using the Davies-Bouldin score, the QCC provides an unsupervised measure of feature discrimination, allowing the model to self-evaluate and adapt to varying object distributions, enhancing robustness during the testing phase. These enhancements are computationally efficient and straightforward, offering significant performance gains without additional annotated data. The proposed Eigen-Cluster VIS method is evaluated on the YouTube-VIS 2019/2021 and OVIS datasets, demonstrating that it effectively narrows the performance gap between the fully-supervised and weakly-supervised VIS approaches.
- SoccerNet game state reconstruction: End-to-end athlete tracking and identification on a minimapVladimir Somers, Victor Joos, Anthony Cioppa, Silvio Giancola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Baptiste Standaert, Amir M Mansourian, Xin Zhou, Shohreh Kasaei, and othersIn IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Tracking and identifying athletes on the pitch holds acentral role in collecting essential insights from the game,such as estimating the total distance covered by players orunderstanding team tactics. This tracking and identification process is crucial for reconstructing the game state,defined by the athletes’ positions and identities on a 2Dtop-view of the pitch, (i.e. a minimap). However, reconstructing the game state from videos captured by a singlecamera is challenging. It requires understanding the position of the athletes and the viewpoint of the camera to localize and identify players within the field. In this work,we formalize the task of Game State Reconstruction and introduce SoccerNet-GSR, a novel Game State Reconstruction dataset focusing on football videos. SoccerNet-GSRis composed of 200 video sequences of 30 seconds, annotated with 9.37 million line points for pitch localization andcamera calibration, as well as over 2.36 million athlete positions on the pitch with their respective role, team, and jersey number. Furthermore, we introduce GS-HOTA, a novelmetric to evaluate game state reconstruction methods. Finally, we propose and release an end-to-end baseline forgame state reconstruction, bootstrapping the research onthis task. Our experiments show that GSR is a challengingnovel task, which opens the field for future research.
- ArXivAttention-guided Feature Distillation for Semantic SegmentationAmir M Mansourian, Arya Jalali, Rozhan Ahmadi, and Shohreh KasaeiarXiv preprint arXiv:2403.05451, 2024
In contrast to existing complex methodologies commonly employed for distilling knowledge from a teacher to a student, the pro-posed method showcases the efficacy of a simple yet powerful method for utilizing refined feature maps to transfer attention. The proposed method has proven to be effective in distilling rich information, outperforming existing methods in semantic segmentation as a dense prediction task. The proposed Attention-guided Feature Distillation (AttnFD) method, employs the Convolutional Block Attention Module (CBAM), which refines feature maps by taking into account both channel-specific and spatial information content. By only using the Mean Squared Error (MSE) loss function between the refined feature maps of the teacher and the student,AttnFD demonstrates outstanding performance in semantic segmentation, achieving state-of-the-art results in terms of mean Intersection over Union (mIoU) on the PascalVoc 2012 and Cityscapes datasets.
- Deep Spectral Improvement for Unsupervised Image Instance SegmentationFarnoosh Arefi, Amir M Mansourian, and Shohreh KasaeiPlos One, 2024
Recently, there has been growing interest in deep spectral methods for image localization and segmentation, influenced by traditional spectral segmentation approaches. These methods reframe the image decomposition process as a graph partitioning task by extracting features using self-supervised learning and utilizing the Laplacian of the affinity matrix to obtain eigensegments. However, instance segmentation has received less attention compared to other tasks within the context of deep spectral methods. This paper addresses the fact that not all channels of the feature map extracted from a selfsupervised backbone contain sufficient information for instance segmentation purposes. In fact, some channels are noisy and hinder the accuracy of the task. To overcome this issue, this paper proposes two channel reduction modules: Noise Channel Reduction (NCR) and Deviation-based Channel Reduction (DCR). The NCR retains channels with lower entropy, as they are less likely to be noisy, while DCR prunes channels with low standard deviation, as they lack sufficient information for effective instance segmentation. Furthermore, the paper demonstrates that the dot product, commonly used in deep spectral methods, is not su itable for instance segmentation due to its sensitivity to feature map values, potentially leading to incorrect instance segments. To address this issue, a new similarity metric called Bray-Curtis over Chebyshev (BoC) is proposed. It takes into account the distribution of features in addition to their values, providing a more robust similarity measure for instance segmentation. Quantitative and qualitative results on the Youtube-VIS2019 dataset highlight the improvements achieved by the proposed channel reduction methods and the use of BoC instead of the conventional dot product for creating the affinity matrix. These improvements are observed in terms of mean Intersection over Union (mIoU) and extracted instance segments, demonstrating enhanced instance segmentation performance.
- Rethinking RAFT for Efficient Optical FlowNavid Eslami, Farnoosh Arefi, Amir M Mansourian, and Shohreh KasaeiInternational Conference on Machine Vision and Image Processing (MVIP), 2024
Despite significant progress in deep learning-based optical flow methods, accurately estimating large displacements and repetitive patterns remains a challenge. The limitations of local features and similarity search patterns used in these algorithms contribute to this issue. Additionally, some existing methods suffer from slow runtime and excessive graphic memory consumption. To address these problems, this paper proposes a novel approach based on the RAFT framework. The proposed Attention-based Feature Localization (AFL) approach incorporates the attention mechanism to handle global feature extraction and address repetitive patterns. It introduces an operator for matching pixels with corresponding counterparts in the second frame and assigning accurate flow values. Furthermore, an Amorphous Lookup Operator (ALO) is proposed to enhance convergence speed and improve RAFTs ability to handle large displacements by reducing data redundancy in its search operator and expanding the search space for similarity extraction. The proposed method, Efficient RAFT (Ef-RAFT),achieves significant improvements of 10% on the Sintel dataset and 5% on the KITTI dataset over RAFT. Remarkably, these enhancements are attained with a modest 33% reduction in speed and a mere 13% increase in memory usage.
2023
- Multi-task Learning for Joint Re-identification, Team Affiliation, and Role Classification for Sports Visual TrackingAmir M Mansourian, Vladimir Somers, Christophe De Vleeschouwer, and Shohreh KasaeiIn Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023
Effective tracking and re-identification of players is essential foranalyzing soccer videos. But, it is a challenging task due to the non-linear motion of players, the similarity in appearance of playersfrom the same team, and frequent occlusions. Therefore, the abilityto extract meaningful embeddings to represent players is crucialin developing an effective tracking and re-identification system.In this paper, a multi-purpose part-based person representationmethod, called PRTreID, is proposed that performs three tasks ofrole classification, team affiliation, and re-identification, simultane-ously. In contrast to available literature, a single network is trainedwith multi-task supervision to solve all three tasks, jointly. The pro-posed joint method is computationally efficient due to the sharedbackbone. Also, the multi-task learning leads to richer and morediscriminative representations, as demonstrated by both quanti-tative and qualitative results. To demonstrate the effectiveness ofPRTreID, it is integrated with a state-of-the-art tracking method,using a part-based post-processing module to handle long-termtracking. The proposed tracking method, outperforms all existingtracking methods on the challenging SoccerNet tracking dataset.
- SESoccerNet 2023 Challenges ResultsAnthony Cioppa, Silvio Giancola, Vladimir Somers, Floriane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deliege, Jan Held, Carlos Hinojosa, Amir M Mansourian, and othersSports Engineering, 2023
The SoccerNet 2023 challenges were the third annual video understanding challenges organized bythe SoccerNet team. For this third edition, the challenges were composed of seven vision-based taskssplit into three main themes. The first theme, broadcast video understanding, is composed of threehigh-level tasks related to describing events occurring in the video broadcasts: (1) action spotting,focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting,focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense videocaptioning, focusing on describing the broadcast with natural language and anchored timestamps.The second theme, field understanding, relates to the single task of (4) camera calibration, focus-ing on retrieving the intrinsic and extrinsic camera parameters from images. The third and lasttheme, player understanding, is composed of three low-level tasks related to extracting informa-tion about the players: (5) re-identification, focusing on retrieving the same players across multipleviews, (6) multiple object tracking, focusing on tracking players and the ball through unedited videostreams, and (7) jersey number recognition, focusing on recognizing the jersey number of players fromtracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel,including new annotations and data, task (4) was enhanced with more data and annotations, andtask (6) now focuses on end-to-end approaches.
- ArXivAICSD: Adaptive Inter-Class Similarity Distillation for Semantic SegmentationAmir M Mansourian, Rozhan Ahmadi, and Shohreh KasaeiarXiv preprint arXiv:2308.04243, 2023
In recent years, deep neural networks have achievedremarkable accuracy in computer vision tasks. With inferencetime being a crucial factor, particularly in dense predictiontasks such as semantic segmentation, knowledge distillation hasemerged as a successful technique for improving the accuracyof lightweight student networks. The existing methods oftenneglect the information in channels and among different classes.To overcome these limitations, this paper proposes a novelmethod called Inter-Class Similarity Distillation (ICSD) for thepurpose of knowledge distillation. The proposed method transfershigh-order relations from the teacher network to the studentnetwork by independently computing intra-class distributionsfor each class from network outputs. This is followed bycalculating inter-class similarity matrices for distillation usingKL divergence between distributions of each pair of classes. Tofurther improve the effectiveness of the proposed method, anAdaptive Loss Weighting (ALW) training strategy is proposed.Unlike existing methods, the ALW strategy gradually reducesthe influence of the teacher network towards the end of trainingprocess to account for errors in teacher’s predictions. Extensiveexperiments conducted on two well-known datasets for semanticsegmentation, Cityscapes and Pascal VOC 2012, validate theeffectiveness of the proposed method in terms of mIoU andpixel accuracy. The proposed method outperforms most ofexisting knowledge distillation methods as demonstrated by bothquantitative and qualitative evaluations.
- An Efficient Knowledge Distillation Architecture for Real-time Semantic SegmentationAmir M Mansourian, Nader Karimi, and Shohreh KasaeiAUT Journal of Modeling and Simulation, 2023
In recent years, Convolutional Neural Networks (CNNs) have made significant strides in the field of segmentation, particularly in semantic segmentation where both accuracy and efficiency are crucial. However, despite their high accuracy, these deep networks are not practical for real-time use due to their low inference speed. This issue has prompted researchers to explore various techniques to improve the efficiency of CNNs. One such technique is knowledge distillation, which involves transferring knowledge from a larger, cumbersome (teacher) model to a smaller, more compact (student) model. This paper proposes a simple yet efficient approach to address the issue of low inference speed in CNNs using knowledge distillation. The proposed method involves distilling knowledge from the feature maps of the teacher model to guide the learning of the student model. The approach uses a straightforward technique known as pixel-wise distillation to transfer the feature maps of the last convolution layer of the teacher model to the student model. Additionally, a pair-wise distillation technique is used to transfer pair-wise similarities of the intermediate layers. To validate the effectiveness of the proposed method, extensive experiments were conducted on the PascalVoc 2012 dataset using a state-of-the art DeepLabV3+ segmentation network with different backbone architectures. The results showed that the proposed method achieved a balanced mean Intersection over Union (mIoU) and training time.