Understanding and Coding DETR
Transformers took by storm in most fields of deep learning. While transformers reached success in the NLP community almost immediately after the Attention is All You Need paper published in 2017 and the BERT model in 2018, it took longer to shock the computer vision community. It happened just in 2020, after the DEtection TRansformers (DETR) work showed competitive results for object detection tasks.
Two main new concepts are brought by DETR: 1) object queries and 2) the set prediction problem. These new concepts support a distinct training paradigm that freed object detection networks from anchor-based outcomes and non-maximum suppression mechanisms. And most importantly, it inspired further state-of-the-art works for segmentation tasks like Mask2Former (2022) and OneFormer (2023).
I recently created a Medium article that goes step-by-step through the DETR architecture and its new training paradigm, showing them on code didactically and training the solution on the COCO dataset. If this interests you, please check this out here.