TY - GEN
T1 - Globally-Robust Instance Identification and Locally-Accurate Keypoint Alignment for Multi-Person Pose Estimation
AU - Tian, Fangzheng
AU - Kim, Sungchan
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/27
Y1 - 2023/10/27
N2 - Scenes with a large number of human instances are characterized by significant overlap of the instances with similar appearance, occlusion, and scale variation. We propose GRAPE, a novel method that leverages both Globally Robust human instance identification and locally Accurate keypoint alignment for 2D Pose Estimation. GRAPE predicts instance center and keypoint heatmaps, as global identifications of instance location and scale, and keypoint offset vectors from instance centers, as representations of accurate local keypoint positions. We use Transformer to jointly learn the global and local contexts, which allows us to robustly detect instance centers even in difficult cases such as crowded scenes, and align instance offset vectors with relevant keypoint heatmaps, resulting in refined final poses. GRAPE also predicts keypoint visibility, which is crucial for estimating centers of partially visible instances in crowded scenes. We demonstrate that GRAPE achieves state-of-the-art performance on the CrowdPose, OCHuman, and COCO datasets. The benefit of GRAPE is more apparent on crowded scenes (CrowdPose and OCHuman), where our model significantly outperforms previous methods, especially on hard examples.
AB - Scenes with a large number of human instances are characterized by significant overlap of the instances with similar appearance, occlusion, and scale variation. We propose GRAPE, a novel method that leverages both Globally Robust human instance identification and locally Accurate keypoint alignment for 2D Pose Estimation. GRAPE predicts instance center and keypoint heatmaps, as global identifications of instance location and scale, and keypoint offset vectors from instance centers, as representations of accurate local keypoint positions. We use Transformer to jointly learn the global and local contexts, which allows us to robustly detect instance centers even in difficult cases such as crowded scenes, and align instance offset vectors with relevant keypoint heatmaps, resulting in refined final poses. GRAPE also predicts keypoint visibility, which is crucial for estimating centers of partially visible instances in crowded scenes. We demonstrate that GRAPE achieves state-of-the-art performance on the CrowdPose, OCHuman, and COCO datasets. The benefit of GRAPE is more apparent on crowded scenes (CrowdPose and OCHuman), where our model significantly outperforms previous methods, especially on hard examples.
KW - crowded scene
KW - human pose estimation
KW - single-stage
KW - transformer
UR - https://www.scopus.com/pages/publications/85179553745
U2 - 10.1145/3581783.3612525
DO - 10.1145/3581783.3612525
M3 - Conference paper
AN - SCOPUS:85179553745
T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
SP - 4816
EP - 4827
BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 31st ACM International Conference on Multimedia, MM 2023
Y2 - 29 October 2023 through 3 November 2023
ER -