IUML: Inception U-Net based multi-task learning for density level classification and crowd density estimation


  Nowadays, image-based people counting is an essential technique for public safety management. However, this work is still extremely challenging due to many kinds of scale issues caused by different congested scenes, different viewing points, different image sizes, and different density levels. In this paper, we proposed a CNNs-based framework for people counting and crowd density map estimation with the consideration of the scale problems. First, we introduced an encoder-decoder architecture, which is composed of Inception modules to learn the multi-scale feature representations. Besides, to be adaptive to image resolution, a multi-loss setting over different resolutions of density maps is designed for network training. Second, we apply multi-task learning to learn the joint features for the density map estimation task and the density level classification task. This helps to enhance the feature generality under different scenes. Finally, by adopting the U-net architecture, the encoder and decoder features are then fused to generate high-resolution density maps. The efficacy of the proposed method is evaluated in the extensive experiments by quantifying the counting performance through multiple evaluation criteria



The scale variation problem caused by the diversity of distance between the camera and the person. The object size is inversely proportional to the distance from the camera.


For the bird's-eye images taken by UAVs, the camera height would produce other scale variations.


Density variation: Even for the same scene, the crowd density levels are totally different.


Image resolution variation: The images resolution are also critical issues for people counting. The smaller image resolution, the smaller object will be.

Proposed Method



ShanghaiTech dataset


UCF_CC_50 dataset


Source Code

  The code is available in https://github.com/SuHuynh/IUML-Crowd-Counting