Image-based people counting is a challenging work due to the large scale variation problem caused by the diversity of distance between the camera and the person, especially in the congested scenes. To handle this problem, the previous methods focus on building complicated models and rely on labeling the sophisticated density maps to learn the scale variation implicitly. It is often time-consuming in data pre-processing and difficult to train these deep models due to the lack of training data. In this paper, we thus propose an alternative and novel way for crowd counting which handles the scale variation problem by leveraging the auxiliary depth estimation dataset. Using separated crowd and depth datasets, we train a unified network for two tasks- crowd density map estimation and depth estimation- at the same time. By introducing the auxiliary depth estimation task, we prove that the scale problem caused by distance can be well solved and the labeling cost can be reduced. The efficacy of our method is demonstrated in the extensive experiments by multiple evaluation criteria.



The scale variation problem caused by the diversity of distance between the camera and the person. The object size is inversely proportional to the distance from the camera.

Proposed Method



The dataset information:


ShanghaiTech dataset

Data description:

The ShanghaiTech dataset has two parts. Part_A has 482 images collected from the Internet; Part_B has 716 street-view images taken in Shanghai. Both of them are further divided into training and testing sets. The training set of Part_A contains 300 images, whereas that of Part_B includes 400 images. The rest images are used for testing. The average number of individuals in the testing set of part_A and part_B are 433 people/image and 124 people/image, respectively

Estimation results:


UCF_CC_50 dataset

Data description:

The UCF_CC_50 dataset was introduced by Idrees et al. [13], which is regarded as a more challenging dataset for crowd counting. It consists of 50 grayscale images taken from various scenarios such as concerts, protests, marathons, stadiums, etc. The number of people in an image varies from 94 to 4543. In average, there are 1280 individuals per image

Estimation results:


Depth Estimation on KITTI dataset

Data description:

To train the depth estimation branch, we use the KITTI depth completion dataset [9]. The dataset consists of 85,898 training samples. However, to balance the training samples between two tasks, we only select 27,000 training images. For each image, we randomly crop and resize to 640x192. For better training, we also apply the algorithm in [10] to smooth the sparse depth maps

Estimation results:


Explanation:Why the depth map estimation is not so good (blur)?

(1) Only VGG network could not generate the good depth maps

(2) L2 loss force model to learn objects far away the camera, so it's good for crowd counting. However, L2 makes depth estimation blurry.

(3) Overall loss = Loss_density + lamda*Loss_depth. Because our goal is to generate the accurate crowd density map and counting, lamda was set by 0.1. So, the model would focus on crowd counting task much more.

Source Code

  The code is available in Publish soon


[9] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in arXiv preprint arXiv:1708.06500, 2017

[10] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” ACM transactions on graphics (tog), vol. 23, no. 3, 2004.

[13] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547-2554, 2013.