Early vs Late Fusion in Multimodal Convolutional Neural Networks

Gadzicki, Konrad; Khamsehashari, Razieh; Zetzsche, Christoph

doi:10.23919/FUSION45008.2020.9190246

by Konrad Gadzicki, Razieh Khamsehashari, Christoph Zetzsche

Abstract:

Combining machine learning in neural networks with multimodal fusion strategies offers an interesting potential for classification tasks but the optimum fusion strategies for many applications have yet to be determined. Here we address this issue in the context of human activity recognition, making use of a state-of-the-art convolutional network architecture (Inception I3D) and a huge dataset (NTU RGB+D). As modalities we consider RGB video, optical flow, and skeleton data. We determine whether the fusion of different modalities can provide an advantage as compared to uni-modal approaches, and whether a more complex early fusion strategy can outperform the simpler latefusion strategy by making use of statistical correlations between the different modalities. Our results show a clear performance improvement by multi-modal fusion and a substantial advantage of an early fusion strategy.

Download PDF

PDF URL: https://doi.org/10.23919/FUSION45008.2020.9190246

Reference:

Early vs Late Fusion in Multimodal Convolutional Neural Networks (Konrad Gadzicki, Razieh Khamsehashari, Christoph Zetzsche), In 23rd International Conference on Information Fusion (FUSION), IEEE, 2020.

Bibtex Entry:

@inproceedings{Gadzicki2020Fusion,
	author = {Gadzicki, Konrad and Khamsehashari, Razieh and Zetzsche, Christoph},
	booktitle={23rd International Conference on Information Fusion (FUSION)},
	title={Early vs Late Fusion in Multimodal Convolutional Neural Networks}, 
	year={2020},  
	pages={1-6},
	publisher={IEEE},
	abstract={Combining machine learning in neural networks with multimodal fusion strategies offers an interesting potential for classification tasks but the optimum fusion strategies for many applications have yet to be determined. Here we address this issue in the context of human activity recognition, making use of a state-of-the-art convolutional network architecture (Inception I3D) and a huge dataset (NTU RGB+D). As modalities we consider RGB video, optical flow, and skeleton data. We determine whether the fusion of different modalities can provide an advantage as compared to uni-modal approaches, and whether a more complex early fusion strategy can outperform the simpler latefusion strategy by making use of statistical correlations between the different modalities. Our results show a clear performance improvement by multi-modal fusion and a substantial advantage of an early fusion strategy.},
	doi = {10.23919/FUSION45008.2020.9190246},
	url={10.23919/FUSION45008.2020.9190246">https://doi.org/10.23919/FUSION45008.2020.9190246},
	keywords = {EASE-H3},
}