一种基于多模态感知的双声道音频生成方法

官  丽1; 尹  康1; 樊梦佳1; 薛  昆1; 解  凯2

一种基于多模态感知的双声道音频生成方法

引用本文：官丽1，尹康1，樊梦佳1，薛昆1，解凯2.一种基于多模态感知的双声道音频生成方法[J].计算技术与自动化,2022,(4):157-165

摘要点击次数: 219

全文下载次数: 0

作者	单位
官丽1，尹康1，樊梦佳1，薛昆1，解凯2	(1.国网北京市电力公司,北京 100031；2. 南京南瑞继保电气有限公司，江苏南京 211102)

中文摘要:现有多数视频只包含单声道音频，缺乏双声道音频所带来的立体感。针对这一问题，本文提出了一种基于多模态感知的双声道音频生成方法，其在分析视频中视觉信息的基础上，将视频的空间信息与音频内容融合，自动为原始单声道音频添加空间化特征，生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络，以编码器-解码器的结构，对单声道视频进行编码，接着对视频特征和音频特征进行多尺度融合，并对视频及音频信息进行协同分析，使得双声道音频拥有了原始单声道音频所没有的空间信息，最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明，本方法取得了优于现有模型的双声道音频生成效果，在STFT距离以及ENV距离两项指标上均取得提升。

中文关键词:音频生成卷积神经网络多模态

A Dual-Channel Audio Generation Method Based on Multimodal Perception

Abstract:Most existing videos only contain mono audio and lack the stereoscopic sense by dual-channel audio. To address this issue, this paper proposes a method for generating dual-channel audio based on multimodal perception. Based on the analysis of visual information in the video, it fuses the spatial information and the audio content of the video, and generates dual-channel audio that is closer to the real auditory experience. We first encode the mono video via an improved audio-video fusion analysis network with an encoder-decoder structure. Then we fuse the video features and audio features in multiple perspectives. Subsequently, we co-analyze the video and audio information, so that the dual-channel audio has spatial information that the original mono audio does not have. Finally, the corresponding dual-channel audio is generated by the audio-video fusion analysis network. Experimental results demonstrate that our method achieves better performance than existing models in the generation of two-channel audio, with improvements in both STFT distance and ENV distance.

keywords:audio generation CNN multimodal

查看全文 查看/发表评论 下载pdf阅读器