-
Notifications
You must be signed in to change notification settings - Fork 3
/
mdvc.html
157 lines (143 loc) · 8.11 KB
/
mdvc.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>MDVC - Vladimir Iashin</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Open+Sans&display=swap" rel="stylesheet">
<link rel="stylesheet" href="css/style_phd_times.css">
<link rel="icon" href="./favicon.ico" type="image/x-icon">
</head>
<body>
<ul class="breadcrumb">
<a class="bread_crumb" href="index.html">Vladimir Iashin</a> / Multi-modal Dense Video Captioning
</ul>
<h1> Multi-modal Dense Video Captioning </h1>
<!-- Authors -->
<div class="container_authors">
<div class="author_affiliation">
<div class="author_name">
<a class="social" href="https://v-iashin.github.io/"> Vladimir Iashin </a>
</div>
<div class="affiliation"> Tampere University </div>
</div>
<div class="author_affiliation">
<div class="author_name">
<a class="social" href="https://esa.rahtu.fi/"> Esa Rahtu </a>
</div>
<div class="affiliation"> Tampere University </div>
</div>
</div>
<!-- Conference -->
<div class="conference">
<a class="social" href="https://mul-workshop.github.io/"> Workshop on Multimodal Learning 2020 (CVPR Workshop) </a>
</div>
<!-- Links -->
<div class="code_and_links">
<div style="padding: 1em 0 1em 0;" class="div_link"> <a class="social" href="https://github.com/v-iashin/MDVC"> Code </a> </div>
<div style="padding: 1em 0 1em 0;" class="div_link"> <a class="social" href="http://openaccess.thecvf.com/content_CVPRW_2020/html/w56/Iashin_Multi-Modal_Dense_Video_Captioning_CVPRW_2020_paper.html"> Paper </a> </div>
<div style="padding: 1em 0 1em 0;" class="div_link"> <a class="social" href="https://www.youtube.com/watch?v=0Vmx_gzP1bM"> Presentation </a> </div>
</div>
<!-- Teaser -->
<div class="section">
<div class="section_content">
<div class="img background_beige_rounded">
<img src="./images/mdvc/Introduction.svg" alt="Teaser MDVC">
</div>
<div class="caption">
Figure. Example video with ground truth captions and predictions of Multi-modal Dense Video Captioning module (Ours).
(<a class="intext" href="https://www.youtube.com/embed/PLqTX6ij52U?rel=0">Link to the video on YouTube</a>)
</div>
</div>
</div>
<!-- Abstract -->
<div class="section">
<div class="section_name"> Abstract </div>
<div class="section_content">
Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual information and completely ignore the audio track. However, audio, and speech, in particular, are vital cues for a human observer in understanding an environment. In this paper, we present a new dense video captioning approach that is able to utilize any number of modalities for event description. Specifically, we show how audio and speech modalities may improve a dense video captioning model. We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track. We formulate the captioning task as a machine translation problem and utilize recently proposed Transformer architecture to convert multi-modal input data into textual descriptions. We demonstrate the performance of our model on ActivityNet Captions dataset. The ablation studies indicate a considerable contribution from audio and speech components suggesting that these modalities contain substantial complementary information to video frames. Furthermore, we provide an in-depth analysis of the ActivityNet Caption results by leveraging the category tags obtained from original YouTube videos.
</div>
</div>
<!-- Our Framework -->
<div class="our_framework">
<div class="section_name"> Our Framework </div>
<div class="section_content">
<div class="img background_beige_rounded">
<img src="./images/mdvc/MDVC.svg" alt="Our Framework">
</div>
<div class="caption">
Figure. The proposed Multi-modal Dense Video Captioning (MDVC) framework. Given an input consisting of several modalities, namely, audio, speech, and visual, internal representations are produced by a corresponding feature transformer (middle). Then, the features are fused in the multi-modal generator (right) that outputs the distribution over the vocabulary.
</div>
<br>
<div class="img background_beige_rounded">
<img src="./images/mdvc/Transformer_individual.svg" alt="Feature Transformer">
</div>
<div class="caption">
Figure. The feature transformer architecture that consists of an encoder (bottom part) and a decoder (top part). The encoder inputs pre-processed and position-encoded features from I3D (in case of the visual modality), and outputs an internal representation. The decoder, in turn, is conditioned on both position-encoded caption that is generated so far and the output of the encoder. Finally, the decoder outputs its internal representation.
</div>
</div>
</div>
<!-- BibTex -->
<div class="section">
<div class="section_name"> Citation </div>
<div class="section_content">
<pre class="code_box background_beige_rounded"><code>@InProceedings{MDVC_Iashin_2020,
author = {Iashin, Vladimir and Rahtu, Esa},
title = {Multi-Modal Dense Video Captioning},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}</code></pre>
</div>
</div>
<div style="padding-top: 2em;" class="our_framework">
<div class="section_name"> Acknowledgements </div>
<div class="section_content">
<p class="p_on_project_pages">
Funding for this research was provided by the Academy of Finland projects 327910 & 324346.
The authors acknowledge CSC — IT Center for Science, Finland, for computational resources.
</p>
</div>
</div>
<div class="our_framework">
<div class="section_name">
Our New Work on This Topic
</div>
<div class="section_content" style="text-align: center;">
<div class="row">
<div class="column_left">
<a href="bmt.html">
<img onmouseover="this.src='images/enc_dec.png'" onmouseout="this.src='images/bmt_enc_dec_novel.png'"
src="images/bmt_enc_dec_novel.png" />
</a>
</div>
<div class="column_right">
<p class="project_description">
<i><b>Vladimir Iashin</b> and Esa Rahtu.</i> <br>
<span style="color: #ad662b;">A Better Use of Audio-Visual Cues:</span> <br>
<span style="color: #ad662b;">Dense Video Captioning with Bi-modal Transformer.</span> <br>
In <i>British Machine Vision Conference</i> (BMVC), 2020
</p>
<ul class="proj_buttons">
<li class="proj_buttons">
<a href="bmt.html" class="proj_buttons"> Project Page </a>
</li>
<li class="proj_buttons">
<a href="https://github.com/v-iashin/BMT" class="proj_buttons"> Code </a>
</li>
<li class="proj_buttons">
<a href="https://arxiv.org/abs/2005.08271" class="proj_buttons"> Paper </a>
</li>
<li class="proj_buttons">
<a href="https://www.youtube.com/watch?v=C4zYVIqGDVQ" class="proj_buttons"> Presentation </a>
</li>
</ul>
</div>
</div>
</div>
</div>
</body>
<!-- Prefetching the hidden images for snappier toggles -->
<img src="./images/enc_dec.png" style="display: none;">
</html>