-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
1567 lines (1431 loc) · 113 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html lang="en">
<head>
<title>V-IRL: Grounding Virtual Intelligence in Real Life</title>
<link rel="icon" type="image/x-icon" href="static/img/icons/earth_icon.png">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:url" content="https://virl-platform.github.io/" />
<meta property="og:image" content="https://virl-platform.github.io/static/img/preview.png" />
<meta property="og:title" content="V-IRL: Grounding Virtual Intelligence in Real Life" />
<meta property="og:description" content="An open-source framework for embodied agent and open-world computer vision research. Develop practical agents and test foundation models grounded with real street view imagery from around the world." />
<meta name="twitter:url" content="https://virl-platform.github.io/" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://virl-platform.github.io/static/img/preview.png" />
<meta name="twitter:title" content="V-IRL: Grounding Virtual Intelligence in Real Life" />
<meta name="twitter:description" content="An open-source framework for embodied agent and open-world computer vision research. Develop practical agents and test foundation models grounded with real street view imagery from around the world." />
<script src="./static/js/distill_template.v2.js"></script>
<script src="https://d3js.org/d3.v5.min.js"></script>
<script src="https://d3js.org/d3-collection.v1.min.js"></script>
<script src="https://rawgit.com/nstrayer/slid3r/master/dist/slid3r.js"></script>
<script defer="" src="./static/js/hider.js"></script>
<script src="./static/js/image_interact.js"></script>
<script src="./static/js/switch_videos.js"></script>
<link rel="stylesheet" href="./static/css/style.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.10.2/dist/katex.min.css" integrity="sha384-yFRtMMDnQtDRO8rLpMIKrtPCD5jdktao2TV19YiZYWMDkUR5GQZR/NOVTdquEx1j" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.10.2/dist/katex.min.js" integrity="sha384-9Nhn55MVVN0/4OFx7EE5kpFBPsEMZxKTCnA+4fqDmg12eCTqGi6+BB2LjY8brQxJ" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/katex@0.10.2/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<!-- medium zoom https://github.com/francoischalifour/medium-zoom -->
<script src="https://cdn.jsdelivr.net/npm/jquery@3.7.1/dist/jquery.min.js"></script> <!-- jquery -->
<script defer src="./static/js/medium-zoom.min.js"></script>
<script defer src="./static/js/zoom.js"></script>
</head>
<body>
<div class="header-wrapper">
<div class="header-container" id="header-container">
<div class="header-content">
<h1 style="margin-top: 0px">V-<i>IRL</i>: Grounding Virtual Intelligence in Real Life</h1>
<p style="color: #FFF7D4">
An open-source framework for
<em><strong style="color: #ffe099">embodied agent</strong></em>
and
<em><strong style="color: #ffe099">open-world computer vision</strong></em>
research.
Develop practical agents and test foundation models in virtual real world cities across the globe, grounded with <em><strong>real</strong></em> geospatial data and street view imagery.
</p>
<div class="button-container">
<a href="https://arxiv.org/abs/2402.03310" class="button paper-link" target="_blank">
<span class="icon is-small">
<i class="ai ai-arxiv"></i>
</span>
arXiv
</a>
<a href="./static/V-IRL.pdf" class="button paper-link" target="_blank">
<span class="icon is-small">
<i class="fas fa-file-pdf"></i>
</span>
<span>pdf</span>
</a>
<a href="https://github.com/VIRL-Platform/VIRL" class="button" target="_blank">
<span class="icon is-small">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</div>
</div>
<div class="header-image">
<img src="static/img/teaser_img_v3.jpg" alt="Teaser Image" class="teaser-image">
</div>
</div>
</div>
<d-article>
<div class="byline">
<div class="byline-container">
<div class="byline-column">
<h3>Authors</h3>
<p><a href="https://jihanyang.github.io/" class="author-link" target="_blank">Jihan Yang</a> <sup>△</sup></p>
<p><a href="https://dingry.github.io/" class="author-link" target="_blank">Runyu Ding</a> <sup>△</sup></p>
<p><a href="https://ellisbrown.github.io/" class="author-link" target="_blank">Ellis Brown</a> <sup>▲</sup></p>
<p><a href="https://xjqi.github.io/" class="author-link" target="_blank">Xiaojuan Qi</a> <sup>△</sup></p>
<p><a href="https://www.sainingxie.com/" class="author-link" target="_blank">Saining Xie</a> <sup>▲</sup></p>
</div>
<div class="byline-column">
<h3>Affiliations</h3>
<p>
<sup>△</sup>
<a href="https://www.eee.hku.hk/" class="affiliation-link" target="_blank">University of Hong Kong</a>
</p>
<p>
<sup>▲</sup>
<a href="https://cs.nyu.edu/home/index.html" class="affiliation-link" target="_blank">New York University</a>
</p>
</div>
<div class="byline-column">
<h3>Date</h3>
<p>
Feb 5<sup>th</sup>, 2024
</p>
</div>
</div>
</div>
<div class="nav-bar" id="nav-bar">
<a class="nav-link" href="#top" style="opacity: 0.7">
<div style="margin: 8px 0px; text-align: center">
<span style="font-size: 30px;">🔝</span>
</div>
</a>
<hr style="display: block; margin: auto;">
<div class="geo-color">
<a class="nav-link" href="#geo">
<img class="virl-tag" src="static/img/tags/geo.png">
</a>
<a class="nav-link" href="#peng"><img src="static/img/avatars/courier.png" title="Peng: visiting student">
</a>
</div>
<hr style="display: block; margin: auto;">
<div class="llm-color">
<a class="nav-link" href="#language">
<img class="virl-tag" src="static/img/tags/lm.png">
</a>
<a class="nav-link" href="#aria"><img src="static/img/avatars/recommender.png" title="Aria: place recommender">
</a>
<a class="nav-link" href="#vivek"><img src="static/img/avatars/real_estate.png" title="Vivek: estate agent">
</a>
</div>
<hr style="display: block; margin: auto;">
<div class="cv-color">
<a class="nav-link" href="#vision">
<img class="virl-tag" src="static/img/tags/cv.png">
</a>
<a class="nav-link" href="#rx-399"><img src="static/img/avatars/robot.png" title="RX-399: urban assistant robot">
</a>
<a class="nav-link" href="#imani"><img src="static/img/avatars/urban_planner.png" title="Imani: urban planner">
</a>
<a class="nav-link" href="#hiro"><img src="static/img/avatars/explorer.png" title="Hiro: explorer">
</a>
</div>
<hr style="display: block; margin: auto;">
<div class="col-color">
<a class="nav-link" href="#collaboration">
<img class="virl-tag" src="static/img/tags/col.png">
</a>
<a class="nav-link" href="#ling">
<img src="static/img/avatars/tourist.png" title="Ling: tourist">
</a>
<a class="nav-link" href="#diego">
<img src="static/img/avatars/concierge.png" title="Diego: expert concierge">
</a>
</div>
<hr style="display: block; margin: auto;">
<div id="nav-bar-system">
<a class="nav-link" href="#system"><img src="static/img/icons/system.png" title="System fundamentals"></a>
</div>
<hr style="display: block; margin: auto;">
<div id="nav-bar-benchmark">
<a class="nav-link" href="#benchmark"><img src="static/img/icons/benchmark.png" title="V-IRL Benchmark"></a>
</div>
</div>
<div class="l-page video-container" style="margin-left: 4%; margin-bottom: 20px">
<iframe width="560" height="315" src="https://www.youtube.com/embed/F8OYtifxfe8?si=mgddGW5uih500O_m" title="YouTube video player" frameborder="0" allow="autoplay; encrypted-media; picture-in-picture" allowfullscreen></iframe>
<figcaption style="text-align: center">(Best viewed in 4K)</figcaption>
</div>
<p class="text abstract">
There's a massive gap between the text-centric digital environments of current AI agents and the sensory-rich world we humans inhabit.
To develop agents that can operate flexibly and reliably in real-world settings, we must bridge this gap and embody agents in an environment that <em>necessitates</em> the nuanced perceptual understanding required in the real world.
Naturally, this problem has long been studied in robotics, with agents embodied physically in the world; however, the physical constraints and cost of real hardware prohibit scaling up agents and testing them in diverse environments beyond the lab.
<br><br>
To address this challenge, we introduce <strong>V-<i>IRL</i></strong>,
a <em>scalable</em> platform enabling agents to interact
with a <em>virtual facsimile</em> of the real world.
Leveraging mapping, geospatial, and street view imagery APIs (see <a href="#system">§System Fundamentals</a>), V-<i>IRL</i>
embeds agents in real cities across the Earth.
To showcase the capabilities our platform enables, in <a href="#agent-exemplars">§Agent Exemplars</a>, we use V-<i>IRL</i> to instantiate a series of agents
that solve various practical tasks, grounded with its sensory-rich perceptual and descriptive data.
<br><br>
Our platform also functions as a vast testbed for measuring progress in
open-world computer vision and embodied AI with unprecedented scale and diversity—providing structured access to
<em>hundreds of billions of images</em> spanning the entire globe.
<d-footnote>
Google Street View alone has >220 billion images as of May 2022, and there are numerous other sources of imagery and data that can be incorporated to enrich the environment.
<a href="https://blog.google/products/maps/street-view-15-new-features/" target="_blank">https://blog.google/products/maps/street-view-15-new-features/</a>
</d-footnote>
In <a href="#benchmark">§V-<i>IRL</i> Benchmark</a>, we use V-<i>IRL</i> to construct an initial benchmark of "open-world" vision models on a <em>truly open-world</em> distribution.
</p>
<div class="l-page teaser-video">
<video autoplay loop muted style="width:100%" preload="auto" playsinline>
<source src="static/video/teaser_gif.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<hr>
<div id="agent-exemplars" class="agent-block">
<h1 class="text">V-<i>IRL</i> Agent Exemplars</h1>
<p class="text">
To demonstrate the versatility of the <strong>V-<i>IRL</i></strong> platform, we use it to instantiate several exemplar agents virtually in real cities around the globe and engage them in various practical tasks.
For illustration, we give V-<i>IRL</i> agents character metadata, including an 8-bit avatar, a name, a short bio, and an intention they are trying to accomplish.
For a deeper dive into V-<i>IRL</i>'s components and the capabilities they enable, see <a href="#system">§System Fundamentals</a>.
</p><p class="text">
Each subsequent agent and their task is designed to reveal a new capability of the platform.
We highlight the specific V-<i>IRL</i> capabilities being employed throughout using tags and correspondingly colored sections:
</p>
<ul class="text">
<li><img class="virl-tag" src="static/img/tags/geo.png"> Action & Geolocation/Mapping capabilities: <a href="#geo" class="geo-color">§Earthbound Agents</a></li>
<li><img class="virl-tag" src="static/img/tags/lm.png"> Reasoning & Language Models: <a href="#language" class="llm-color">§Language-Driven Agents</a></li>
<li><img class="virl-tag" src="static/img/tags/cv.png"> Perception & Computer Vision: <a href="#vision" class="cv-color">§Visually Grounded Agents</a></li>
<li><img class="virl-tag" src="static/img/tags/col.png"> Agent-{Agent, Human} Collaboration: <a href="#collaboration" class="col-color">§Collaborative Agents</a></li>
</ul>
<p class="click-hint" style="width: 85%;"><strong><img src="static/img/icons/click.gif" style="width: 35px">
Hover over each avatar to see more info. Click to jump to its section.
</strong></p>
<div class="avatar-row figure">
<div class="avatar" onmouseover="showTakeaway('takeaway-peng')">
<a href="#peng">
<img src="static/img/avatars/courier.png" alt="Route optimizer Peng">
</a>
<figcaption>Peng</figcaption>
</div>
<div class="avatar">
<a href="#aria" onmouseover="showTakeaway('takeaway-aria')">
<img src="static/img/avatars/recommender.png" alt="Place recommender">
</a>
<figcaption>Aria</figcaption>
</div>
<div class="avatar">
<a href="#vivek" onmouseover="showTakeaway('takeaway-vivek')">
<img src="static/img/avatars/real_estate.png" alt="Estate recommender">
</a>
<figcaption>Vivek</figcaption>
</div>
<div class="avatar">
<a href="#rx-399" onmouseover="showTakeaway('takeaway-rx-399')">
<img src="static/img/avatars/robot.png" alt="Robot RX-399">
</a>
<figcaption>RX-399</figcaption>
</div>
<div class="avatar">
<a href="#imani" onmouseover="showTakeaway('takeaway-imani')">
<img src="static/img/avatars/urban_planner.png" alt="Urban planner">
</a>
<figcaption>Imani</figcaption>
</div>
<div class="avatar">
<a href="#hiro" onmouseover="showTakeaway('takeaway-hiro')">
<img src="static/img/avatars/explorer.png" alt="Intentional explorer">
</a>
<figcaption>Hiro</figcaption>
</div>
<div class="avatar">
<a href="#ling" onmouseover="showTakeaway('takeaway-ling')">
<img src="static/img/avatars/tourist.png" alt="Tourist">
</a>
<figcaption>Ling</figcaption>
</div>
<div class="avatar">
<a href="#diego" onmouseover="showTakeaway('takeaway-diego')">
<img src="static/img/avatars/concierge.png" alt="Concierge">
</a>
<figcaption>Diego</figcaption>
</div>
</div>
<div class="exemplar-takeaways">
<div class="takeaway-card" id="takeaway-peng">
<div class="takeaway-head">
<span>Peng: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> instantiates agents with real geospatial information, and enables useful tasks like route optimization.
</p>
</div>
<div class="takeaway-card" id="takeaway-aria">
<div class="takeaway-head">
<span>Aria: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> exposes rich real-world information to agents that they can use for real-world tasks.
</p>
</div>
<div class="takeaway-card" id="takeaway-vivek">
<div class="takeaway-head">
<span>Vivek: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
</div>
</div>
<p class="takeaway-content">
Grounded in geographic coordinates, V-<i>IRL</i> agents can leverage arbitrary real-world information via APIs.
</p>
</div>
<div class="takeaway-card" id="takeaway-rx-399">
<div class="takeaway-head">
<span>RX-399: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/cv.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can use perceptual input to understand and interact with their environment.
</p>
</div>
<div class="takeaway-card" id="takeaway-imani">
<div class="takeaway-head">
<span>Imani: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/cv.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> enables realistic open-world applications requiring vast geospatial and first-person visual information.
</p>
</div>
<div class="takeaway-card" id="takeaway-hiro">
<div class="takeaway-head">
<span>Hiro: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
<img src="static/img/tags/cv.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can utilize visual detectors, VLMs and LLMs to iteratively perceive, decide, and interact in the environment.
</p>
</div>
<div class="takeaway-card" id="takeaway-ling">
<div class="takeaway-head">
<span>Ling: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
<img src="static/img/tags/cv.png">
<img src="static/img/tags/col.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can collaborate to solve complex tasks that are beyond their individual expertise.
</p>
</div>
<div class="takeaway-card" id="takeaway-diego">
<div class="takeaway-head">
<span>Diego: Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
<img src="static/img/tags/cv.png">
<img src="static/img/tags/col.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can collaborate with users to solve complex tasks that require understanding the user's internal state.
</p>
</div>
</div>
</div>
<div id="geo">
<h2 class="text"><img class="virl-tag" src="static/img/tags/geo.png"><br>Earthbound Agents</h2>
<div class="l-screen grey-overlay"></div>
<p class="text">
Agents using the V-<i>IRL</i> platform inhabit virtual representations of real cities around the globe. At the core of this representation are <em>geographic coordinates</em> corresponding to points on the Earth's surface.
</p>
<div>
<d-figure>
<img src="static/img/Latitude_and_Longitude_of_the_Earth.svg" alt="Latitude and Longitude of the Earth" data-zoomable style="max-width: 100%;">
<figcaption style="text-align: center">Geographic Coordinates: Latitude and Longitude of the Earth
<d-footnote>Figure source: <a href="https://commons.wikimedia.org/wiki/File:Latitude_and_Longitude_of_the_Earth.svg" target="_blank">Wikimedia Commons</a></d-footnote>
</figcaption>
</d-figure>
</div>
<p class="text">
With these geographic coordinates as a link between digital media and the real world,
V-<i>IRL</i> agents <em>ground</em> themselves in the world using APIs for maps, <em>real</em> street view imagery, information about nearby destinations, and much more.
</p>
</div>
<div id="peng" class="agent-block geo">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/courier.png"> Peng: Visiting Student</h3>
<video class="auto-video l-page" muted autoplay preload="auto" playsinline>
<source src="static/video/story/peng.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
Peng needs to visit several locations throughout NYC to get documents signed for registration as a visiting student...
Leveraging Geolocation & Mapping capabilities, Peng saves 7 minutes by walking along the shortest path as opposed to in order waypoint visitation.
</p>
<div>
<d-figure>
<figure>
<img data-zoomable src="static/img/courier.jpg" alt="Route optimizer figure">
<figcaption style="text-align: center">Finding the shortest path for Peng to travel to five places.</figcaption>
</figure>
</d-figure>
</div>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> instantiates agents with real geospatial information, and enables useful tasks like route optimization.
</p>
</div>
</div>
<div id="language">
<h2 class="text"><img class="virl-tag" src="static/img/tags/lm.png"><br>Language-Driven Agents</h2>
<div class="l-screen grey-overlay"></div>
<p class="text">
To tackle more complex tasks, we follow the pattern of language-driven agents <d-cite key="xi2023rise"></d-cite>. LLMs enable agents to reason, plan, and use external tools & APIs.
</p>
</div>
<!-- place recommender agent -->
<div id="aria" class="agent-block language">
<h3 class="text" id="aria" style="margin-top: 40px;"><img style="width: 35px" src="static/img/avatars/recommender.png"> Aria: Place Recommender</h3>
<div class="l-screen grey-overlay"></div>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/aria.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
Aria searches for possible restaurants nearby.
She then synthesizes public reviews to make final recommendations via GPT-4. As Peng is new to the city and originally from Sichuan, she recommends the spicy Chinese joint <em>Chow House 粤德轩</em> to give him a taste of home.
</p>
<!-- TODO: optimize mobile view -->
<div class="interactive-image">
<div id="image-text-container">
<p x="165" y="-12.5" text-anchor="left" style="font-weight: 700; font-size: 15px; font-family: sans-serif;">Click to check different candidate places:</p>
<div class="button-bar">
<!-- Buttons to interact with the image and text -->
<button onclick="changeContent('Place1')">Chow House</button>
<button onclick="changeContent('Place2')">Kwa Food Fried Skewers</button>
<button onclick="changeContent('Place3')">Tartinery Cafe</button>
<button onclick="changeContent('Place4')">Sushi Zo</button>
<button onclick="changeContent('Place5')">Dos Toros Taqueria</button>
</div>
<div class="image-review-container" style="margin-bottom: 0px;">
<img data-zoomable style="width: 55%;" src="static/img/place_recommend/place1.jpg" alt="place illustration">
<div class="text-box">
<p style="font-size: 14px; margin-bottom: 0px;"><strong>Example Place Review:</strong></p>
<blockquote class="place-review">Well done, Chow House. Authentic cuisine, expertly prepared. Twice Cooked pork, done the right way. You can see the leeks from the photo with none of that Americanized garbage many places dump into this otherwise elegant dish... Big shrimp. Big flavor. Peanuts. Initially, I was scared of the deep fried dried red pepper, but they turned out really tasty and crunchy, not as spicy as I had originally feared. Pork fried rice, expertly cooked. All in all, if you live downtown and like authentic Sichuan food, this is your place. Bravo! (rating: 5)</blockquote>
<blockquote class="place-review">Amazing Asian grill! Great taste, big variety, good prices... They have plenty of options vegan and non-vegan. The skewers are super affordable, fun to eat and taste great. Prices range from 50cents - 4$ per skewer, depending in what skewers, with most being around 2$. They deep fry the skewers right there for you and add salt and other spices (let them know if you don't like your food too salty). Highly recommended:).(rating: 5)</blockquote>
<blockquote class="place-review">I was so impressed with my brunch. I tried the French toast, my boyfriend got the Eggs Benedict and our friend got the burger. The quality of the food is good and it takes a short amount of time to receive it. The staff was sweet and helpful. I will definitely come back. (rating: 5)</blockquote>
<blockquote class="place-review">You must have to love sushi if you plan on dining here. It’s $299/person and omakase only... The sushi is flown in daily from Japan. It is definitely a dining experience for taste, texture and art of the making and presentation of each sushi. With a few glasses of wine and bottled water as well as tip was about $850. It’s quite expensive but I feel that the experience and quality of the food is worth it. It’s perfect for a special occasion. The sushi was the freshest sushi I’ve ever ate in my life. (rating: 5)</blockquote>
<blockquote class="place-review">Very sad burrito for the price point. Very small small for the price point as you can see it is no bigger than my friends arm. There were things missing from the burrito such as pico and guac. Each bite I took I regretted. As a vegan I would say it’s better for your wallet and your stomach to go to chipotle. (rating: 2)</blockquote>
</div>
</div>
<p style="text-align: left; font-size: 15px; margin-bottom: 0px"><strong>Agent Consideration:</strong></p>
<blockquote style="font-size:14px">Chow House is a highly recommended Sichuan restaurant, which aligns with Peng's background as he grew up in Sichuan. The restaurant offers authentic Sichuan food, which Peng might be familiar with and enjoy. The restaurant also has good seating, decoration, and friendly service, which would make for a pleasant dining experience. However, some dishes received mixed reviews, which is why the rating is not a perfect 10.</blockquote>
</div>
</div>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> exposes rich real-world information to agents that they can use for real-world tasks.
</p>
</div>
</div>
<!-- Real estate agent -->
<div id="vivek" class="agent-block language">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/real_estate.png"> Vivek: Estate Agent</h3>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/vivek.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
Vivek uses real estate APIs to find potential apartments in Peng's desired regions and price range.
For each candidate, he researches its proximity to the places Peng cares about. Synthesizing these factors, Vivek provides a holistic rating and accompanying reasoning using GPT-4.
His top recommendation is a cost-effective 1 bedroom apartment for $1986/mo, which is close to a supermarket, 2 bus stations, and a gym.
</p>
<div>
<d-figure>
<figure>
<img data-zoomable src="static/img/agent_estate.jpg" alt="Estate recommender">
<figcaption style="text-align: center">Part of candidate estates.</figcaption>
</figure>
</d-figure>
</div>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
</div>
</div>
<p class="takeaway-content">
Grounded in geographic coordinates, V-<i>IRL</i> agents can leverage arbitrary real-world information via APIs.
</p>
</div>
</div>
<!-- Visual agents -->
<div id="vision">
<h2 class="text"><img class="virl-tag" src="static/img/tags/cv.png"><br>Visually Grounded Agents</h2>
<div class="l-screen grey-overlay"></div>
<p class="text">
Although language-driven agents can address some real-world tasks using external tools, their reliance on solely text-based information limits their applicability to tasks where <em>visual grounding</em> is required.
In contrast, <em>real sensory input</em> is integral to many daily human activities—allowing a deep connection to and understanding of the
real world around us.
Agents can leverage street view imagery through the V-<i>IRL</i> platform to <em>visually ground</em> themselves in the real world—opening up a wide range of <em>perception-driven tasks</em>.
</p>
</div>
<!-- RX-399 -->
<div id="rx-399" class="agent-block vision">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/robot.png"> RX-399: Urban Assistance Robot</h3>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/rx399.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
RX-399 navigates along pre-defined city routes, tagging all trash bins using its open-world detector and geolocation module as depicted in the following figure and videos.
</p>
<div id="slider-img-rx399" class="slider-img-container">
<div class="my-slides">
<img data-zoomable src="static/img/rx-399/rx-399_clean_ny.jpg" style="width:100%">
</div>
<div class="my-slides">
<img data-zoomable src="static/img/rx-399/rx-399_clean_hk.jpg" style="width:100%">
</div>
<a class="prev" onclick="plusSlides('slider-img-rx399', -1)">❮</a>
<a class="next" onclick="plusSlides('slider-img-rx399', 1)">❯</a>
<figcaption id="caption" style="margin-bottom: 10px; text-align: center"></figcaption>
<div class="row">
<div class="column">
<img class="demo cursor" src="static/img/rx-399/rx-399_clean_ny.jpg" style="width:100%" onclick="currentSlide('slider-img-rx399', 1)" alt="Portions of RX-399's system records in New York City.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/rx-399/rx-399_clean_hk.jpg" style="width:100%" onclick="currentSlide('slider-img-rx399', 2)" alt="Portions of RX-399's system records in Hong Kong">
</div>
</div>
</div>
<div class="l-body">
<div id="RX399video1Container" class="video-container">
<video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/rx-399_ny.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<div id="RX399video2Container" class="video-container" style="display:none;">
<video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/rx-399_hk.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<!-- Preview Images in a Flex Container -->
<div class="preview-container">
<text x="165" y="-12.5" text-anchor="middle" style="font-weight: 700; font-size: 15px; font-family: sans-serif;">Switch recording videos between NYC and HK:</text>
<img id="RX399video1Preview" class="preview" src="static/img/previews/video_rx-399_preview.jpg" alt="Preview image of RX-399 NYC" onclick="switchVideo('RX399', 'video1Container', 'video1Preview')">
<img id="RX399video2Preview" class="preview" src="static/img/previews/video_rx-399_hk_preview.jpg" alt="Preview 2" onclick="switchVideo('RX399','video2Container', 'video2Preview')">
</div>
</div>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/cv.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can use perceptual input to understand and interact with their environment.
</p>
</div>
</div>
<!-- Urban Planner -->
<div id="imani" class="agent-block vision">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/urban_planner.png"> Imani: Urban Planner</h3>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/imani.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
Imani sets routes spanning Central Park and objects of interest for RX-399, who traverses the routes and records all detected instances.
After RX-399 finishes its route, Imani analyzes the collected data by RX-399 at different levels of detail.
</p>
<div class="img-magnifier-container">
<img data-zoomable id="urban_planner_img" style="width: 100%" src="static/img/urban_planner.jpg" alt="Urban Planner agent visualization">
<figcaption>Imani's visualization of trash bins, fire hydrants, park benches in NYC's Central Park using data collected by RX-399. The coarsest level shows general distributions of trash bins, hydrants, and benches in the park.
Imani can also zoom in to specific regions, where lighter colors represent positions with more unique instances identified.</figcaption>
</div>
<aside class="counting-table">
<figure style="width: 300px">
<table style="margin-bottom: 5px">
<tr>
<th style="font-size: 13px;">Category</th>
<th style="font-size: 13px;">Trash bin</th>
<th style="font-size: 13px;">Hydrant</th>
<th style="font-size: 13px;">Bench*</th>
</tr>
<tr>
<td style="font-size: 13px;">Count</td>
<td style="font-size: 13px;">1059</td>
<td style="font-size: 13px;">727</td>
<td style="font-size: 13px;">1015</td>
</tr>
</table>
<figcaption class="table-caption">
Table 1: RX-399's counting report. *Note: contiguous benches counted as one instance.
</figcaption>
</figure>
</aside>
<div class="l-body">
<div id="UrbanPlannervideo1Container" class="video-container">
<video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/urban_planner.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<div id="UrbanPlannervideo2Container" class="video-container" style="display:none;">
<video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/urban_planner_play.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<!-- Preview Images in a Flex Container -->
<div class="preview-container">
<text x="165" y="-12.5" text-anchor="middle" style="font-weight: 700; font-size: 15px; font-family: sans-serif;">Switch videos between data collecting and heatmap distribution:</text>
<img id="UrbanPlannervideo1Preview" class="preview" src="static/img/previews/video_urban_plan_collect_preview.jpg" alt="Preview image of urban planner exploration" onclick="switchVideo('UrbanPlanner', 'video1Container', 'video1Preview')">
<img id="UrbanPlannervideo2Preview" class="preview" src="static/img/previews/video_urban_plan_play_preview.jpg" alt="Preview image of urban planner checking" onclick="switchVideo('UrbanPlanner','video2Container', 'video2Preview')">
</div>
</div>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/cv.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> enables realistic open-world applications requiring vast geospatial and first-person visual information.
</p>
</div>
</div>
<!-- Intentional explorer -->
<div id="hiro" class="agent-block vision">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/explorer.png"> Hiro: Seasoned Traveler (Intentional Explorer)</h3>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/hiro.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
Driven by his intention, Hiro uses open-world detection to find a restaurant; uses VQA to select proper roads; uses place reviews and LLM to decide whether a place is suitable for his purpose.
</p>
<div id="slider-img-explorer" class="slider-img-container">
<div class="my-slides">
<div class="numbertext">Milestone 1 / 5</div>
<img src="static/img/intentional_explorer/intentional_explorer_split_1.jpg" style="width:100%">
</div>
<div class="my-slides">
<div class="numbertext">Milestone 2 / 5</div>
<img src="static/img/intentional_explorer/intentional_explorer_split_2.jpg" style="width:100%">
</div>
<div class="my-slides">
<img src="static/img/intentional_explorer/intentional_explorer_split_3.jpg" style="width:100%">
<div class="numbertext">Milestone 3 / 5</div>
</div>
<div class="my-slides">
<img src="static/img/intentional_explorer/intentional_explorer_split_4.jpg" style="width:100%">
<div class="numbertext">Milestone 4 / 5</div>
</div>
<div class="my-slides">
<img src="static/img/intentional_explorer/intentional_explorer_split_5.jpg" style="width:100%">
<div class="numbertext">Milestone 5 / 5</div>
</div>
<a class="prev" onclick="plusSlides('slider-img-explorer', -1, 'explorer-aside')">❮</a>
<a class="next" onclick="plusSlides('slider-img-explorer', 1, 'explorer-aside')">❯</a>
<figcaption id="caption" style="margin-bottom: 10px; text-align: center">Visualization for Hiro's lunch exploration in HK. Concrete procedure is depicted in the following video.</figcaption>
<div class="row">
<div class="column">
<img class="demo cursor" src="static/img/intentional_explorer/intentional_explorer_split_1.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-explorer', 1, 'explorer-aside')" alt="Visualization for Hiro's lunch exploration in HK. Concrete procedure is depicted in the following video.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/intentional_explorer/intentional_explorer_split_2.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-explorer', 2, 'explorer-aside')" alt="Visualization for Hiro's lunch exploration in HK. Concrete procedure is depicted in the following video.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/intentional_explorer/intentional_explorer_split_3.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-explorer', 3, 'explorer-aside')" alt="Visualization for Hiro's lunch exploration in HK. Concrete procedure is depicted in the following video.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/intentional_explorer/intentional_explorer_split_4.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-explorer', 4, 'explorer-aside')" alt="Visualization for Hiro's lunch exploration in HK. Concrete procedure is depicted in the following video.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/intentional_explorer/intentional_explorer_split_5.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-explorer', 5, 'explorer-aside')" alt="Visualization for Hiro's lunch exploration in HK. Concrete procedure is depicted in the following video.">
</div>
</div>
</div>
<aside class="explorer-aside">
Starting at the user-defined location <img src="static/img/icons/start_icon.jpg" class="inline-tag" draggable="false" style="height: 16px;">, Hiro walks down the street to find a place can fulfil his intention: "<i>Hiro is hungry and looking for a place where he can explore great local food. He cannot handle spicy food.</i>"</aside>
<aside class="explorer-aside">
<p>When he meets the first intersection, thanks to the interactive and sensor-rich environment, he adjusts his pose to fetch real street views for each possible path at the crossroads. He then uses these with <em>VQA</em> to <em>decide</em> to turn left:</p>
<blockquote style="font-size: 12px">
<img src="static/img/icons/star.png" class="inline-tag" draggable="false" style="height: 16px;"> Road 1 has this homey, residential feel to it, which usually means cozy, family-run spots serving up the real-deal local food. It's got that quiet, laid-back dining scene that lets you really soak in the experience, take your time, and eat like one of the locals.
</blockquote>
</aside>
<aside class="explorer-aside">
<p>Then, after exploring for a block, he encounters the second intersection where he <em>looks around</em> and <em>decides</em> to turn right:</p>
<blockquote style="font-size: 12px">
<img src="static/img/icons/star.png" class="inline-tag" draggable="false" style="height: 16px;"> I'm leaning towards taking Road 2. It looks promising with all those signs pointing to eateries and local food joints. That's got to mean there's a good selection of local dishes to try out. And with more places to choose from, I bet I'll have a better shot at finding something that isn't too spicy. That's a big deal for me. Road 1 just doesn't seem to cut it; it's more of a residential vibe and doesn't really shout 'food' like Road 2 does.
</blockquote>
</aside>
<aside class="explorer-aside">
<p>After a few steps, Hiro finds <i>"A One Chinese Noodles 阿一豬扒酸辣米線"</i>
using his <em>open-world detector</em>, and looks up
its information and reviews using our <i>real-world environment</i> which <i>connects street views to places</i>. Hiro <i>decides</i> to pass on it because:</p>
<blockquote style="font-size: 12px">
<img src="static/img/icons/star.png" class="inline-tag" draggable="false" style="height: 16px;">Hmm, spicy food is a no-go for me, and this place seems to be all about pork chop noodles. That might be tricky with my dietary needs. I should probably keep looking for something that fits what I can eat.
</blockquote>
</aside>
<aside class="explorer-aside">
<p>
Finally, at the end of this street block <img src="static/img/icons/end_icon.jpg" class="inline-tag" draggable="false" style="height: 16px;">, Hiro discovers another lunch spot called <i>Xintianfa 新天發</i>. He decides to dine there after <em>reading</em> <em>online reviews</em> praising its authentic cuisine and diverse menu:
</p>
<blockquote>
Even though opinions vary, Xintianfa presents an array of local cuisine that beckons to my desire for authentic culinary experiences. My seasoned traveler's spirit thrives on novelty and the thrill of discovery, so a restaurant with a diverse menu naturally draws me in. Additionally, the absence of any emphasis on spicy fare is a relief, given my inability to tolerate heat in my meals.
</blockquote>
</aside>
<p class="video-container">
<video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/intentional_explorer.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</p>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
<img src="static/img/tags/cv.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can utilize visual detectors, VLMs and LLMs to iteratively perceive, decide, and interact in the environment.
</p>
</div>
</div>
<!-- Collaborative agents -->
<div id="collaboration">
<h2 class="text">Collaborative Agents<br><img class="virl-tag" src="static/img/tags/col.png"></h2>
<div class="l-screen grey-overlay"></div>
<p class="text">
Humans often work together to solve complex real-world tasks. This collaboration promotes efficiency and effectiveness by decomposing a complex task into simpler sub-tasks, allowing each to be handled by an expert in its domain.
</p>
</div>
<div id="ling" class="agent-block collaboration">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/tourist.png"> Ling: Tourist</h3>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/ling.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
After obtaining route descriptions from Locals, Ling starts her journey. Grounded in our embodied platform, Ling can adjust her pose and identify visual landmarks along the streets using open-world recognition and her map. Recognizing these landmarks helps GPT-4 to make correct decisions about where to turn direction, move forward and stop. Concrete examples are shown in the following figure and videos.
</p>
<div id="slider-img-tourist" class="slider-img-container">
<div class="my-slides">
<img data-zoomable src="static/img/tourist/tourist_nyc_1.jpg">
</div>
<div class="my-slides">
<img data-zoomable src="static/img/tourist/tourist_nyc_2.jpg">
</div>
<div class="my-slides">
<img data-zoomable src="static/img/tourist/tourist_sf.jpg">
</div>
<div class="my-slides">
<img data-zoomable src="static/img/tourist/tourist_hk.jpg">
</div>
<a class="prev" onclick="plusSlides('slider-img-tourist', -1, 'tourist-aside')">❮</a>
<a class="next" onclick="plusSlides('slider-img-tourist', 1, 'tourist-aside')">❯</a>
<figcaption id="caption" style="margin-bottom: 10px; text-align: center"></figcaption>
<div class="row">
<div class="column">
<img class="demo cursor" src="static/img/tourist/tourist_nyc_1.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-tourist', 1, 'tourist-aside')" alt="Ling and Local collaboration examples in New York City.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/tourist/tourist_nyc_2.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-tourist', 2, 'tourist-aside')" alt="Another Ling and Local collaboration examples in New York City.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/tourist/tourist_sf.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-tourist', 3, 'tourist-aside')" alt="Ling and Local collaboration examples in San Francisco.">
</div>
<div class="column">
<img class="demo cursor" src="static/img/tourist/tourist_hk.jpg" style="height: 60px; width: auto" onclick="currentSlide('slider-img-tourist', 4, 'tourist-aside')" alt="Ling and Local collaboration examples in Hong Kong.">
</div>
</div>
</div>
<aside class="tourist-aside">
Ling successfully find a nearby gift store by following the route description from Local agent.
</aside>
<aside class="tourist-aside">
Ling successfully find a good burger spot by following the route description from Local agent.
</aside>
<aside class="tourist-aside">
Ling passes by the destination because only the wall of the Apple store is visible from her viewpoint. Fortunately, she can ask another Local agent nearby to start another round of navigation, which eventually leads her to the destination. Ling's first and second attempts are shown in red and green trajectories, respectively.
</aside>
<aside class="tourist-aside">
Ling mistakes another restaurant as her destination at her first attempt. She then can ask another Local agent nearby to start another round of navigation, which eventually leads her to the destination. Ling's first and second attempts are shown in red and green trajectories, respectively.
</aside>
<div class="l-body">
<div id="Touristvideo1Container" class="video-container">
= <video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/tourist_sf.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<div id="Touristvideo2Container" class="video-container" style="display:none;">
<video class="video-music" controls preload="metadata" playsinline>
<source src="static/video/tourist_hk.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
</div>
<!-- Preview Images in a Flex Container -->
<div class="preview-container">
<text x="165" y="-12.5" text-anchor="middle" style="font-weight: 700; font-size: 15px; font-family: sans-serif;">Switch videos between SF and HK journeys:</text>
<img id="Touristvideo1Preview" class="preview" src="static/img/previews/video_toursit_sf_preview.jpg" alt="Preview image of tourist-local SF" onclick="switchVideo('Tourist', 'video1Container', 'video1Preview')">
<img id="Touristvideo2Preview" class="preview" src="static/img/previews/video_tourist_hk_preview.jpg" alt="Preview image of tourist-local HK" onclick="switchVideo('Tourist','video2Container', 'video2Preview')">
</div>
</div>
<div class="takeaway-card">
<div class="takeaway-head">
<span>Takeaway</span>
<div class="takeaway-tags">
<img src="static/img/tags/geo.png">
<img src="static/img/tags/lm.png">
<img src="static/img/tags/cv.png">
<img src="static/img/tags/col.png">
</div>
</div>
<p class="takeaway-content">
V-<i>IRL</i> agents can collaborate to solve complex tasks that are beyond their individual expertise.
</p>
</div>
</div>
<div id="diego" class="agent-block collaboration">
<div class="l-screen grey-overlay"></div>
<h3 class="text"><img style="width: 35px" src="static/img/avatars/concierge.png"> Diego: Expert Concierge</h3>
<video class="auto-video" muted autoplay preload="auto" playsinline>
<source src="static/video/story/diego.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<p class="text">
As depicted in the following figure, Diego's itinerary is tailored to your needs. Diego not only considers your physical and mental interoception status, budget for each activity, but also anticipates your status changes and cost when you follow each event.
He is able to take into account <em>real</em> travel times from the V-<i>IRL</i> platform and select suitable dining options by collaborating with another
restaurant recommendation agent.
</p>
<d-figure class="l-page">
<figure>
<video id="diego-plan-video" playsinline autoplay loop muted>
<source src="static/video/interactive_concierge_gif.mp4" type="video/mp4">
Your browser does not support the video tag.
</video>
<figcaption id="diego-plan-video-cap"><em>The Perfect Day Itinerary</em>: Crafted by Diego, our iterative concierge agent, this schedule is meticulously tailored, accounting for your mental and physical well-being and budget variations as your day unfolds.</figcaption>
</figure>
</d-figure>
<p class="text">
You can intervene Diego's planning process by adjusting your interoception status or providing verbal feedback for Diego.
In response, Diego promptly revises his original plan to make it accommodate your demands, and re-estimate your state changes after revision. (see the following figures)
</p>
<div>
<img src="static/img/diego_intervent/diego_origin.jpg" class="diego-revise-img">
</div>
<div id="slider-img-diego-revise-state" class="slider-img-container">
<div class="my-slides">
<img src="static/img/diego_intervent/adjust_state_1.jpg" class="diego-revise-img">
</div>
<div class="my-slides">
<img src="static/img/diego_intervent/adjust_state_2.jpg" class="diego-revise-img">
</div>
<div class="my-slides">
<img src="static/img/diego_intervent/adjust_state_3.jpg" class="diego-revise-img">
</div>
<a class="prev" id="prev-diego-revise-1" onclick="plusSlides('slider-img-diego-revise-state', -1)">❮</a>
<a class="next" id="next-diego-revise-2" onclick="plusSlides('slider-img-diego-revise-state', 1)">❯</a>
</div>
<div id="slider-img-diego-revise-verbal" class="slider-img-container">
<div class="my-slides">
<img src="static/img/diego_intervent/verbal_1.jpg" class="diego-revise-img">
</div>
<div class="my-slides">
<img src="static/img/diego_intervent/verbal_2.jpg" class="diego-revise-img">
</div>
<div class="my-slides">
<img src="static/img/diego_intervent/verbal_3.jpg" class="diego-revise-img">
</div>
<a class="prev" id="prev-diego-revise-2" onclick="plusSlides('slider-img-diego-revise-verbal', -1)">❮</a>
<a class="next" id="next-diego-revise-2" onclick="plusSlides('slider-img-diego-revise-verbal', 1)">❯</a>
</div>
<figcaption style="width: 100%; text-align: center; margin-bottom: 20px;">Diego adapts original plan to suit user's intervention.</figcaption>
<p class="text">
Behind Diego's proficiency in developing itineraries is his iterative planning pipeline.
The process begins with Diego creating an initial draft plan for the first activity using <i>GPT-4</i>, taking into account the user's biography, requirements, and previous activities in working memory. This draft is then meticulously refined by <i>hierarchical coordination</i> (real geospatial/place information), <i>interoceptive estimation</i> (activity cost and influence for human states) and <i>supervisor</i> (human interoception, budget and potential intervention).
</p>