Researchers! On 19 December 2024, a preprint paper was published that focuses on "evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation." The 4DS-j model presented there achieves significantly better monocular depth estimation results than DINOv2 ViT-g, making it a better backbone than DINOv2 for specialised video depth estimation models that can be the basis for better 2D to 3D video conversion, too! Please try to implement the 4DS-j backbone instead of DINOv2 ViT-g for your future breakthrough video depth estimation models! Below is a special ranking showing the capabilities of 4DS-j:
Due to the recent number of new models that I am unable to add to the rankings immediately, I have decided to add a waiting list of new models:
- ScanNet (170 frames): TAE<=2.2
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): OPW<=0.04
- ScanNet++ (98 video clips with 32 frames each): TAE
- NYU-Depth V2: OPW<=0.37
- Bonn RGB-D Dynamic (5 video clips with 110 frames each): AbsRel<=0.078
- NYU-Depth V2: AbsRel<=0.045 (relative depth)
- NYU-Depth V2: AbsRel<=0.051 (metric depth)
- NYU-Depth V2 (640×480): AbsRel<=0.058 (old layout - currently no longer up to date)
- UnrealStereo4K (3840×2160): AbsRel<=0.04 (old layout - currently no longer up to date)
- Appendix 1: Rules for qualifying models for the rankings (to do)
- Appendix 2: Metrics selection for the rankings (to do)
- Appendix 3: List of all research papers from the above rankings
📝 Note: There are no quantitative comparison results of StereoCrafter yet, so this ranking is based on my own perceptual judgement of the qualitative comparison results shown in Figure 7. One output frame (right view) is compared with one input frame (left view) from the video clip: 22_dogskateboarder and one output frame (right view) is compared with one input frame (left view) from the video clip: scooter-black
RK | Model Links: Venue Repository |
Rank ↓ (human perceptual judgment) |
---|---|---|
1 | StereoCrafter |
1 |
2-3 | Immersity AI | 2-3 |
2-3 | Owl3D | 2-3 |
4 | Deep3D |
4 |
RK | Model Links: Venue Repository |
OPW ↓ {Input fr.} BA |
---|---|---|
1 | Buffer Anytime (DA V2) |
0.028 {MF} |
2 | DepthCrafter |
0.029 {MF} |
3 | ChronoDepth |
0.035 {MF} |
RK | Model Links: Venue Repository |
TAE ↓ {Input fr.} DAV |
---|---|---|
1 | Depth Any Video |
2.1 {MF} |
2 | DepthCrafter |
2.2 {MF} |
3 | ChronoDepth |
2.3 {MF} |
4 | NVDS |
3.7 {4} |
RK | Model Links: Venue Repository |
OPW ↓ {Input fr.} FD |
OPW ↓ {Input fr.} NVDS+ |
OPW ↓ {Input fr.} NVDS |
---|---|---|---|---|
1 | FutureDepth |
0.303 {4} | - | - |
2 | NVDS+ |
- | 0.339 {4} | - |
3 | NVDS |
0.364 {4} | - | 0.364 {4} |
📝 Note: This ranking is based on data from Table 4. The example result 3:0:2 (first left in the first row) means that Depth Pro has a better F-score than UniDepth-V in 3 datasets, in no dataset has the same F-score as UniDepth-V and has a worse F-score compared to UniDepth-V in 2 datasets.
📝 Note: This Ranking will temporarily not be updated due to - see Figure 4
RK | Model Links: Venue Repository |
AbsRel ↓ {Input fr.} MonST3R |
AbsRel ↓ {Input fr.} DC |
---|---|---|---|
1 | MonST3R |
0.063 {MF} | - |
2 | DepthCrafter |
0.075 {MF} | 0.075 {MF} |
3 | Depth Anything |
- | 0.078 {1} |
RK | Model Links: Venue Repository |
AbsRel ↓ {Input fr.} M3D v2 |
AbsRel ↓ {Input fr.} GRIN |
- | - | - | - | - |
---|---|---|---|---|---|---|---|---|
1 | Metric3D v2 ViT-giant |
0.045 {1} | - | - | - | - | - | - |
2 | GRIN_FT_NI |
- | 0.051 {1} | - | - | - | - | - |
RK | Model | AbsRel ↓ {Input fr.} |
Training dataset |
Official repository |
Practical model |
Vapour- Synth |
---|---|---|---|---|---|---|
1 | ZoeDepth +PFR=128 ENH: |
0.0388 {1} |
ENH: UnrealStereo4K |
ENH: |
- | - |
2025:
Method | Abbr. | Paper | Venue (Alt link) |
Official repository |
---|---|---|---|---|
Video Depth Anything | VDA | Video Depth Anything: Consistent Depth Estimation for Super-Long Videos |
2024 and older: