2021 July

POSTECH-CVLab · Aug 3, 2021 · d79a570 · d79a570
1 parent cca7fe0
commit d79a570
Show file tree

Hide file tree

Showing 72 changed files with 1,961 additions and 1 deletion.
diff --git a/Archive/2021/07/README.md b/Archive/2021/07/README.md
@@ -0,0 +1,67 @@
+# 2021 July
+## Full List of papers
+
+* Baking Neural Radiance Fields for Real-Time View Synthesis ([Link](./summary/1.md))
+* NeRF: Representation Scenes as Neural Radiance Fields for View synthesis ([Link](./summary/2.md))
+* A Random CNN Sees Objects: One Inductive Bias of CNN and Its Applications ([Link](./summary/3.md))
+* PixelNerf: Neural Radiance Fields from One or Few Images ([Link](./summary/4.md))
+* Hypercorrelation Squeeze for Few-Shot Segmentation ([Link](./summary/5.md))
+* Dense Contrastive Learning for Self-Supervised Visual Pre-Training ([Link](./summary/6.md))
+* Unsupervised Learning of Dense Visual Representations ([Link](./summary/7.md))
+* Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation ([Link](./summary/8.md))
+* Few-shot Image Generation via Cross-domain Correspondence ([Link](./summary/9.md))
+* EfficientDet: Scalable and Efficient Object Detection ([Link](./summary/10.md))
+* Depth-supervised NeRF: Fewer Views and Faster Training for Free ([Link](./summary/11.md))
+* Occupancy Networks - Learning 3D Reconstruction in Function Space ([Link](./summary/12.md))
+* Rethinking and Improving the Robustness of Image Style Transfer ([Link1](./summary/13.md), [Link2](./summary/33.md))
+* RepVGG: Making VGG-style ConvNets Great Again ([Link](./summary/14.md))
+* ViTGAN: Training GANs with Vision Transformers ([Link](./summary/15.md))
+* GIRAFFE: Representing Scene As Compositional Generative Nerual Feature Fields ([Link1](./summary/16.md), [Link2](./summary/25.md), [Link3](./summary/67.md))
+* kiloNeRF ([Link](./summary/17.md))
+* TransGAN: Two Pure Transformers Can Make One strong GAN, and That Can Scale Up ([Link](./summary/18.md))
+* Per-Pixel Classification is Not All You Need for Semantic Segmentation ([Link](./summary/19.md))
+* Residual Network Behave like ensembles of Relatively Shallow Networks ([Link](./summary/20.md))
+* D-NeRF: Neural Radiance Fields for Dynamic Scenes ([Link](./summary/21.md))
+* SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud based Place Recognition ([Link](./summary/22.md))
+* MCL-GAN: Generative Adversarial Networks with Multiple Specialized Discriminators ([Link](./summary/23.md))
+* NeRF--: Neural Radiance Fields Without Known Camera Parameters ([Link](./summary/24.md))
+* Stereo radiance fields ([Link](./summary/26.md))
+* Weakly-supervised physically Unconstrained Gaze Estimation ([Link](./summary/27.md))
+* Convolutional Occupancy Networks ([Link](./summary/28.md))
+* Generative Multi-Adversarial Networks ([Link](./summary/29.md))
+* GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis ([Link1](./summary/30.md), [Link2](./summary/31.md))
+* Editing Conditional Radiance Fields ([Link1](./summary/32.md), [Link2](./summary/42.md))
+* Swapping Autoencoder for Deep Image Manipulation ([Link](./summary/35.md))
+* divco diverse conditional image synthesis via contrastive generative adversarial network ([Link](./summary/36.md))
+* Robust Neural Routing Through Space Partitions for Camera Relocalization in Dynamic Indoor Environments ([Link](./summary/37.md))
+* ChannelPruning for Accelerating Very Deep Neural Networks ([Link](./summary/38.md))
+* CorrNet3D : Unsupervised End-to-end Learning of Dense Correspondence for 3D point clouds ([Link](./summary/39.md))
+* On the Continuity Rotation Representations in Neural Networks ([Link](./summary/40.md))
+* Anycost GANs for Interactive Image Synthesis and Editing ([Link](./summary/41.md))
+* Indoor Visual Localization with Dense Mathing and View Synthesis ([Link](./summary/43.md))
+* NERF Research Directions ([Link](./summary/44.md))
+* Taming Transformers for High-Resolution Image Synthesis ([Link](./summary/45.md))
+* ShaRF: Shape-conditioned Radiance Fiedls from a Single View ([Link](./summary/46.md))
+* Learning Deep Features for Discriminative Localization ([Link](./summary/47.md))
+* cGANs with Auxiliary Discriminative Classifier ([Link](./summary/48.md))
+* Nerfies: Deformable Nerual Radiance Fields ([Link](./summary/49.md))
+* Correlated Input-Dependent Label Noise in Large-scale Image Classification ([Link](./summary/50.md))
+* Playable Video Generation ([Link](./summary/51.md))
+* Few-shot Image Generation via Cross-domain Correspondence ([Link](./summary/52.md))
+* A Simple Framework for Contrastive Learning of Visual Representations ([Link](./summary/53.md))
+* On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation ([Link](./summary/54.md))
+* MinkLoc++ : Lidar and Monocular Image Fusion for Place Recognition ([Link](./summary/55.md))
+* KeypointDeformer: Unsupervised 3D Keypoint DIscovery for Shape Control ([Link](./summary/56.md))
+* High-Fidelity Neural Human Motion Transfer from Monocular Video ([Link](./summary/57.md))
+* Big self-supervised Models are Strong Semi-supervised learners ([Link](./summary/58.md))
+* NeuralRecon: Real-Time Coherent 3D Reconstruction From Monocular Video ([Link](./summary/59.md))
+* Quantifying Attention Flow in Transformers ([Link](./summary/60.md))
+* Repurposing GANs for One-Shot Semantic Part Segmentation ([Link](./summary/61.md))
+* PIC-NET : Point Cloud and Image Collaboration Network for Large-Scale Place Recognition ([Link](./summary/62.md))
+* Repurposing GANs for One-shot Semantic Part Segmentation ([Link](./summary/63.md))
+* Training Generative Adversarial Networks in One Stage ([Link](./summary/64.md))
+* Transformer Interpretability Beyond Attention Visualization ([Link](./summary/65.md))
+* GNeRF: GAN-based Neural Radiance Field without Posed Camera ([Link](./summary/66.md))
+* Exploring Simple Siamese Representation Learning ([Link](./summary/68.md))
+* Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers ([Link](./summary/69.md))
+* Nex: Real-time View Synthesis with Neural Basis Expansion  ([Link](./summary/70.md))
diff --git a/Archive/2021/07/summary/1.md b/Archive/2021/07/summary/1.md
@@ -0,0 +1,11 @@
+# Baking Neural Radiance Fields for Real-Time View Synthesis 
+### Peter Hedman et al., Google Research
+#### Summarized by Jinoh Cho
+
+* 기존의 NeRF는 Inference Time때에 각 ray에 대해서 (Novel View Image의 픽셀 값을 생성하기 위해서) MLP를 일반적으로 수백번(ray가 지나가는 점들을 샘플링 한 수 만큼)을 통과 해야 되기 때문에 대충 single frame rendering에 대략 1분 정도의 시간이 필요하다는 단점을 가지고 있어 VR과 AR과 같은 real time application에 적용이 불가하다는 단점이 있음. 이를 해결하려는 논문이다.
+* 이를 해결하기 위해서 3가지 방안을 통해서 해결하고자 한다. 첫번째, Nerf의 최대 장점 중 하나인 View Dependent Effect를 ray가 지나가는 sample point마다 계산하는 것이 아닌 ray마다 한번 View Dependent Effect를 계산하는 방법으로 바꾼다. 두번째, 기존의 NeRF는 Inference Time에 Opacity와 Color를  각 포인트마다 계산하는데 이를 미리 계산하고 값을 저장해 놓는 방법을 통해서 시간을 단축하고자 한다. 또한 값을 효과적으로 8bit integer값으로 저장할 수 있는 방법을 제안 한다. 세번째, Rendering 시간과 저장 공간의 크기가 sparsity of opacity에 의존적이기 때문에 우리는 Opacity Regularization Loss를 추가한다. 
+* 위에서 이야기한 첫번째 방법을 조금 더 구체적으로 살펴보면, 기존의 Nerf는 단순하게 각 샘플링 포인트와 viewing direction을 인풋으로 받아 Opacity와 Color 값을 내뱉게 된다. Rendering시에 사이즈가 큰 MLP를 각 ray당 수백번 통과 해야된다는 단점이 있음. 하지만 이 논문에서 제안한 방법은  viewing direction없이 좌표 값을 인풋으로 받아 dencity, color, feature vector를 내뱉게 되고 이 값을 accumulate하여 미리 계산하여 특별한 Data structure에 저장해둔다. 그리고 나중에 렌더링 시에 ray당 view dependent 특징을 살리기 위해 accumulated feature vector와 ray의 viewing direction을 인풋으로 넣어 한번만 MLP 레이어를 통과 시켜주는 작업을 통해서 view dependent 특징을 살려 주게 된다. 이것이 큰 차이를 만들게 된다.
+* SNeRG의 데이터 구조는 아직 정확히 이해를 하지 못해서 업데이트 예정
+
+
+
diff --git a/Archive/2021/07/summary/10.md b/Archive/2021/07/summary/10.md
@@ -0,0 +1,23 @@
+# EfficientDet: Scalable and Efficient Object Detection
+### MingXing Tan et al., Google Research (Brain Team) - CVPR2020
+#### Summarized by Seungwook Kim
+
+```
+Task: Object Detection (Evaluated on Semantic Segmentation as well)
+
+Main novelty/ Contribution: Weighted Bi-FPN to fuse multi-scale features from level 3 to 7 (vs FPN, PA-Net...)
+--> vs. FPN: uses bi-level pathways (top-down + bottom-up)
+--> vs. PA-Net: Repeated blocks, more efficient, learnable weights per fusion
+
+Side contribution: Scalable models ranging from small (D0) to large (D7). Accurate results with good memory/latency tradeoff.
+ 
+EfficientNet의 novelty에서 크게 벗어나지 않으며, EfficientNet과 마찬가지로 Compound scaling을 제안합니다.
+Detection Model에는 Classification model과 달리 bounding box prediction head, 또한 필요하기 때문에, compound scaling이 적용될 부분이 더 많긴 하지만 딱히 새로운 novelty는 아니라고 여겨집니다.
+ 
+요새 많이 사용하는 방향인지 모르겠지만 FPN에서 비롯된 Feature fusion method를 weighted + bi-directed로 바꿈으로서 성능적 이점을 챙기면서, 모든 layer 들간의 fusion을 다 하는 것이 아니라 선택적으로 하므로서 메모리와 시간적 성능도 챙기는 방향입니다.
+ 
+결론적으로는 Scalable (memory-latency) + Accurate detection model이 이 논문의 최종 contribution이며, 이 구조에서 크게 변경하지 않고 semantic segmentation에 적용해도 당시에 SoTA성능을 보였습니다.
+ 
+요새 Detection 연구들의 흐름은 잘 모르지만, EfficientNet만 읽어도 이해하기 쉬울 것 같아서 detection에 입문하는 사람이라도 읽기 편할 것 같습니다.
+Writing도 완벽하지는 않지만 읽기 편하게 되어있습니다.
+```
diff --git a/Archive/2021/07/summary/11.md b/Archive/2021/07/summary/11.md
@@ -0,0 +1,13 @@
+# Depth-supervised NeRF: Fewer Views and Faster Training for Free
+### Kangle Deng, Andrew Liu, Jun-Yan Zhu, Deva Ramanan) - arxiv
+#### Summarized by Jinoh Cho
+
+```
+NeRF 모델의 경우 정확한 카메라 포스가 Annotation이 된 인풋 이미지가 충분히 많이 주어질 경우에 High Quality Image Generation이 가능합니다. 허나 NeRF 모델의 경우에 충분히 많은 이미지가 주어져 있지 않으면 학습하기 정말 쉽지 않다. 이에 이 논문에서는 더 적은 이미지와 NeRF의 빠른 optimization을 위하여 거의 Free로 얻을 수 있는 Depth Supervision Loss를 도입한다.
+ 
+그렇다면 왜 Free인가? 일반적으로 NeRF 학습 데이터는 정확히 annotated 이미지 데이터가 필요하기 떄문에 SFM 모듈 (COLMAP)을 사용하여 정확한Camera Pose를 얻게된다. SFM 모듈은 이와 동시에 Depth Supervision에 사용할 수 있는 Sparse 3d points를 제공해 준다. 따라서 우리는 트레이닝 데이터 Generation시에 자동으로 거의 공짜로 얻게 되는 것이다. 이 때문에 Free 라고 부른다.
+ 
+우리는 3d keypoint로 부터 camera 까지의 GT 거리를 Projection matrix로 구할 수 있게 되고, NeRF에서 Color를 rendering을 하는 것과 같은 방법으로 depth도 rendering하기 됩니다. 이 둘간의 L2로스로 Depth Supervision을 부가적으로 주게됩니다.
+ 
+논문을 읽으면서, 왜 Depth Supervison이 Continuous Radiance Field 학습에 도움이 될까라는 부분이 납득이 가지 않았지만, 실험 결과 상으로는 상당한 향상이 있는 것으로 보입니다. 수식도 약간 이상한 부분이 있는 것 같아서.. 제가 잘못 이해한거 일 수도 있지만 윤우학생과 이야기 해보고 있습니다.
+```
diff --git a/Archive/2021/07/summary/12.md b/Archive/2021/07/summary/12.md
@@ -0,0 +1,15 @@
+# Occupancy Networks - Learning 3D Reconstruction in Function Space
+### Author information
+#### Summarized by Jinoh Cho
+
+```
+본 논문이 제안되기 전까지 computationally and memory efficient 한 3D를 표현하는 representation이 없었음. Voexel Representation은 resolution을 키우게 되면 cubically 하게 메모리 complexity가 증가하게 된다. Point Cloud나 메쉬 같은 경우에도 점이나 Vertices 수에 제한이 있었다. 
+
+따라서 이 논문에서는 3d geometry를 continous 3d mapping을 통해서 새롭게 표현하는 representation을 제안한다.
+
+이 방법은 굉장히 간단하다. 3D sampling point와 reconstruction시에 이용할 입력 정보(input image, low resolution voxel, 노이즈가 낀 point cloud)등이 주어졌을 때 임의의 3D 공간에서 샘플링된 점들이 Occupied 되어있는지 판단하는 네트워크를 학습시켜주게 되면 된다.
+ 
+3D 공간에서 각 점들이 Occupied 되어있는지 아니면 Unoccupied되어있는지는 우리는 쉽게 크로스 엔트로피 로스로 수퍼비전을 줄 수 있게 된다.
+
+배치 안의 i 번째 인풋 이미지(xi)에 대해서 우리는 K개의 포인트를 샘플링 할 수 있게 되는데 우리는 이를 pij, j=1,.....,K 로 표현하고 true occupancy at point pij는 Oij로 표현한다. 따라서 학습시 필요한 로스를 논문과 같이 디자인 할 수 있게 된다.
+```
diff --git a/Archive/2021/07/summary/13.md b/Archive/2021/07/summary/13.md
@@ -0,0 +1,9 @@
+# Rethinking and Improving the Robustness of Image Style Transfer
+### Pei Wang et al., UC San Diego) - CVPR 2021 [best paper candidate]
+#### Summarized by Woohyeon Sim
+
+* **ResNet이 VGG보다 style transfer가 잘 안되는 원인을 찾고 이를 해결한 논문**. 그 결과 ResNet계열의 아키텍쳐에서 pre-trained weight뿐 아니라 random weights에서도 VGG 못지 않게 stylization 성능이 크게 오르는 것을 보여주고, feature representation이 style transfer에서도 중요함을 보임. 또한 제안한 방법은 다른 style transfer loss와 compatible함.
+* **원인 분석1 (invalid, X): ResNet은 feature에 robustness가 부족해서** 잘 안됨 ⇒ robustness가 증가되면 stylization quality는 높아지는 건 맞지만 VGG는 random weight에도 잘 되기 때문에 robusness가 직접적인 원인은 아님.
+* **원인 분석2 (valid, O): ResNet으로 구한 gram matrix의 entropy가 작기 때문에 그것을 따라 학습된 style pattern도 diversity가 낮아질 수 밖에 없음** ⇒ 실험적으로 residual connection이 있으면 feature map과 gram matrix 모두 small entropy (large peak)를 갖는 것으로 관측됨. 따라서 이것을 따라가도록 학습하는 것은 특정 style pattern에 dominate 되거나 outlier sensitivity가 심해질 수 있음. 또한 style을 비슷하게 맞추는 것을 distillation관점에서도 설명할 수 있는데, distillation은 hard target을 쓰는 것이 soft target을 쓰는 것보다 수렴도 느리고 안좋음.
+* **해결책 - Stylization With Activation smoothinG (SWAG).** 원인 분석 2의 distillation관점에서 봤을 때 hard target보다 soft target이 좋으므로 peaky activation을 smoothing하여 soft target으로 바꾸는 방법을 제안. 구체적으로는, 단순히 로스 구할 때 feature에 softmax를 취하는 것임. 이렇게 하면 큰 값은 작아지고 작은 값은 커져서 entropy가 커진다고함. 주의할 점은 네트워크에 softmax layer를 넣는 것이 아니어서 representation power는 그대로 유지된다는 것임. 다른 smoothing 기법, 곧 nested softmax, 0.1배 곱하는 것등이 있는데 성능은 비슷해서 간단한 softmax를 썼다고 함.
+* **총평**: 실험 결과는 제안한 로스로만 바꾸면 모든 네트워크와 학습 방법에서 크게 좋아짐. 그러나 분석 방식과 제안한 방법이 크게 새로운 요소는 없어서 무엇때문에 best paper candidate이 되었는지는 모르겠음.
diff --git a/Archive/2021/07/summary/14.md b/Archive/2021/07/summary/14.md
@@ -0,0 +1,46 @@
+# RepVGG: Making VGG-style ConvNets Great Again
+### Xiaohan Ding et al., BNRist, Tsinghua U, HKUST - CVPR 2021
+#### Summarized by Seungwook Kim
+
+```
+Task: Image Classification (downstream task: Semantic segmentation)
+
+Main novelty/Contribution: 
+VGG-style 네트워크들은 보다 심플한 디자인을 띄고 있다. 
+오직 3x3 conv와 ReLU operator만 이루어져 있기에 
+Flop수치로는 비교적 높을지언정 MAC과 parallelism을 고려하면 (Winograd convolution) 오히려 더 빠른 속도를 보이기도 한다.
+ 
+그러나 요새 다양한 Network들은 residual connection과, 이에 따라서 channel dim 을 맞추기 위한  1x1 convolution이 포함되어있다.
+이런 multi-branch design은 확실히 성능적 이점이 있으나, inference때 단순 3x3 conv로만 이루어져있는 네트워크에 비해 느리고 memory utilization 또한 떨어진다. 
+ 
+이 논문은 multi-branch 학습의 성능적 이점과, plain convolution network의 inference time speed 이점을 둘 다 챙기는 논문이다.
+이를 위해 structured re-parametrization을 사용한다.
+ 
+train time때는 실제로 residual connection / 1x1 convolution / Batch normalization 을 사용하여 train을 하고,
+inference time 때는 학습한 1x1 convolution / Batch normalization등의 학습된 weight들을 structured re-parametrization을 통해
+동일한 결과를 내는 네트워크이지만, 3x3 conv와 ReLU로만 이루어져있는 모델로 바꾼다.
+(논문의 그림을 보면 쉽게 이해할 수 있음)
+ 
+크기가 비슷하거나 더 큰 모델들에 비해 뛰어난 성능을 내며, 가장 주목할 점은 속도가 빠르다는 점이다.
+FLOP이 비교적 높아도 Plain Convolutional Network의 속도가 빠를 수 있는 것은 요새 Hardware들이 지원해주는 Acceleration 덕분이다.
+(Multi-branch network의 경우 많이 사용하기 어려운 acceleration이라고 함)
+Downstream task인 semantic segmentation에도 좋은 성능을 보임
+
+```
+
+총평:
+
+
+
+* Flop이 속도에 가장 중요한 것은 아님 (MAC, Degree of parallelism 또한 중요)
+
+
+
+
+* 1x1 conv, BN, residual connection을 사용하여 학습한 network를
+	structured reparametrization으로 쉽게 3x3 conv 기반만으로 네트워크를 만들수 있음 
+	(같은 성능,  속도 이점) --> 코드 공유 되어있음
+
+* 다만 hardware acceleration이 제한되어있는 mobile 등의 상황에서는 별로 gain이 없을 수 있음
+
+* 다양한 multi-branch architecture를, inference time때만 acceleration이 가능하며 간단한 convolution 기반의 네트워크로 바꿀 수 있다는 점이 가장 인상적
diff --git a/Archive/2021/07/summary/15.md b/Archive/2021/07/summary/15.md
@@ -0,0 +1,13 @@
+# ViTGAN: Training GANs with Vision Transformers
+### Author Information
+#### Summarized by Minguk Kang
+
+```
+해당 논문은 Google Brain에서 나왔으며, ViT 와 Implicit Neural Representation을 조합하여 Generator를 디자인 했으며, 지금까지 나온 Transformer 기반 GAN 중에 가장 높은 성능을 보여주고 있는 모델인 것 같습니다.
+ 
+조금 더 구체적으로 ViTGAN의 generator는 latent variable z가 mapping network f에 의해 style vector w로 변환되며 이를 self-modulation과 혼합하여 self-attention해줍니다. 이러한 아이디어는 StyleGAN에서 영향받았으며, 최종적으로 Fourier embedding 을 하여 페치 벡터를 만들고 reshape를 해주게 됩니다. 이러한 구조는 이전의 TransGAN과는 달리 upsampling이 없으며 순수한 Transformer-based GAN이라 의미가 있는 것 같네요.
+ 
+성능은 5만장의 이미지를 사용하여 측정하였으며, 기존의 DiffAugGAN을 이겼다하지만 DiffAugGAN이 1만장의 이미지를 사용하여 평가한 것을 생각하면 공평한 비교가 아니라는 것을 알 수 있습니다. 또한, Imagenet 실험 결과가 없기 때문에, 아직까지는 Transformer가 CNN-based GAN을 이겼다고 하기에는 시기상조가 아닌가 싶습니다. 
+ 
+개인적으로 논문은 쉽게 잘 적혀있지만, limitation이 많은 논문이라고 생각이 되네요.
+```