With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
随着虚拟现实应用的广泛普及,3D 场景生成已成为一个具有挑战性的研究前沿。3D 场景具有高度复杂的结构,要求输出密集、连贯,并包含所有必要的结构。许多当前的 3D 场景生成方法依赖预训练的文本到图像扩散模型和单目深度估计器。然而,生成的场景通常占用大量存储空间,且缺乏有效的正则化方法,导致几何失真问题。 为此,我们提出了 BloomScene,一种基于轻量化结构化 3D 高斯点渲染的跨模态场景生成方法,可从文本或图像输入生成多样化且高质量的 3D 场景。具体而言,我们设计了一种跨模态渐进式场景生成框架,通过增量点云重建和 3D 高斯点渲染生成连贯的场景。此外,我们提出了一种基于分层深度先验的正则化机制,通过多层次的深度精度和光滑性约束,提高生成场景的真实感和连贯性。 最终,我们提出了一种结构化上下文引导的压缩机制,利用结构化哈希网格建模非组织锚点属性的上下文,有效消除结构冗余并减少存储开销。多场景的综合实验表明,与多个基线方法相比,我们的框架展现出了显著的潜力和优势。