HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Internet image collections containing photos captured by crowds of photographers show promise for enabling digital exploration of large-scale tourist landmarks. However, prior works focus primarily on geometric reconstruction and visualization, neglecting the key role of language in providing a semantic interface for navigation and fine-grained understanding. In constrained 3D domains, recent methods have leveraged vision-and-language models as a strong prior of 2D visual semantics. While these models display an excellent understanding of broad visual semantics, they struggle with unconstrained photo collections depicting such tourist landmarks, as they lack expert knowledge of the architectural domain. In this work, we present a localization system that connects neural representations of scenes depicting large-scale landmarks with text describing a semantic region within the scene, by harnessing the power of SOTA vision-and-language models with adaptations for understanding landmark scene semantics. To bolster such models with fine-grained knowledge, we leverage large-scale Internet data containing images of similar landmarks along with weakly-related textual information. Our approach is built upon the premise that images physically grounded in space can provide a powerful supervision signal for localizing new concepts, whose semantics may be unlocked from Internet textual metadata with large language models. We use correspondences between views of scenes to bootstrap spatial understanding of these semantics, providing guidance for 3D-compatible segmentation that ultimately lifts to a volumetric scene representation. Our results show that HaLo-NeRF can accurately localize a variety of semantic concepts related to architectural landmarks, surpassing the results of other 3D models as well as strong 2D segmentation baselines. Our project page is at https://tau-vailab.github.io/HaLo-NeRF/.


Introduction
Our world is filled with incredible buildings and monuments that contain a rich variety of architectural details.Such intricatelydesigned human structures have attracted the interest of tourists and scholars alike.Consider, for instance, the Notre-Dame Cathe-dral pictured above.This monument is visited annually by over 10 million people from all around the world.While Notre-Dame's facade is impressive at a glance, its complex architecture and history contain details which the untrained eye may miss.Its structure includes features such as portals, towers, and columns, as well as more esoteric items like rose window and tympanum.Tourists often avail themselves of guidebooks or knowledgeable tour guides in order to fully appreciate the grand architecture and history of such landmarks.But what if it were possible to explore and understand such sites without needing to hire a tour guide or even to physically travel to the location?
The emergence of neural radiance fields presents new possibilities for creating and exploring virtual worlds that contain such large-scale monuments, without the (potential burden) of traveling.Prior work, including NeRF-W [MBRS * 21] and Ha-NeRF [CZL * 22], has demonstrated that photo-realistic images with independent control of viewpoint and illumination can be readily rendered from unstructured imagery for sites such as the Notre-Dame Cathedral.However, these neural techniques lack the highlevel semantics embodied within the scene-such semantic understanding is crucial for exploration of a new place, similarly to the travelling tourist.
Recent progress in language-driven 3D scene understanding has leveraged strong two-dimensional priors provided by modern vision-and-language (V&L) representations [HCJW22, CLW * 22, CGT * 22, KMS22, KKG * 23].However, while existing pretrained vision-and-language models (VLMs) show broad semantic understanding, architectural images use a specialized vocabulary of terms (such as the minaret and rose window depicted in Figure 1) that is not well encapsulated by these models out of the box.Therefore, we propose an approach for performing semantic adaptation of VLMs by leveraging Internet collections of landmark images and textual metadata.Inter-view coverage of a scene provides richer information than collections of unrelated imagery, as observed in prior work utilizing collections capturing physically grounded inthe-wild images [WZHS20, IMK20, WAESS21].Our key insight is that modern foundation models allow for extracting a powerful supervision signal from multi-modal data depicting large-scale tourist scenes.
To unlock the relevant semantic categories from noisy Internet textual metadata accompanying images, we leverage the rich knowledge of large language models (LLMs).We then localize this image-level semantic understanding to pixel-level probabilities by leveraging the 3D-consistent nature of our image data.In particular, by bootstrapping with inter-view image correspondences, we fine-tune an image segmentation model to both learn these specific concepts and to localize them reliably within scenes, providing a 3D-compatible segmentation.
We demonstrate the applicability of our approach for connecting low-level neural representations depicting such real-world tourist landmarks with higher-level semantic understanding.Specifically, we present a text-driven localization technique that is supervised on our image segmentation maps, which augments the recently proposed Ha-NeRF neural representation [CZL * 22] with a localization head that predicts volumetric probabilities for a target text prompt.By presenting the user with a visual halo marking the region of interest, our approach provides an intuitive interface for interacting with virtual 3D environments depicting architectural land-marks.HaLo-NeRF (Ha-NeRF + Localization halo) allows the user to "zoom in" to the region containing the text prompt and view it from various viewpoints and across different appearances, yielding a substantially more engaging experience compared to today's common practice of browsing thumbnails returned by an image search.
To quantitatively evaluate our method, we introduce HolyScenes, a new benchmark dataset composed of six places of worship annotated with ground-truth segmentations for multiple semantic concepts.We evaluate our approach qualitatively and quantitatively, including comparisons to existing 2D and 3D techniques.Our results show that HaLo-NeRF allows for localizing a wide array of elements belonging to structures reconstructed in the wild, capturing the unique semantics of our use case and significantly surpassing the performance of alternative methods.
Explicitly stated, our key contributions are: • A novel approach for performing semantic adaptation of VLMs which leverages inter-view coverage of scenes in multiple modalities (namely textual metadata and geometric correspondences between views) to bootstrap spatial understanding of domain-specific semantics; • A system enabling text-driven 3D localization of large-scale scenes captured in-the-wild; • Results over diverse scenes and semantic regions, and a benchmark dataset for rigorously evaluating the performance of our system as well as facilitating future work linking Internet collections with a semantic understanding.These methods aim for general open-vocabulary image segmentation and can achieve impressive performance over a broad set of visual concepts.However, they lack expert knowledge specific to culturally significant architecture (as we show in our comparisons).In this work, we incorporate domain-specific knowledge to adapt an image segmentation model conditioned on free text to our setting; we do this by leveraging weak image-level text supervision and pixel-level supervision obtained from multi-view correspon- dences.Additionally, we later lift this semantic understanding to volumetric probabilities over a neural representation of the scene.
Language-grounded scene understanding and exploration.These works generally assume strong supervision from existing semantically annotated 3D data, consisting of common standalone objects.By contrast, we tackle the challenging real-world scenario of a photo collection in the wild, aiming to localizing semantic regions in large-scale scenes and lacking annotated ground-truth 3D segmentation data for training.To overcome this lack of strong ground-truth data, our method distills both semantic and spatial information from large-scale Internet image collections with textual metadata, and fuses this knowledge together into a neural volumetric field.
The problem of visualizing and exploring large-scale 3D scenes depicting tourist landmarks captured in-the-wild has been explored by several prior works predating the current deep learning dominated era [SSS06, SGSS08, RMBB * 13].Exactly a decade ago, Russell et al. [RMBB * 13] proposed 3D Wikipedia for annotating isolated 3D reconstructions of famous tourist sites using reference text via image-text co-occurrence statistics.Our work, in contrast, does not assume access to text describing the landmarks of interest and instead leverages weakly-related textual information of similar landmarks.More recently, Wu et al. [WAESS21] also addressed the problem of connecting 3D-augmented Internet image collections to semantics.However, like most prior work, they focused on learning a small set of predefined semantic categories, associated with isolated points in space.By contrast, we operate in the more challenging setting of open-vocabulary semantic understanding, aiming to associate these semantics with volumetric probabilities.
NeRF-based semantic representations.Recent research efforts have aimed to augment neural radiance fields (NeRF) [MST * 20] with semantic information for segmentation and editing [TZFR23].One approach is to add a classification branch to assign each pixel with a semantic label, complementing the color branch of a vanilla NeRF [ZLLD21,KGY * 22,SPB * 22,FZC * 22].A general drawback of these categorical methods is the confinement of the segmentation to a pre-determined set of classes.
To enable open-vocabulary segmentation, an alternative approach predicts an entire feature vector for each 3D point [TLLV22, KMS22, FWJ * 22, KKG * 23]; these feature vectors can then be probed with the embedding of a semantic query such as free text or an image patch.While these techniques allow for more flexibility than categorical methods, they perform an ambitious taskregressing high-dimensional feature vectors in 3D space-and are usually demonstrated in controlled capture settings (e.g. with images of constant illumination).
To reduce the complexity of 3D localization for unconstrained large-scale scenes captured in the wild, we adopt a hybrid approach.Specifically, our semantic neural field is optimized over a single text prompt at a time, rather than learning general semantic features which could match arbitrary queries.This enables open-vocabulary segmentation, significantly outperforming alternative methods in our setting.

Method
An overview of the proposed system is presented in Figure 2. Our goal is to perform text-driven neural 3D localization for landmark scenes captured by collections of Internet photos.In other words, given this collection of images and a text prompt describing a semantic concept in the scene (for example, windows or spires), we would like to know where it is located in 3D space.These images are in the wild, meaning that they may be taken in different seasons,  The full image metadata (Input), including FILENAME, "caption" and Wi-kiCategories (depicted similarly above) are used for extracting distilled semantic pseudo-labels (Output) with an LLM.Note that the associated images on top (depicted with corresponding colors) are not used as inputs for the computation of their pseudolabels.
time of day, viewpoints, and distances from the landmark, and may include transient occlusions.
In order to localize unique architectural features landmarks in 3D space, we leverage the power of modern foundation models for visual and textual understanding.Despite progress in general multimodal understanding, modern VLMs struggle to localize finegrained semantic concepts on architectural landmarks, as we show extensively in our results.The architectural domain uses a specialized vocabulary, with terms such as pediment and tympanum being rare in general usage; furthermore, terms such as portal may have a particular domain-specific meaning in architecture (referring primarily to doors) in contrast to its general usage (meaning any kind of opening).
To address these challenges, we design a three-stage system: the offline stages of LLM-based semantic concept distillation (Section 3.1) and semantic adaptation of VLMs (Section 3.2), followed by the online stage of 3D localization (Section 3.3).In the offline stages of our method, we learn relevant semantic concepts using textual metadata as guidance by distilling it via an LLM, and subsequently locate these concepts in space by leveraging inter-view correspondences.The resulting fine-tuned image segmentation model is then used in the online stage to supervise the learning of volumetric probabilities-associating regions in 3D space with the probability of depicting the target text prompt.

Training Data
The training data for learning the unique semantics of such landmarks is provided by the WikiScenes dataset [WAESS21], consisting of images capturing nearly one hundred Cathedrals.We augment these with images capturing 734 Mosques, using their data scraping procedure * .We also remove all landmarks used in our HolyScenes benchmark (described in Section 4) from this training data to prevent data leakage.The rich data captured in both textual and visual modalities in this dataset, along with large-scale coverage of a diverse set of scenes, provides the needed supervision for our system.

LLM-Based Semantic Concept Distillation
In order to associate images with relevant semantic categories for training, we use their accompanying textual metadata as weak supervision.As seen in Figure 3, this metadata is highly informative but also noisy, often containing many irrelevant details as well as having diverse formatting and multilingual contents.Prior work has shown that such data can be distilled into categorical labels that provide a supervision signal [WAESS21]; however, this loses the long tail of uncommon and esoteric categories which we are interested in capturing.Therefore, we leverage the power of instructiontuned large language models (LLMs) for distilling concise, openended semantic pseudo-labels from image metadata using an instruction alone (i.e.zero-shot, with no ground-truth supervision).In particular, we use the encoder-decoder LLM Flan-T5 [CHL * 22], which performs well on tasks requiring short answers and is publicly available (allowing for reproducibility of our results).To construct a prompt for this model, we concatenate together the image's filename, caption, and WikiCategories (i.e., a hierarchy of named categories provided in Wikimedia Commons) into a single description string; we prepend this description with the instruction: "What architectural feature of ⟨BUILDING⟩ is described in the following image?Write "unknown" if it is not specified."In this prompt template, the building's name is inserted in ⟨BUILDING⟩ (e.g.Cologne Cathedral).We then generate a pseudo-label using beam search decoding, and lightly process these outputs with standard textual cleanup techniques.Out of ∼101K images with metadata in our train split of WikiScenes, this produces ∼58K items with nonempty pseudo-labels (those passing filtering heuristics), consisting of 4,031 unique values.Details on text generation settings, textual cleanup heuristics, and further statistics on the distribution of pseudo-labels are provided in the supplementary material.
Qualitatively, we observe that these pseudo-labels succeed in producing concise English pseudo-labels for inputs regardless of distractor details and multilingual data.This matches the excellent performance of LLMs such as Flan-T5 on similar tasks such as text summarization and translation.Several examples of the metadata and our generated pseudo-labels are provided in Figure 3, and a quantitative analysis of pseudo-label quality is given in our ablation study (Section 5.4).

Semantic Adaptation of V&L Models
After assigning textual pseudo-labels to training images as described in Section 3.1, we use them as supervision for cross-modal understanding, learning image-level and pixel-level semantics.As we show below in Section 5, existing V&L models lack the requisite domain knowledge out of the box, struggling to understand architectural terms or to localize them in images depicting large portions of buildings.We therefore adapt pretrained models to our setting, using image-pseudolabel pairs to learn image-level semantics and weak supervision from pairwise image correspondences to bootstrap pixel-level semantic understanding.We outline the training procedures of these models here; see the supplementary material for further details.
To learn image-level semantics of unique architectural concepts in our images, we fine-tune the popular foundation model CLIP [RKH * 21], a dual encoder model pretrained with a contrastive text-image matching objective.This model encodes images and texts in a shared semantic space, with cross-modal similarity reflected by cosine distance between embeddings.Although CLIP has impressive zero-shot performance on many classification and retrieval tasks, it may be fine-tuned on text-image pairs to adapt it to particular semantic domains.We fine-tune with the standard contrastive learning objective using our pairs of pseudo-labels and images, and denote the resulting refined model by CLIP FT .In addition to being used for further stages in our VLM adaptation pipeline, CLIP FT serves to retrieve relevant terminology for the user who may not be familiar with architectural terms, as we show in our evaluations (Section 5.3).
To apply our textual pseudo-labels and image-level semantics to concept localization, we build on the recent segmentation model CLIPSeg [LE22], which allows for zero-shot text-conditioned image segmentation.CLIPSeg uses image and text features from a CLIP backbone along with additional fusion layers in an added decoder component; trained on text-supervised segmentation data, this shows impressive open-vocabulary understanding on general text prompts.While pretrained CLIPSeg fails to adequately understand architectural concepts or to localize them (as we show in Section 5.4), it shows a basic understanding of some concepts along with a tendency to attend to salient objects (as we further illustrate in the supplementary material), which we exploit to bootstrap understanding in our setting.
Our key observation is that large and complex images are composed of subregions with different semantics (e.g. the region around a window or portal of a building), and pretrained CLIPSeg predictions on these zoomed-in regions are closer to the ground truth than predictions on the entire building facade.To find such pairs of zoomed-out and zoomed-in images, we use two types of geometric connections: multi-view geometric correspondences (i.e. between images) and image crops (i.e.within images).Using these paired images and our pseudo-label data, we use predictions on zoomed-in views as supervision to refine segmentation on zoomed-out views.
For training across multiple images, we use a feature matching model [SSW * 21] to find robust geometric correspondences between image pairs and CLIP FT to select pairs where the semantic concept (given by a pseudo-label) is more salient in the zoomedin view relative to the zoomed-out view; for training within the same image, we use CLIP FT to select relevant crops.We use pretrained CLIPSeg to segment the salient region in the zoomed-in or cropped image, and then fine-tune CLIPSeg to produce this result in the relevant image when zoomed out; we denote the resulting trained model by CLIPSeg FT .During training we freeze CLIPSeg's encoders, training its decoder module alone with loss functions optimizing for the following: Geometric correspondence supervision losses.As described above, we use predictions on zoomed-in images to supervise segmentation of zoomed-out views.We thus define loss terms Lcorresp and Lcrop, the cross-entropy loss of these predictions calculated on the region with supervision targets, for correspondence-based and crop-based data respectively.In other words, Lcorresp encourages predictions on zoomed-out images to match predictions on corresponding zoomed-in views as seen in Figure 4; Lcrop is similar but uses predictions on a crop of the zoomed-out view rather than finding a distinct image with a corresponding zoomed-in view.
Multi-resolution consistency.To encourage consistent predictions across resolutions and to encourage our model to attend to relevant details in all areas of the image, we use a multi-resolution consistency loss L consistency calculated as follows.Selecting a random crop of the image from the correspondence-based dataset, we calculate cross-entropy loss between our model's prediction cropped to this region, and CLIPSeg (pretrained, without fine-tuning) applied within this cropped region.To attend to more relevant crops, we pick the random crop by sampling two crops from the given image and using the one with higher CLIP FT similarity to the textual pseudo-label.
Regularization.We add the regularization loss Lreg, calculated as the average binary entropy of our model's outputs.This encourages confident outputs (probabilities close to 0 or 1).
These losses are summed together with equal weighting; further training settings, hyperparameters, and data augmentation are detailed in the supplementary material.
We illustrate this fine-tuning process over corresponding image pairs in Figure 4.As illustrated in the figure, the leftmost images (i.e., zoom-ins) determine the supervision signal.Note that while we only supervise learning in the corresponding region in each training sample, the refined model (denoted as CLIPSeg FT ) correctly extrapolates this knowledge to the rest of the zoomed-out image.Figure 5 illustrates the effect of this fine-tuning on segmentation of new landmarks (unseen during training); we see that our fine-tuning gives CLIPSeg FT knowledge of various semantic categories that the original pretrained CLIPSeg struggles to localize; we proceed to use this model to produce 2D segmentations that may be lifted to a 3D representation.

Text-Driven Neural 3D Localization
In this section, we describe our approach for performing 3D localization over a neural representation of the scene, using the semantic understanding obtained in the previous offline training stages.The input to our 3D localization framework is an Internet image collection of a new (unseen) landmark and a target text prompt.with a segmentation MLP head, added on top of a shared backbone (see the supplementary material for additional details).To learn the volumetric probabilities of given target text prompt, we freeze the shared backbone and optimize only the segmentation MLP head.
To provide supervision for semantic predictions, we use the 2D segmentation map predictions of CLIPSeg FT (described in Section 3.2) on each input view.While these semantically adapted 2D segmentation maps are calculated for each view separately, HaLo-NeRF learns a 3D model which aggregates these predictions while enforcing 3D consistency.We use a binary cross-entropy loss to optimize the semantic volumetric probabilities, comparing them to the 2D segmentation maps over sampled rays [ZLLD21].This yields a a representation of the semantic concept's location in space.Novel rendered views along with estimated probabilities are shown in Figures 1 and 2 and in the accompanying videos.

The HolyScenes Benchmark
To evaluate our method, we need Internet photo collections covering scenes, paired with ground truth segmentation maps.As we are not aware of any such existing datasets, we introduce the HolyScenes benchmark, assembled from multiple datasets (WikiScenes [WAESS21], IMC-PT 2020 [Yi20] MegaDepth [LS18]) along with additional data collected using the data scraping procedure of Wu et al.We enrich these scene images with ground-truth segmentation annotations.Our dataset includes 6,305 images associated with 3D structure-from-motion reconstructions and ground-truth segmentations for multiple semantic categories.
We select six landmarks, exemplified in Figure 6 number of publicly-available Internet images.We associate these landmarks with the following semantic categories: portal, window, spire, tower, dome, and minaret.Each landmark is associated with a subset of these categories, according to its architectural structure (e.g., minaret is only associated with the two mosques in our benchmark).
We produce ground-truth segmentation maps to evaluate our method using manual labelling combined with correspondenceguided propagation.For each semantic concept, we first manually segment several images from different landmarks.We then propagate these segmentation maps to overlapping images, and manually filter these propagated masks (removing, for instance, occluded images).Additional details about our benchmark are provided in the supplementary material.

Results and Evaluation
In this section, we evaluate the performance of HaLo-NeRF on the HolyScenes benchmark, and compare our method to recent works on text-guided semantic segmentation and neural localization techniques.We also validate each component of our system with ablation studies -namely, our LLM-based concept distillation, VLM semantic adaptation, and 3D localization.Finally, we discuss limitations of our approach.In the supplementary material, we provide experimental details as well as additional experiments, such as an evaluation of the effect of CLIPSeg fine-tuning on general and architectural term understanding evaluated on external datasets.

Baselines
We compare our method to text-driven image segmentation methods, as well as 3D NeRF segmentation techniques.As HolyScenes consists of paired images and view-consistent segmentation maps, it can be used to evaluate both 2D and 3D segmentation methods; in the former case, by directly segmenting images and evaluating on their ground-truth (GT) annotations; in the latter case, by rendering 2D segmentation masks from views corresponding to each GT annotation.images with constant illumination or a single camera model.To provide a fair comparison, we replace the NeRF backbones used by DFF and LERF (vanilla NeRF and Nerfacto respectively) with Ha-NeRF, as used in our model, keeping the remaining architecture of these models unchanged.In the supplementary material, we also report results over the unmodified DFF and LERF implementations using constant illumination images rendered from Google Earth.
In addition to these existing 3D methods, we compare to the baseline approach of lifting 2D CLIPSeg (pretrained, not finetuned) predictions to a 3D representation with Ha-NeRF augmented with a localization head (as detailed in Section 3.3).This baseline, denoted as HaLo-NeRF-, provides a reference point for evaluating the relative contribution of our optimization-based approach rather than learning a feature field which may be probed for various textual inputs (as used by competing methods), and of our 2D segmentation fine-tuning.

Quantitative Evaluation
As stated in Section 5.1, our benchmark allows us to evaluate segmentation quality for both both 2D and 3D segmentation methods, in the latter case by projecting 3D predictions onto 2D views with ground-truth segmentation maps.We perform our evaluation using pixel-wise metrics relative to ground-truth segmentations.Since we are interested in the quality of the model's soft probability predictions, we use average precision (AP) as our selected metric as it is threshold-independent.
In Table 1 we report the AP per semantic category (averaged over landmarks), as well as the overall mean AP (mAP) across categories.We report results for 2D image segmentation models on top, and 3D segmentation methods underneath.In addition to reporting 3D localization results for our full proposed system, we also report the results of our intermediate 2D segmentation component (CLIPSeg FT ).
As seen in the table, CLIPSeg FT (our fine-tuned segmentation model, as defined in Section 3.2) outperforms other 2D methods, showing better knowledge of architectural concepts and their localization.In addition to free-text guided methods (LSeg and CLIPSeg), we also outperform the ToB model (which was trained on WikiScenes), consistent with the low recall scores reported by Wu et al. [WAESS21].LSeg also struggles in our free-text setting where semantic categories strongly deviate from its training data; CLIPSeg shows better zero-shot understanding of our concepts out of the box, but still has a significance performance gap relative to CLIPSeg FT .
In the 3D localization setting, we also see that our method strongly outperforms prior methods over all landmarks and semantic categories.HaLo-NeRF adds 3D-consistency over CLIPSeg FT image segmentations, further boosting performance by fusing predictions from multi-view inputs into a 3D representation which enforces consistency across observations.We also find an overall performance boost relative to the baseline approach using HaLo-NeRF without CLIPSeg fine-tuning.This gap is particularly evident in unique architectural terms such as portal and minaret.
Regarding the gap between our performance and the competing 3D methods (DFF, LERF), we consider multiple contributing factors.In addition to our enhanced understanding of domainspecific semantic categories and their positioning, the designs of these models differ from HaLo-NeRF in ways which may impact performance.DFF is built upon LSeg as its 2D backbone; hence, its performance gap on our benchmark follows logically from the poor performance of LSeg in this setting (as seen in the reported 2D results for LSeg), consistent with the observation of Kobayashi et al. [KMS22] that DFF inherits bias towards in-distribution semantic categories from LSeg (e.g. for traffic scenes).LERF, like DFF, regresses a full semantic 3D feature field which may then be probed for arbitrary text prompts.By contrast, HaLo-NeRF optimizes for the more modest task of localizing a particular concept in space, likely more feasible in this challenging setting.The significant improvement provided by performing per-concept optimization is also supported by the relatively stronger performance of the baseline model shown in Table 1, which performs this optimization using pretrained (not fine-tuned) CLIPSeg segmentation maps as inputs.

Qualitative Results
Sample results of our method are provided in Figures 6-11.As seen in Figure 6, HaLo-NeRF segments regions across various landmarks and succeeds in differentiating between fine-grained architectural concepts.Figure 7 compares these results to alternate 3D localization methods.As seen there, alternative methods fail to reliably distinguish between the different semantic concepts, tending to segment the entire building facade rather than identifying the areas of interest.With LERF, this tendency is often accompanied by higher probabilities in coarsely accurate regions, as seen by the roughly highlighted windows in the middle row. Figure 8 shows a qualitative comparison of HaLo-NeRF with and without CLIPSeg fine-tuning over additional semantic concepts beyond those from our benchmark.As is seen there, our fine-tuning proce-dure is needed to learn reliable localization of such concepts which may be lifted to 3D.
We include demonstrations of the generality of our method.Besides noting that our test set includes the synagogue category which was not seen in training (see the results for the Hurva Synagogue shown in Figure 6), we test our model in the more general case of (non-religious) architectural landmarks.Figure 11 shows results on various famous landmarks captured in the IMC-PT 2020 dataset [Yi20] (namely, Brandenburg Gate, Palace of Westminster, The Louvre Museum, Park Güell, The Statue of Liberty, Las Vegas, The Trevi Fountain, The Pantheon, and The Buckingham Palace).As seen there, HaLo-NeRF localizes unique scene elements such as the quadriga in the Brandenburg Gate, the Statue of Liberty's torch, and the Eiffel tower, The Statue of Liberty, and Las Vegas, respectively.In addition, HaLo-NeRF localizes common semantic concepts, such as clock, glass, and text in the Palace of Westminster, The Louvre Museum, and The Pantheon, respectively.Furthermore, while we focus mostly on outdoor scenes, Figure 9 shows that our method can also localize semantic concepts over reconstructions capturing indoor scenes.
Understanding that users may not be familiar with fine-grained or esoteric architectural terminology, we anticipate the use of CLIP FT (our fine-tuned CLIP model, as defined in Section 3.2) for retrieving relevant terminology.In particular, CLIP FT may be applied to any selected view to retrieve relevant terms to which the user may then apply HaLo-NeRF.We demonstrate this qualitatively in Figure 10, which shows the top terms retrieved by CLIP FT on test images.In the supplementary material, we also report a quantitative evaluation over all architectural terms found at least 10 times in the training data.This evaluation further demonstrates that CLIP FT can retrieve relevant terms over these Internet images (significantly outperforming pretrained CLIP at this task).
Figure 11 further illustrates the utility of our method for intuitive exploration of scenes.By retrieving scene images having maximal overlap with localization predictions, the user may focus automatically on the text-specified region of interest, allowing for exploration of the relevant semantic regions of the scene in question.This is complementary to exploration over the optimized neural representation, as illustrated in Figures 1-2, and in the accompanying videos.

Ablation Studies
We proceed to evaluate the contribution of multiple components of our system-LLM-based concept distillation and VLM semantic adaptation-to provide motivation for the design of our full system.LLM-based Concept Distillation.In order to evaluate the quality of our LLM-generated pseudo-labels and their necessity, we manually review a random subset of 100 items (with non-empty pseudolabels), evaluating their factual correctness and comparing them to  Table 2: Ablation Studies, evaluating the effect of design choices on the fine-tuning process of CLIPSeg FT ."Baseline" denotes using the CLIPSeg segmentation model without fine-tuning.We report AP and mAP metrics over the HolyScenes benchmark as in Table 1.Best results are highlighted in bold.
two metadata-based baselines -whether the correct architectural feature is present in the image's caption, and whether it could be inferred from the last WikiCategory listed in the metadata for the corresponding image (see Section 3.1 for an explanation of this metadata).These baselines serve as upper bounds for architectural feature inference using the most informative metadata fields by themselves (and assuming the ability to extract useful labels from them).We find 89% of pseudo-labels to be factually correct, while only 43% of captions contain information implying the correct architectural feature, and 81% of the last WikiCategories to describe said features.We conclude that our pseudo-labels are more informative than the baseline of using the last WikiCategory, and significantly more so than inferring the architectural feature from the image caption.Furthermore, using either of the latter alone would still require summarizing the text to extract a usable label, along with translating a large number of results into English.
To further study the effect of our LLM component on pseudolabels, we provide ablations on LLM sizes and prompts in the supplementary material, finding that smaller models underperform ours while the best-performing prompts show similar results.There we also provide statistics on the distribution of our pseudo-labels,

Glass
The Egg

Torch
The Eiffel showing that they cover a diverse set of categories with a long tail of esoteric items.
VLM Semantic Adaptation Evaluation.To strengthen the motivation behind our design choices of CLIPSeg FT , we provide an ablation study of the segmentation fine-tuning in Table 2.We see that each element of our training design provides a boost in overall performance, together significantly outperforming the 2D baseline segmentation model.In particular, we see the key role of our correspondence-based data augmentation, without which the finetuning procedure significantly degrades due to lack of grounding in the precise geometry of our scenes (both relative to full finetuning, and relative to the original segmentation model).These results complement Figure 5, which show a qualitative comparison of the CLIPSeg baseline and CLIPSeg FT .We also note that we have provided a downstream evaluation of the effect of fine-tuning CLIPSeg on 3D localization in Table 1, showing that it provides a significant performance boost and is particularly crucial for less common concepts.

Limitations
As our method uses an optimization-based pipeline applied for each textual query, it is limited by the runtime required to fit each term's segmentation field.In particular, a typical run takes roughly two hours on our hardware setup, described in the supplementary material.We foresee future work building upon our findings to accelerate these results, possibly using architectural modifications such as encoder-based distillation of model predictions.
Furthermore, if the user inputs a query which does not appear in the given scene, our model may segment semantically-or geometrically-related regions -behavior inherited from the base segmentation model.For example, the spires of Milan Cathedral are segmented when the system is prompted with the term minarets, which are not present in the view but bear visual similarity to spires.Nevertheless, CLIP FT may provide the user with a vocabulary of relevant terms (as discussed in Section 5.3), mitigating this issue (e.g.minarets does not appear among the top terms for images depicting Milan Cathedral).We further discuss this tendency to segment salient, weakly-related regions in the supplementary material.
Additionally, since we rely on semantic concepts that appear across landmarks in our training set, concepts require sufficient coverage in this training data in order to be learned.While our method is not limited to common concepts and shows understanding of concepts in the long tail of the distribution of pseudo-labels (as analyzed in the supplementary material), those that are extremely rare or never occur in our training data may not be properly identified.This is seen in Figure 12, where the localization of the scene-specific concepts Immaculate Conception and Papal Coat of Arms (terms which never occur in our training data; for example, the similar term coat of arms appears only seven times) incorrectly include other regions.

Conclusions
We have presented a technique for connecting unique architectural elements across different modalities of text, images, and 3D volumetric representations of a scene.To understand and localize domain-specific semantics, we leverage inter-view coverage of a scene in multiple modalities, distilling concepts with an LLM and using view correspondences to bootstrap spatial understanding of these concepts.We use this knowledge as guidance for a neural 3D representation which is view-consistent by construction, and demonstrate its performance on a new benchmark for concept localization in large-scale scenes of tourist landmarks.
Our work represents a step towards the goal of modeling historic and culturally significant sites as explorable 3D models from photos and metadata captured in the wild.We envision a future where these compelling sites are available to all in virtual form, making them accessible and offering educational opportunities that would not otherwise be possible.Several potential research avenues include making our approach interactive, localizing multiple prompts simultaneously and extending our technique to additional mediums with esoteric concepts, such as motifs or elements in artwork.

Supplementary Material 1. HolyScenes -Additional Details
Landmarks and Categories Used.
Our benchmark spans three landmark building types (cathedrals, mosques, and a synagogue), from different areas around the world.We select scenes that have sufficient RGB imagery for reconstructing with [CZL * 22].The images were taken from IMC-PT 20 [Yi20] (Notre-Dame Cathedral, St. Paul's Cathedral), MegaDepth [LS18] (Blue Mosque), WikiScenes [WAESS21] (Milan Cathedral), and scraped from Wikimedia Commons using the WikiScenes data scraping procedure (Badshahi Mosque and Hurva Synagogue).The Notre-Dame cathedral has the most images in the dataset (3,765 images), and the Hurva Synagogue has the fewest (104 images).For semantic categories, we select diverse concepts of different scales.Some of these (such as portal) are applicable to all landmarks in our dataset while others (such as minaret) only apply to certain landmarks.As illustrated in Table 3, we provide segmentations of 3-4 semantic categories for each landmark; these are selected based on the relevant categories in each case (e.g.only the two mosques have minarets).

Annotation Procedure
We produce ground-truth binary segmentation maps to evaluate our method using manual labelling combined with correspondenceguided propagation.We first segment 110 images from 3-4 different categories from each of the six different scenes in our dataset, as shown in Table 3.We then estimate homographies between these images and the remaining images for these landmarks, using shared keypoint correspondences from COLMAP [SF16] and RANSAC.We require at least 100 corresponding keypoints that are RANSAC inliers; we also filter out extreme (highly skewed or rotated) homographies by using the condition number of the first two columns of the homography matrix.When multiple propagated masks can be inferred for a target image, we calculate each pixel's binary value by a majority vote of the warped masks.Finally, we filter these augmented masks by manual inspection.Out of 8,951 images, 6,195 were kept (along with the original manual seeds), resulting in a final benchmark size of 6,305 items.Those that were filtered are mostly due to occlusions and inaccurate warps.Annotation examples from our benchmark are shown in Figure 13.

Augmenting the WikiScenes Dataset
The original WikiScenes dataset is as described in Wu et al. [WAESS21].To produce training data for the offline stages of our system (LLM-based semantic distillation and V&L model semantic adaptation), we augment this cathedral-focused dataset with mosques by using the same procedure to scrape freelyavailable Wikimedia Commons collected from the root WikiCategory "Mosques by year of completion".The collected data contains a number of duplicate samples, since the same image may * Denotes equal contribution Table 3: The HolyScenes Benchmark, composed of the Notre-Dame Cathedral (NDC), Milan Cathedral (MC), St. Paul's Cathedral (SPC), Badshahi Mosque (BAM), Blue Mosque (BLM), and the Hurva Synagogue (HS).Above we report the set of semantic categories annotated for each landmark, chosen according to their visible structure.In the columns on the right, we report the number of initial manually segmented images (#Seed), and the final number of ground-truth segmentations after augmented with filtered warps (#Seg) appear under different categories in Wikimedia Commons and is thus retrieved multiple times by the scraping script.In order to deduplicate, we treat the image's filename (as accessed on Wikimedia Commons) as a unique identifier.After de-duplication, we are left with 69,085 cathedral images and 45,668 mosque images.Out of these, we set aside the images from landmarks which occur in HolyScenes (13,743 images total) to prevent test set leakage; the remaining images serve as our training data.

LLM-Based Semantic Distillation
To distill the image metadata into concise textual pseudo-labels, we use the instruction-tuned language model Flan-T5 [CHL * 22], selecting the 3B parameter Flan-T5-XL variant.The model is given the image caption, related key-words, and filename, and outputs a single word describing a prominent architectural feature within the image serving as its pseudo-label.Text is generated using beam search decoding with four beams.The prompt given to Flan-T5 includes the instruction to Write "unknown" if it is not specified (i.e. the architectural feature), in order to allow the language model to express uncertainty instead of hallucinating incorrect answers in indeterminate cases, as described in our main paper.We also find the use of the building's name in the prompt (What architectural feature of ⟨BUILDING⟩...) to be important in order to cue the model to omit the building's name from its output (e.g.towers of the Cathedral of Seville vs. simply towers).
To post-process these labels, we employ the following textual cleanup techniques.We (1) employ lowercasing, (2) remove outputs starting with "un-" ("unknown", "undefined" etc.), and (3) remove navigation words (e.g."west" in "west facade") since these are not informative for learning visual semantics.Statistics on the final pseudo-labels are given in Section 3.1.

Semantic Adaptation of V&L Models
We fine-tune CLIP FT on images and associated pseudo-labels, preprocessing by removing all pairs whose pseudo-label begins with "un-" (e.g."unknown", "undetermined", etc.) and removing initial direction words ("north", "southern", "north eastern", etc.), as these are not visually informative.In total, this consists of 57,874 I 2 , as a zoomed-out image, should contain P but not perfectly match it as a concept.
• At least 3 inlier keypoints within the region R P of I 1 matching P. R P is estimated by by segmenting I 1 with CLIPSeg and prompt P and binarizing with threshold 0.3.• A low ratio of areas of the region matching P relative to the building's facade, since this suggests a localizable concept.This is estimated as follows: We first find the quadrilateral Q which is the region of I 2 corresponding to I 1 , by projecting I 1 with the homography estimated from corresponding keypoints.We then find the facade of the building in I 2 by segmenting I 2 using CLIPSeg with the prompt cathedral or mosque (as appropriate for the given landmark), which outputs the matrix of probabilities M. Finally, we calculate the sum of elements of M contained in Q divided by the sum of all elements of M, and check if this is less than 0.5.
Empirically, we find that these heuristics succeed in filtering out many types of uninteresting image pairs and noise while selecting for the correspondences and pseudo-labels that are of interest.Due to computational constraints, we limit our search to 50 images from each landmark in our train set paired with every other image from the same landmark, and this procedure yields 3,651 triplets (I 1 , I 2 , P) in total, covering 181 unique pseudo-label categories.To use these correspondences as supervision for training segmentation, we segment I 1 using CLIPSeg with prompt P, project this segmentation onto I 2 using the estimated homography, and using the resulting segmentation map in the projected region as groundtruth for segmenting I 2 with P.
In addition to this data, we collect training data on a larger scale by searching for images from the entire training dataset with crops that are close to particular pseudo-labels.To do this, we run a search by randomly selecting landmark L and and one of its images I, selecting a random pseudo-label P that appears with L (not necessarily with the chosen image) in our dataset, selecting a random crop C of I, and checking its similarity to P with CLIP FT .We check if the following heuristic conditions hold: • C must have CLIP FT similarity of at least 0.2 with P.
• C must have higher CLIP FT similarity to P than I does.• This similarity must be higher than the similarity between C and the 20 most common pseudo-labels in our train dataset (excluding P, if it is one of these common pseudo-labels).• C when segmented using CLIPSeg with prompt P must have some output probability at least 0.1 in its central area (the central 280×280 region within the 352×352 output matrix).
If these conditions hold, we use the pair (I, P) along with the CLIPSeg segmentation of the crop C with prompt P as ground-truth data for fine-tuning our segmentation model.Although this search could be run indefinitely, we terminate it after collecting 29,440 items to use as training data.
For both sources of data (correspondence-based and crop-based), we further refine the pseudo-labels by converting them to singular, removing digits and additional direction words, and removing nonlocalizable concepts and those referring to most of the landmark or its entirety ("mosque", "front", "gothic", "cathedral", "side", "view").
We fine-tune CLIPSeg to produce CLIPSeg FT by training for 10 epochs with learning rate 1e-4.We freeze CLIPSeg's encoders and only train its decoder module.To provide robustness to label format, we randomly augment textual pseudo-labels by converting them from singular to plural form (e.g."window" → "windows") with probability 0.5.At each iteration, we calculate losses using a single image and ground-truth pair from the correspondence-based data, and a minibatch of four image and ground-truth pairs from the crop-based data.We use four losses for training, summed together with equal weighting, as described in the main paper in Section 3.2.
CLIPSeg (and CLIPSeg FT ) requires a square input tensor with spatial dimensions 352×352.In order to handle images of varying aspect ratios during inference, we apply vertical replication padding to short images, and to wide images we average predictions applied to a horizontally sliding window.In the latter case, we use overlapping windows with stride of 25 pixels, after resizing images to have maximum dimension of size 500 pixels.Additionally, in outdoor scenes, we apply inference after zooming in to the bounding box of the building in question, in order to avoid attending to irrelevant regions.The building is localized by applying CLIPSeg with the zero-shot prompt cathedral, mosque, or synagogue (as appropriate for the building in question), selecting the smallest bounding box containing all pixels with predicted probabilities above 0.5, and adding an additional 10% margin on all sides.While our model may accept arbitrary text as input, we normalize inputs for metric calculations to plural form ("portals", "windows", "spires" etc.) for consistency.

3D Localization
We build on top of the Ha-NeRF [CZL * 22] architecture with an added semantic channel, similarly to Zhi et al. [ZLLD21].This semantic channel consists of an MLP with three hidden layers (dimensions 256, 256, 128) with ReLU activations, and a final output layer for binary prediction with a softmax activation.We first train the Ha-NeRF RGB model of a scene (learning rate 5e-4 for 250K iterations); we then freeze the shared MLP backbone of the RGB and semantic channels and train only the semantic channel head (learning rate 5e-5, 12.5K iterations).We train with batch size 8,192.
When training the semantic channel, the targets are binary segmentation masks produced by CLIPSeg FT with a given text prompt, using the inference method described above.We binarize these targets (threshold 0.2) to reduce variance stemming from outputs with low confidence, and we use a binary cross-entropy loss function when training on them.
For indoor scenes, we use all available images to train our model.For outdoor scenes, we select 150 views with segmentations for building the 3D semantic field by selecting for images with clear views of the building's entire facade without occlusions.We find that this procedure yields comparable performance to using all the images in the collection, while being more computationally efficient.To select these images, we first segment each candidate image with CLIPSeg using one of the prompts cathedral, mosque, or synagogue (as relevant) , select the largest connected component C of the output binary mask (using thresold 0.5), and sort the images by the minimum horizontal or vertical margin length of this component from the image's borders.This prioritizes images where the building facade is fully visible and contained within the boundary of the visible image.To prevent occluded views of the building from being selected, we add a penalty using the proportion of overlap C and the similar binary mask C ′ calculated on the RGB NeRF reconstruction of the same view, since transient occlusions are not typically reconstructed by the RGB NeRF.In addition, we penalize images with less than 10% or more than 90% total area covered by C, since these often represent edge cases where the building is barely visible or not fully contained within the image.Written precisely, the scoring formula is given by s = m + c − x, where m is the aforementioned margin size (on a scale from 0 to 1), c is the proportion of area of C ′ overlapping C, and x is a penalty of 1.0 when C covers too little or much of the image (as described before) and 0 otherwise.
Runtime.A typical run (optimizing the volumetric probabilities for a single landmark) takes roughly 2 hours on a NVIDIA RTX A5000 with a single GPU.Optimizing the RGB and density values is only done once per landmark, and takes 2 days on average, depending on the number of images in the collection.

Baseline Comparisons
We provide additional details of our comparison to DFF [KMS22] and LERF [KKG * 23].We train these models on the same images used to train our model.We use a Ha-NeRF backbone; similarly to our method we train the RGB NeRF representations for 250K steps and then the semantic representations for an additional 150K steps.Otherwise follow the original training and implementation details of these models, which we reproduce here for clarity.
For DFF, we implement feature layers as an MLP with 2 hidden layers of 128 and ReLU activations.The input to the DFF model is the images and the corresponding features derived from LSeg, and we minimize the difference between the learned features and LSeg features with an L2 loss, training with batch size 1024.
For LERF, we use the official implementation which uses the Nerfacto method and the Nerfstudio API [TWN * 23].The architecture includes a DINO MLP with one hidden layer of dimension 256 and a ReLU activation; and a CLIP MLP consisting of with 3 hidden layers of dimension 256, ReLU activations, and a final 512-dimensional output layer.The input to this model consists of images, their CLIP embeddings in different scales, and their DINO features.We use the same loss as the original LERF paper: CLIP loss for the CLIP embeddings to maximize the cosine similarity, and MSE loss for the DINO features.The CLIP loss is multiplied by a factor of 0.01 similar to the LERF paper.We use an image pyramid from scale 0.05 to 0.5 in 7 steps.We train this model with batch size was 4096.We used also the relevancy score with the same canonical phrases as described in the LERF paper: "object", "things", "stuff", and "texture".
We note the long tail of pseudo-labels includes items shown in our evaluation such as tympanum (29 occurrences), roundel (occurs once as painted roundel), colonnade (230 occurrences), and pediment (3 occurrences; 44 times as plural pediments).

CLIPSeg Visualizations
As described in our main paper, we leverage the ability of CLIPSeg to segment salient objects in zoomed-in images even when it lacks fine-grained understanding of the accompanying pseudo-label.To illustrate this, Figure 16 shows several results of inputting the target text prompt door to CLIPSeg along with images that do not have visible doors.As seen there, the model segments salient regions which bear some visual and semantic similarity to the provided text prompt (i.e.possibly recognizing an "opening" agnostic to its fine-grained categorization as a door, portal, window, etc).Our finetuning scheme leverages this capability to bootstrap segmentation knowledge in zoomed-out views by supervising over zoomed-in views where the salient region is known to correspond to its textual pseudo-label.
Additionally, we find that 2D segmentation maps often show a bias towards objects and regions in the center of images, at the expense of the peripheries of scenes.This is seen for instance in Figure 16, where the windows on the center are better localized, in comparison to the windows on the sides of the building.

HaLo-NeRF Visualizations
In Figure 14, we compare segmentation results before and after 3D localization.We see that HaLo-NeRF exhibits 3D consistency, while 2D segmentation results of CLIPSeg FT operating on each image separately exhibit inconsistent results between views.We also see that this effect is prominent when using these methods for binary segmentation, obtained by thresholding predictions.
In Figure 15, we demonstrate our ability to perform localization of multiple semantic concepts in a single view of a scene.By providing HaLo-NeRF with different text prompts, the user may decompose a scene into semantic regions to gain an understanding of its composition.

2D Baseline Visualizations
In Figure 17, we visualize outputs of the two 2D baseline segmentation methods (LSeg and ToB) as well as CLIPSeg and our finetuned CLIPSeg FT .We see that the baseline methods struggle to attend to the relevant regions in our images, while CLIPSeg FT shows the best undestanding of these concepts and their localizations.We sample 100 random items from our dataset for manual inspection, running pseudo-labeling with our original setting (XL, P 1 ) as well as with alternate model sizes and prompts.Regarding model sizes, while the majority of non-empty generated pseudo- Above we provide the target prompt door to CLIPSeg (pretrained and fine-tuned) along with images that do not have visible doors.As seen above, the models instead segment more salient regions which bear some visual and semantic similarity to the provided text prompt (in this case, segmenting windows).
We report recall at k ∈ {1, 5, 10, 16, 32, 64}, comparing our results (highlighted in the table) to the baseline CLIP model.Best results are highlighted in bold.
labels are valid as we show in the main paper, we consider how often empty or incorrect pseudo-labels are yielded when varying the model size.Considering this, 62/100 items receive an empty, poor or vague pseudo-label in our original setting, only one of these receives a valid pseudo-label with a smaller model, confirming the superior performance of the largest (XL) model.Regarding prompt variants, P 2 only yields 9/100 valid pseudo-labels (versus 38/100 for P 1 ), while P 3 yields 40/100 valid pseudo-labels (31 of these are in common with P 1 ).Thus, the best-performing prompt (P 3 ) is comparable to our original setting, suggesting that our original setting is well-designed to produce useful pseudo-labels.

CLIP FT Retrieval Results
In Table 4, we show quantitative results for the use of CLIP FT to retrieve relevant terminology, as described in our main paper.In particular, we fix a vocabulary of architectural terms found at least 10 times in the training data, and evaluate text-to-image retrieval on test images (from landmarks not seen during training) with pseudolabels in this list.As seen in these results, our fine-tuning provides a significant performance boost to CLIP in retrieving relevant terms for scene views, as the base CLIP model is not necessarily familiar with fine-grained architectural terminology relevant to our landmarks out-of-the-box.

Additional CLIPSeg FT Results
To test the robustness of our CLIPSeg fine-tuning on additional datasets and preservation of pretraining knowledge, we evaluate segmentation results on two additional datasets: SceneParse150 [ZZP * 16,ZZP * 17] (general outdoor scene segmentation) and Wikiscenes [WAESS21] (architectural terminology).
On SceneParse150, we test on the validation split (2000 items), selecting a random semantic class per image (from among those classes present in the image's annotations).We segment using the class's textual name and measure average precision, averaged over all items to yield the mean average precision (mAP) metric.We observe a negligible performance degradation after fine-tuning, namely mAP 0.53 before fine-tuning and 0.52 afterwards, suggesting overall preservation of pretraining knowledge.
semantic segmentation.The emergence of powerful large-scale vision-language models [JYX * 21, RKH * 21] has propelled a surge of interest in pixel-level semantic segmentation using text prompts [XZW * 21,LWB * 22,LE22,DXXD22,XDML * 22, GGCL22, ZLD22, LWD * 23].A number these works leverage the rich semantic understanding of CLIP [RKH * 21], stemming from large-scale contrastive training on text-image pairs.LSeg [LWB * 22] trains an image encoder to align a dense pixel representation with CLIP's embedding for the text description of the corresponding semantic class.OpenSeg [GGCL22] optimizes a class-agnostic region segmentation module to matched extracted words from image captions.CLIPSeg [LE22] leverages the activations of CLIP's dual encoders, training a decoder to convert them into a binary segmentation mask.CLIP's zero-shot understanding on the image level has also been leveraged for localization by Decatur et al. [DLH22], who lift CLIP-guided segmentation in 2D views to open-vocabulary localization over 3D meshes.

Figure 2 :
Figure 2: System overview of our approach.(a) We extract semantic pseudo-labels from noisy Internet image metadata using a large language model (LLM).(b) We use these pseudo-labels and correspondences between scene views to learn image-level and pixel-level semantics.In particular, we fine-tune an image segmentation model (CLIPSeg FT ) using multi-view supervision-where zoomed-in views and their associated pseudo-labels (such as image on the left associated with the term "tympanum") provide a supervision signal for zoomed-out views.(c) We then lift this semantic understanding to learn volumetric probabilities over new, unseen landmarks (such as the St. Paul's Cathedral depicted on the right), allowing for rendering views of the segmented scene with controlled viewpoints and illumination settings.See below for the definitions of the concepts shown * .
Input: ARCHED-WALKWAYS-AT RAJON-KI-BAOLI.JPG "This is a photo of ASI monument number."Rajon ki Baoli.Output: Archways Input: CATEDRAL-DE-PALMA-DE-MALLORCA,-FACHADA-SUR,-DESDE-EL-PASEO-DE-LA-MURALLA.JPG "Catedral de Palma de Mallorca, fachada sur, desde el Paseo de la Muralla."mallorca catedral cathedral palma spain mallorca majorca;Exterior of Cathedral of Palma de Mallorca;Cathedral of Palma de Mallorca -Full.Output: Facade Input: SUNDIAL-YENI CAMII2-ISTANBUL.JPG "sundial outside Yeni Camii.On top of the lines the arabic word Asr (afternoon daily prayer) is given.The ten lines (often they are only 9) indicate the times from 20min to 3h before the prayer.Time is read off at the tip of the shadow.The clock was made around 1669 (1074 H)." New Mosque (Istanbul).Output: Sundial

Figure 3 :
Figure3: LLM-based distillation of semantic concepts.The full image metadata (Input), including FILENAME, "caption" and Wi-kiCategories (depicted similarly above) are used for extracting distilled semantic pseudo-labels (Output) with an LLM.Note that the associated images on top (depicted with corresponding colors) are not used as inputs for the computation of their pseudolabels.

Figure 4 :
Figure 4: Adapting a text-based image segmentation model to architectural landmarks.We utilize image correspondences (such as the pairs depicted on the left) and pseudo-labels to fine-tune CLIPSeg.We propogate the pseudo-label and pseudo-label of the zoomed-in image to serve as the supervision target, as shown in the central column; we supervise predictions on the zoomed-out image only over the corresponding region (other regions are colored in grayed out for illustration purposes).This supervision (together with using random crops further described in the text) refines the model's ability to recognize and localize architectural concepts, as seen by the improved performance shown on the right.

Figure 5 :
Figure 5: Text-based segmentation before and after fine-tuning.Above we show 2D segmentation results over images belonging to landmarks from HolyScenes (unseen during training).As illustrated above, our weakly-supervised fine-tuning scheme improves the segmentation of domain-specific semantic concepts.

Figure 6 :
Figure 6: Neural 3D Localization Results.We show results from each landmark in our HolyScenes benchmark (clockwise from top: St. Paul's Cathedral, Hurva Synagogue, Notre-Dame Cathedral, Blue Mosque, Badshahi Mosque, Milan Cathedral), visualizing segmentation maps rendered from 3D HaLo-NeRF representations on input scene images.As seen above, HaLo-NeRF succeeds in localizing various semantic concepts across diverse landmarks.
: Notre-Dame Cathedral (Paris), Milan Cathedral (Milan), St. Paul's Cathedral (London), Badshahi Mosque (Lahore), Blue Mosque (Istanbul) and Hurva Synagogue (Jerusalem).These landmarks span different geographical regions, religions and characteristics, and can readily be associated with accurate 3D reconstructions due to the large © 2024 The Authors.Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd.
For text-based 2D segmentation baseline methods, we consider CLIPSeg[LE22] and LSeg [LWB * 22].We also compare to the ToB model proposed by Wu et al.[WAESS21] that learns image segmentation over the WikiScenes dataset using cross-view correspondences as weak supervision.As their model is categorical, operating over only ten categories, we report the performance of ToB only over the semantic concepts included in their model.For 3D NeRF-based segmentation methods, we consider DFF[KMS22] and LERF [KKG * 23].Both of these recent methods utilize text for NeRF-based 3D semantic segmentation.DFF[KMS22] performs semantic scene decomposition using text prompts, distilling text-aligned image features into a volumetric 3D representation and segmenting 3D regions by probing these with the feature representation of a given text query.Similarly, LERF optimizes a 3D language field from multi-scale CLIP embeddings with volume rendering.
Ours * Using a Ha-NeRF backbone

Figure 7 :Figure 8 :Figure 10 :
Figure7: Localizing semantic regions in architectural landmarks compared to prior work.We show probability maps for DFF and LERF models on Milan Cathedral, along with our results.As seen above, DFF and LERF struggle to distinguishing between different semantic regions on the landmark, while our method accurately localizes the semantic concepts.

Figure 11 :
Figure 11: Localization for general architectural scenes.HaLo-NeRF can localize various semantic concepts in a variety of scenes in the wild, not limited to the religious domain of HolyScenes.Our localization, marked in green in the first image for each concept, enables focusing automatically on the text-specified region of interest, as shown by the following zoomed-in images in each row.

Figure 12 :
Figure 12: Limitation examples.Correct results are marked in green boxes and incorrect ones in red.Our method may fail to properly identify terms that never appear in our training data, such as the Immaculate Conception * as on the left and the Papal Coat of Arms as on the right.

Figure 14 :Figure 15 :
Figure14: Results before and after 3D localization.Segmentation results for the prompts windows and rose window are presented in the first and last pairs of rows, respectively.We show the results of CLIPSeg FT and HaLo-NeRF's projected localization in green, observing that HaLo-NeRF yields 3D-consistent results by fusing the 2D predictions of CLIPSeg FT , which exhibit view inconsistencies.We also show binary segmentation (th) obtained with threshold 0.5 in red, seeing that inconsistencies are prominent when using these methods for binary prediction.
3.5.LLM AblationsAs an additional test of our LLM-based pseudo-labeling procedure, we ablate the effect of the LLM model size and prompt templates used.In particular, we test the following sizes of Flan-T5 [CHL * 22]: XL (ours), Large, Base, and Small * These vary in size from 80M (Small) to 3B (XL) parameters.In addition, we test the following prompt templates: P 1 (our original prompt, including the phrase ...what architectural feature of...), P 2 (...what aspect of the building...), and P 3 (...what thing in...).

Figure 16 :
Figure16: Providing text-based segmentation models with partially related text prompts.Above we provide the target prompt door to CLIPSeg (pretrained and fine-tuned) along with images that do not have visible doors.As seen above, the models instead segment more salient regions which bear some visual and semantic similarity to the provided text prompt (in this case, segmenting windows).

Figure 17 :
Illustration of baseline 2D segmentation methods.As is seen above, the baseline methods (LSeg and ToB) struggle to attend to the relevant regions in the images, while CLIPSeg FT shows the best understanding of these concepts and their localizations, consistent with our quantitative evaluation.

Table 1 :
The publicly available implementations of DFF and LERF cannot operate on our in-the-wild problem setting, as it does not have Quantitative Evaluation.We report mean average precision (mAP; averaged per category) and per category average precision over the HolyScenes benchmark, comparing our results (highlighted in the table) to 2D segmentation and 3D localization techniques.Note that ToB uses a categorical model, and hence we only report performance over concepts it was trained on.Best results are highlighted in bold.
On Wikiscenes, fine-tuning improves all metrics reported by Wu et al. (IoU, precision, recall), as shown in Table 5.As these met-© 2024 The Authors.Computer Graphics Forum published by Eurographics and John Wiley & Sons Ltd.