Skip to content

caption module

Image captioning and feature extraction module using BLIP and spaCy.

This module provides functionality to generate captions for images using the BLIP model and extract relevant features from the captions using spaCy NLP.

ImageCaptioner

Image captioning and feature extraction using BLIP and spaCy.

This class provides functionality to generate captions for images using the BLIP model and extract relevant features from the captions using spaCy NLP processing.

Parameters:

Name Type Description Default
blip_model_name str

Name or path of the BLIP model to use for captioning. Defaults to "Salesforce/blip-image-captioning-base".

DEFAULT_BLIP_MODEL
spacy_model_name str

Name of the spaCy model to use for NLP processing. Defaults to "en_core_web_sm".

DEFAULT_SPACY_MODEL
device Optional[str]

Device to run the BLIP model on. If None, automatically detects the best available device (CUDA, MPS, or CPU).

None

Attributes:

Name Type Description
blip_model_name

The name of the loaded BLIP model.

spacy_model_name

The name of the loaded spaCy model.

device

The device the model is running on.

processor

The BLIP processor for image preprocessing.

blip_model

The BLIP model for caption generation.

nlp

The spaCy NLP pipeline.

Example

captioner = ImageCaptioner() caption, features = captioner.analyze("path/to/image.jpg") print(caption) "an aerial view of a parking lot with cars" print(features) ["parking_lot", "car"]

Source code in samgeo/caption.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
class ImageCaptioner:
    """Image captioning and feature extraction using BLIP and spaCy.

    This class provides functionality to generate captions for images using
    the BLIP model and extract relevant features from the captions using
    spaCy NLP processing.

    Args:
        blip_model_name: Name or path of the BLIP model to use for captioning.
            Defaults to "Salesforce/blip-image-captioning-base".
        spacy_model_name: Name of the spaCy model to use for NLP processing.
            Defaults to "en_core_web_sm".
        device: Device to run the BLIP model on. If None, automatically
            detects the best available device (CUDA, MPS, or CPU).

    Attributes:
        blip_model_name: The name of the loaded BLIP model.
        spacy_model_name: The name of the loaded spaCy model.
        device: The device the model is running on.
        processor: The BLIP processor for image preprocessing.
        blip_model: The BLIP model for caption generation.
        nlp: The spaCy NLP pipeline.

    Example:
        >>> captioner = ImageCaptioner()
        >>> caption, features = captioner.analyze("path/to/image.jpg")
        >>> print(caption)
        "an aerial view of a parking lot with cars"
        >>> print(features)
        ["parking_lot", "car"]
    """

    def __init__(
        self,
        blip_model_name: str = DEFAULT_BLIP_MODEL,
        spacy_model_name: str = DEFAULT_SPACY_MODEL,
        device: Optional[str] = None,
    ):
        """Initialize the ImageCaptioner with specified models.

        Args:
            blip_model_name: Name or path of the BLIP model to use.
                Defaults to "Salesforce/blip-image-captioning-base".
            spacy_model_name: Name of the spaCy model to use.
                Defaults to "en_core_web_sm".
            device: Device to run the model on ('cuda', 'mps', 'cpu').
                If None, automatically detects the best available device.
        """
        self.blip_model_name = blip_model_name
        self.spacy_model_name = spacy_model_name
        self.device = device if device else get_device()

        # Load spaCy model
        ensure_spacy_model(spacy_model_name)
        self.nlp = spacy.load(spacy_model_name)

        # Load BLIP model
        self.processor = BlipProcessor.from_pretrained(
            blip_model_name,
            use_fast=True,
        )
        self.blip_model = BlipForConditionalGeneration.from_pretrained(
            blip_model_name
        ).to(self.device)
        self.blip_model.eval()

    @torch.inference_mode()
    def generate_caption(self, image_source: Union[str, Image.Image]) -> str:
        """Generate a caption for the given image.

        Args:
            image_source: The image to caption. Can be a local file path,
                an HTTP(S) URL, or a PIL Image object.

        Returns:
            Generated caption string describing the image content.

        Example:
            >>> captioner = ImageCaptioner()
            >>> caption = captioner.generate_caption("path/to/aerial.jpg")
            >>> print(caption)
            "an aerial view of a building with a parking lot"
        """
        img = load_image(image_source)
        inputs = self.processor(img, return_tensors="pt").to(self.device)
        out = self.blip_model.generate(**inputs)
        caption = self.processor.decode(out[0], skip_special_tokens=True)
        return caption

    def extract_features(
        self,
        caption: str,
        include_features: Optional[Union[str, List[str]]] = None,
        exclude_features: Optional[List[str]] = None,
    ) -> List[str]:
        """Extract features from a caption using NLP processing.

        Uses spaCy to parse the caption and extract relevant noun features
        based on the provided inclusion/exclusion criteria.

        Args:
            caption: The caption text to extract features from.
            include_features: Controls which features to extract:
                - None: Extract any noun (excluding large-scale terms
                  and custom excludes).
                - "default" or ["default"]: Use the aerial_features.json
                  vocabulary for matching.
                - List of strings: Custom allowed features (with or without
                  underscores).
            exclude_features: List of noun lemmas to exclude in addition
                to the built-in large-scale terms.

        Returns:
            Sorted list of extracted feature names (canonical keys or
            noun lemmas).

        Example:
            >>> captioner = ImageCaptioner()
            >>> features = captioner.extract_features(
            ...     "a parking lot with several cars",
            ...     include_features=["default"]
            ... )
            >>> print(features)
            ["car", "parking_lot"]
        """
        doc = self.nlp(caption)
        detected = set()

        # ----------------------- exclusions -----------------------
        active_exclude = set(LARGE_SCALE)
        if exclude_features:
            active_exclude.update(
                ex.lower().replace("_", " ") for ex in exclude_features
            )

        # Normalize include_features semantics
        use_default_vocab = False
        user_include_list: Optional[List[str]] = None

        if include_features is None:
            use_default_vocab = False
        else:
            # allow include_features="default" or ["default"]
            if isinstance(include_features, str):
                include_features = [include_features]

            if any(f.lower() == "default" for f in include_features):
                use_default_vocab = True
            else:
                user_include_list = include_features

        # Build maps for matching
        if use_default_vocab:
            single_map = SINGLE_WORD_FEATURES
            multi_map = MULTIWORD_FEATURES
        elif user_include_list:
            norm = [f.lower().replace("_", " ") for f in user_include_list]
            single_map = {
                phrase: phrase.replace(" ", "_") for phrase in norm if " " not in phrase
            }
            multi_map = {
                phrase: phrase.replace(" ", "_") for phrase in norm if " " in phrase
            }
        else:
            single_map = None
            multi_map = None

        # ----------------------- multi-word (noun chunks) -----------------------
        if use_default_vocab or user_include_list:
            for chunk in doc.noun_chunks:
                chunk_text = chunk.text.lower()
                chunk_lemma = " ".join(tok.lemma_.lower() for tok in chunk)

                for candidate in {chunk_text, chunk_lemma}:
                    if multi_map and candidate in multi_map:
                        detected.add(multi_map[candidate])

        # ----------------------- single-word nouns -----------------------
        original_include = include_features
        for token in doc:
            if token.pos_ != "NOUN":
                continue

            lemma = token.lemma_.lower()
            if lemma in active_exclude:
                continue

            # Case: no include list → accept any noun lemma
            if original_include is None:
                detected.add(lemma)
            else:
                if single_map and lemma in single_map:
                    detected.add(single_map[lemma])

        # ----------------------- fallback if using include list -----------------------
        if (use_default_vocab or user_include_list) and not detected:
            fallback = {
                token.lemma_.lower()
                for token in doc
                if token.pos_ == "NOUN" and token.lemma_.lower() not in active_exclude
            }
            return sorted(fallback)

        return sorted(detected)

    @torch.inference_mode()
    def analyze(
        self,
        image_source: Union[str, Image.Image],
        include_features: Optional[Union[str, List[str]]] = None,
        exclude_features: Optional[List[str]] = None,
    ) -> Tuple[str, List[str]]:
        """Analyze an image by generating a caption and extracting features.

        This is the main entry point that combines caption generation and
        feature extraction into a single call.

        Args:
            image_source: The image to analyze. Can be a local file path,
                an HTTP(S) URL, or a PIL Image object.
            include_features: Controls which features to extract:
                - None: Extract any noun (excluding large-scale terms
                  and custom excludes).
                - "default" or ["default"]: Use the aerial_features.json
                  vocabulary for matching.
                - List of strings: Custom allowed features (with or without
                  underscores).
            exclude_features: List of noun lemmas to exclude in addition
                to the built-in large-scale terms.

        Returns:
            A tuple containing:
                - caption: The BLIP-generated caption string.
                - features: Sorted list of extracted feature names.

        Example:
            >>> captioner = ImageCaptioner()
            >>> caption, features = captioner.analyze(
            ...     "https://example.com/aerial.jpg",
            ...     include_features=["default"],
            ...     exclude_features=["building"]
            ... )
            >>> print(caption)
            "an aerial view of a residential area"
            >>> print(features)
            ["house", "road", "tree"]
        """
        caption = self.generate_caption(image_source)
        features = self.extract_features(
            caption,
            include_features=include_features,
            exclude_features=exclude_features,
        )
        return caption, features

__init__(blip_model_name=DEFAULT_BLIP_MODEL, spacy_model_name=DEFAULT_SPACY_MODEL, device=None)

Initialize the ImageCaptioner with specified models.

Parameters:

Name Type Description Default
blip_model_name str

Name or path of the BLIP model to use. Defaults to "Salesforce/blip-image-captioning-base".

DEFAULT_BLIP_MODEL
spacy_model_name str

Name of the spaCy model to use. Defaults to "en_core_web_sm".

DEFAULT_SPACY_MODEL
device Optional[str]

Device to run the model on ('cuda', 'mps', 'cpu'). If None, automatically detects the best available device.

None
Source code in samgeo/caption.py
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def __init__(
    self,
    blip_model_name: str = DEFAULT_BLIP_MODEL,
    spacy_model_name: str = DEFAULT_SPACY_MODEL,
    device: Optional[str] = None,
):
    """Initialize the ImageCaptioner with specified models.

    Args:
        blip_model_name: Name or path of the BLIP model to use.
            Defaults to "Salesforce/blip-image-captioning-base".
        spacy_model_name: Name of the spaCy model to use.
            Defaults to "en_core_web_sm".
        device: Device to run the model on ('cuda', 'mps', 'cpu').
            If None, automatically detects the best available device.
    """
    self.blip_model_name = blip_model_name
    self.spacy_model_name = spacy_model_name
    self.device = device if device else get_device()

    # Load spaCy model
    ensure_spacy_model(spacy_model_name)
    self.nlp = spacy.load(spacy_model_name)

    # Load BLIP model
    self.processor = BlipProcessor.from_pretrained(
        blip_model_name,
        use_fast=True,
    )
    self.blip_model = BlipForConditionalGeneration.from_pretrained(
        blip_model_name
    ).to(self.device)
    self.blip_model.eval()

analyze(image_source, include_features=None, exclude_features=None)

Analyze an image by generating a caption and extracting features.

This is the main entry point that combines caption generation and feature extraction into a single call.

Parameters:

Name Type Description Default
image_source Union[str, Image]

The image to analyze. Can be a local file path, an HTTP(S) URL, or a PIL Image object.

required
include_features Optional[Union[str, List[str]]]

Controls which features to extract: - None: Extract any noun (excluding large-scale terms and custom excludes). - "default" or ["default"]: Use the aerial_features.json vocabulary for matching. - List of strings: Custom allowed features (with or without underscores).

None
exclude_features Optional[List[str]]

List of noun lemmas to exclude in addition to the built-in large-scale terms.

None

Returns:

Type Description
Tuple[str, List[str]]

A tuple containing: - caption: The BLIP-generated caption string. - features: Sorted list of extracted feature names.

Example

captioner = ImageCaptioner() caption, features = captioner.analyze( ... "https://example.com/aerial.jpg", ... include_features=["default"], ... exclude_features=["building"] ... ) print(caption) "an aerial view of a residential area" print(features) ["house", "road", "tree"]

Source code in samgeo/caption.py
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
@torch.inference_mode()
def analyze(
    self,
    image_source: Union[str, Image.Image],
    include_features: Optional[Union[str, List[str]]] = None,
    exclude_features: Optional[List[str]] = None,
) -> Tuple[str, List[str]]:
    """Analyze an image by generating a caption and extracting features.

    This is the main entry point that combines caption generation and
    feature extraction into a single call.

    Args:
        image_source: The image to analyze. Can be a local file path,
            an HTTP(S) URL, or a PIL Image object.
        include_features: Controls which features to extract:
            - None: Extract any noun (excluding large-scale terms
              and custom excludes).
            - "default" or ["default"]: Use the aerial_features.json
              vocabulary for matching.
            - List of strings: Custom allowed features (with or without
              underscores).
        exclude_features: List of noun lemmas to exclude in addition
            to the built-in large-scale terms.

    Returns:
        A tuple containing:
            - caption: The BLIP-generated caption string.
            - features: Sorted list of extracted feature names.

    Example:
        >>> captioner = ImageCaptioner()
        >>> caption, features = captioner.analyze(
        ...     "https://example.com/aerial.jpg",
        ...     include_features=["default"],
        ...     exclude_features=["building"]
        ... )
        >>> print(caption)
        "an aerial view of a residential area"
        >>> print(features)
        ["house", "road", "tree"]
    """
    caption = self.generate_caption(image_source)
    features = self.extract_features(
        caption,
        include_features=include_features,
        exclude_features=exclude_features,
    )
    return caption, features

extract_features(caption, include_features=None, exclude_features=None)

Extract features from a caption using NLP processing.

Uses spaCy to parse the caption and extract relevant noun features based on the provided inclusion/exclusion criteria.

Parameters:

Name Type Description Default
caption str

The caption text to extract features from.

required
include_features Optional[Union[str, List[str]]]

Controls which features to extract: - None: Extract any noun (excluding large-scale terms and custom excludes). - "default" or ["default"]: Use the aerial_features.json vocabulary for matching. - List of strings: Custom allowed features (with or without underscores).

None
exclude_features Optional[List[str]]

List of noun lemmas to exclude in addition to the built-in large-scale terms.

None

Returns:

Type Description
List[str]

Sorted list of extracted feature names (canonical keys or

List[str]

noun lemmas).

Example

captioner = ImageCaptioner() features = captioner.extract_features( ... "a parking lot with several cars", ... include_features=["default"] ... ) print(features) ["car", "parking_lot"]

Source code in samgeo/caption.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
def extract_features(
    self,
    caption: str,
    include_features: Optional[Union[str, List[str]]] = None,
    exclude_features: Optional[List[str]] = None,
) -> List[str]:
    """Extract features from a caption using NLP processing.

    Uses spaCy to parse the caption and extract relevant noun features
    based on the provided inclusion/exclusion criteria.

    Args:
        caption: The caption text to extract features from.
        include_features: Controls which features to extract:
            - None: Extract any noun (excluding large-scale terms
              and custom excludes).
            - "default" or ["default"]: Use the aerial_features.json
              vocabulary for matching.
            - List of strings: Custom allowed features (with or without
              underscores).
        exclude_features: List of noun lemmas to exclude in addition
            to the built-in large-scale terms.

    Returns:
        Sorted list of extracted feature names (canonical keys or
        noun lemmas).

    Example:
        >>> captioner = ImageCaptioner()
        >>> features = captioner.extract_features(
        ...     "a parking lot with several cars",
        ...     include_features=["default"]
        ... )
        >>> print(features)
        ["car", "parking_lot"]
    """
    doc = self.nlp(caption)
    detected = set()

    # ----------------------- exclusions -----------------------
    active_exclude = set(LARGE_SCALE)
    if exclude_features:
        active_exclude.update(
            ex.lower().replace("_", " ") for ex in exclude_features
        )

    # Normalize include_features semantics
    use_default_vocab = False
    user_include_list: Optional[List[str]] = None

    if include_features is None:
        use_default_vocab = False
    else:
        # allow include_features="default" or ["default"]
        if isinstance(include_features, str):
            include_features = [include_features]

        if any(f.lower() == "default" for f in include_features):
            use_default_vocab = True
        else:
            user_include_list = include_features

    # Build maps for matching
    if use_default_vocab:
        single_map = SINGLE_WORD_FEATURES
        multi_map = MULTIWORD_FEATURES
    elif user_include_list:
        norm = [f.lower().replace("_", " ") for f in user_include_list]
        single_map = {
            phrase: phrase.replace(" ", "_") for phrase in norm if " " not in phrase
        }
        multi_map = {
            phrase: phrase.replace(" ", "_") for phrase in norm if " " in phrase
        }
    else:
        single_map = None
        multi_map = None

    # ----------------------- multi-word (noun chunks) -----------------------
    if use_default_vocab or user_include_list:
        for chunk in doc.noun_chunks:
            chunk_text = chunk.text.lower()
            chunk_lemma = " ".join(tok.lemma_.lower() for tok in chunk)

            for candidate in {chunk_text, chunk_lemma}:
                if multi_map and candidate in multi_map:
                    detected.add(multi_map[candidate])

    # ----------------------- single-word nouns -----------------------
    original_include = include_features
    for token in doc:
        if token.pos_ != "NOUN":
            continue

        lemma = token.lemma_.lower()
        if lemma in active_exclude:
            continue

        # Case: no include list → accept any noun lemma
        if original_include is None:
            detected.add(lemma)
        else:
            if single_map and lemma in single_map:
                detected.add(single_map[lemma])

    # ----------------------- fallback if using include list -----------------------
    if (use_default_vocab or user_include_list) and not detected:
        fallback = {
            token.lemma_.lower()
            for token in doc
            if token.pos_ == "NOUN" and token.lemma_.lower() not in active_exclude
        }
        return sorted(fallback)

    return sorted(detected)

generate_caption(image_source)

Generate a caption for the given image.

Parameters:

Name Type Description Default
image_source Union[str, Image]

The image to caption. Can be a local file path, an HTTP(S) URL, or a PIL Image object.

required

Returns:

Type Description
str

Generated caption string describing the image content.

Example

captioner = ImageCaptioner() caption = captioner.generate_caption("path/to/aerial.jpg") print(caption) "an aerial view of a building with a parking lot"

Source code in samgeo/caption.py
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
@torch.inference_mode()
def generate_caption(self, image_source: Union[str, Image.Image]) -> str:
    """Generate a caption for the given image.

    Args:
        image_source: The image to caption. Can be a local file path,
            an HTTP(S) URL, or a PIL Image object.

    Returns:
        Generated caption string describing the image content.

    Example:
        >>> captioner = ImageCaptioner()
        >>> caption = captioner.generate_caption("path/to/aerial.jpg")
        >>> print(caption)
        "an aerial view of a building with a parking lot"
    """
    img = load_image(image_source)
    inputs = self.processor(img, return_tensors="pt").to(self.device)
    out = self.blip_model.generate(**inputs)
    caption = self.processor.decode(out[0], skip_special_tokens=True)
    return caption

blip_analyze_image(image_source, include_features=None, exclude_features=None, blip_model_name=None, spacy_model_name=None)

Analyze an image by generating a caption and extracting features.

This is a convenience function that provides the full pipeline for image analysis. For repeated use or custom model configurations, consider creating an ImageCaptioner instance directly.

Parameters:

Name Type Description Default
image_source Union[str, Image]

The image to analyze. Can be a local file path, an HTTP(S) URL, or a PIL Image object.

required
include_features Optional[Union[str, List[str]]]

Controls which features to extract: - None: Extract any noun (excluding large-scale terms and custom excludes). - "default" or ["default"]: Use the aerial_features.json vocabulary for matching. - List of strings: Custom allowed features (with or without underscores).

None
exclude_features Optional[List[str]]

List of noun lemmas to exclude in addition to the built-in large-scale terms.

None
blip_model_name Optional[str]

Name or path of the BLIP model to use. If None, uses the default "Salesforce/blip-image-captioning-base".

None
spacy_model_name Optional[str]

Name of the spaCy model to use. If None, uses the default "en_core_web_sm".

None

Returns:

Type Description
Tuple[str, List[str]]

A tuple containing: - caption: The BLIP-generated caption string. - features: Sorted list of extracted feature names.

Example

caption, features = blip_analyze_image( ... "path/to/image.jpg", ... include_features=["default"], ... blip_model_name="Salesforce/blip-image-captioning-large" ... ) print(caption) "an aerial view of a parking lot with cars" print(features) ["car", "parking_lot"]

Source code in samgeo/caption.py
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
@torch.inference_mode()
def blip_analyze_image(
    image_source: Union[str, Image.Image],
    include_features: Optional[Union[str, List[str]]] = None,
    exclude_features: Optional[List[str]] = None,
    blip_model_name: Optional[str] = None,
    spacy_model_name: Optional[str] = None,
) -> Tuple[str, List[str]]:
    """Analyze an image by generating a caption and extracting features.

    This is a convenience function that provides the full pipeline for
    image analysis. For repeated use or custom model configurations,
    consider creating an ImageCaptioner instance directly.

    Args:
        image_source: The image to analyze. Can be a local file path,
            an HTTP(S) URL, or a PIL Image object.
        include_features: Controls which features to extract:
            - None: Extract any noun (excluding large-scale terms
              and custom excludes).
            - "default" or ["default"]: Use the aerial_features.json
              vocabulary for matching.
            - List of strings: Custom allowed features (with or without
              underscores).
        exclude_features: List of noun lemmas to exclude in addition
            to the built-in large-scale terms.
        blip_model_name: Name or path of the BLIP model to use.
            If None, uses the default "Salesforce/blip-image-captioning-base".
        spacy_model_name: Name of the spaCy model to use.
            If None, uses the default "en_core_web_sm".

    Returns:
        A tuple containing:
            - caption: The BLIP-generated caption string.
            - features: Sorted list of extracted feature names.

    Example:
        >>> caption, features = blip_analyze_image(
        ...     "path/to/image.jpg",
        ...     include_features=["default"],
        ...     blip_model_name="Salesforce/blip-image-captioning-large"
        ... )
        >>> print(caption)
        "an aerial view of a parking lot with cars"
        >>> print(features)
        ["car", "parking_lot"]
    """
    # Use custom models if specified, otherwise use default captioner
    if blip_model_name is not None or spacy_model_name is not None:
        captioner = ImageCaptioner(
            blip_model_name=blip_model_name or DEFAULT_BLIP_MODEL,
            spacy_model_name=spacy_model_name or DEFAULT_SPACY_MODEL,
        )
    else:
        captioner = _get_default_captioner()

    return captioner.analyze(
        image_source,
        include_features=include_features,
        exclude_features=exclude_features,
    )

ensure_spacy_model(model_name=DEFAULT_SPACY_MODEL)

Download spaCy model only if it's missing.

Parameters:

Name Type Description Default
model_name str

Name of the spaCy model to ensure is installed. Defaults to "en_core_web_sm".

DEFAULT_SPACY_MODEL
Source code in samgeo/caption.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def ensure_spacy_model(model_name: str = DEFAULT_SPACY_MODEL) -> None:
    """Download spaCy model only if it's missing.

    Args:
        model_name: Name of the spaCy model to ensure is installed.
            Defaults to "en_core_web_sm".
    """
    try:
        importlib.import_module(model_name)
    except ImportError:
        print(f"↓ Model '{model_name}' not found. Installing...")
        try:
            download(model_name)
            print(f"✓ Model '{model_name}' installed.")
        except Exception as e:
            raise RuntimeError(
                f"Failed to download spaCy model '{model_name}'. "
                f"You may need to install it manually with: python -m spacy download {model_name}. "
                f"Error: {e}"
            ) from e

extract_features_from_caption(caption, include_features=None, exclude_features=None)

Extract features from a caption using NLP processing.

This is a convenience function that uses the default ImageCaptioner instance. For more control over models, create an ImageCaptioner instance directly.

Parameters:

Name Type Description Default
caption str

The caption text to extract features from.

required
include_features Optional[Union[str, List[str]]]

Controls which features to extract: - None: Extract any noun (excluding large-scale terms and custom excludes). - "default" or ["default"]: Use the aerial_features.json vocabulary for matching. - List of strings: Custom allowed features (with or without underscores).

None
exclude_features Optional[List[str]]

List of noun lemmas to exclude in addition to the built-in large-scale terms.

None

Returns:

Type Description
List[str]

Sorted list of extracted feature names (canonical keys or

List[str]

noun lemmas).

Example

features = extract_features_from_caption( ... "a parking lot with several cars", ... include_features=["default"] ... ) print(features) ["car", "parking_lot"]

Source code in samgeo/caption.py
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
def extract_features_from_caption(
    caption: str,
    include_features: Optional[Union[str, List[str]]] = None,
    exclude_features: Optional[List[str]] = None,
) -> List[str]:
    """Extract features from a caption using NLP processing.

    This is a convenience function that uses the default ImageCaptioner
    instance. For more control over models, create an ImageCaptioner
    instance directly.

    Args:
        caption: The caption text to extract features from.
        include_features: Controls which features to extract:
            - None: Extract any noun (excluding large-scale terms
              and custom excludes).
            - "default" or ["default"]: Use the aerial_features.json
              vocabulary for matching.
            - List of strings: Custom allowed features (with or without
              underscores).
        exclude_features: List of noun lemmas to exclude in addition
            to the built-in large-scale terms.

    Returns:
        Sorted list of extracted feature names (canonical keys or
        noun lemmas).

    Example:
        >>> features = extract_features_from_caption(
        ...     "a parking lot with several cars",
        ...     include_features=["default"]
        ... )
        >>> print(features)
        ["car", "parking_lot"]
    """
    captioner = _get_default_captioner()
    return captioner.extract_features(
        caption,
        include_features=include_features,
        exclude_features=exclude_features,
    )

load_aerial_feature_vocab(url=AERIAL_FEATURES_URL)

Load the nested aerial_features.json and flatten to a list of feature keys.

Parameters:

Name Type Description Default
url str

URL to the aerial features JSON file. Defaults to the Hugging Face hosted version.

AERIAL_FEATURES_URL

Returns:

Type Description
List[str]

Sorted list of feature keys extracted from the JSON file.

Source code in samgeo/caption.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def load_aerial_feature_vocab(url: str = AERIAL_FEATURES_URL) -> List[str]:
    """Load the nested aerial_features.json and flatten to a list of feature keys.

    Args:
        url: URL to the aerial features JSON file. Defaults to the
            Hugging Face hosted version.

    Returns:
        Sorted list of feature keys extracted from the JSON file.
    """
    resp = requests.get(url)
    resp.raise_for_status()
    data = resp.json()

    features = set()

    # data is two-level nested: {category: list OR {subcat: list}}
    for _, val in data.items():
        if isinstance(val, list):
            features.update(val)
        elif isinstance(val, dict):
            for _, sublist in val.items():
                if isinstance(sublist, list):
                    features.update(sublist)

    return sorted(features)

load_image(source)

Load a PIL image from various sources.

Supports loading from local file paths, HTTP(S) URLs, or returns the image directly if it's already a PIL.Image.Image.

Parameters:

Name Type Description Default
source Union[str, Image]

The image source. Can be a local file path (str), an HTTP(S) URL (str), or an existing PIL Image object.

required

Returns:

Type Description
Image

PIL Image object converted to RGB mode.

Raises:

Type Description
TypeError

If the source type is not supported.

HTTPError

If downloading from URL fails.

FileNotFoundError

If local file path doesn't exist.

Source code in samgeo/caption.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def load_image(source: Union[str, Image.Image]) -> Image.Image:
    """Load a PIL image from various sources.

    Supports loading from local file paths, HTTP(S) URLs, or returns
    the image directly if it's already a PIL.Image.Image.

    Args:
        source: The image source. Can be a local file path (str),
            an HTTP(S) URL (str), or an existing PIL Image object.

    Returns:
        PIL Image object converted to RGB mode.

    Raises:
        TypeError: If the source type is not supported.
        requests.HTTPError: If downloading from URL fails.
        FileNotFoundError: If local file path doesn't exist.
    """
    if isinstance(source, Image.Image):
        return source.convert("RGB")

    if isinstance(source, str):
        if source.startswith("http://") or source.startswith("https://"):
            resp = requests.get(source, stream=True)
            resp.raise_for_status()
            return Image.open(BytesIO(resp.content)).convert("RGB")
        else:
            return Image.open(source).convert("RGB")

    raise TypeError(f"Unsupported image source type: {type(source)}")