Image Captioning¶

This notebook demonstrates how to perform image captioning and feature extraction using the BLIP model and spaCy NLP processing. The ImageCaptioner class provides a convenient interface for:

Generating captions for images from local files or URLs
Extracting meaningful features (nouns) from captions
Filtering features using predefined aerial vocabulary or custom lists

Installation¶

Uncomment the following line to install the required packages if needed.

In [ ]:

Copied!

# %pip install "segment-geospatial[samgeo3]"
# %pip install "segment-geospatial[samgeo3]"

Import Libraries¶

In [ ]:

Copied!

from samgeo.caption import ImageCaptioner, blip_analyze_image, show_image
from samgeo.caption import ImageCaptioner, blip_analyze_image, show_image

Initialize the ImageCaptioner¶

Create an ImageCaptioner instance. You can customize the models used:

blip_model_name: The BLIP model for caption generation (default: "Salesforce/blip-image-captioning-base")
spacy_model_name: The spaCy model for NLP processing (default: "en_core_web_sm")
device: The device to run inference on ("cuda", "mps", or "cpu"). Auto-detected if not specified.

Available BLIP models:

Salesforce/blip-image-captioning-base (default, ~990MB)
Salesforce/blip-image-captioning-large (larger, more accurate, ~1.9GB)

In [ ]:

Copied!





captioner = ImageCaptioner(
    blip_model_name="Salesforce/blip-image-captioning-base",
    spacy_model_name="en_core_web_sm",
)
captioner = ImageCaptioner(
    blip_model_name="Salesforce/blip-image-captioning-base",
    spacy_model_name="en_core_web_sm",
)

Example 1: Building Image¶

Let's analyze an aerial image of a building.

In [ ]:

Copied!

url1 = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/caption-building.webp"
show_image(url1)
url1 = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/caption-building.webp"
show_image(url1)

Basic Analysis¶

Use the analyze() method to generate a caption and extract all noun features.

In [ ]:

Copied!

caption, features = captioner.analyze(url1)
print(f"Caption: {caption}")
print(f"Features: {features}")
caption, features = captioner.analyze(url1)
print(f"Caption: {caption}")
print(f"Features: {features}")

Using Aerial Features Vocabulary¶

Set include_features="default" to filter features using a predefined aerial/geospatial vocabulary available here. This helps identify features relevant to remote sensing and aerial imagery analysis.

In [ ]:

Copied!

caption, features = captioner.analyze(url1, include_features="default")
print(f"Caption: {caption}")
print(f"Aerial Features: {features}")
caption, features = captioner.analyze(url1, include_features="default")
print(f"Caption: {caption}")
print(f"Aerial Features: {features}")

Custom Feature Filtering¶

You can also provide a custom list of features to look for, and exclude specific features.

In [ ]:

Copied!





# Look only for specific features
caption, features = captioner.analyze(
    url1, include_features=["building", "parking_lot", "road", "car", "tree"]
)
print(f"Caption: {caption}")
print(f"Custom Features: {features}")
# Look only for specific features
caption, features = captioner.analyze(
    url1, include_features=["building", "parking_lot", "road", "car", "tree"]
)
print(f"Caption: {caption}")
print(f"Custom Features: {features}")

In [ ]:

Copied!





# Exclude certain features from results
caption, features = captioner.analyze(url1, exclude_features=["view", "image"])
print(f"Caption: {caption}")
print(f"Features (excluding 'view', 'image'): {features}")
# Exclude certain features from results
caption, features = captioner.analyze(url1, exclude_features=["view", "image"])
print(f"Caption: {caption}")
print(f"Features (excluding 'view', 'image'): {features}")

Example 2: Traffic Sign Image¶

Let's analyze a different type of image - a traffic sign.

In [ ]:

Copied!

url2 = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/caption-traffic-sign.webp"
show_image(url2)
url2 = "https://huggingface.co/datasets/giswqs/geospatial/resolve/main/caption-traffic-sign.webp"
show_image(url2)

In [ ]:

Copied!

caption, features = captioner.analyze(url2)
print(f"Caption: {caption}")
print(f"Features: {features}")
caption, features = captioner.analyze(url2)
print(f"Caption: {caption}")
print(f"Features: {features}")

In [ ]:

Copied!





# Using aerial vocabulary
caption, features = captioner.analyze(url2, include_features="default")
print(f"Caption: {caption}")
print(f"Aerial Features: {features}")
# Using aerial vocabulary
caption, features = captioner.analyze(url2, include_features="default")
print(f"Caption: {caption}")
print(f"Aerial Features: {features}")

Using Individual Methods¶

The ImageCaptioner class also provides individual methods for more granular control:

generate_caption(): Generate only the caption
extract_features(): Extract features from an existing caption

In [ ]:

Copied!

# Generate caption only
caption = captioner.generate_caption(url1)
print(f"Caption: {caption}")
# Generate caption only
caption = captioner.generate_caption(url1)
print(f"Caption: {caption}")

In [ ]:

Copied!





# Extract features from an existing caption
features = captioner.extract_features(caption)
print(f"All Features: {features}")

aerial_features = captioner.extract_features(caption, include_features="default")
print(f"Aerial Features: {aerial_features}")
# Extract features from an existing caption
features = captioner.extract_features(caption)
print(f"All Features: {features}")

aerial_features = captioner.extract_features(caption, include_features="default")
print(f"Aerial Features: {aerial_features}")

Using the Convenience Function¶

For quick one-off analyses, you can use the blip_analyze_image() function directly without creating an ImageCaptioner instance. You can also specify custom models.

In [ ]:

Copied!





# Quick analysis with default models
caption, features = blip_analyze_image(url1)
print(f"Caption: {caption}")
print(f"Features: {features}")
# Quick analysis with default models
caption, features = blip_analyze_image(url1)
print(f"Caption: {caption}")
print(f"Features: {features}")

In [ ]:

Copied!





# Using a larger BLIP model for potentially better captions
caption, features = blip_analyze_image(
    url1,
    include_features="default",
    blip_model_name="Salesforce/blip-image-captioning-large",
)
print(f"Caption (large model): {caption}")
print(f"Aerial Features: {features}")
# Using a larger BLIP model for potentially better captions
caption, features = blip_analyze_image(
    url1,
    include_features="default",
    blip_model_name="Salesforce/blip-image-captioning-large",
)
print(f"Caption (large model): {caption}")
print(f"Aerial Features: {features}")