Table of Content
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Table of Content
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Breast Cancer

Please note that Squint Vision Studio defines a “Datasource” as a collection of data used to train and evaluate a model (e.g. Breast Cancer) and a “Dataset” as a datasource descriptor created in Squint Vision Studio to manage the datasource.
Breast Cancer Dataset Classification
This guide walks you through a complete workflow in Squint Vision Studio using the Breast Cancer datasource; a collection of natural images that are commonly used to train machine learning and computer vision algorithms. You will be guided through the process of setting up a project, creating a Breast Cancer dataset from the Breast Cancer datasource, importing a model, and leveraging Squint Vision Studio's tools to thoroughly evaluate, interpret, and monitor your model's performance.
The following sections will walk you through how to use Squint Vision Studio, so it is expected that you have the Studio open and have obtained a valid license.
1.

Project Setup: Breast Cancer

1. Project Setup

Creating a Project

  • Navigate to the Project tab.
  • Click Load to open an existing project or click the + icon next to Manage
    Projects to create a new one.
  • Name your project (e.g., Breast Cancer Classification) and optionally add a description.
  • Click Save.
2. Dataset: CIFAR-10
2.

Dataset: Breast Cancer

Preparing the Datasource

Download the Breast Histopathology Images datasource from the kaggle website.
The original dataset comprises 162 whole-slide images of breast cancer (BCa) specimens, scanned at 40x magnification. The most common subtype present is invasive ductal carcinoma (IDC). From these slides, a total of 277,524 image patches of size 50×50 pixels were extracted:
  • 198,738 patches labeled as IDC-negative
  • 78,786 patches labeled as IDC-positive

Each patch is named using the format: u_xX_yY_classC.png.

For example: 10253_idx5_x1351_y1101_class0.png, where:
  • u = Patient ID (10253_idx5)
  • X = x-coordinate of the cropped patch
  • Y = y-coordinate of the cropped patch
  • C = Class label (0 for non-IDC, 1 for IDC)

To balance the dataset, a subset was selected for training and testing:
  • Training set: 126,056 patches
  1. Class 0: 63,155
  2. Class 1: 62,901
  • Testing set: 15,758 patches
  1. Class 0: 7,811
  2. Class 1: 7,947
The Breast Cancer dataset should be placed in the SquintVolume / DataSources directory and structured as follows:
  • A top-level folder named breastCancer.
  • Inside, two subfolders: train and test.
  • Each of these contains two subfolders: 1, corresponding to IDC, and 0, corresponding to non-IDC, each containing the respective data files.

Example code to create a Breast Cancer datasource

The Breast Histopathology Images datasource is originally provided as a large collection of image files, where each file name encodes metadata such as patient ID, patch coordinates, and class label. Each image is a 50×50 pixel patch extracted from whole slide images, and the class label indicates whether the patch contains invasive ductal carcinoma (IDC) or not. This format is useful for direct image-level analysis, but it does not match the folder-based structure required by Squint Vision Studio. Squint Vision Studio expects the data to be organized into directories by class label, with each image saved as an individual file within its respective class folder. Therefore, we need to convert the datasource from its flat file structure into a structured directory format.
We provide the following code as an example of how a datasource can be converted from a numpy-array dictionary format to a directory structure where the labels can be inferred from the folder names. You can use this code to create your own  Breast Cancer datasource, or to create a datasource out of your own real-world data
Copied!
import os
import shutil

# Define paths to training and testing name files
train_file_path = 'training_names.txt'
test_file_path = 'testing_names.txt'

# Load image names for training
with open(train_file_path, 'r') as file:
    train_names = [line.split("',")[0].split('/')[-1] for line in file]

# Load image names for testing
with open(test_file_path, 'r') as file:
    test_names = [line.split("',")[0].split('/')[-1] for line in file]

# Define source and target dataset directories
source_root = 'source_dataset'
target_root = 'SquintVolume/DataSources/breastCancer'

# Iterate through each class folder in the source dataset
for class_folder in os.listdir(source_root):
    source_class_path = os.path.join(source_root, class_folder)
    target_class_path = os.path.join(target_root, class_folder)

    # Ensure target train/test directories exist
    os.makedirs(os.path.join(target_class_path, 'train'), exist_ok=True)
    os.makedirs(os.path.join(target_class_path, 'test'), exist_ok=True)

    # Process each image in the class folder
    for image_name in os.listdir(source_class_path):
        source_image_path = os.path.join(source_class_path, image_name)

        if image_name in train_names:
            destination_path = os.path.join(target_class_path, 'train', image_name)
            shutil.copy(source_image_path, destination_path)

        elif image_name in test_names:
            destination_path = os.path.join(target_class_path, 'test', image_name)
            shutil.copy(source_image_path, destination_path)

Creating the Dataset

  • In the Studio, navigate to Data ⇀ Load.
  • Click the + icon to create a new dataset.
  • Select the Breast Cancer datasource.
  • Apply data preprocessing as illustrated in the image below.
We use a target image resolution of 64×64 pixels to match the expected input size of our model. The original breast cancer images are 50×50 pixel RGB patches, so we resize them to 64×64 to ensure compatibility with models that require larger input dimensions and to potentially improve feature extraction. Since the dataset consists of RGB images, each image has three channels. Therefore, we chose the color mode as RGB.
To prepare the data for training, we normalize the pixel values to the range [0, 1] by dividing each value by 255. This normalization improved model performance and training stability in our experiments . As a result, we selected the rescale mode of 0 to 1.

Note: that once you are working with your own datasource and model, the settings you specify when creating the dataset should match your own model/data expectations.

The mini batch size should be set to whatever batch size you use while training your model on the datasource. In this case we left the default setting of 32 as that’s the same batch size we use while training our model (see below).
  • Save the dataset.

Exploring the Breast Histopathology Images Dataset

Navigate to Data ⇀ View – Browse the Breast Histopathology Images Images Using the Image Carousel

The View feature provides an interactive image carousel that lets you visually inspect the images.
  • Scroll through images by category to verify data quality and labeling.
  • Confirm that images are correctly grouped by class.
  • Identify any anomalies, such as misclassified or low-quality images.

This tool is especially useful for validating the dataset before training or evaluating models.

Navigate to Data ⇀ Metrics – View per-class sample distribution and dataset statistics

The Metrics feature offers a statistical overview of the breast cancer dataset.
  • View the number of samples per class in both training and test sets.
  • Detect any class imbalances that could affect model performance and bias.
  • Review the overall dataset size and class diversity.

These insights are essential for diagnosing potential biases and ensuring the dataset is well-prepared for training and evaluation.

Navigate to Data ⇀ Histogram – Analyze per-class pixel intensity distributions

The Histogram feature visualizes the pixel intensity distributions for each class in the breast cancer dataset.
  • Analyze brightness and contrast patterns across different categories.
  • Detect inconsistencies or anomalies in image quality.
  • Compare visual characteristics between classes to assess uniformity.

This tool helps ensure that the dataset is visually consistent and suitable for model training.
3. Model: CIFAR-10 Classifier
3.

Model: Breast Cancer Classifier

Uploading the trained Model

  • Navigate to the Model ⇀ Load page.
  • Load a previously added model, or click on the + icon to add a vision model trained on Breast Cancer to the studio

Note:
The model must be trained before uploading. Squint Vision Studio is not intended for training models; it is designed for deploying and analyzing already-trained models. Ensure your model is trained externally and then uploaded to the studio.
Below is an example of how you can train a Convolutional Neural Network (CNN) model on the Breast Cancer datasource before uploading it to Squint Vision Studio:

Structure of the CNN Model

Copied!
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Flatten, BatchNormalization, MaxPooling2D, Dropout
from tensorflow.keras.initializers import HeUniform

# Initialize the model
model = Sequential()

# First convolutional block
model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer=HeUniform(), padding='same', input_shape=(64, 64, 3)))
model.add(BatchNormalization())
model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer=HeUniform(), padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.3))

# Second convolutional block
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer=HeUniform(), padding='same'))
model.add(BatchNormalization())
model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer=HeUniform(), padding='same'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))

# Third convolutional block
model.add(Conv2D(128, (3, 3), activation='relu', kernel_initializer=HeUniform(), padding='same'))

# Fully connected layers
model.add(Flatten())
model.add(Dense(1024, activation='relu', kernel_initializer=HeUniform()))
model.add(BatchNormalization())
model.add(Dense(512, activation='relu', kernel_initializer=HeUniform()))
model.add(BatchNormalization())
model.add(Dropout(0.3))

# Output layer
model.add(Dense(2, activation='softmax'))

Training the CNN Model

Copied!
# Compile the model
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Train the model
history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=32,
    verbose=1,
    shuffle=True,
    validation_split=0.5
)

# Save the trained model
model.save('breastCancer_model.keras')

Explore the model and capture intermediate representations

  • Use the View feature to inspect the model graph.
  • Click on model nodes to view layer details.
  • Select a node and click Generate Model to create a squinting model that captures intermediate outputs.A squinting model is a modified version of your original model designed to capture and output intermediate activations from specific layers during inference. This step is crucial as we will need to evaluate how your model sees the data (e.g. by capturing the embeddings of a layer deep in the model) before we can proceed to create a discovery.
  • Once you click “Generate Model” from a model node, navigate back to the Model ⇀ Load page and load the newly created squinting (“SQ”) model.
In this project, we selected the one to the last "Dense" layer as the feature extraction layer, which has an output shape of (None, 512), as shown in the image below. This indicates that we are capturing the output from this layer to analyze how the model processes data at that stage, where the layer produces 512 features per input sample.
4. Discovery: Evaluating the CIFAR-10 Model
4.

Discovery: Evaluating the Breast Cancer Model

Creating a Discovery

  • Navigate to the Discovery ⇀ Load page.
  • Click on the + icon to create a new discovery.
  • The model will run inference on the Breast Cancer dataset and generate performance metrics.

Discovery ⇀ Metrics

  • Analyze mistake distributions across classes for both training and testing sets. The image below illustrates how errors are distributed per class, helping to identify which category is more prone to misclassification.
  • Evaluate overall model performance on training and testing sets. Identify the best and worst performing classes (e.g., class 1 performs worst).
  • Examine per-class performance metrics, including accuracy, F1 scores, and error distributions for both training and testing sets.
  • Inspect confusion matrices for training and testing sets.
  • View dataset information used in this discovery, including dataset details and any augmentations applied during preprocessing or training.

Discovery ⇀ Analyze

  • Explore the embedding space of the model (based on the layer selected when creating the squinting model) using the Cognitive Atlas. In the Cognitive Atlas you can explore the relationships your model sees in the data by analyzing scatter plots for both training and testing data (select “train” or “test” sunder Image Source). You can analyze the model’s perception for all classes at once, or by selecting specific classes using the “class filter” control; by default, all classes are shown.
  • Below is a cognitive atlas generated for this project, illustrating the embedding space on the Breast Cancer training data:
  • Analyze the embedding space by selecting different `Display` options to view all data points, correct predictions, false positives, and false negatives. You can also filter by class using the selection bar on the right (default shows all classes).
    The image below shows false positive errors across all classes in the Breast Cancer training data, visualized on the scatter plot.
As shown in the image, most errors are concentrated near the edges of the clusters in the embedding space. These boundary areas, which we refer to as ambiguous regions, represent zones where the model struggles to distinguish between classes. In these regions, the feature representations of different classes overlap or are very close, making it difficult for the model to make confident predictions. As a result, predictions in ambiguous regions are less reliable and more prone to error.
In contrast, the central areas of the clusters referred to as trusted regions, are where the data points are more densely packed and clearly separated from other classes. These regions reflect high-confidence zones where the model consistently makes correct predictions. The lack of errors in these areas suggests that the model has learned strong, discriminative features for those examples, making its predictions more dependable.
Understanding the distribution of errors in the embedding space helps in diagnosing model behavior, identifying areas of uncertainty, and guiding improvements such as targeted data augmentation or model calibration.
  • Use the DensityMap option in Squint Vision Studio to visualize the distribution of data in the embedding space.
    When All Data is selected in the Display menu, the DensityMap shows the overall data distribution. It highlights areas with the highest concentration of data points. The image below shows the DensityMap for the data on the Breast Cancer training set. As you can see in the image, the densest regions are located at the center of the clusters, while the edges represent areas of lower density.
When False Positive or False Negative is selected in the Display menu, it highlights areas with the highest concentration of mistakes. This helps identify problematic regions or cells in the embedding space. By focusing on these areas; such as setting triggers or targeted improvements; you can enhance model performance more effectively. The image below shows the DensityMap for Incorrect Predictions on the Breast Cancer training set:
  • Interactively explore ambiguous regions in the embedding space by clicking on cells located near the edges of clusters in the scatter plot. These edge cells often represent areas where the model is uncertain, leading to a higher likelihood of misclassifications. When you click on a cell, you can view detailed cell statistics and browse the image gallery of similar embeddings within that region.
    The image below shows cell statistics for a selected cell in an ambiguous region of the breast cancer training set:

Cell Statistics

  • Total Data Count: 725
  • Cell Accuracy: 49%
  • Correct Prediction Count: 354
  • Error Count: 371

Correct Prediction Distribution Visual breakdown of correct predictions by class:

  • Class 0: 350 correct predictions
  • Class 1: 4 correct predictions

Error Distribution Visual breakdown of misclassifications by class:

  • Class 0 Errors: 368
  • Class 1 Errors: 3
The image below shows image gallery for a selected cell in an ambiguous region of the breast cancer training set:
As you can see, the images are ambiguous and not clearly recognizable.
You can also select the Input Saliency option (top-right corner of the image gallery) to highlight which parts of the images influenced the model's predictions, as shown below:
  • Interactively explore trusted regions in the embedding space by clicking on cells located near the center of clusters, typically represent areas where the model is trustworthy, resulting in fewer misclassifications. Clicking on a cell reveals cell statistics and an image gallery of consistent embeddings.
    The image below shows cell statistics for a selected cell in trusted region of the class 0 of the breast cancer training set:

Cell Statistics:

  • Total Data Count: 396
  • Cell Accuracy: 100%
  • Correct Prediction Count: 396
  • Error Count: 0

Correct Prediction Distribution:

  • Class 0: 396
The image below shows the image gallery for a selected cell in the trusted region of the class 0 of the breast cancer training set:
As you can see, the images are clear and consistent, making them easier to recognize.
You can also select the Input Saliency option (top-right corner of the image gallery) to highlight which parts of the images influenced the model's predictions, as shown below:
5. Trigger: Monitoring CIFAR-10 Predictions
5.

Trigger: Monitoring Breast Cancer Predictions

A trigger is a set of conditions that a user can design using the Cognitive Atlas to monitor the predictions of a model at runtime.

Note: that a premium license is required to access the Trigger feature.

Creating a Trigger

  • Go to the Trigger → Load page.
  • Click the + icon to create a new trigger.
  • In the Discovery → Analyze page, select cells in the embedding space where the model shows uncertainty or frequent misclassifications (e.g., where classes 1 and 0 overlap).
  • These selected cells will appear in the create Triggers dialog.
  • Define the trigger condition (e.g., "If the model predicts 0 and the input falls in cells (12:16, 13:16), fire alert missed diagnosis").
  • Save the trigger.

Exporting a Trigger

  • Navigate to the Trigger → Export page.
  • View the trigger's conditions as diagrams showing:
  • The monitored cells for each prediction.
  • The alert value returned when the condition is met.
  • Exporting a trigger generates a runtime watchdog.
  • The watchdog can be integrated into your application using the Squint Watchdog API to monitor predictions and flag ambiguous or out of distrubution input in real time.

Example: Creating a Trigger for High-Risk Misclassifications

Let's walk through an example of how to identify an ambiguous region that leads to high-risk misclassifications; specifically, false positives for class 0. In this scenario, the patient actually has cancer but is incorrectly classified as not having it, which poses a missed diagnosis.

Step 1: Analyze Training Mistakes

  • Navigate to the Discovery → Analyze page.
  • Under “Image source” select “Train” and under “Display” select “False positives”.
  • Hover over different cells in the embedding space for class 0.
  • Identify the cell with the lowest accuracy. For example, cells (12:16, 13:16) show lower accuracy, indicating frequent misclassifications in those regions.
  • We consider cells (12:16, 13:16) an ambiguous region for class 0.

Step 2: Validate with Testing Mistakes

  • Switch to Test data and False positives mistakes and locate cells (12:16, 13:16) in the Cognitive atlas.
  • Confirm that these cells also show lower accuracy in the test set.
  • This consistency across training and testing data reinforces that cells (12:16, 13:16) is a reliable indicator of uncertainty.

Step 3: Define a Trigger

  • Based on this insight, define a trigger condition:
If the model predicts class 0 and the input falls within cells (12:16, 13:16), then send a signal indicating the prediction is not trustworthy.
This trigger helps your application flag potentially high-risk predictions in real time, supporting safer decision-making in clinical or diagnostic workflows.
6. Benefits of Using Squint Insights Studio
6.

Why Choose Squint Vision Studio

Understand Model Behavior Through Semantic Analysis

Squint Vision Studio enables deep semantic exploration of your model's embedding space. By visualizing how data points cluster and where errors occur, you can gain a clearer understanding of how your model perceives and organizes information. This insight is crucial for:
  • Identifying ambiguous regions where the model is uncertain.
  • Differentiating between types of mistakes based on their location in the cognative atlas.
  • Recognizing patterns in misclassifications that may not be obvious from raw metrics alone.

Improve Model Performance Based on Purpose

The platform allows you to tailor your model improvement strategy to your specific use case by leveraging the spatial structure of the embedding space.
For example:
  • You can set custom triggers to monitor specific types of errors.
    Example: If missed diagnosis is particularly important, you can define a rule like:
"If the model predicts 0 and the input falls in cells (12:16, 13:16), fire alert: missed diagnosis."
  • Once this trigger is activated, you can route the flagged sample to a secondary model; one specifically trained on data from the ambiguous region to better distinguish between 0 and 1 when the classification is unclear. This model is specifically trained on data from the ambiguous region; cases where the distinction between 0 and 1 is subtle or unclear. Because it's focused on this narrower, more challenging subset of the data, the secondary model can learn finer-grained patterns and decision boundaries that the primary model might miss. This specialized model is likely to be more reliable in resolving such ambiguity.
This approach allows you to layer your models intelligently, using the general model for broad classification and specialized models for high-risk or high-importance distinctions.
  • This layered approach allows you to:
  • Prioritize and address high-impact errors.
  • Reduce false positives or negatives in sensitive areas.
  • Improve overall system reliability without over complicating the primary model.

Model adjustment Without Retraining

Retraining a model can be time-consuming, expensive, or even infeasible in production environments. Squint Vision Studio offers a powerful alternative: adjusting model behavior without retraining.
For example:
  • Suppose you discover that a particular visual feature in breast cancer images helps distinguish between early-stage and late-stage tumors; an important factor in determining treatment strategy.
  • Instead of retraining your model to explicitly recognize these as separate classes (e.g., "early-stage tumors" vs. "late-stage tumors"), you can analyze the model's internal representations; such as a scatter plot of embeddings and identify clusters within class 1 (positive cases).
  • By splitting class `1` into two sub-classes based on their location in the scatter plot, you leverage the natural grouping of similar samples. This clustering reflects underlying patterns the model has already learned.
  • This approach allows you to extend the model's output space to accommodate new requirements (like distinguishing the stage of the breast cancer) without modifying the model weights. Instead, you build on the model's learned representations, making your system more flexible and adaptive.

Flexible, Interactive, and Cost-Efficient

Squint Vision Studio empowers you to:
  • Interactively explore and annotate the embedding space.
  • Set up real-time alerts for specific error patterns.
  • Make structural changes to your model's interpretation of data without retraining.
  • Save time and resources while maintaining high model performance and adaptability.
7. Insights: Summarizing the CIFAR-10 Project
7.

Insights: Summarizing the Breast Cancer Project

The Insights section serves as a critical component in understanding and communicating the outcomes of the Breast Cancer classification project. It consolidates key findings, performance metrics, and model behavior into a structured, shareable format. This supports internal analysis, model refinement, and transparent collaboration across clinical, research, and technical teams.

Creating an Insight Report

To generate a comprehensive Insight report:

  • Navigate to Insights → Load.
  • Click the + icon to create a new Insight report.

The generated report includes:

  • Dataset Details
    Information such as class distribution for both training and testing sets.
  • Model Metrics and Performance
    Includes model architecture, size, training parameters. Evaluation metrics such as accuracy, precision, recall, and F1 score. Critical for assessing diagnostic reliability and minimizing false positives/negatives.
  • Discovery Results
    Visual tools like confusion matrices and saliency maps reveal which cases are most frequently misclassified and why. These insights are essential for identifying model blind spots and improving clinical relevance.
  • Trigger Definitions and Monitored Conditions
    These highlight specific thresholds and conditions used to monitor model behavior during training and inference, ensuring safety, reliability, and compliance with medical standards.
  • Recommendations
    The Recommendations feature in the Studio's Insight report provides automated, data-driven guidance for improving the performance and quality of your breast cancer classification model. It analyzes key metrics from your dataset and model evaluation, then generates targeted suggestions to help you optimize results.

Exporting an Insight

To share or archive the report:

  • Go to Insights → Export.
  • Export the Insight report as a detailed summary of the Breast Cancer project.

This exported report is invaluable for:

  • Documentation
    Keeping a record of model development, training cycles, and evaluation results for regulatory and research purposes.
  • Presentations
    Communicating findings to clinical teams, data scientists, and stakeholders in healthcare and research.
  • Stakeholder Engagement
    Providing clear, data-driven insights to support decisions in clinical deployment, research validation, or product development.
  • Recommendations Tracking
    Reviewing automated suggestions for improving dataset balance, per-class performance, and overall model quality. These recommendations help guide iterative improvements and ensure the project evolves toward higher diagnostic accuracy, fairness, and robustness.

Why It Matters:

The Insight report transforms raw metrics into actionable intelligence. It helps teams understand model strengths and weaknesses, guides iterative improvements, and ensures that the Breast Cancer project remains transparent, clinically relevant, and aligned with its goals - whether for research, diagnostic support, or deployment in healthcare settings.

Summary

This sample project demonstrates how to use Squint Vision Studio to evaluate a model trained on the Breast Cancer datasource. By following these steps, you can gain deep insights into your model's performance, understand its decision-making process, and identify areas for improvement.
In summary, Squint Vision Studio is more than just a visualization tool; it's a semantic control center for your model, enabling smarter diagnostics, targeted improvements, and flexible adaptation to evolving needs.