IBM Granite Vision 2B Model Locally Install locally and test

IBM Granite Vision 2B Model Locally - Install and test with samples | WORKS 100% | FREE OCR

I this tutorial we are going to install ibm-granite/granite-vision-3.1-2b-preview model on window machine and then test by analyzing many images. This model is vision model from IBM that can be used to understand the image and get data from the image.

This model is a vision-language model which is designed to understand the visual understanding of the document and images. This model can be used to extract the content from tables, charts, infographics, plots, diagrams and many other visual objects. This model is trained on a vast amount of diverse public and synthetic datasets. These datasets are selected very well to train the model to understand a wide range of documents.

Video Instruction of Installing and using IBM Granite Vision 2B Model Locally

Visit the hugging face page at https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview to find the latest information about the model. Here is the screen shot of the page at the time of writing of this tutorial.

IBM Granite Vission 3.1-2b preview

Step 1: Create Python virtual environment

First of all we will create and activate the Python Virtual environment with the help of following command:


conda create -n aienv python=3.11 -y
conda activate aienv

Above command will create virtual environment and then activate the environment for you.

Step 2: Install Transformer library

The next step is to install the latest transformer library. Here is the command to install the latest transformer library:

pip install git+https://github.com/huggingface/transformers.git

Above command will download and install transformer library.

Step 3: Install Python libraries

Install required Python libraries using following pip command:

pip install torch torchvision torchaudio einops timm pillow

Step 4: Develop program to use Granite Vision model and understand the image

Create a new python file say test_image.py and add the following code:


# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText,AutoModelForVision2Seq
import torch
from PIL import Image, ImageDraw

device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "ibm-granite/granite-vision-3.1-2b-preview"

processor = AutoProcessor.from_pretrained(model_path)
# model = AutoModelForImageTextToText.from_pretrained(model_path)


model = AutoModelForVision2Seq.from_pretrained(model_path,ignore_mismatched_sizes=True).to(device)


image = Image.open('C:\\Tutorials\\GenAi\\ibm\\hindi.jpg').convert('RGB')

prompt = """
What is in the image?

"""


conversation = [
  {"role":"user", "content":[{"type":"image", "url":image},{"type":"text", "text":prompt},],}]


inputs = processor.apply_chat_template(conversation,add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(device)

output = model.generate(**inputs, max_new_tokens=500)

print(processor.decode(output[0], skip_special_tokens=True))

Run the above code and you will get the results from the model. Here is the screen shot of the output:

Granite Vision Model Output

Model is able to identity the content of the image and printed its understanding. This model is good model and I will provide you more examples of this model.

Related Tutorials

Artificial Intelligence (AI)