IBM Granite Vision 2B Model Locally - Install and test with samples | WORKS 100% | FREE OCR
I this tutorial we are going to install ibm-granite/granite-vision-3.1-2b-preview model on window machine and then test by analyzing many images. This model is vision model from IBM that can be used to understand the image and get data from the image.
This model is a vision-language model which is designed to understand the visual understanding of the document and images. This model can be used to extract the content from tables, charts, infographics, plots, diagrams and many other visual objects. This model is trained on a vast amount of diverse public and synthetic datasets. These datasets are selected very well to train the model to understand a wide range of documents.
Video Instruction of Installing and using IBM Granite Vision 2B Model Locally
Visit the hugging face page at https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview to find the latest information about the model. Here is the screen shot of the page at the time of writing of this tutorial.
Step 1: Create Python virtual environment
First of all we will create and activate the Python Virtual environment with the help of following command:
conda create -n aienv python=3.11 -y
conda activate aienv
Above command will create virtual environment and then activate the environment for you.
Step 2: Install Transformer library
The next step is to install the latest transformer library. Here is the command to install the latest transformer library:
pip install git+https://github.com/huggingface/transformers.git
Above command will download and install transformer library.
Step 3: Install Python libraries
Install required Python libraries using following pip command:
pip install torch torchvision torchaudio einops timm pillow
Step 4: Develop program to use Granite Vision model and understand the image
Create a new python file say test_image.py and add the following code:
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText,AutoModelForVision2Seq
import torch
from PIL import Image, ImageDraw
device = "cuda" if torch.cuda.is_available() else "cpu"
model_path = "ibm-granite/granite-vision-3.1-2b-preview"
processor = AutoProcessor.from_pretrained(model_path)
# model = AutoModelForImageTextToText.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path,ignore_mismatched_sizes=True).to(device)
image = Image.open('C:\\Tutorials\\GenAi\\ibm\\hindi.jpg').convert('RGB')
prompt = """
What is in the image?
"""
conversation = [
{"role":"user", "content":[{"type":"image", "url":image},{"type":"text", "text":prompt},],}]
inputs = processor.apply_chat_template(conversation,add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=500)
print(processor.decode(output[0], skip_special_tokens=True))
Run the above code and you will get the results from the model. Here is the screen shot of the output:
Model is able to identity the content of the image and printed its understanding. This model is good model and I will provide you more examples of this model.
Related Tutorials