Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing in the input stage Using Encoder and Decoder Model Architecture (DONUT model) #192

Open
swapnil-lader opened this issue Nov 15, 2022 · 0 comments

Comments

@swapnil-lader
Copy link

swapnil-lader commented Nov 15, 2022

I am trying to get a model summary of the donut model but am unable to define the input for the torch summary.
###########################################################
import argparse
import gradio as gr
import torch
from PIL import Image
from donut.donut.model import DonutModel
from torchvision import models
from torchsummary import summary

def demo_process_vqa(input_img, question):
global pretrained_model, task_prompt, task_name
# pretrained_model = './donut/result/train_docvqa/20220912_103244'
# task_name = "docvqa"
# task_prompt = "<s_pdf-donut>"
input_img = Image.fromarray(input_img)
user_prompt = task_prompt.replace("{user_input}", question)
print(user_prompt)
output = pretrained_model.inference(input_img, prompt=user_prompt)["predictions"][0]
print('inf_out',output)
return output

def demo_process(input_img):
global pretrained_model, task_prompt, task_name
input_img = Image.fromarray(input_img)
output = pretrained_model.inference(image=input_img, prompt=task_prompt)["predictions"][0]
return output

parser = argparse.ArgumentParser()
parser.add_argument("--task", type=str, default="docvqa")
parser.add_argument("--pretrained_path", type=str, default="train_docvqa_for_all_atts/donut/result/train_docvqa/20220915_125713")
args, left_argv = parser.parse_known_args()

task_name = args.task
if "docvqa" == task_name:
task_prompt = "<s_taco_eiko_pdf_donut>{user_input}</s_question><s_answer>"
else: # rvlcdip, cord, ...
task_prompt = f"<s_{task_name}>"

pretrained_model = DonutModel.from_pretrained(args.pretrained_path)

if torch.cuda.is_available():
# pretrained_model.half()
device = torch.device("cuda")
pretrained_model.to(device)
else:
pretrained_model.encoder.to(torch.bfloat16)

summary(pretrained_model, [(1, 3, 1280 , 960), (1, 21),(1, 21)])

The shape of the encoder and decoder is as follows.
Encoder : torch.Size([1, 3, 1280, 960])
Decode : torch.Size([1, 21])

##Model forward architecture looks like this

    encoder_outputs = self.encoder(image_tensors)
    decoder_outputs = self.decoder(
        input_ids=decoder_input_ids,
        encoder_hidden_states=encoder_outputs,
        labels=decoder_labels,
    )
    return decoder_outputs

Can you please guide how to pass down the model input in summary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant