Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:OCR推理多页pdf文件时,设置了page_num参数会出现只识别第一页的情况 #10259

Closed
minboo opened this issue Jun 29, 2023 · 11 comments
Assignees
Labels
bug Something isn't working Code PR is needed This issue could inspire a code PR expneeded need extra experiment to fix issue good first issue Good for newcomers status/close

Comments

@minboo
Copy link

minboo commented Jun 29, 2023

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:Windows和Linux都有此问题
  • 版本号/Version:Paddleocr和paddlepaddle版本都为最新
    部分代码:
app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")


def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现:先识别一个单页的pdf,再识别一个多页的pdf,此时多页的pdf只能识别第一页

@shiyutang shiyutang added bug Something isn't working good first issue Good for newcomers expneeded need extra experiment to fix issue Code PR is needed This issue could inspire a code PR labels Jun 29, 2023
@livingbody
Copy link
Contributor

livingbody commented Jul 4, 2023

找到问题,PR中,稍等一秒钟。PR链接:#10290

@livingbody
Copy link
Contributor

修改及测试地址:
飞桨AI Studio - 人工智能学习与实训社区
https://aistudio.baidu.com/aistudio/projectdetail/6474682?contributionType=1

@dizhenx
Copy link

dizhenx commented Jul 4, 2023

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:Windows和Linux都有此问题
  • 版本号/Version:Paddleocr和paddlepaddle版本都为最新
    部分代码:
app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")


def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现:先识别一个单页的pdf,再识别一个多页的pdf,此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的,换成1.18.14版试试

@minboo
Copy link
Author

minboo commented Jul 4, 2023

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:Windows和Linux都有此问题
  • 版本号/Version:Paddleocr和paddlepaddle版本都为最新
    部分代码:
app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")


def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现:先识别一个单页的pdf,再识别一个多页的pdf,此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的,换成1.18.14版试试

PyMuPDF版本肯定是1.18.14,因为不是这个版本的话识别pdf时会报错 AttributeError: 'Document' object has no attribute 'pageCount'我都有记录的
image

@shiyutang
Copy link
Collaborator

page_num在初始化一个PaddleOCR实例的时候就确定了,每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例?

@minboo
Copy link
Author

minboo commented Jul 4, 2023

page_num在初始化一个PaddleOCR实例的时候就确定了,每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例?

每次调用都重新初始化一个实例是非常耗时的,创建实例所需的时间都超过了识别所需的时间,这还怎么用?

@minboo
Copy link
Author

minboo commented Jul 4, 2023

page_num在初始化一个PaddleOCR实例的时候就确定了,每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例?

如果每次调用ocr.ocr page_num根据第一次传入的pdf确定了,那么初始化实例时page_num这个参数的意义是什么?这样的操作建议还是修改一下

@shiyutang
Copy link
Collaborator

建议尝试下PR,我刚刚看是可以解决问题的,目前已经合入了。

找到问题,PR中,稍等一秒钟。PR链接:#10290

@shiyutang
Copy link
Collaborator

以上回答已经充分解答了问题,如果有新的问题欢迎随时提交issue,或者在此条issue下继续回复~
我们开启了飞桨套件的ISSUE攻关活动,欢迎感兴趣的开发者参加:#10223

@ColorfulDick
Copy link

我也复现了这个问题,初始化PaddleOCR后,多次输入一个pdf文件,有时会只识别有限的几页

@clSpider
Copy link

我也出现了这个问题,多页的pdf如果连续识别,只能识别第一页

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Code PR is needed This issue could inspire a code PR expneeded need extra experiment to fix issue good first issue Good for newcomers status/close
Projects
None yet
Development

No branches or pull requests

7 participants