BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况 #10259

minboo · 2023-06-29T03:07:55Z

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Windows和Linux都有此问题
版本号/Version：Paddleocr和paddlepaddle版本都为最新
部分代码：

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")


def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现：先识别一个单页的pdf，再识别一个多页的pdf，此时多页的pdf只能识别第一页

livingbody · 2023-07-04T03:14:23Z

找到问题，PR中，稍等一秒钟。PR链接：#10290

livingbody · 2023-07-04T03:19:59Z

修改及测试地址：
飞桨AI Studio - 人工智能学习与实训社区
https://aistudio.baidu.com/aistudio/projectdetail/6474682?contributionType=1

dizhenx · 2023-07-04T07:02:27Z

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Windows和Linux都有此问题
版本号/Version：Paddleocr和paddlepaddle版本都为最新
部分代码：

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")


def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现：先识别一个单页的pdf，再识别一个多页的pdf，此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的，换成1.18.14版试试

minboo · 2023-07-04T07:07:18Z

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Windows和Linux都有此问题
版本号/Version：Paddleocr和paddlepaddle版本都为最新
部分代码：

app = FastAPI()

ocr = PaddleOCR(use_angle_cls=True, lang="ch",
                           use_mp=True,
                           total_process_num=4,
                           use_gpu=True,
                           page_num=999,
                           cls_model_dir="/workspace/OCR/models/PP-OCRv3/ch_ppocr_mobile_v2.0_cls_infer",
                           det_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_det_infer",
                           rec_model_dir="/workspace/OCR/models/PP-OCRv3/ch_PP-OCRv3_rec_infer")


def process_predict(path: str):
    result= ocr.ocr(path, cls=True)
    return result

@app.post("/test")
async def ocr_rec(file: UploadFile = File(...)):

    upload_folder = "input/upload/"
    os.makedirs(upload_folder, exist_ok=True)
    new_filename = str(uuid.uuid4()) + os.path.splitext(file.filename)[-1]
    file_path = os.path.join(upload_folder, new_filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    result = process_predict(file_path)

    return {"results": result}

bug复现：先识别一个单页的pdf，再识别一个多页的pdf，此时多页的pdf只能识别第一页

应该是PyMuPDF版本不对造成的，换成1.18.14版试试

PyMuPDF版本肯定是1.18.14，因为不是这个版本的话识别pdf时会报错 AttributeError: 'Document' object has no attribute 'pageCount'我都有记录的

shiyutang · 2023-07-04T07:31:50Z

page_num在初始化一个PaddleOCR实例的时候就确定了，每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例？

minboo · 2023-07-04T07:36:54Z

page_num在初始化一个PaddleOCR实例的时候就确定了，每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例？

每次调用都重新初始化一个实例是非常耗时的，创建实例所需的时间都超过了识别所需的时间，这还怎么用？

minboo · 2023-07-04T07:41:53Z

page_num在初始化一个PaddleOCR实例的时候就确定了，每次调用ocr.ocr page_num根据第一次传入的pdf的确定了。可以每次重新初始化PaddleOCR一个OCR实例？

如果每次调用ocr.ocr page_num根据第一次传入的pdf确定了，那么初始化实例时page_num这个参数的意义是什么？这样的操作建议还是修改一下

shiyutang · 2023-07-04T07:47:05Z

建议尝试下PR，我刚刚看是可以解决问题的，目前已经合入了。

找到问题，PR中，稍等一秒钟。PR链接：#10290

shiyutang · 2023-07-04T07:47:31Z

以上回答已经充分解答了问题，如果有新的问题欢迎随时提交issue，或者在此条issue下继续回复～
我们开启了飞桨套件的ISSUE攻关活动，欢迎感兴趣的开发者参加：#10223

ColorfulDick · 2024-05-13T10:28:05Z

我也复现了这个问题，初始化PaddleOCR后，多次输入一个pdf文件，有时会只识别有限的几页

clSpider · 2024-06-13T06:45:41Z

我也出现了这个问题，多页的pdf如果连续识别，只能识别第一页

paddle-bot bot assigned huangjun12 Jun 29, 2023

shiyutang added bug Something isn't working good first issue Good for newcomers expneeded need extra experiment to fix issue Code PR is needed This issue could inspire a code PR labels Jun 29, 2023

livingbody mentioned this issue Jul 4, 2023

🏅️飞桨套件快乐开源常规赛 #10223

Closed

shiyutang closed this as completed Jul 4, 2023

paddle-bot bot added the status/close label Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况 #10259

BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况 #10259

minboo commented Jun 29, 2023 •

edited

Loading

livingbody commented Jul 4, 2023 •

edited by shiyutang

Loading

livingbody commented Jul 4, 2023

dizhenx commented Jul 4, 2023

minboo commented Jul 4, 2023

shiyutang commented Jul 4, 2023

minboo commented Jul 4, 2023

minboo commented Jul 4, 2023

shiyutang commented Jul 4, 2023

shiyutang commented Jul 4, 2023

ColorfulDick commented May 13, 2024

clSpider commented Jun 13, 2024

BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况 #10259

BUG：OCR推理多页pdf文件时，设置了page_num参数会出现只识别第一页的情况 #10259

Comments

minboo commented Jun 29, 2023 • edited Loading

livingbody commented Jul 4, 2023 • edited by shiyutang Loading

livingbody commented Jul 4, 2023

dizhenx commented Jul 4, 2023

minboo commented Jul 4, 2023

shiyutang commented Jul 4, 2023

minboo commented Jul 4, 2023

minboo commented Jul 4, 2023

shiyutang commented Jul 4, 2023

shiyutang commented Jul 4, 2023

ColorfulDick commented May 13, 2024

clSpider commented Jun 13, 2024

minboo commented Jun 29, 2023 •

edited

Loading

livingbody commented Jul 4, 2023 •

edited by shiyutang

Loading