Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Azure PDF scan #282

Merged
merged 14 commits into from
Aug 26, 2021
Merged

Feature: Azure PDF scan #282

merged 14 commits into from
Aug 26, 2021

Conversation

dinhtungdu
Copy link
Contributor

@dinhtungdu dinhtungdu commented May 23, 2021

Description of the Change

This PR utilizes the Azure Computer Vision Read API to extract text from multi-page PDF files. It supports both textbase and text-heavy image base pdf files.

Because of the Read API design, this feature uses WP Cron to periodically check and grab the result.

Verification Process

  1. Go to ClassifAI > Image Processing.
  2. See the new setting Enable Scanning PDF.
  3. Enable that feature.
  4. Upload a PDF file.
  5. Right after the file is uploaded, open its media modal, see Classifai Read PDF field with a disabled In progress! button.
  6. Wait for some minutes for API to process the file, check the modal again, see the description field filled with content of the PDF file.
  7. Open the attachment detail page, see a new metabox Classifai PDF Processing with rescan checkbox.

Checklist:

  • I have read the CONTRIBUTING document.
  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my change.
  • All new and existing tests passed.

Applicable Issues

Changelog Entry

@dinhtungdu dinhtungdu self-assigned this May 23, 2021
@dinhtungdu dinhtungdu requested review from dkotter and helen May 23, 2021 01:09
@jeffpaul jeffpaul added this to the 1.7.0 milestone May 24, 2021
Copy link
Collaborator

@dkotter dkotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't fully tested this out yet but code looks good. Left a few minor comments

@dinhtungdu
Copy link
Contributor Author

@dkotter Thanks for the head up, I fixed those typo issues.

@jeffpaul jeffpaul requested a review from dkotter June 2, 2021 04:53
@dinhtungdu dinhtungdu linked an issue Jun 2, 2021 that may be closed by this pull request
@jeffpaul jeffpaul mentioned this pull request Jul 7, 2021
21 tasks
@phpbits
Copy link
Contributor

phpbits commented Jul 13, 2021

@jeffpaul @dinhtungdu Confirming that this feature is working as expected. I followed the steps and the PDF file was scanned successfully. See screenshot below:

Screen Shot 2021-07-14 at 1 04 47 AM

Copy link
Contributor

@helen helen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it's working! We probably need to get smarter with the scan button in a future release, because you can do something like request a scan, switch items, come back, and request another scan. It should probably check the status any time that button is loaded up, or load the button with AJAX entirely.

@helen helen merged commit 0b72995 into develop Aug 26, 2021
@helen helen deleted the features/pdf-scanning branch August 26, 2021 22:27
@jeffpaul jeffpaul mentioned this pull request Aug 31, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to v3 API to gain PDF OCR functionality
5 participants