Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store an MD5 hash of uploaded/indexed file and check before prepdocs #835

Merged
merged 6 commits into from
Oct 22, 2023

Conversation

tonybaloney
Copy link
Contributor

Purpose

Running prep docs on every provision is really time consuming, especially if the docs haven't changed.

This PR writes an .md5 file with an MD5 hash of the files in data/ when they get uploaded, then every time the prep docs script is re-run that file is checked against the current hash and the file is skipped if it hasn't changed.

I've added those md5 files to gitignore so they don't get committed and they also won't be uploaded.

Does this introduce a breaking change?

[ ] Yes
[ ] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Get the code
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

@tonybaloney tonybaloney requested a review from pamelafox October 20, 2023 02:08
@tonybaloney
Copy link
Contributor Author

screenshot 2023-10-20 at 13 32 23 screenshot of this working..

@pamelafox
Copy link
Collaborator

Yay! Also adding @mattmsft as a reviewer as we discussed the issue together about avoiding repetitive indexing, and also because Matt has prepdocs-related changes planned in the future. Looks like you were able to make this change fairly unobtrusively.

scripts/prepdocs.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mattgotteiner mattgotteiner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me, we can add tests at a later point

@tonybaloney tonybaloney merged commit ca01af9 into Azure-Samples:main Oct 22, 2023
8 checks passed
@tonybaloney tonybaloney deleted the check_diff_before_upload branch October 22, 2023 22:27
vishalgtingre added a commit to vishalgtingre/azure-search-openai-demo that referenced this pull request Nov 5, 2023
commit 3bf2d19
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Thu Nov 2 09:10:15 2023 -0700

    Fix list file (Azure-Samples#897)

commit b3c55b0
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 1 06:42:42 2023 -0700

    Bump pypdf from 3.16.3 to 3.17.0 in /scripts (Azure-Samples#890)

    Bumps [pypdf](https://github.com/py-pdf/pypdf) from 3.16.3 to 3.17.0.
    - [Release notes](https://github.com/py-pdf/pypdf/releases)
    - [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
    - [Commits](py-pdf/pypdf@3.16.3...3.17.0)

    ---
    updated-dependencies:
    - dependency-name: pypdf
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Pamela Fox <pamelafox@microsoft.com>

commit cfbb90b
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Tue Oct 31 21:37:55 2023 -0700

    Add more readmes/guides (Azure-Samples#889)

    * Add more readmes/guides

    * Add image

    * Diagram added

commit 4479a2c
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Mon Oct 30 21:38:26 2023 -0700

    Handle errors better especialyl for streaming (Azure-Samples#884)

commit 0d54f84
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Mon Oct 30 17:58:09 2023 -0700

    Add exclude files (Azure-Samples#876)

commit 3647826
Author: Roderic Bos <github@rooc.nl>
Date:   Mon Oct 30 17:35:50 2023 +0100

    When using the option --storagekey for the prepdocs script the key might (Azure-Samples#866)

    contain `==` base64 padding at the end. This will fail to succesfully
    login because the script just removes the `=` signs during the split
    action. Copied the version from the app/start.ps1 which is better suited here.

    Co-authored-by: Pamela Fox <pamelafox@microsoft.com>

commit 5daa934
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Oct 30 09:17:55 2023 -0700

    Bump the github-actions group with 1 update (Azure-Samples#880)

    Bumps the github-actions group with 1 update: [actions/setup-node](https://github.com/actions/setup-node).

    - [Release notes](https://github.com/actions/setup-node/releases)
    - [Commits](actions/setup-node@v3...v4)

    ---
    updated-dependencies:
    - dependency-name: actions/setup-node
      dependency-type: direct:production
      update-type: version-update:semver-major
      dependency-group: github-actions
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit bfb3ee5
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Fri Oct 27 11:04:37 2023 -0700

    Render entire PDFs instead of single pages (Azure-Samples#840)

    * Adding anchors

    * Show whole file

    * Show whole file

    * Page number support

    * More experiments with whole file

    * Revert unintentional changes

    * Add tests

    * Remove random num

    * Add retry_total=0 to avoid unnecessary network requests

    * Add comment to explain retry_total

    * Bring back deleted file

    * Blob manager refactor after merge

    * Update coverage amount

    * Make mypy happy with explicit check of path

    * Add debug for 3.9

    * Skip in 3.9 since its silly

    * Reduce fail under percentage due to 3.9

commit c989048
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Thu Oct 26 20:01:18 2023 -0700

    add screenshot (Azure-Samples#870)

commit a64a12e
Author: Matt <57731498+mattmsft@users.noreply.github.com>
Date:   Thu Oct 26 12:49:02 2023 -0700

    Refactor prepdocs (Azure-Samples#862)

    * setting up types

    * setting up more types...

    * working on it...

    * prepdocs refactor

    * typing fixes; updating tests

    * more fixes; updating tests

    * fixing retry for embeddings

    * fixing adls gen2 list

    * more test fixes

    * fixes from manual runs

    * more fixes

    * more fixes

    * type fixes

    * more type fixes and test fixes

    * break into modules

    * openai embedding fix

    * novectors fix

    * fix id generation

    * doc strings

    * feedback from pr

    * rename feedback

    * trying to get imports to work

    * update test workflow with pamela's suggestion

    * fix ci again

    * delete __init__

    * mypy configuration

    ---------

    Co-authored-by: Matt Gotteiner <magottei@microsoft.com>

commit 94be632
Author: MaciejLitwiniec <MaciejLitwiniec@users.noreply.github.com>
Date:   Thu Oct 26 15:23:00 2023 +0200

    Updated FAQ so that it reflect PR 835 (Azure-Samples#868)

    * Updated FAQ so that it reflect PR 835

    * Update README.md

    * Update README.md

    ---------

    Co-authored-by: Pamela Fox <pamela.fox@gmail.com>

commit d7bbf9f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 25 14:23:05 2023 -0700

    Bump werkzeug from 3.0.0 to 3.0.1 in /app/backend (Azure-Samples#863)

    Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.0 to 3.0.1.
    - [Release notes](https://github.com/pallets/werkzeug/releases)
    - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
    - [Commits](pallets/werkzeug@3.0.0...3.0.1)

    ---
    updated-dependencies:
    - dependency-name: werkzeug
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Pamela Fox <pamelafox@microsoft.com>

commit d02aa14
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Wed Oct 25 13:48:45 2023 -0700

    Message builder improvements (Azure-Samples#852)

commit 8f55988
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Tue Oct 24 19:53:07 2023 -0700

    Reorder tags to optimize for sample browser (Azure-Samples#853)

commit 16a61bf
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Mon Oct 23 17:03:59 2023 -0700

    Improve follow-up questions and pipe into context (Azure-Samples#832)

    * Add follow-up questions and parsing

    * Test breaking the e2e

    * Actually run tests

    * Fix runner

    * Add conditional

    * Fix the test

    * chat approach

commit ca01af9
Author: Anthony Shaw <anthony.p.shaw@gmail.com>
Date:   Mon Oct 23 09:27:32 2023 +1100

    Store an MD5 hash of uploaded/indexed file and check before prepdocs (Azure-Samples#835)

    * Hash the uploaded files locally and skip them if you provision a second time and they haven't changed

    * Overwrite the hash when it changes

    * Remove open mode parameter

    * fix f-string

    * reformat changes

    * Update prepdocs.py
HughRunyan pushed a commit to RMI/RMI_chatbot that referenced this pull request Mar 26, 2024
…zure-Samples#835)

* Hash the uploaded files locally and skip them if you provision a second time and they haven't changed

* Overwrite the hash when it changes

* Remove open mode parameter

* fix f-string

* reformat changes

* Update prepdocs.py
# if there is a file called .md5 in this directory, see if its updated
stored_hash = None
with open(filename, "rb") as file:
existing_hash = hashlib.md5(file.read()).hexdigest()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called something like : new_hash. Here we are calculating a hash of a new file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants