-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export Google Presentation to pttx instead using get_media #1665
Conversation
@sacovo is attempting to deploy a commit to the Danswer Team on Vercel. A member of the Team first needs to authorize it. |
@sacovo I tested your PR, and got this error
Have you tested it on your end ? |
I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me: from google.oauth2.service_account import Credentials
from googleapiclient import discovery # type: ignore
def main():
credentials = Credentials.from_service_account_file('...')
service = discovery.build("drive", "v3", credentials=credentials)
files = service.files()
file_id = "..."
print(files.get(fileId=file_id).execute())
# {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}
content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()
print(content[:30]) # Some binary data
try:
files.get_media(fileId=file_id).execute()
except Exception as ex:
print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.
if __name__ == "__main__":
main() Do you get different output? |
I'm not so sure. After running with this new logic, I reindexed all the files in Google Drive but all the presentations are being marked at |
I tested on my side, and one of my problem is it usually hit this error
Turns out when exporting the Google spreadsheet, the exported file size is usually big. I found one relevant article to go around this limitation https://stackoverflow.com/questions/40890534/google-drive-rest-api-files-export-limitation , I'll try to test it to see if it works. |
This is my workaround to circumvent size restriction: diff --git a/backend/danswer/connectors/google_drive/connector.py b/backend/danswer/connectors/google_drive/connector.py
--- a/backend/danswer/connectors/google_drive/connector.py (revision c7af6a4601d897718400c52a95ccd2cb8fd53cca)
+++ b/backend/danswer/connectors/google_drive/connector.py (revision e41e0cc06c75fcd74c0114d6bed92ca132749dc9)
@@ -8,6 +8,8 @@
from typing import Any
from typing import cast
+import requests
+
from google.oauth2.credentials import Credentials as OAuthCredentials # type: ignore
from google.oauth2.service_account import Credentials as ServiceAccountCredentials # type: ignore
from googleapiclient import discovery # type: ignore
@@ -304,7 +306,7 @@
)
-def extract_text(file: dict[str, str], service: discovery.Resource) -> str:
+def extract_text(file: dict[str, str], service: discovery.Resource, credentials: ServiceAccountCredentials) -> str:
mime_type = file["mimeType"]
if mime_type not in set(item.value for item in GDriveMimeType):
# Unsupported file types can still have a title, finding this way is still useful
@@ -334,8 +336,12 @@
response = service.files().get_media(fileId=file["id"]).execute()
return pptx_to_text(file=io.BytesIO(response))
elif mime_type == GDriveMimeType.PPT.value:
- response = service.files().get_media(fileId=file["id"]).execute()
- return pptx_to_text(file=io.BytesIO(response))
+ access_token = credentials.token
+ url = f"https://docs.google.com/feeds/download/presentations/Export?id={file['id']}&exportFormat=pptx"
+ headers = {"Authorization": f"Bearer {access_token}"}
+ response = requests.get(url, headers=headers, timeout=300) # 5 minutes timeout
+ file_data = response.content
+ return pptx_to_text(file=io.BytesIO(file_data))
return UNSUPPORTED_FILE_TYPE_CONTENT
@@ -487,7 +493,7 @@
):
continue
- text_contents = extract_text(file, service) or ""
+ text_contents = extract_text(file, service, self.creds) or ""
doc_batch.append(
Document( |
I also ran into an issue with diff --git a/backend/danswer/connectors/google_drive/connector.py b/backend/danswer/connectors/google_drive/connector.py
--- a/backend/danswer/connectors/google_drive/connector.py (revision acc6f0fcc1f51a876bdb6bb6593139546073b05b)
+++ b/backend/danswer/connectors/google_drive/connector.py (revision 6ca9a2045872c585ce49ccab321f767482515b01)
@@ -7,6 +7,7 @@
from itertools import chain
from typing import Any
from typing import cast
+from zipfile import BadZipFile
import requests
@@ -354,7 +355,10 @@
headers = {"Authorization": f"Bearer {access_token}"}
response = requests.get(url, headers=headers, timeout=300) # 5 minutes timeout
file_data = response.content
- return pptx_to_text(file=io.BytesIO(file_data))
+ try:
+ return pptx_to_text(file=io.BytesIO(file_data))
+ except BadZipFile as exc:
+ logger.exception("Cannot parse pptx at url: %s", url, exc_info=exc)
return UNSUPPORTED_FILE_TYPE_CONTENT
|
Fixes #1664 by exporting the presentation as pptx file.