Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export Google Presentation to pttx instead using get_media #1665

Closed
wants to merge 1 commit into from

Conversation

sacovo
Copy link

@sacovo sacovo commented Jun 19, 2024

Fixes #1664 by exporting the presentation as pptx file.

Copy link

vercel bot commented Jun 19, 2024

@sacovo is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

@onimsha
Copy link

onimsha commented Jun 20, 2024

@sacovo I tested your PR, and got this error

googleapiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/drive/v3/files/my-file-name/export?mimeType=application%2Fvnd.op │
│ enxmlformats-officedocument.presentationml.presentation returned "This file cannot be exported by the user.". Details: "[{'message': 'This file cannot be exported by the user.', 'domain': 'global', ' │
│ reason': 'cannotExportFile'}]">

Have you tested it on your end ?

@sacovo
Copy link
Author

sacovo commented Jun 20, 2024

I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:

from google.oauth2.service_account import Credentials
from googleapiclient import discovery  # type: ignore


def main():
    credentials = Credentials.from_service_account_file('...')

    service = discovery.build("drive", "v3", credentials=credentials)

    files = service.files()

    file_id = "..."
    
    print(files.get(fileId=file_id).execute())
    # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}
    
    content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()
    
    print(content[:30]) # Some binary data
    
    try:
        files.get_media(fileId=file_id).execute()
    except Exception as ex:
       print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.
    
if __name__ == "__main__":
    main()

Do you get different output?

@onimsha
Copy link

onimsha commented Jun 25, 2024

I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:

from google.oauth2.service_account import Credentials
from googleapiclient import discovery  # type: ignore


def main():
    credentials = Credentials.from_service_account_file('...')

    service = discovery.build("drive", "v3", credentials=credentials)

    files = service.files()

    file_id = "..."
    
    print(files.get(fileId=file_id).execute())
    # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}
    
    content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()
    
    print(content[:30]) # Some binary data
    
    try:
        files.get_media(fileId=file_id).execute()
    except Exception as ex:
       print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.
    
if __name__ == "__main__":
    main()

Do you get different output?

I'm not so sure. After running with this new logic, I reindexed all the files in Google Drive but all the presentations are being marked at ignore_for_qa, indicates that Danswer can't extract the text from these files. I will need to setup a similar debug like yours to see what's my problem is.

@onimsha
Copy link

onimsha commented Jun 25, 2024

I tested on my side, and one of my problem is it usually hit this error

[{'message': 'This file is too large to be exported.', 'domain': 'global', 'reason': 'exportSizeLimitExceeded'}]

Turns out when exporting the Google spreadsheet, the exported file size is usually big. I found one relevant article to go around this limitation https://stackoverflow.com/questions/40890534/google-drive-rest-api-files-export-limitation , I'll try to test it to see if it works.

@broken-wheel
Copy link

This is my workaround to circumvent size restriction:

diff --git a/backend/danswer/connectors/google_drive/connector.py b/backend/danswer/connectors/google_drive/connector.py
--- a/backend/danswer/connectors/google_drive/connector.py	(revision c7af6a4601d897718400c52a95ccd2cb8fd53cca)
+++ b/backend/danswer/connectors/google_drive/connector.py	(revision e41e0cc06c75fcd74c0114d6bed92ca132749dc9)
@@ -8,6 +8,8 @@
 from typing import Any
 from typing import cast
 
+import requests
+
 from google.oauth2.credentials import Credentials as OAuthCredentials  # type: ignore
 from google.oauth2.service_account import Credentials as ServiceAccountCredentials  # type: ignore
 from googleapiclient import discovery  # type: ignore
@@ -304,7 +306,7 @@
                 )
 
 
-def extract_text(file: dict[str, str], service: discovery.Resource) -> str:
+def extract_text(file: dict[str, str], service: discovery.Resource, credentials: ServiceAccountCredentials) -> str:
     mime_type = file["mimeType"]
     if mime_type not in set(item.value for item in GDriveMimeType):
         # Unsupported file types can still have a title, finding this way is still useful
@@ -334,8 +336,12 @@
         response = service.files().get_media(fileId=file["id"]).execute()
         return pptx_to_text(file=io.BytesIO(response))
     elif mime_type == GDriveMimeType.PPT.value:
-        response = service.files().get_media(fileId=file["id"]).execute()
-        return pptx_to_text(file=io.BytesIO(response))
+        access_token = credentials.token
+        url = f"https://docs.google.com/feeds/download/presentations/Export?id={file['id']}&exportFormat=pptx"
+        headers = {"Authorization": f"Bearer {access_token}"}
+        response = requests.get(url, headers=headers, timeout=300)  # 5 minutes timeout
+        file_data = response.content
+        return pptx_to_text(file=io.BytesIO(file_data))
 
     return UNSUPPORTED_FILE_TYPE_CONTENT
 
@@ -487,7 +493,7 @@
                         ):
                             continue
 
-                    text_contents = extract_text(file, service) or ""
+                    text_contents = extract_text(file, service, self.creds) or ""
 
                     doc_batch.append(
                         Document(

@broken-wheel
Copy link

I also ran into an issue with pptx_to_text failing with BadZipFile exception. I would rather indexing continue with files skipped than fail at first error. So, I changed the above slightly more as:

diff --git a/backend/danswer/connectors/google_drive/connector.py b/backend/danswer/connectors/google_drive/connector.py
--- a/backend/danswer/connectors/google_drive/connector.py	(revision acc6f0fcc1f51a876bdb6bb6593139546073b05b)
+++ b/backend/danswer/connectors/google_drive/connector.py	(revision 6ca9a2045872c585ce49ccab321f767482515b01)
@@ -7,6 +7,7 @@
 from itertools import chain
 from typing import Any
 from typing import cast
+from zipfile import BadZipFile
 
 import requests
 
@@ -354,7 +355,10 @@
         headers = {"Authorization": f"Bearer {access_token}"}
         response = requests.get(url, headers=headers, timeout=300)  # 5 minutes timeout
         file_data = response.content
-        return pptx_to_text(file=io.BytesIO(file_data))
+        try:
+            return pptx_to_text(file=io.BytesIO(file_data))
+        except BadZipFile as exc:
+            logger.exception("Cannot parse pptx at url: %s", url, exc_info=exc)
 
     return UNSUPPORTED_FILE_TYPE_CONTENT
 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Google Drive Connector: Only files with binary content can be downloaded. Use Export with Docs Editors files.
4 participants