Export Google Presentation to pttx instead using get_media #1665

sacovo · 2024-06-19T12:12:20Z

Fixes #1664 by exporting the presentation as pptx file.

vercel · 2024-06-19T12:12:24Z

@sacovo is attempting to deploy a commit to the Danswer Team on Vercel.

A member of the Team first needs to authorize it.

onimsha · 2024-06-20T05:54:10Z

@sacovo I tested your PR, and got this error

googleapiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/drive/v3/files/my-file-name/export?mimeType=application%2Fvnd.op │
│ enxmlformats-officedocument.presentationml.presentation returned "This file cannot be exported by the user.". Details: "[{'message': 'This file cannot be exported by the user.', 'domain': 'global', ' │
│ reason': 'cannotExportFile'}]">

Have you tested it on your end ?

sacovo · 2024-06-20T07:43:02Z

I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:

from google.oauth2.service_account import Credentials
from googleapiclient import discovery  # type: ignore


def main():
    credentials = Credentials.from_service_account_file('...')

    service = discovery.build("drive", "v3", credentials=credentials)

    files = service.files()

    file_id = "..."
    
    print(files.get(fileId=file_id).execute())
    # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}
    
    content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()
    
    print(content[:30]) # Some binary data
    
    try:
        files.get_media(fileId=file_id).execute()
    except Exception as ex:
       print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.
    
if __name__ == "__main__":
    main()

Do you get different output?

onimsha · 2024-06-25T06:41:43Z

I didn't run it directly in the backend, but I checked that the method works when supplying ids. This works for me:

from google.oauth2.service_account import Credentials
from googleapiclient import discovery  # type: ignore


def main():
    credentials = Credentials.from_service_account_file('...')

    service = discovery.build("drive", "v3", credentials=credentials)

    files = service.files()

    file_id = "..."
    
    print(files.get(fileId=file_id).execute())
    # {'kind': 'drive#file', 'id': '...', 'name': '...', 'mimeType': 'application/vnd.google-apps.presentation'}
    
    content = files.export(fileId=file_id, mimeType="application/vnd.openxmlformats-officedocument.presentationml.presentation").execute()
    
    print(content[:30]) # Some binary data
    
    try:
        files.get_media(fileId=file_id).execute()
    except Exception as ex:
       print(ex) # HttpError 403: Only files with binary content can be downloaded. Use Export with Docs Editors files.
    
if __name__ == "__main__":
    main()

Do you get different output?

I'm not so sure. After running with this new logic, I reindexed all the files in Google Drive but all the presentations are being marked at ignore_for_qa, indicates that Danswer can't extract the text from these files. I will need to setup a similar debug like yours to see what's my problem is.

onimsha · 2024-06-25T22:54:32Z

I tested on my side, and one of my problem is it usually hit this error

[{'message': 'This file is too large to be exported.', 'domain': 'global', 'reason': 'exportSizeLimitExceeded'}]

Turns out when exporting the Google spreadsheet, the exported file size is usually big. I found one relevant article to go around this limitation https://stackoverflow.com/questions/40890534/google-drive-rest-api-files-export-limitation , I'll try to test it to see if it works.

broken-wheel · 2024-07-15T05:23:46Z

This is my workaround to circumvent size restriction:

diff --git a/backend/danswer/connectors/google_drive/connector.py b/backend/danswer/connectors/google_drive/connector.py
--- a/backend/danswer/connectors/google_drive/connector.py	(revision c7af6a4601d897718400c52a95ccd2cb8fd53cca)
+++ b/backend/danswer/connectors/google_drive/connector.py	(revision e41e0cc06c75fcd74c0114d6bed92ca132749dc9)
@@ -8,6 +8,8 @@
 from typing import Any
 from typing import cast
 
+import requests
+
 from google.oauth2.credentials import Credentials as OAuthCredentials  # type: ignore
 from google.oauth2.service_account import Credentials as ServiceAccountCredentials  # type: ignore
 from googleapiclient import discovery  # type: ignore
@@ -304,7 +306,7 @@
                 )
 
 
-def extract_text(file: dict[str, str], service: discovery.Resource) -> str:
+def extract_text(file: dict[str, str], service: discovery.Resource, credentials: ServiceAccountCredentials) -> str:
     mime_type = file["mimeType"]
     if mime_type not in set(item.value for item in GDriveMimeType):
         # Unsupported file types can still have a title, finding this way is still useful
@@ -334,8 +336,12 @@
         response = service.files().get_media(fileId=file["id"]).execute()
         return pptx_to_text(file=io.BytesIO(response))
     elif mime_type == GDriveMimeType.PPT.value:
-        response = service.files().get_media(fileId=file["id"]).execute()
-        return pptx_to_text(file=io.BytesIO(response))
+        access_token = credentials.token
+        url = f"https://docs.google.com/feeds/download/presentations/Export?id={file['id']}&exportFormat=pptx"
+        headers = {"Authorization": f"Bearer {access_token}"}
+        response = requests.get(url, headers=headers, timeout=300)  # 5 minutes timeout
+        file_data = response.content
+        return pptx_to_text(file=io.BytesIO(file_data))
 
     return UNSUPPORTED_FILE_TYPE_CONTENT
 
@@ -487,7 +493,7 @@
                         ):
                             continue
 
-                    text_contents = extract_text(file, service) or ""
+                    text_contents = extract_text(file, service, self.creds) or ""
 
                     doc_batch.append(
                         Document(

broken-wheel · 2024-07-15T05:27:30Z

I also ran into an issue with pptx_to_text failing with BadZipFile exception. I would rather indexing continue with files skipped than fail at first error. So, I changed the above slightly more as:

diff --git a/backend/danswer/connectors/google_drive/connector.py b/backend/danswer/connectors/google_drive/connector.py
--- a/backend/danswer/connectors/google_drive/connector.py	(revision acc6f0fcc1f51a876bdb6bb6593139546073b05b)
+++ b/backend/danswer/connectors/google_drive/connector.py	(revision 6ca9a2045872c585ce49ccab321f767482515b01)
@@ -7,6 +7,7 @@
 from itertools import chain
 from typing import Any
 from typing import cast
+from zipfile import BadZipFile
 
 import requests
 
@@ -354,7 +355,10 @@
         headers = {"Authorization": f"Bearer {access_token}"}
         response = requests.get(url, headers=headers, timeout=300)  # 5 minutes timeout
         file_data = response.content
-        return pptx_to_text(file=io.BytesIO(file_data))
+        try:
+            return pptx_to_text(file=io.BytesIO(file_data))
+        except BadZipFile as exc:
+            logger.exception("Cannot parse pptx at url: %s", url, exc_info=exc)
 
     return UNSUPPORTED_FILE_TYPE_CONTENT

Export ppt to pttx instead of media

6abf32a

pablodanswer closed this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export Google Presentation to pttx instead using get_media #1665

Export Google Presentation to pttx instead using get_media #1665

sacovo commented Jun 19, 2024

vercel bot commented Jun 19, 2024

onimsha commented Jun 20, 2024

sacovo commented Jun 20, 2024

onimsha commented Jun 25, 2024

onimsha commented Jun 25, 2024

broken-wheel commented Jul 15, 2024

broken-wheel commented Jul 15, 2024

Export Google Presentation to pttx instead using get_media #1665

Export Google Presentation to pttx instead using get_media #1665

Conversation

sacovo commented Jun 19, 2024

vercel bot commented Jun 19, 2024

onimsha commented Jun 20, 2024

sacovo commented Jun 20, 2024

onimsha commented Jun 25, 2024

onimsha commented Jun 25, 2024

broken-wheel commented Jul 15, 2024

broken-wheel commented Jul 15, 2024