-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to access Azure Delta Lake #600
Comments
Hi @ganesh-gawande, tahnks for this report. - could you share the basic layout of your storage account? Specifically where the |
Hi @roeap - |
not sure i follow. This means, the delta table is located at the root of the container within the storage account? In your example you mention Also, would it be possible to share the (possibly redacted) contents of the |
Yes. Delta table is located at root of container of storage account. This means - when I open the container in storage explorer - I can see More info - I have partitioned the delta table - based on Id and date. So where the
I don't see the json file with name you mentioned. I can see multiple json files with different numbers. e.g. 00000000000000086096.json, 00000000000000086097.json, 00000000000000086098.json etc etc. |
Thanks! In that case "adls2://{ContainerName}@{StorageAccountName}.dfs.core.windows.net" should be the way to go. The fact that we are getting the error you mentioned should already mean, that we can access the storage, but it seems we are not finding the entry point to the logs. AFAIK the log files should never be deleted and the all zero file i mentioned denotes the initial commit. I have to look a bit into our codebase to remind me how delta-rs starts parsing the log. Given the number of commits, there should also be a file called So my question is, does that @houqp - is that correct? |
Thank for insights. This still not working - I have checked my other delta lake storage - logs folder as well - but did not find Having said that, in all the delta lake log folders - I can locate the checkpoint file like this - Let me know when you take look at the code to check how delta-rs starts parsing the logs. |
just making sure - does the the logic is roughly like this .. check if we find the |
No file with name as The name of file is like This is only file with name as More information: |
hmm strange ... this seems like a corruption in the delta log to me... when databricks creates a checkpoint it should also create a One way to load the table could be to use the If you use the I'd be interested to know if databricks is able to load that table without specifying a specific version. |
I am able to query the delta table correctly from the Databricks notebooks. I hope that is also working on the delta logs and latest checkpoint concept. So if those delta logs are corrupted then that might have cause problem when same delta table is getting queried from the databricks notebook. Does this understanding correct? I have following configuration in Databricks cluster spark configuration. Does this have any relation to modify names of the delta log files or checkpoint files. spark.databricks.delta.symlinkFormatManifest.fileSystemCheck.enabled false |
@roeap Actually, the first delta entry is not guaranteed to exist. See my update in delta-io/delta#913 Not sure if we are testing that in this repo though. |
@ganesh-gawande - sorry for sending you on a wild goose chase ... I recreated the behaviour locally. Seems like we have a bug where we cannot read a table from the root of a container. I opened #602 to track this and should soon be able to get to this.
@wjones127 would we then expect that the |
It's vague in the protocol, but I don't think it necessarily exists. (And it's also not guaranteed to point to the most recent checkpoint.) We probably shouldn't rely on it 😢 |
@ganesh-gawande - so the path you should be using is However I also tried loading a delta log with initial commit files remove, which only work if there is a @wjones127 @houqp - I do remember the protocol explicitly mentioning lexicographical sort to work with the log. Should we implement that logic, or make sure first that delta needs to support finding checkpoints w/o that file. or are we already sure :). I guess the core logic from loading a specific version can already largely be reused. Likely we would also want to mirror the logic in our writers to create a checkpoint every ten commits. |
kind of lost track... just making sure, so to clarify
Did you also try the test table you created which DOES have |
Yes. If you see the message which has screenshot - there I have tried to create new storage account and new delta table and which has 00000000000000000000.json file as well. Still I am not able to access it. I am getting error as - Not a Delta table: No snapshot or version 0 found, perhaps adls2://sampledeltalakestorage/sample is an empty dir? |
also if you add a trailing "/"? This is actually the bug that gets fixed in #603. i.e. if the trailing slash is missing it also fails for me ... |
Yes. Although I add / at last or do not add it, still getting same error. |
at which point do you see the error? I just tried with the released python package, I can load the table, get metadata, history etc, but am also seeing an error when trying to materialize the dataset. could you try something like table = DeltaTable("adls2://{StorageAccountName}/{ContainerName}/")
table.pyarrow_schema() That would help narrow down the source of the error. |
Can you please confirm once the version you are using ? Or point out to URL to download latest version so that I will check again ? |
i used the released version |
When I tried to install or upgrade new version - I am getting following errors. See the first line and last line. It start download of 0.5.7 - but at the last line - it installs 0.5.6. Any thoughts on this? I have upgraded the pip as well to pip 21.3.1 Collecting deltalake Cargo, the Rust package manager, is not installed or is not on PATH. Checking for Rust toolchain....WARNING: Discarding https://files.pythonhosted.org/packages/4f/6c/fe7dafb8e4fed25e97652c1ab1bbd73ae4fb1f32881abc73e9dcaabe1167/deltalake-0.5.7.tar.gz#sha256=b14f7417f72fa363519e7080ed9c99f4fc31f93a8af8428fae2370f090297bc6 (from https://pypi.org/simple/deltalake/) (requires-python:>=3.6). Command errored out with exit status 1: 'C:\Users\g.gawande\AppData\Local\Programs\Python\Python36\python.exe' 'C:\Users\g.gawande\AppData\Local\Programs\Python\Python36\lib\site-packages\pip_vendor\pep517\in_process_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\G4071~1.GAW\AppData\Local\Temp\tmprw4bi353' Check the logs for full command output. |
seems like your system is choosing a source distribution in case of 0.5.7, while using a pre-compiled wheel in case of 0.5.6. As you seem to not have cargo (i.e. the rust toolchain) installed on your system its failing to build the package locally. As to why that is the case I am not sure, as I am no too familiar with how python (or pip) makes a choice which install mehtod / artifact to choose. |
@ganesh-gawande - i was able to dig into loading the table from python. Turns out it was #602 which also caused the issue, since the trailing slash gets truncated when we initialize the file system internally. I was able to load the table using the following workaround. from deltalake import DeltaTable
from deltalake.fs import DeltaStorageHandler
import pyarrow.fs as pa_fs
path = "adls2://{StorageAccountName}/{ContainerName}/"
table = DeltaTable(path)
filesystem = pa_fs.PyFileSystem(DeltaStorageHandler(path))
ds = table.to_pyarrow_dataset(filesystem=filesystem) |
@ganesh-gawande is this still relevant, or did you manage to resolve this? |
I am able to upgrade the release version to 0.5.7. (For that I need to upgrade my python version to 3.8.10 and then upgrade the pip and then I was able to upgrade it to 0.5.7). Now After that - I am trying to use the code you have given in #600 (comment) I have created a new Delta Lake - at the root of the container - which has folder Still I am getting the issue - Not a Delta table: No snapshot or version 0 found. Please find attached screenshot for the same. |
@roeap - Please do let me know if you need more information on this? I can share the code file and associated storage account keys with you offline if you want to check it once. |
should be very shortly - the release PR just needs to be updated and merged, but all pending work that we wanted to include is on main. |
@ganesh-gawande - the new python bindings are released, could you check if that mitigates your error? |
@roeap - Unfortunately I am getting the same error as earlier. I have removed the earlier deltalake package version and installed new 0.5.8. Sharing all the details again here for your quick reference. Here is my Azure storage structure, the storage account name and container name. Here are contents of _delta_log folder Here is code snippet I am using - as per shared by you.
I am getting following error -
|
I am also getting the same issue. Here is code snippet that I used:
I am getting following error -
Is there any solution for the issue? |
Hmm, this is a bit puzzling. Could you remind me which authorization mechanism you are using, and also validate that you can read and list on that account? During tests as well as at my work we successfully working with tables stored in azure, and there really is not much more one can do other then provide proper authjorization. Given your code snipplet (and probably discussed above somewhere :)) I assume you are using environment variables, right? There were some issus in azure lsit, that have been fixed recently, but reading a table with the initial version file present should not require any list operation ... In any case, if you can build off main and see if that works, that is always helpful :). |
@roeap I am hitting this same error message today trying to write to S3 and local with deltalake.writer.write_deltalake. The error message appears when passing a DeltaTable object, but not when passing a string. Hopefully this helps with a repro case here.
|
@michaelenew - thanks for the report! Currently writing to remote stores is not supported using |
Update: I think something has changed in 0.6.0 and the docs aren't published yet. I re-installed with target version 0.5.8 and adls2 worked for reading the table. I'll wait for the 0.6.0 docs to see what may have changed and try with the newer version. @roeap I am running the latest release 0.6.0 and also running into the azure error "Not a Delta table: No snapshot or version 0 found, perhaps adls2://accountname/sandbox/taxi_data is an empty dir?" I have pulled the delta table directory down locally and run the DeltaTable call and it works fine. I have double checked that the ENV entries are coming through. If I remove them then the error output complains about missing auth. I can also use azure.storage.blob with the key and account name and list every file. Below is a summary.
|
@ganesh-gawande @cdena - we have released |
When will writing to remote stores be supported? @roeap |
Current |
may I please have an example of how this works? How and where do I pass in the auth parameters to remote azure blob storage? |
in the docs is an example, as well as a link to the available azure options. The same storage options can also be passed to the |
@roeap I am doing the below:
And getting the below error:
|
You are missing a setting for By the way, Azure unfortunately is somewhat "special" here. Usually, I would recommend using one of the In case you have the capacity, I would be really interested to see the difference in performance between the way you specified it (which uses the rust filesystem wrapped in several translation layers) and using Eventually though we will indentify the bottenecks and bring up performance. |
I do not think AZURE_STORAGE_ACCOUNT_NAME is the issue because I have specified it and I can even get the delta table version. @roeap
|
just to clarify, you are using a build of main, or a released version? The released versions right now do not yet really support writing to remote storages. If you are using main, this may be a bug, but it may be able to circumvent that, by setting |
I am using the release version which I installed via pip. @roeap . |
Hello, As an update. I downgraded to version 0.5.8 |
Hmm, i and many others are successfully using deltalake with azure storage. Could be that the space in the url is a problem. The latest release also fixed a bug using client id / secret. Alternatively, could you try the recommended az://... url format and setting the account via parameters. |
@roeap , Any help would be appreciated. |
@bkazour - sure :). The format is storage_options = {
"account_name": "<my-account>",
"account_key": "my-key",
} That said, the config you provided should work as well. It may be that these is an issue with a space within the paths. Although the underlying object store crate does rigorous testing on all sorts of edge cases a paths may be specified in. LEt me know if that works, otherwise I'll have to do some investigating. |
@roeap It is definitely the space in the name. Once I tried it with a file that has no space, it read normally. |
I opened #1156, to keep track of the space character encoding, Closing this generic issue since I validated various azure auth methods. Feel free to re-open if the issue persists. |
@roeap _ I confirmed that the issue reported above is resolved in version 0.7.0. I am able to connect Azure Storage account with the change in path with az://{containerName}/path and with storage options parameter. |
Discussed in #599
Originally posted by ganesh-gawande May 9, 2022
Hi,
I am using the documentation - https://github.com/delta-io/delta-rs/blob/main/docs/ADLSGen2-HOWTO.md
I tried many version of paths - but not able to access the Delta lake.
Following error received - Not a Delta table: No snapshot or version 0 found OR Invalid object URI
Here are the paths I have tried in my code but nothing works.
The text was updated successfully, but these errors were encountered: