Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to delete data before a given day #286

Open
mnuccioarpae opened this issue Feb 2, 2022 · 16 comments
Open

Unable to delete data before a given day #286

mnuccioarpae opened this issue Feb 2, 2022 · 16 comments
Assignees

Comments

@mnuccioarpae
Copy link
Contributor

mnuccioarpae commented Feb 2, 2022

I need to delete the data before 2011-12-21 for station 6257. So I ran the following commands:

$ DS=/arkivio/arkimet/dataset/locali
$ arki-query 'area:VM2,6257;reftime:<2011-12-21' $DS > 6257-da-eliminare.arkimet
$ arki-check --fix --remove 6257-da-eliminare.arkimet "$DS"

As a check, I run the following command:

$ arki-query --summary --dump 'area:VM2,6257' $DS > 6257-summary.txt
$ grep Reftime: 6257-summary.txt | cut -d\  -f4 | sort -u
2011-12-21T00:00:00Z
2011-12-21T10:00:00Z
2011-12-21T11:00:00Z
2011-12-23T00:00:00Z
2014-12-31T23:30:00Z

But, if I restrict the query to an interval of reftimes starting before the 2011-12-21, I see the old data. For example, limiting to the variable 78, I get:

$ arki-query --summary --dump 'area:VM2,6257;product:VM2,78' $DS
SummaryItem:
  Product: VM2(78, bcode=B12101, l1=2000, lt1=103, p1=0, p2=3600, tr=0, unit=C)
  Area: VM2(6257,lat=4436161, lon=1192193, rep=locali)
SummaryStats:
  Count: 86525
  Size: 3156064
  Reftime: 2011-12-21T11:00:00Z to 2022-02-02T08:00:00Z

$ arki-query --summary --dump "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" $DS
SummaryItem:
  Product: VM2(78, bcode=B12101, l1=2000, lt1=103, p1=0, p2=3600, tr=0, unit=C)
  Area: VM2(6257,lat=4436161, lon=1192193, rep=locali)
SummaryStats:
  Count: 237
  Size: 8562
  Reftime: 2011-12-15T00:00:00Z to 2011-12-25T23:00:00Z

$ arki-query --data "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" $DS | head
201112150000,6257,78,6.5,,,000000000
201112150100,6257,78,6.2,,,000000000
201112150200,6257,78,5.9,,,000000000
201112150300,6257,78,5.8,,,000000000
201112150400,6257,78,5.7,,,000000000
201112150500,6257,78,5.7,,,000000000
201112150600,6257,78,5.6,,,000000000
201112150700,6257,78,5.5,,,000000000
201112150800,6257,78,5.5,,,000000000
201112150900,6257,78,5.5,,,000000000

What am I doing wrong?

Thanks

@mnuccioarpae mnuccioarpae changed the title How to delete data before a given day Error deleting data before a given day Feb 7, 2022
@mnuccioarpae
Copy link
Contributor Author

rm $DS/.summaries/* fixed the problem of wrong summary data.

However the data is still there even after having repeated the deletion.

@mnuccioarpae mnuccioarpae changed the title Error deleting data before a given day Unable to delete data before a given day Feb 7, 2022
@spanezz
Copy link
Contributor

spanezz commented Feb 7, 2022

I tried to reproduce the problem and it works as expected:

$ cd /arkivio/arkimet/dataset/locali
$ tar acf ~/issue286.tar.gz config 2011/12-*
$ cd ~
$ mdir test
$ cd test
$ tar axf ~/issue286.tar.gz
$ arki-query --data 'area:VM2,6257;reftime:<2011-12-21' . | wc -l
25157
$ arki-query 'area:VM2,6257;reftime:<2011-12-21' . > /tmp/cancellare
(il file cancellare sono 5.8M)
$ arki-check --fix --remove /tmp/cancellare  .
(i timestamp dei file .index fino a 12-20.vm2.index sono cambiati)
$ arki-query --data 'area:VM2,6257;reftime:<2011-12-21' . | wc -l
0

But once I try copypasting your query, I get data:

$ arki-query --data "reftime:>=2011-12-15,<=2011-12-25;area:VM2,6257;product:VM2,78" . | wc -l
109

On closer inspection, the data has been removed until the 21st of december, but the later queries are until the 25th of december, so they correctly show data between the 21st and hte 25th.

It seems that arkimet is working as expected, but it's really hard to see the difference between 21 and 25 among all those numbers (it took me quite a while to see it, too)

@mnuccioarpae
Copy link
Contributor Author

@spanezz if you look at the results, you can see that the reftime of the first record is 201112150000, which is 15 Dec 2011.

I suspect the problem is not reproducible on a small dataset. I have removed decades of data. The arki-check command took a long time to finish.

Maybe I can try removing the data in batches of smaller subsets, for example, one variable at a time.

@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.

If we both worked on the same dataset, it looks like when I took a copy of it, the data had not been deleted. I now suspect something went wrong in the deletion when you ran it, and worked when I ran it on a subset of the dataset.

I'm also considering making arki-check --remove print statistics on the number of elements deleted

@mnuccioarpae
Copy link
Contributor Author

Unfortunately, I did not notice the error immediately because I did check only the summary with arki-query --summary, not the data, and the summary was wrong.

The arki-check command did not print any error message, and I forgot to check the returned status with "exit $?". So I cannot be sure that it did not end without error. However, the summaries were updated to reflect the deletion requested. Is this done only at the end of the transaction?

Maybe we can check that the expected result is consistent with the final result. For example, we can check that the number of records before and after is correct.

@mnuccioarpae
Copy link
Contributor Author

I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.

An advantage of the current system is that it forces you to backup the deleted data.

@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

I may be time to optimize deletion by directly passing a query to arki-check --remove, indeed.

An advantage of the current system is that it forces you to backup the deleted data.

That is a very good point, I never considered it that way. It's a tricky backup, since the results of the query do not contain the data. But until the dataset is repacked, the results of the query should contain valid references to the deleted data still in the dataset.

@mnuccioarpae
Copy link
Contributor Author

It's a tricky backup, since the results of the query do not contain the data

I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"

@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

The arki-check command did not print any error message, and I forgot to check the returned status with "exit $?". So I cannot be sure that it did not end without error. However, the summaries were updated to reflect the deletion requested. Is this done only at the end of the transaction?

In theory the summaries are deleted while the data is deleted and regenerated at the end. The actual deletion is performed in the index files, which are the main files you should expect to see modified by the deletion

@edigiacomo
Copy link
Member

I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"

Sorry @spanezz but maybe I don't understand which query you're referring to:

# Save the result of the query
[arkimet@arkioss8 ~]$ arki-query 'reftime:=today 00:00' /arkivio/arkimet/dataset/boa > /tmp/buttami.arkimet
# Copy the result in my laptop
[edg 🫒  ~]$ rsync -avz arkimet@arkioss:/tmp/buttami.arkimet .
# Check that the original path is not reachable
[edg 🫒  ~]$ arki-scan --dump buttami.arkimet | head -n 1
Source: BLOB(vm2,/arkivio/arkimet/dataset/boa/2022/02-09.vm2:0+37)
[edg 🫒  ~]$ ls /arkivio/arkimet/dataset/boa
ls: cannot access '/arkivio/arkimet/dataset/boa': No such file or directory
# Extract the data
[edg 🫒  ~]$ arki-scan --data buttami.arkimet | head
202202090000,12626,139,49,,,000000000
202202090000,12626,158,7.7,,,000000000
202202090000,12626,164,0,,,000000000
202202090000,12626,166,0.4,,,000000000
202202090000,12626,629,0.13,,,000000000
202202090000,12626,631,3.6,,,000000000
202202090000,12626,632,3.3,,,000000000
202202090000,12626,683,-0.09,,,000000000
202202090000,12626,684,0.25,,,000000000
202202090000,12628,139,41,,,000000000

@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

I did not know this! I did believe that it contained all the data because I can extract all the records with "arki-scan --data ./file.arkimet"

Sorry @spanezz but maybe I don't understand which query you're referring to:

# Save the result of the query
[arkimet@arkioss8 ~]$ arki-query 'reftime:=today 00:00' /arkivio/arkimet/dataset/boa > /tmp/buttami.arkimet
# Copy the result in my laptop
[edg 🫒  ~]$ rsync -avz arkimet@arkioss:/tmp/buttami.arkimet .
# Check that the original path is not reachable
[edg 🫒  ~]$ arki-scan --dump buttami.arkimet | head -n 1
Source: BLOB(vm2,/arkivio/arkimet/dataset/boa/2022/02-09.vm2:0+37)
[edg 🫒  ~]$ ls /arkivio/arkimet/dataset/boa
ls: cannot access '/arkivio/arkimet/dataset/boa': No such file or directory
# Extract the data
[edg 🫒  ~]$ arki-scan --data buttami.arkimet | head
202202090000,12626,139,49,,,000000000
202202090000,12626,158,7.7,,,000000000
202202090000,12626,164,0,,,000000000
202202090000,12626,166,0.4,,,000000000
202202090000,12626,629,0.13,,,000000000
202202090000,12626,631,3.6,,,000000000
202202090000,12626,632,3.3,,,000000000
202202090000,12626,683,-0.09,,,000000000
202202090000,12626,684,0.25,,,000000000
202202090000,12628,139,41,,,000000000

Ah interesting, then it works for VM2 data only, because the metadata contain enough information to reconstruct the data. For other formats, I would expect this not to work

@edigiacomo
Copy link
Member

Nice! Is it possible that this feature is format agnostic and it's enabled by the smallfiles option (https://github.com/ARPA-SIMC/arkimet/blob/v1.40-1/doc/datasets.rst)?

@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

No, smallfiles are only supported for VM2, since a VM2 data can be reconstructed with is its arkimet metadata plus a small string. For all other formats it does not make any sense, since to preserve the data one has to copy all of it after the metadata, and that's what --inline does.

It is not however possible to delete data from the output of arki-query --inline, because with --inline the reference to the data in the dataset is replaced with the data itself

spanezz added a commit that referenced this issue Feb 9, 2022
@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

I've reworked deletion for iseg datasets (which are now the only datasets that support deletion) to group data to delete by segment, and do one transaction per segment. The result should be much faster.

I've also added, with --verbose, feedback for each segment:

$ time arki-check --fix --remove 6257-da-eliminare.arkimet  test1 --verbose
INFO test1: 2011/12-19.vm2: 1297 data marked as deleted
INFO test1: 2011/12-18.vm2: 1297 data marked as deleted
INFO test1: 2011/12-17.vm2: 1297 data marked as deleted
INFO test1: 2011/12-14.vm2: 1296 data marked as deleted
INFO test1: 2011/12-20.vm2: 545 data marked as deleted
INFO test1: 2011/12-01.vm2: 1293 data marked as deleted
INFO test1: 2011/12-15.vm2: 1297 data marked as deleted
INFO test1: 2011/12-13.vm2: 1297 data marked as deleted
INFO test1: 2011/12-02.vm2: 1295 data marked as deleted
INFO test1: 2011/12-03.vm2: 1294 data marked as deleted
INFO test1: 2011/12-04.vm2: 1295 data marked as deleted
INFO test1: 2011/12-16.vm2: 1296 data marked as deleted
INFO test1: 2011/12-05.vm2: 1297 data marked as deleted
INFO test1: 2011/12-07.vm2: 1297 data marked as deleted
INFO test1: 2011/12-11.vm2: 1289 data marked as deleted
INFO test1: 2011/12-06.vm2: 1297 data marked as deleted
INFO test1: 2011/12-08.vm2: 1294 data marked as deleted
INFO test1: 2011/12-09.vm2: 1297 data marked as deleted
INFO test1: 2011/12-10.vm2: 1293 data marked as deleted
INFO test1: 2011/12-12.vm2: 1294 data marked as deleted
INFO test1: 25157 data marked as deleted

real	0m4.361s
user	0m1.884s
sys	0m1.120s

@spanezz
Copy link
Contributor

spanezz commented Feb 9, 2022

Redoing the deletion with current master should be much faster and have a far less boring output.

Hopefully data should also stay deleted, but I still have no idea how come it didn't get deleted when you first tried :(

@edigiacomo
Copy link
Member

@mnuccioarpae arkimet 1.41-1 is available in the Copr repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants