Feature: add way to create index #24

Shooter3k · 2021-07-13T18:33:23Z

Would there be any way to build an index of sorts that rling could compare against instead of using the contents of a directory over and over?

Right now I compare small files to 300GB of data over and over and it would be super helpful if I could somehow stage/store the directory in a smaller/faster way vs the original contents.

so maybe something along the lines of:
step 1) ~~create index from directory of files~~
step 2) rling big-file.txt new-file.txt /path/to/old/files/index.txt -use_index

Waffle2 · 2021-07-14T06:36:22Z

:-)

Here's the problem. There isn't a smaller way to store the data (well, you could compress it, but that may end up being slower, rather than faster - depending on the algorithm).

But, let's explore the numbers. I'm going to take some guesses here. Let's say that your 300G of data is really around 30 billion lines (at about 10 characters per line). Feel free to adjust the numbers to reality. Indexing those 30B lines can happen in a number of ways - but let's start by saying you hash each line, and store the hashed value in a big file.

30B lines hashed into MD5 would be 480G (in addition to the 300G you already have). Oops. You could store it in a database, but the index for the database would be pretty big too. A quick test with sqlite3 shows about 40Mbytes for 1,000,000 words, so about 1.2T for your 300G of words.

This is a problem I started working on about 10 years ago. The best choice may be to store the words in a large online database (like bigtable), and index from there. At least you won't care about the disk space :-)

roycewilliams · 2021-07-14T06:46:25Z

If the big file was unchanging, could some metadata about the big file be cached in a way that might save some time?

Waffle2 · 2021-07-14T07:11:20Z

Of course. But the only "metadata" that is required is the contents of the file :-)

Let's start simple - with one file. Say it has 10 lines in it. What would be useful to cache about those 10 lines?

The existing rling is pretty much as efficient as it is possible to be, given the problem set. It reads the files exactly once, and does it at the limit of the underlying hardware. The only way to make it faster at the task they are suggesting would be to have a (very fast) index of the "known words". And that's going to be costly.

There is another approach, which would require that the "index" be exactly as large as the archive (so, in this case, 300G). For this, we create a sorted list of the lines in the archive, and then binary search that list (or create N sorted lists, and binary search or index into the Nth sorted list, and then binary search). This either forces you to change the archived 300G list, or to create the separate sorted list(s). Each new word could then be binary searched.

For 30B words, that's about 35 iterations-per-word for the full list, or 35/N for multiple lists. Adding new words/lines to the archive would require re-sorting these indexes, which is expensive (or you could do a floating insertion sort - but still expensive).

So, it becomes a question of how much disk you can afford to spare - or what a bigtable index would cost to maintain. My guess is that, currently, processing a new list against that 300G of archive lists takes about an hour - but it takes 1 hour pretty much exclusive of the size of the incoming lists. Batching this once per day would be painless - or you could just do a batch once per week. A whole lot cheaper than (what I imagine) you would pay for 1T of bigtable resource.

Looking at the products, maybe Firestore would be a better option. Cost for this data set would be about $200/month (guessing - 300G will likely need about 1T of firestore), plus query cost at $0.06 per 100K. Is this worth it for you, vs doing 1 hour of computer time once/week?

Shooter3k · 2021-07-25T15:43:20Z

Right now it takes about 20 minutes to check all 300GB on an m.2 drive. So, I would say you can close this request unless you want to leave it open for some reason.

Thanks for listening :)

Waffle2 · 2021-07-25T16:18:30Z

It's important to always question assumptions, so I do appreciate the questions. 20 minutes is awesome, and I'm glad that this works for you. 250 megabytes per second is great - I would be interested to know how fast you can read all 300G of files with another program.... maybe there is some work to do in the IO, still?

Shooter3k · 2021-07-25T19:44:17Z

This is actually a really good question. I'm using a Samsung 970 m.2 drive. I'm not sure why rling runs at the RND4K speeds and I'd love to hear any ideas but that really seems like a different topic

Waffle2 · 2021-07-25T19:51:09Z

The work is split between the number of threads available. It's reading in MAXCHUNK/2 bytes at a time from the remove files (about 25 megabytes), but then it has to split that work into multiple threads for processing. The more threads you have, the more work it is able to do until you exhaust memory bandwidth.

It would be interesting to run a subset (say 1 minute worth of files) using -t 1, -t 2, -t 4, -t 8 (and so on, to the limit of your CPU), to see the effect on how much work you can do. With the large systems I have here, I was able to exhaust memory bandwidth before I/O (but that was on the Power 8 system with 80 cores).

Shooter3k closed this as completed Jul 25, 2021

Shooter3k reopened this Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add way to create index #24

Feature: add way to create index #24

Shooter3k commented Jul 13, 2021

Waffle2 commented Jul 14, 2021

roycewilliams commented Jul 14, 2021 •

edited

Loading

Waffle2 commented Jul 14, 2021

Shooter3k commented Jul 25, 2021

Waffle2 commented Jul 25, 2021

Shooter3k commented Jul 25, 2021

Waffle2 commented Jul 25, 2021

Feature: add way to create index #24

Feature: add way to create index #24

Comments

Shooter3k commented Jul 13, 2021

Waffle2 commented Jul 14, 2021

roycewilliams commented Jul 14, 2021 • edited Loading

Waffle2 commented Jul 14, 2021

Shooter3k commented Jul 25, 2021

Waffle2 commented Jul 25, 2021

Shooter3k commented Jul 25, 2021

Waffle2 commented Jul 25, 2021

roycewilliams commented Jul 14, 2021 •

edited

Loading