-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: add way to create index #24
Comments
:-) Here's the problem. There isn't a smaller way to store the data (well, you could compress it, but that may end up being slower, rather than faster - depending on the algorithm). But, let's explore the numbers. I'm going to take some guesses here. Let's say that your 300G of data is really around 30 billion lines (at about 10 characters per line). Feel free to adjust the numbers to reality. Indexing those 30B lines can happen in a number of ways - but let's start by saying you hash each line, and store the hashed value in a big file. 30B lines hashed into MD5 would be 480G (in addition to the 300G you already have). Oops. You could store it in a database, but the index for the database would be pretty big too. A quick test with sqlite3 shows about 40Mbytes for 1,000,000 words, so about 1.2T for your 300G of words. This is a problem I started working on about 10 years ago. The best choice may be to store the words in a large online database (like bigtable), and index from there. At least you won't care about the disk space :-) |
If the big file was unchanging, could some metadata about the big file be cached in a way that might save some time? |
Of course. But the only "metadata" that is required is the contents of the file :-) Let's start simple - with one file. Say it has 10 lines in it. What would be useful to cache about those 10 lines? The existing rling is pretty much as efficient as it is possible to be, given the problem set. It reads the files exactly once, and does it at the limit of the underlying hardware. The only way to make it faster at the task they are suggesting would be to have a (very fast) index of the "known words". And that's going to be costly. There is another approach, which would require that the "index" be exactly as large as the archive (so, in this case, 300G). For this, we create a sorted list of the lines in the archive, and then binary search that list (or create N sorted lists, and binary search or index into the Nth sorted list, and then binary search). This either forces you to change the archived 300G list, or to create the separate sorted list(s). Each new word could then be binary searched. For 30B words, that's about 35 iterations-per-word for the full list, or 35/N for multiple lists. Adding new words/lines to the archive would require re-sorting these indexes, which is expensive (or you could do a floating insertion sort - but still expensive). So, it becomes a question of how much disk you can afford to spare - or what a bigtable index would cost to maintain. My guess is that, currently, processing a new list against that 300G of archive lists takes about an hour - but it takes 1 hour pretty much exclusive of the size of the incoming lists. Batching this once per day would be painless - or you could just do a batch once per week. A whole lot cheaper than (what I imagine) you would pay for 1T of bigtable resource. Looking at the products, maybe Firestore would be a better option. Cost for this data set would be about $200/month (guessing - 300G will likely need about 1T of firestore), plus query cost at $0.06 per 100K. Is this worth it for you, vs doing 1 hour of computer time once/week? |
Right now it takes about 20 minutes to check all 300GB on an m.2 drive. So, I would say you can close this request unless you want to leave it open for some reason. Thanks for listening :) |
It's important to always question assumptions, so I do appreciate the questions. 20 minutes is awesome, and I'm glad that this works for you. 250 megabytes per second is great - I would be interested to know how fast you can read all 300G of files with another program.... maybe there is some work to do in the IO, still? |
The work is split between the number of threads available. It's reading in MAXCHUNK/2 bytes at a time from the remove files (about 25 megabytes), but then it has to split that work into multiple threads for processing. The more threads you have, the more work it is able to do until you exhaust memory bandwidth. It would be interesting to run a subset (say 1 minute worth of files) using -t 1, -t 2, -t 4, -t 8 (and so on, to the limit of your CPU), to see the effect on how much work you can do. With the large systems I have here, I was able to exhaust memory bandwidth before I/O (but that was on the Power 8 system with 80 cores). |
Would there be any way to build an index of sorts that rling could compare against instead of using the contents of a directory over and over?
Right now I compare small files to 300GB of data over and over and it would be super helpful if I could somehow stage/store the directory in a smaller/faster way vs the original contents.
so maybe something along the lines of:
step 1)
create index from directory of filesstep 2) rling big-file.txt new-file.txt /path/to/old/files/index.txt -use_index
The text was updated successfully, but these errors were encountered: