-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion on Path-Lookup Strategy (Recursive or Non-Recursive) #3964
Comments
One thing that is not mentioned yet: Synchronizing down segments among the core ASes in the old system is very hard. This was papered over in the previous version by constantly resyncing ALL the segments in the database, which is very suboptimal. |
Right, this was already discussed in the original Issue #3731. From my point of view, abolishing the syncing system is not really disputed; instead, I'm more worried about the non-recursive path lookup (which is only partially related). |
TL;DR I'm not convinced that the problems you describe will actually turn out to be problems, at least not any time soon. Given the lack of evidence, I prefer the simpler system that is easier to implement and reason about over a (potentially) more performant and better defendable but more complex system. Hi @mlegner. Thanks for your write up. I will try to respond to the raised points individually.
Right. Each layer of caching introduces complexities. How long to retain segments? When to check for new segments? How to deal with stale paths, e.g., revoked paths. All of these points add complexity both for implementing and reasoning about the system. I'm not convinced that the gains in path lookup latency outweigh the added complexity, especially, since local CSes do in fact cache path segments.
It's again about added complexity for questionable gains (at least in terms of performance) as you argument yourself.
Both of these cases should be relatively rare overall, since there is one level of caching in the local CS. Besides, the latency difference between contacting a local core CS vs a remote CS might not be that large after all (depending of course on the geographic distribution of CSes and interconnections). In the end, I don't think it matters much if a single first path lookup takes 200ms instead of 10ms, because already the caching in the local CS will amortize this well enough. Of course, we cannot say this definitely, but without compelling evidence that this is a problem I prefer to err on the side of simplicity over premature optimization.
This seems to be your main concern about the new lookup strategy. I don't fully understand the filtering based on SCION path argument. Why does it matter that there is only a single segment? Regarding per-AS rate limiting, again we have no evidence that the increased number of source ASes will lead to problems. It's simply to early to tell. Again, my argument here is to err on the side of simplicity over premature optimization.
I'm not convinced by this (as argued above). You'd need to elaborate a bit more on why you think this is the case and ideally have some data to back it up (maybe from simulations or related literature). |
Thank you for the feedback. It seems we agree with most points. Unfortunately it is a bit tricky to underpin all these points with real-world data as it is difficult to find datasets for the traffic distributions that we are interested in and to map them to SCION ISDs and ASes. Still, we will look into this to get some more confidence and to better assess the potential caching benefits etc. As my point 5 seems to have been unclear, let me elaborate on that somewhat. For the following, I make the (conservative) assumption that we have 100k ASes, 100 ISDs, 10 core ASes per ISD, so 1k core ASes in total. (With full SCION deployment, the number of ASes would likely increase substantially.) Old (recursive) lookup
For both registration and lookup, there is small set (around 2k on average) of allowed clients, which makes DDoS defenses much easier:
New (non-recursive) lookupWhile path registration remains unchanged, the lookup requests now can come from every CS anywhere in the Internet. Instead of 2k different clients, we now have 100k (and maybe 1M at some point in the future). This makes DoS defenses more tricky: The simple path-based filtering doesn't work any more and also keeping persistent connections with all these CSes may not be practical. However, using authentication with DRKey as a first line of defense would still work. Rate-limiting probably still works as well but becomes more difficult due to the large number of potential clients. The core CSes in the new lookup system are somewhat comparable to authoritative TLD name servers in their scope and the number of clients they have to handle. Still, there are a few reasons why the situation is worse for CSes:
Current situationI fully understand that all the potential problems would only occur in a much larger system and are not relevant at the moment. Therefore, I agree that it makes sense to keep the simpler system in the current implementation. In the book, we could describe both variants and discuss their pros and cons. Then, at some point in the future, standardization efforts would need to revisit this decision. |
I think we can agree on that. Nobody really knows what the future will bring, however, we are not closing any door with the current approach. If it turns out that we'll need more levels of caching to be able to scale, we can add them - in a backwards-compatible way even. Your arguments about DDoS defense being more difficult if there is a larger potential set of sources intuitively makes sense, but no one can say at which point this would become a problem. I propose that in the book we'll describe the currently implemented approach and have a section about the alternative with some of the tradeoffs you listed here. I do believe that before the SCION network becomes this large a standardization body would be addressing these questions and probably would have more insight and experience about the involved tradeoffs. |
That sounds like a good solution to me, thank you for the discussion. I will close this issue for now but will keep it in mind if anything changes in the future. |
tl/dr: I see some problems with the new path-lookup system, namely dropping the recursion at core CSes, and would like to have a (brief) discussion about whether the advantages of the new system outweigh its disadvantages.
In the context of writing the new SCION book, I've taken a closer look at the new path-lookup system, mostly based on #3731, the documentation introduced in #3747, and an offline discussion that @matzf pointed me to. There is the first change, mainly removing the path syncing in core ASes, which is discussed in some detail in #3731. I agree with those points and think that the changes make sense.
In addition, the lookup process was changed significantly. (To simplify the text I will use "local CS" for the CS in the source's non-core AS, "local core CS" for a CS in a core AS of the source's ISD, and "remote CS" for a CS in a core AS of the destination's ISD.)
In the old system, the request from the local CS to the local core CS was recursive, i.e., the local CS queried the local core CS and that in turn fetched (and cached) down-segments from the remote CS. In the new system, the local CS asks the local core CSes and remote CSes separately for core- and down-segments, respectively.
Problems of the old system (with recursion)
As far as I understand, the main arguments for the new system are the following:
Problems of the new system (no recursion)
I do understand these points, but I also see significant issues with the new system:
Options
In my opinion, we should choose either the old (recursive) or new (non-recursive) system and try to fix their respective shortcomings.
Concerning the old system:
Concerning the new system:
The text was updated successfully, but these errors were encountered: