[prev in list] [next in list] [prev in thread] [next in thread]
List: ms-ospf
Subject: Re: [Lsr] Multiple failures in Dynamic Flooding
From: tony.li () tony ! li
Date: 2019-03-11 17:41:08
Message-ID: 10A1CA48-0D09-44FF-95ED-8D52FB867B8B () tony ! li
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Hi Huaimo,
> In summary for multiple failures, two issues below in \
> draft-li-lsr-dynamyic-flooding are discussed: 1) how to determine the current \
> flooding topology is split; and 2) how to repair/connect the flooding topology \
> split. For the first issue, the discussions are still going on.
> For the second issue, repairing/connecting the flooding topology split through \
> Hello protocol extensions does not work. When a "backup path"/connection of \
> multiple hops is needed to connect/repair the flooding topology split, Hello can \
> not go beyond one hop, thus can not repair the flooding topology split in this \
> case.
You do not try to repair things remotely, they are always repaired locally. If there \
are multiple failures in the flooding topology and it is partitioned, then it follows \
that there are multiple remaining connected components of the flooding topology. \
Nodes that are adjacent to the failures will update their LSPs and flood them \
throughout their connected component. Each component will see at least two link \
failures if there is a partition of the FT and each node in the component can detect \
that the FT has partitioned. Each node is then capable of enabling temporary \
flooding on one or more links that will traverse the partition, thereby restoring a \
functioning FT. The Area Leader then recomputes and redistributes the revised FT.
To put it yet another way, repair is fully distributed. You should like that. :-)
> > We are not requiring it, but a system could also do a more extensive computation \
> > and compare the links between itself and the neighbor by tracing the path in the \
> > FT and then confirming that each link is up in the LSDB.
>
> It normally takes a long time such as more than ten minutes to age out and remove \
> an LSP/LSA for the neighbor from the LSDB even though the neighbor is disconnected \
> physically. How can you decide quickly in tens of milliseconds that the flooding \
> topology is disconnected?
You do not wait for LSP/LSA removal. You look for link changes in the LSPs that you \
do get, or local link changes.
> > As we have discussed, this is not a solution. In fact, this is more dangerous \
> > than anything else that has been proposed and seems highly likely to trigger a \
> > cascade failure. You are enabling full flooding for many nodes. In dense \
> > topologies, even a radius of 3 is very high. For example, in a LS topology, a \
> > radius of 3 is sufficient to enable full flooding throughout the entire topology. \
> > If that were stable, we would not need Dynamic Flooding at all.
>
> This full flooding is enabled only for a very short time.
All it takes is enabling it at sufficient density to create a cascade failure. \
Milliseconds are sufficient for a collapse.
> How do you get that this is more dangerous than anything else and seems highly \
> likely to trigger a cascade failure? Can you give some explanations in details?
Again, we do not have absolute metrics on what triggers a cascade failure today. We \
have several data points of several different implementations at different points in \
time. We know that in the early ‘90s, a full mesh of 20 neighbors running L1L2 was \
sufficient. Obviously things have changed somewhat, but even more modern \
implementations have had problems. This is why the MSDC went to BGP.
As a result, we need to be very conservative about what flooding we temporarily \
enable. We do not want to walk anywhere near the cliff, as the cascade failure is \
fatal to the network.
Tony
[Attachment #5 (unknown)]
<html><head><meta http-equiv="Content-Type" content="text/html; \
charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; \
line-break: after-white-space;" class=""><div class=""><br class=""></div>Hi \
Huaimo,<div class=""><br class=""></div><div class=""><br class=""><div><br \
class=""><blockquote type="cite" class=""><div class=""><span style="color: rgb(0, \
112, 192); font-family: Calibri, sans-serif; font-size: 11pt;" class=""> \
In summary for multiple failures, two issues below in draft-li-lsr-dynamyic-flooding \
are discussed:</span></div><div class=""><div class="WordSection1" style="page: \
WordSection1; caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 14px; \
font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: \
normal; text-align: start; text-indent: 0px; text-transform: none; white-space: \
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: \
none;"><div style="margin: 0in 0in 0.0001pt 0.5in; font-size: 12pt; font-family: \
"Times New Roman", serif; text-indent: -0.25in;" class=""><span \
style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(0, 112, 192);" \
class=""><span class="">1)<span style="font-style: normal; font-variant-caps: normal; \
font-weight: normal; font-stretch: normal; font-size: 7pt; line-height: normal; \
font-family: "Times New Roman";" \
class=""> <span \
class="Apple-converted-space"> </span></span></span></span><span \
style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(0, 112, 192);" \
class="">how to determine the current flooding topology is split; and<o:p \
class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt 0.5in; font-size: \
12pt; font-family: "Times New Roman", serif; text-indent: -0.25in;" \
class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: \
rgb(0, 112, 192);" class=""><span class="">2)<span style="font-style: normal; \
font-variant-caps: normal; font-weight: normal; font-stretch: normal; font-size: 7pt; \
line-height: normal; font-family: "Times New Roman";" \
class=""> <span \
class="Apple-converted-space"> </span></span></span></span><span \
style="font-size: 11pt; font-family: Calibri, sans-serif; color: rgb(0, 112, 192);" \
class="">how to repair/connect the flooding topology split.<o:p \
class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span style="font-size: \
11pt; font-family: Calibri, sans-serif; color: rgb(0, 112, 192);" class="">For the \
first issue, the discussions are still going on.<o:p class=""></o:p></span></div><div \
style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: "Times New \
Roman", serif;" class=""><span style="font-size: 11pt; font-family: Calibri, \
sans-serif; color: rgb(0, 112, 192);" class="">For the second issue, \
repairing/connecting the flooding topology split through Hello protocol extensions \
does not work. When a "backup path"/connection of multiple hops is needed to \
connect/repair the flooding topology split, Hello can not go beyond one hop, thus can \
not repair the flooding topology split in this \
case.</span></div></div></div></blockquote><div><br class=""></div><div><br \
class=""></div><div>You do not try to repair things remotely, they are always \
repaired locally. If there are multiple failures in the flooding topology and \
it is partitioned, then it follows that there are multiple remaining connected \
components of the flooding topology. Nodes that are adjacent to the failures \
will update their LSPs and flood them throughout their connected component. \
Each component will see at least two link failures if there is a partition of \
the FT and each node in the component can detect that the FT has partitioned. \
Each node is then capable of enabling temporary flooding on one or more links \
that will traverse the partition, thereby restoring a functioning FT. The Area \
Leader then recomputes and redistributes the revised FT.</div><div><br \
class=""></div><div>To put it yet another way, repair is fully distributed. You \
should like that. :-)</div><div><br class=""></div><div><br \
class=""></div><blockquote type="cite" class=""><div class=""><div \
class="WordSection1" style="page: WordSection1; caret-color: rgb(0, 0, 0); \
font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: \
normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: \
0px; text-transform: none; white-space: normal; word-spacing: 0px; \
-webkit-text-stroke-width: 0px; text-decoration: none;"><div style="margin: 0in 0in \
0.0001pt; font-size: 12pt; font-family: "Times New Roman", serif;" \
class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif; color: \
rgb(0, 112, 192);" class=""><o:p class=""></o:p></span></div><div style="margin: 0in \
0in 0.0001pt; font-size: 12pt; font-family: "Times New Roman", serif;" \
class=""><span class="" style="font-size: 11pt; font-family: Calibri, \
sans-serif;">></span><span style="font-size: 12pt;" class="">We are not requiring \
it, but a system could also do a more extensive computation and compare the links \
between itself and the neighbor</span></div><div style="margin: 0in 0in 0.0001pt; \
font-size: 12pt; font-family: "Times New Roman", serif;" class=""><o:p \
class=""></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span style="font-size: \
11pt; font-family: Calibri, sans-serif;" class="">></span>by tracing the path in \
the FT and then confirming that each link is up in the LSDB.<o:p \
class=""></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><o:p \
class=""> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span style="color: rgb(0, \
112, 192);" class="">It normally takes a long time such as more than ten minutes to \
age out and remove an LSP/LSA for the neighbor from the LSDB even though the neighbor \
is disconnected physically.<o:p class=""></o:p></span></div><div style="margin: 0in \
0in 0.0001pt; font-size: 12pt; font-family: "Times New Roman", serif;" \
class=""><span style="color: rgb(0, 112, 192);" class="">How can you decide quickly \
in tens of milliseconds that the flooding topology is \
disconnected?</span></div></div></div></blockquote><div><br class=""></div><div><br \
class=""></div><div>You do not wait for LSP/LSA removal. You look for link \
changes in the LSPs that you do get, or local link changes.</div><div><br \
class=""></div><br class=""><blockquote type="cite" class=""><div \
class="WordSection1" style="page: WordSection1; caret-color: rgb(0, 0, 0); \
font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: \
normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: \
0px; text-transform: none; white-space: normal; word-spacing: 0px; \
-webkit-text-stroke-width: 0px; text-decoration: none;"><div style="margin: 0in 0in \
0.0001pt; font-size: 12pt; font-family: "Times New Roman", serif;" \
class=""><span style="color: rgb(0, 112, 192);" class=""><o:p \
class=""></o:p></span></div><div style="margin: 0in 0.5in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span class="" \
style="font-size: 11pt; font-family: Calibri, sans-serif;">></span><span \
style="font-size: 12pt;" class="">As we have discussed, this is not a solution. In \
fact, this is more dangerous than anything else that has been proposed \
and</span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: \
"Times New Roman", serif;" class=""><o:p class=""></o:p></div><div \
style="margin: 0in 0in 0.0001pt; font-size: 12pt; font-family: "Times New \
Roman", serif;" class=""><span style="font-size: 11pt; font-family: Calibri, \
sans-serif;" class="">></span>seems highly likely to trigger a cascade failure. \
You are enabling full flooding for many nodes. In dense topologies, even<o:p \
class=""></o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span style="font-size: \
11pt; font-family: Calibri, sans-serif;" class="">></span>a radius of 3 is very \
high. For example, in a LS topology, a radius of 3 is sufficient to enable full \
flooding throughout the<o:p class=""></o:p></div><div style="margin: 0in 0in \
0.0001pt; font-size: 12pt; font-family: "Times New Roman", serif;" \
class=""><span style="font-size: 11pt; font-family: Calibri, sans-serif;" \
class="">></span>entire topology. If that were stable, we would not need Dynamic \
Flooding at all.<o:p class=""></o:p></div><div style="margin: 0in 0in 0.0001pt; \
font-size: 12pt; font-family: "Times New Roman", serif;" class=""><o:p \
class=""> </o:p></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span style="color: rgb(0, \
112, 192);" class="">This full flooding is enabled only for a very short \
time.</span></div></div></blockquote><div><br class=""></div><div><br \
class=""></div><div>All it takes is enabling it at sufficient density to create a \
cascade failure. Milliseconds are sufficient for a collapse.</div><div><br \
class=""></div><br class=""><blockquote type="cite" class=""><div \
class="WordSection1" style="page: WordSection1; caret-color: rgb(0, 0, 0); \
font-family: Helvetica; font-size: 14px; font-style: normal; font-variant-caps: \
normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: \
0px; text-transform: none; white-space: normal; word-spacing: 0px; \
-webkit-text-stroke-width: 0px; text-decoration: none;"><div style="margin: 0in 0in \
0.0001pt; font-size: 12pt; font-family: "Times New Roman", serif;" \
class=""><span style="color: rgb(0, 112, 192);" class=""><o:p \
class=""></o:p></span></div><div style="margin: 0in 0in 0.0001pt; font-size: 12pt; \
font-family: "Times New Roman", serif;" class=""><span style="color: rgb(0, \
112, 192);" class="">How do you get that this is more dangerous than anything else \
and seems highly likely to trigger a cascade failure? Can you give some explanations \
in details?<br class=""></span></div></div></blockquote><div><br \
class=""></div><div><br class=""></div><div>Again, we do not have absolute metrics on \
what triggers a cascade failure today. We have several data points of several \
different implementations at different points in time. We know that in the \
early ‘90s, a full mesh of 20 neighbors running L1L2 was sufficient. \
Obviously things have changed somewhat, but even more modern implementations \
have had problems. This is why the MSDC went to BGP.</div><div><br \
class=""></div><div>As a result, we need to be very conservative about what flooding \
we temporarily enable. We do not want to walk anywhere near the cliff, as the \
cascade failure is fatal to the network.</div><div><br \
class=""></div></div><div>Tony</div><div><br class=""></div><br \
class=""></div></body></html>
_______________________________________________
Lsr mailing list
Lsr@ietf.org
https://www.ietf.org/mailman/listinfo/lsr
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic