Sorry for this long post folks, but this is just something I don't understand and might require someone with higher Routing skills.
had a strange issue that was resolved with route-tagging a while ago,
but I could never figure out why this problem even happened in the first place. Picture of topolgy added.
https://imgur.com/a/83UNU2N
Our IGP is EIGRP within the datacenters and we connect to our branches via BGP (ISP MPLS network).
Before the problem:
We have 2 Datacenters that both have WAN(MPLS routers that do BGP to the branches) Routers. All our branches connect to DC1 MPLS router, but if we flip the Default information orginiate(failover) command to the 2nd DC, we can make all the branches go that way. This was also the way that our 2 DCs would talk to eachother....they would just go right through their WAN routers to get to eachothers DCs..
Both the WAN routers would do full mutual redistribution without any filters...so any routes in DC1, DC2 and any branches would get advertised right into eachother. kind of a mess but it worked.
What started the problem:
1 day someone decided to add a 10gig metro-e circuit between our 2 DCs core-to-core(blue line in picture)..now the DCs have a new way to talk to eachother because the metrics for going through the 10gig is better than going through the MPLS.
Life was all better now because of faster speeds and less latency between DC-to-DC talk...BUT...a new problem emerged:
The problem was that when a branch fell off the network and came back up(whether power outage or circuit outage) all of a sudden the branches subnet from the cores routing table would not go directly down to the RTR and to the branch...it would instead traverse the 10gig link to the other DC and down to the WAN router and to the branch. This was only for the return traffic from the branches perspective.
Example just to make sure i'm clear:
Let's say there is a user in the branch that wants to ping a user off core 1 in DC1 - the user sends a ping and it goes up to RTR1 then to core1, but the return traffic ping would go to Core2 in DC2 and down the RTR2 to the..so this would essentially be asymmetric routing. This would ONLY happen when a branch loss connectivity and came back up and had to be learned on the network again. It was not a problem that was discovered right away when they implemented the 10gig ciruict.
So i know this problem was for sure caused by mutual redistribution so the higher ups implemented route tagging at both RTR1 and RTR2 to filter routes learned at RTR1 to not be redistributed back into RTR2 again.
Here's the part I don't understand..why would the return traffic be going to the other DC?? There is something here I'm missing I don't understand.
But i did receive a very brief response from the guy who fixed it. I asked why did this happen and he said the following:
"Locally the router has the route in BGP from 2 directions, one is local (from redistribution) and the other is through MPLS, on the BGP RIB the AD is not used to decide the best path, in this case is considering that the Local BGP route is the best way to reach the destination, so is not even considering the MPLS one, and that’s why it is talking the EIGRP path, the TAGS will fix the issue."
Can anyone decipher in more plain words what he is attempting to say here?