'Re: [ngw] mta to mta transfer / messages disappered into black hole'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ngw
Subject:    Re: [ngw] mta to mta transfer / messages disappered into black hole
From:       "Jeffrey D. Sessler" <jeff () scrippscollege ! edu>
Date:       2016-06-10 17:06:43
Message-ID: 584FA82F-2C67-4C1A-9B4C-6B963D8A61AC () scrippscollege ! edu
[Download RAW message or body]

Do you have a maximum message size set on the MTA link? If you did, I would expect \
that to show up in the logs

Did the sending MTA ever report/log that the receiving was down? 

Are there routing domains and/or multiple indirect links between MTA/Domains? If the \
receiving MTA was not responding, did the sending MTA route it via an indirect link, \
and those missing messages are sitting on another MTA that's misbehaving?

Jeff 

On 6/10/16, 9:32 AM, "ngw-bounces+jeff=scrippscollege.edu@ngwlist.com on behalf of \
Marvin Huffaker" <ngw-bounces+jeff=scrippscollege.edu@ngwlist.com on behalf of \
mhuffaker@redjuju.com> wrote:

Okay I have some good leads on this, thanks for the ideas... I will continue to dig.  \
A new development though, this morning about 8am the problem happened again and there \
was no replication job or anything unusual happening..  I didn't see it first hand \
though and I can't confirm the situation was identical, but from the sounds of it, it \
was the same thing.  

Marvin

> > > "Marvin Huffaker" <mhuffaker@redjuju.com> 06/10/16 9:24 AM >>>
Georg... I am baffled as well but the 2 scenarios you mention are not possible. I was \
looking at the correct MTA.. Both MTA's had activity logging during the whole time, \
After restarting the receiving MTA, traffic flow resumed and I confirmed this in the \
same agent logs.. In the 2nd scenario, nothing changed with the state and like in the \
first scenario, I have continual log activity.. there is no time lapse in logs, just \
in the incoming messages.

Marvin

> > > "Georg Fritsch FCP" <gf@fcp.at> 06/10/16 1:05 AM >>>
Hi!

---
2) However on the receiving MTA, there are no logs whatsoever showing a connection \
                from the Sending MTA was even made or attempted. 
---

If you are sure that there was no problem with the sending mta , there are a min. of \
two explanations which need to be checked  
* You are not actually looking at the receiving mta or the mta which was receiving at \
                that time.
* You are looking at the correct mta and the mta was receiving at that time, but now \
the state of the mta is different. (=You are not seeing the information/data which \
once was visible at the receiving mta.)  
I had several occasions of data loss in the communication between mta (for different \
reasons), but i always had a hint in the log.  
Georg

> > > "Marvin Huffaker" <mhuffaker@redjuju.com> 10.06.2016 09:24 >>>

It was just a temporary hang as the snapshot was being merged back into the main \
image. Nothing else was lost, and mail within the post office continued to function. \
It only affected mail that transferred from one mta to the other. I saw the same \
thing with the snapshot happen yesterday (that's a different issue), but this problem \
with mailflow did not occur at that time. No other virtual servers were affected. And \
the only snapshot was the Temporary one that Veeam makes for the duration of the \
replication job, which it then removes upon completion. 

Marvin

> > > "Georg Fritsch FCP" <gf@fcp.at> 06/09/16 11:52 PM >>>
Hi!

There are multiple servers and problems with snapshots on the virtual infrastructure \
(VMware and Veeam) at the time of the problem? Maybe this isn't related to GW at all. \
What if some servers reverted to a prior disk snapshot or lost disk writes, etc. and \
continued to work on from this baseline. Wouldn't this also be a plausible \
explanation the weird things you are seeing?

Georg

> > > "Marvin Huffaker" <mhuffaker@redjuju.com> 10.06.2016 06:59 >>>
Weirdest situation I have ever seen. GroupWise 2014 R2 HP1 running on Linux \
virtualized on VMware.The Post office is in one domain with the POA and MTA on one \
server. GWIA in another domain with the GWIA and MTA on a different server.

Customer post office became inaccessible momentarily due to a Veeam snapshot being \
removed (Job had gone WAY too long for some reason). Users became disconnected and \
got notifications that the Post Office was inaccessible. Something you'd expect if \
the system was down. But then it snapped out of it after a few minutes and started \
running normally, or so it seemed. 

After this, the customer reported they were not receiving any inbound email. However, \
they were able to send mail out. All internal mail was fine. Only inbound mail was \
affected. When I restarted the receiving MTA, the messages immediately started \
flowing normally again and things were fine. This is where it gets weird. I went \
looking for the missing messages thinking they would be queued up somewhere.. Well, \
they weren't. They didn't exist. Every queue was empty. Sparkling clean. 

Here are the facts:

1) I was able to trace messages through the GWIA and MTA and show that the messages \
were leaving the MTA enroute to the MTA/domain owning the POA. >From the sending MTA, \
things looked normal as I would expect. 2) However on the receiving MTA, there are no \
logs whatsoever showing a connection from the Sending MTA was even made or attempted. \
 3) Basically the messages were being delivered to a black hole and never seen again. \
They were gone. There were no messages tied up in queues, they were just gone. 4) \
Again, all messages that came in during that time period (Appears to be about an \
hour) were lost. 5) Time on both servers is in sync within a second or so.
6) To expand on this, for example: The Sending MTA reports delivering the file to the \
Receiving MTA at 17:10:02. On the Receiving MTA, there is nothing logged until \
17:13:10, and it is an outbound message. No trace of the file that was reportedly \
sent. Nothing on the receiving MTA shows an inbound connection during the time \
period.

I don't know how to prove this, I don't know how to troubleshoot it, and I certainly \
don't want to duplicate it again on their production system. That was pretty painful. \
I supposed it if happened again I could do a packet trace but I'm not sure what that \
would even prove.

Has anybody run into this before? How is this even possible? 

Marvin

Marvin Huffaker
mhuffaker@redjuju.com
Office: 480-988-7215 (Best Number)
Cell: 480-797-2989 

_______________________________________________
ngw mailing list
ngw@ngwlist.com
http://ngwlist.com/mailman/listinfo/ngw

[prev in list] [next in list] [prev in thread] [next in thread]