No backup
Posted by Paul Cox on July 15th, 2009
So on Monday, we were talking about the problems that ERAM is having. And then at work the other day, I heard about some issues that a test that ERAM had that wound up as an utter failure- and which illustrates something that we were talking about in the comments.
Seems they had the notion to test something about how ERAM works with the FDIO system. FDIO is the flight data system; it sends, well, flight plan data to all the “children” facilities in a given control center’s airspace.
(A jargon note… one project I’d love to have here would be a wiki for FAA stuff, so someone new coming to this blog could just click on these acronyms and have the wiki page pop up and they’d learn what ERAM and FDIO and so forth was. With a staff of a fulltimer or two, man, the things I could do with the Follies… sigh.)
So on Sunday morning, here’s the plan: They’re going to detail a trainee to fax flight plans to the facilities that would normally get them from Seattle Center, and then they can take FDIO down and run their tests.
Apparently, what they figured was they could do a two-hour-dump of the flight plans in the ETMS system (dammit, another acronym! ETMS is the system that our TMU, traffic management unit, uses to monitor supply and demand on our airports, and determines when to start delay programs due to overload) and then use the ETMS strips for the flight plan data to the outlying facilities.
Um, nope. The problem with this plan is that the ETMS strips aren’t really full flight plans. The data dump thing that ETMS does is just intended to serve as an emergency backup, and there’s no times, and the strips aren’t complete, and… well, honestly, I don’t really understand what all is missing from those strips. (Hey, I never work mids, so I never use them- which serves to illustrate how poorly trained some controllers are on the backup procedures, but that’s another post, so never mind.)
Anyway, what happens is that the facilities- like Seattle Approach and Portland Approach and so forth- have a few hundred flights and incomplete flight data. So, naturally, they start calling our TMU. Which has two people staffing it. To try and pass a couple hundred flight plans.
On top of THAT, we also distribute other types of information via FDIO, like significant weather reports. In fact, they’re called SIGMETs, and the approach controls and towers need them so they can inform pilots of weather that can be hazardous to flight.
There’s some SIGMETs out, but these facilities aren’t getting them, so they call for those, too. And they’re kind of long to read and describe and it takes a while, and remember- there’s only two people to do it. Those two people also have their REGULAR job duties to attend to, like playing Minesweeper and surfing the web. (Hey, I’m just kidding my TMU brothers and sisters… and checking to see if any of them are really reading the blog, or just tell me they love it but secretly ignore it. They’ll give me crap at work if they’re reading it!)
Anyway, what winds up happening is aircraft start getting their clearances later, and later, and then they have to start waiting for their clearances, but since the whole system is kind of dorked up we can’t really track the delays, and eventually someone with enough common sense says “screw this noise, end the ERAM test, turn the goddamn FDIO back on and let’s get back in business” and they do it and things get back to normal.
Meanwhile, airlines took delays. They might not have been “reportable”, meaning they were 14 minutes or less, but I’m told we definitely had airplanes waiting on the ground, burning fuel and costing the airlines money, while we couldn’t get our stuff straight.
For a TEST.
Now, the point here is this: We didn’t have backup. There’s systems that we use that simply MUST work, or else things are going to slow down. PERIOD. No way around it; the system, today, is built in such a way that if the main system fails, the backup cannot handle the same demand.
And this was a Sunday morning, the slowest daytime of the week, at Seattle Center, one of the slower centers in the nation for sheer traffic count. What if it’d been Chicago Center, midweek? What if the ERAM system’s link to FDIO failed then? What will we find next?
As people have pointed out in the comments, ERAM is a whole-ball-of-wax kind of deal. There is no separate program that’s a backup; ERAM is its own backup.
Imagine this for a minute: You take two completely identical computers from Dell, and you load them with exactly identical versions of Windows. You load them with exactly all the same programs, like Explorer and Firefox and Word and Minesweeper (gotta have something for the managers to do) and then you hook them into the internet in such a way that they get all of exactly the same information at exactly the same time.
They’re basically mirrors of each other. Computer A’s cooling fan dies, though, and the CPU overheats and the computer starts running all flakey and Minesweeper locks up. What do you do?
Well, you switch over to Computer B while the techs come and fix the fan, right? No problemo, you have a backup system, good move.
But a week later, something else happens. You surf one too many times to the porn page, or to the FAA Follies, and you get a nasty virus. It starts launching pop-up and pop-under windows like mad; each time you close one, another two pop up. It connects with a computer in China or Russia where some 7 year old kid who’s smarter than you and me put together has his game server, and he takes over your computer and uses it to store old episodes of Battlestar Galactica.
So your computer slows to a crawl and quits working entirely, just like it did the week before. What do you do?
You switch to Computer B, right?
But wait! You’re screwed! Computer B is a physically separate machine, sure, but remember- it’s running exactly all the same programs at the same time as A was, and it got all the same data at the same time… including the virus. And now it’s completely fuxored too.
What do you do? You turn ‘em both off and call the tech and hope that he can’t tell that you were reading a bad website when you downloaded the virus, right?
Except that when it’s ERAM, you just shut down the ATC system. And you have no backup.
THAT, folks, is why we’re so uptight about ERAM. There’s no backup. If the software dies, we’re screwed, blued, and tattooed (as they say in Enumclaw… well, someone said it once somewhere.) The hardware is separate, but the computers are running exactly the same thing, and if that software has a glitch, it’s going to take down EVERYTHING.
Now, during the transition time, we’ll have a backup system… but once we’re on ERAM full time, we’re going to dump the old system. And this is why ERAM needs to be seriously super-perfect, and why it needs to be tested like there’s no tomorrow… and why we need to start being seriously honest about how it’s screwed up right now instead of having puff pieces in the FAA’s internal communication channel.
Because if people don’t get honest about ERAM, and the big mucky-muck decision makers don’t hear the Truth, we’re liable to get this crap forced on us, and down the road we might have the complete failure scenario… for lack of honesty.
July 15th, 2009 at 10:03 am
"THAT, folks, is why we’re so uptight about ERAM. There’s no backup. If the software dies, we’re screwed, blued, and tattooed"
When THAT day happens, I'll be glad I work in a tracon and not the center.
July 15th, 2009 at 11:48 am
yea paul but FFA guy said that 40 years ago there were problems with the 9020 so isnt this normal using the flying public as guniepigs???????????oh and the heros at ZAU have URET, thay dont need strips, URETwill save the day……………………………………………………………………………………………………….NOT
July 15th, 2009 at 5:04 pm
URET is part of ERAM not a stand alone like it is today. LOL so much for that!
July 15th, 2009 at 6:53 pm
Let's take a look at flight service. Why do you think it is all screwed up? LM, with the FAA's blessing, implemented their "new" computer system, called FS21,before it was ready. LM and the FAA knew it didn't, and wouldn't, work the day they turned it on and closed Annistin AFSS, in February 2007. But the FAA let LM go ahead with consolidation and the fiasco of 2007 resulted. It is 2009 and FS21 still doesn't work correctly. LM is working on a replacement for FS21 because it is so bad. Anyway, there is no backup except for AISR, which the FAA pays for and runs. In 2007, when FS21 would crash (and it happened often), there was no backup. What a fiasco. There are people in the FAA, LM and WCG associated with the AFSS program that should have gone to jail for fraud.
Lets look at notams. Do the center and tower guys ever wonder why they are so messed up? LM uses a program called OPUS for notams. It doesn't play nice with NFDC (National Flight Data Center) computers. They are like oil and water, they don't mix. Back in the FAA days of flight service, when a notam was entered, the NFDC computer would check the notam for accuracy, and would kick it out if it was wrong. The good people at the NFDC would then send a message back to the issuing AFSS telling them the notam was wrong and to please fix it. Now, with OPUS and NDC not playing nice, the NFDC accepts the notam without checking it (something to do with OPUS being incompatable with the NAS). The only way to catch bad notams at the NFDC id for them specialist to physically check every notam in the country and they don't have the bodies for that gigantic job everyday. So,a notam could be entered under SEA saying "The FAA follies is the greatest" and it would probably come out with a number and everything. There are also a bunch of other reasons that contribute.
Back to today's post. NATCA and all controllers need to hold the FAA's feet to the fire abut this. If ERAM goes like FS21, with the FAA going along even though they know it is screwed up, disaster is in the air, big time. Call your congressional reps or Chairman Oberstar. You can make a difference.
July 16th, 2009 at 5:13 am
Yeah, URET is replaced by EDST (I think it stands for "En Route Decision Support Tool") in ERAM…. EDST, unlike URET gets all its data and is completely dependent on ERAM…
July 16th, 2009 at 6:56 am
So wait, are you suggesting we use strips forever? I'm not following you.
July 16th, 2009 at 10:21 am
So wait, are you suggesting you want a receipt from the bank forever ? Don't you trust their computers to work correctly 100% of the time ?
http://gettheflick.blogspot.com/2009/03/did-someo...
Don Brown
http://gettheflick.blogspot.com/
July 16th, 2009 at 4:10 pm
Having URET going AND working strips seems a bit excessive. I'd be more worried about the GPS system going belly up and having zero backup under ERAM.
July 22nd, 2009 at 11:50 pm
All URET Functions are now handled by the CP processor. (Conflict Probe) Rather than right now with DS URET is spread across several processors and has more ability to fail because on processor relys on another. and having no backup either. the CP has a backup, say I pull the plugg on CP1, CP2 takes over with no problem. Now reading some of the other articles on here saying ERAM has no back up.. Because it runs the same software on Both A and B Channel. That if Channel A has a Bug Channel B will have the same bug. Yes and .. No… I wont spent the time trying to explain why I say that, Id be here a while. but there are "slight" differences in the software between A and B Channel. Most of which is on how it talks to ECG. But You will find there are software issues with Channel A that you wont Find on Channel B, and Vise Versa.
July 23rd, 2009 at 9:14 pm
Actually, ETMS, TMU and FDIO are not acronyms. They are just initials. An acronym must make a pronounceable word. I supposed FDIO IS being pronounced FIDO, however.
Say, am I anal enough to be a controller?
August 12th, 2010 at 10:35 am
i always prefer to use brushless cooling fans because they last longer and needs less maintennance,”