Wednesday, May 07, 2008

A Conversation with James Gosling, Part II

 
Failure and Exceptions
 
Summary
James Gosling talks with Bill Venners about how to build solid apps, organize your catch clauses, scale checked exceptions, and deal with failure.

Much has been written recently about the value that checked exceptions add, or subtract, from Java. Some programmers feel that checked exceptions help them build robust applications. Others find that checked exceptions are an annoyance that hinders their productivity. In this article, Java's creator James Gosling voices his opinions about checked exceptions and error handling.

In this interview, which will be published in multiple installments, James Gosling talks about many aspects of programming.

  • In Part I: Analyze this!, Gosling describes the ways in which Jackpot can help programmers analyze, visualize, and refactor their programs.
  • In this installment, Gosling talks about how to build solid apps, organize your catch clauses, scale checked exceptions, and deal with failure.

Creating Solid Software

Bill Venners: In your weblog, you wrote:

Lots of newbie's coming in from the C world complain about exceptions and the fact that they have to put exception handling all over the place—they want to just write their code. But that's stupid: most C code never checks return codes and so it tends to be very fragile. If you want to build something really robust, you need to pay attention to things that can go wrong, and most folks don't in the C world because it's just too damn hard.

One of the design principles behind Java is that I don't care much about how long it takes to slap together something that kind of works. The real measure is how long it takes to write something solid.

What does "solid" mean?

James Gosling: Solid means you can run the software day in and day out. When the usual crazy things happen, the software survives. The software deals with the problems. If you change the system configuration around, the system, as much as possible, copes—or at least recognizes when things are wrong.

One of the traditional things to screw up in C code is opening a data file to read. It's semi-traditional in the C world to not check the return code, because you just know the file is there, right? So you just open the file and you read it. But someday months from now when your program is in deployment, some system administrator reconfigures files, and the file ends up in the wrong place. Your program goes to open the file. It's not there, and the open call returns you an error code that you never check. You take this file descriptor and slap it into your file descriptor variable. The value happens to be -1, which isn't very useful as a file descriptor, but it's still an integer, right? So you're still happily calling reads. And as far as you can tell, the world is all rosy, except the data just isn't there.

Problems like that are really hard to test for. It is really hard to test the unlikely things, if only because the unlikely things never really occur to you. In this example, the programmer will think, "Well of course the file is there. Why would anybody not have the file there?"

A programming language can't solve all the problems. A language can't guarantee that no matter how screwed up the environment gets the program will survive. But anything the language can do to increase the probability that programs will be reasonably graceful under fire is a good thing. For example, just making people at least willfully ignore return codes helps. In Java you can ignore exceptions, but you have to willfully do it. You can't accidentally say, "I don't care." You have to explicitly say, "I don't care."

Bill Venners: You don't have plausible deniability.

James Gosling: Yeah, there's no plausible deniability when it comes to checked exceptions.

Orgnanizing Your Catch Clauses

Bill Venners: I recently published an interview with C#'s creator Anders Hejlsberg in which we talked about checked exceptions and why C# doesn't have them. I wanted to ask you about some of his comments. He said:

It is funny how people think that the important thing about exceptions is handling them. That is not the important thing about exceptions. In a well-written application there's a ratio of ten to one, in my opinion, of try finally to try catch.

In the finally, you protect yourself against the exceptions, but you don't actually handle them. Error handling you put somewhere else. Surely in any kind of event-driven application like any kind of modern UI, you typically put an exception handler around your main message pump, and you just handle exceptions as they fall out that way. But you make sure you protect yourself all the way out by deallocating any resources you've grabbed, and so forth. You clean up after yourself, so you're always in a consistent state. You don't want a program where in 100 different places you handle exceptions and pop up error dialogs. What if you want to change the way you put up that dialog box? That's just terrible. The exception handling should be centralized, and you should just protect yourself as the exceptions propagate out to the handler.

I do catch exceptions in an outer loop sometimes. But most times my catch clauses tend to be spread around the program. I find that catch clauses usually have a natural home—the method that has enough contextual knowledge to know how to deal with the exception. How would you recommend people organize their catch clauses?

James Gosling: I tend to do catches much more frequently than Anders would have you do, because the knowledge of the situation is always fairly localized. When you try to open a file and it's not there, you're coping strategy is really determined by what you were going for. Some guy miles away isn't going to know what to do. The code that tried to open the file knows what to do, whether it be trying a backup file, looking in a different directory, or asking the user for another filename.

Having one big catch clause on the outside really only works if your exception handling philosophy is simply to die. If you have an event loop, you can maybe cause that one event to just be tossed. If you have a plugin architecture, the enclosing environment could respond to a failure in the plugin by disabling it—like an app server deciding to disable a servlet if it sees failures. But if you're not doing an event driven program or plugins, there isn't an outside place where you can take big hunks of functionality and saw them off. On the other hand, typically you should have last ditch try catch blocks. If you're writing a web server, for example, it's a good thing to put a last ditch try catch block around processing a request. But pretty much all that a try catch block like that can do is blow the request away. There's no ability to respond gracefully. There's no ability to take account of local context to cope and adapt, which is really one of the key hallmarks of truly reliable software.

Bill Venners: It adapts to problems?

James Gosling: Instead of just rolling over and dying.

Scalability of Checked Exceptions

Bill Venners: Another concern Anders Hejlsberg had about checked exceptions, which is a concern I've heard many people express, is scalability. He said:

In the small, checked exceptions are very enticing. With a little example, you can show that you've actually checked that you caught the FileNotFoundException, and isn't that great? Well, that's fine when you're just calling one API. The trouble begins when you start building big systems where you're talking to four or five different subsystems. Each subsystem throws four to ten exceptions. Now, each time you walk up the ladder of aggregation, you have this exponential hierarchy below you of exceptions you have to deal with. You end up having to declare 40 exceptions that you might throw. And once you aggregate that with another subsystem you have 80 exceptions in your throws clause. It just balloons out of control.

James Gosling: I've almost never seen that actually happen in large systems. One of the coping strategies I've seen for dealing with that kind if situation is exception wrapping. If you have a subystem that might be hit with hundreds of different exceptions, you can always take those exceptions and wrap them in a new one. You can create a new exception and list other exceptions as their cause. Also, the exception class hierarchy can help. For example, there are dozens of different IOExceptions, but it's common to not declare that you throw the specific subclasses. You just throw IOException.

But in general, when you are writing a declaration in a throws clause, it is good to try to be as specific as possible.

Bill Venners: Why?

James Gosling: Because it gives more information to the context about what exactly went wrong. That helps client programmers discriminate among the different sources of failure in their catch clauses.

Bill Venners: Have you seen the throw clause scalability problem in your visits out in the industry?

James Gosling: I've never seen issues where people have a gazillion items in their throws clause. Three, four, or five, maybe—and even those numbers are rather large. Almost always it's zero or one.

Bill Venners: I've found that many people don't seem to think about programming so much in terms of the interface as an abstract contract and the implementation as one way to implement that contract. They just think in terms of "the code." But the throws clause is part of the interface. If someone makes an implementation change that results in a method being called that throws a new checked exception, that doesn't mean that exception should necessarily appear in throws clauses up the call stack. If you automatically plop exceptions into throws clauses, you're letting the implementation drive the interface, instead of thinking about what the throws clause should contain at the abstract contract level.

James Gosling: That's a place where exception translation can be a good thing. If you have something that's supposed to be reading a database, and somewhere deep inside it, it could conceivably get a MalformedURLException, it probably doesn't make sense to propagate that outward as a MalformedURLException. It's better to think about what kind of exceptions are really a part of the interface, and not just an artifact of how the underlying method was implemented. In the Java exceptions mechanism, exceptions can have a reference to other exceptions as their cause. So you create a new IOException, and you set its cause to this MalformedURLException. And you bundle them together. That's been very effective.

Exception Handling versus the One True Path

Bill Venners: What percentage of time do you think people should be spending programming the one true path—the path taken if nothing goes wrong—versus handling all the potential errors?

James Gosling: It's all over the map. Ideally you should be spending almost all of your time on the one true path. But if you look at various domains, the percentages are different. The worst case would be people doing avionics software. If they spend a few percent of their time working on the one true path, they feel very good about themselves. That's the kind of area where the high nineties percentage of your time is spent dealing with things that could go wrong, and speculating about things that could go wrong. In that area, life is truly dynamic. There are real random number generators up there, with turbulence doing strange things to wing tips and all kinds of weird stuff that may never actually happen.

Almost as bad as avionics is networking software. When you're trying to write some piece of software that is dealing with something across the network, trying to make that truly reliable can be a lot of work. I've done code that tries to maintain reliable database access, for example, and that can be 90% error handling. You look at most protocol implementations and at some level they're almost all error handling. If you believed everything was perfect all you would do was spew packets. You wouldn't worry about timeouts, flow control negotiations, and any of the other things that go on. Networking is all about how you deal with errors at all various different levels. But when you're in a fairly well-controlled regime like, "Here's a web request, compute the result please," there shouldn't be a whole lot that you're actually worried about.

Bill Venners: Yeah, I think there's a spectrum. At one end you have pace makers and airplanes. If you don't deal with certain kinds of unexpected situations people can die. At the other end you have one off scripts. I remember one script I wrote that was pulling information out of web pages and putting it into XML files, and the script worked on maybe 90% of the pages I was processing. And that was good enough because it was faster for me to fix the other 10% by hand than to make the script better at dealing with badly formed input files. In that case, there was a system administrator who was quite happy to spend an hour fixing things by hand, because that was cheaper than spending 2 or 3 hours trying to get the script to be more solid.

James Gosling: Right. It all depends on what the consequences of failure are. If you're just extracting information from pages like that and you're a system administrator, the consequences of failure are pretty minor. But if you're writing transaction reconciliation software for some bank, you could find that oh, thirteen billion dollars leaked out of some system because of a bug. I actually had a friend who basically lost thirteen billion dollars in one night because of a bug.

Bill Venners: Thirteen billion dollars?

James Gosling: Yep. Thirteen billion. They were actually able to reconstruct it from transaction logs, but it was a real big mess. Thirteen billion dollars is real money, so you're actually probably willing to test, willing to spend some time in dealing with handling those exceptions and making things reliable.

Bill Venners: I have heard a lot of people complain that in practice Java's checked exceptions encourage a lot of empty catch clauses. This empty catch clause problem seems to be more common than exceptions being propagated in throws clauses going all the way up the call stack. The compiler forces a programmer to catch an exception, so they catch it but they don't do anything. Then if the exception is ever thrown, it ends up being swallowed. It is not handled and not propagated. It is lost. Do you think that in a lot of places there's a lack of a culture of programmers wanting to deal with failure, perhaps because in the past they didn't have to deal with failure as much?

James Gosling: Anytime somebody has an empty catch clause they should have a creepy feeling. There are definitely times when it is actually the correct thing to do, but at least you have to think about it. In Java you can't escape the creepy feeling.

It really is a culture thing. When you go through college and you're doing assignments, they just ask you to code up the one true path. I certainly never experienced a college course where error handling was at all discussed. You come out of college and the only stuff you've had to deal with is the one true path. You get a job working in some IT department's data center, and they're using the software in production runs. If it doesn't work, it's a real problem. All of a sudden there's all this painful stuff that you don't like to do, and you're feeling grumpy about it because this isn't as much fun as just writing clean code.

There's really nothing a hacker likes more than having an empty editor buffer. You can just start to write code. That's the way university assignments are: here's a problem, just write the code. The real world is much cruftier.

But there's also a spectrum. You talk to people in banks, where large quantities of money get lost if there's a problem. They take failure very seriously. The spookiest folks are the people who have been doing real time software for a long time. I've spent a certain amount of time with that crowd. By and large these folks are very conservative and very careful. A lot of them know what it's like to go visit the family of the deceased and explain the bug to them. There are a number of places that have policies that if a test pilot augers in, then once it's all figured out what happened, the people who engineered the thing that failed have to go explain it to the family. I've actually only met one person who has ever actually had to do that. That would change your attitude about dealing with failure really quickly.

No comments: