2) What is the fundamental difference between a relational database and a hierarchical database and how does that effect the end-user? (Should we even care? If so why?)

From VistApedia
Jump to: navigation, search

While MUMPS has been characterized as "hierarchical", the DBMS that VistA uses, VA FileMan, provides what is more accurately characterized as a polymorphic view of the database. One can readily use relational projections (indeed there are commercial add-ons that give a strict SQL view of the database). The more advantageous view through VA FileMan is more like an object view of the data with abstract Data Types being highly specialized for optimal use and performance. End users usually need not care (except that performance of VA FileMan is demonstrably superior (there are published reports) to SQL on the same hardware and Configuration.)

The relational model itself says nothing about how data is stored, but one of the oldest and most basic methods of storing data is in the form of sequential Records indexed by B-trees. MUMPS implementations generally store data (not just indexes) in the form of B-trees, providing relatively fast access.

The database in M is called by some a "sparse array." This means that there are no "blank spaces" left for data to be later filled into. So with M, if there is no data present, then no space is wasted. I find this to lead to many many fields being defined for a given file. With a traditional database, having all these fields with empty/wasted space, would lead to huge database files. But with M, one can can store years of patient information on a relatively small disk.

[Rick Marshall replies] The difference between a relational database and a hierarchical one (or, for that matter, a polymorphic one like FileMan) has to do with how it models the world.

All databases exist to Record an abstract model of pieces of the world. Databases are usually structured as files (or tables or classes), each of which lists entities of a similar kind, such as patients, or drugs, or visits. Just as the file represents a category of entities, so each Record (or entry or row or object or instance) in that file represents a specific entity, such as a specific patient, a specific drug, or a specific visit. Databases, files, and Records are not the real things they represent, only abstract representations of them. An entry in the patient file is not a real patient, but an abstraction of a patient, a metaphor for that patient. Very much as with poetry, the more closely that metaphor matches the important parts of the real thing it represents, the more powerful the metaphor, the more meaningful, and from the perspective of medical informatics, the more likely it is to assist in improving patient health. Whether you have the right information and whether you have organized it into the right metaphor is largely dictated by how that information will be used--that tells you which operations can be inefficient and which need to be efficient, which tells you how to balance the tradeoffs that are always involved.

A good database designer chooses apt metaphors that match well the kinds of information the clients need to Record. The strategic part of that choice involves selecting the right database paradigm; the tactical part is using that paradigm effectively. WHICH data a file Records is up to the file designer, but HOW that data is stored is up to the database paradigm you choose (relational, hierarchical, network, polymorphic, object-oriented, etc.). As with successful adaptation in nature, the secret to success lies not with rigid orthodoxy but with responsive flexibility, varying your approach to let each problem dictate its own best solution.

Relational databases represent files as tables, Records as rows, fields as columns. This is the spreadsheet's view of the world. In truth, spreadsheets are excellent for certain categories of problems (e.g., inventories of parts), and terrible for others (e.g., Beethoven's Pastorale piano sonata). The relational database is no more the perfect solution to every problem than is the hammer. Neither is it the wrong solution for every problem. It must be used appropriately, to solve those problems for which it is well-suited.

Relational purists, those who insist every database problem must be solved relationally, love the reductionist simplicity of having only a single metaphor for all problems, and argue that by reducing all problems to this common form you increase interoperability. This assertion is false. When used universally and rigidly, the relational metaphor becomes a Procrustean bed, stretching short people and cutting off the legs of tall people so they all fit "perfectly" in the same size bed. A relational solution is most apt for atomic information bound by simple keys, or for groups of such entities likely to be sorted different ways at different times, in which there is no clear way they are usually organized. It is weak at representing asymmetrical relationships, and does not represent behavior at all (making it incompatible with the object-oriented approach to modeling the world has increasingly favored over the last quarter century). The things left out of the relational model are just as important to creating a good model as are the things it includes, and often turn what may appear to be the trivial problem of making two relational systems share data into a nightmare of hidden assumptions and missing relationships.

Monomaniacal devotion to the relational model (or any one technology) is based on the common but erroneous belief that most software problems are technological problems, that using a common technology removes the most important barriers. In actuality, software is a human communication and relationship problem far more than a technical one, consisting of struggling to understand and communicate our needs, first to each other, then to the computer and whoever maintains the code over the decades that follow. Arguing that common use of a relational model solves interoperability is like arguing that world peace has been solved by having all human beings share the same DNA; the proposed solution does not match the context of the problem.

FileMan is a polymorphic database management system. You can use a relational approach when that is appropriate, hierarchical when that is, and so on. It does not fully implement any of these paradigms, but it does enough of most of them to give us a reasonable solution to most the database problems we have faced over the last twenty-eight years.

In VistA, one place we use a relational model is for information relating to visits, because the priority of the relationships among the many visit-related entities is roughly symmetrical--you are just as likely to look up visits by date as you are by patient or by doctor or by any of a number of other things. Since no one of these entities is the clear majority choice for how to organize the rest, we flatten them all into a collection of separate V files that all point to the shared Visit file. As is often the case with relational solutions, this is as flexible as possible, but using the resulting files is then labor-intensive, since the software or users must correctly combine the files they need to get the right results. This was a good use of the relational model by Indian Health Service in RPMS, which the VA had the good sense to adopt into VistA.

One place we use a hierarchical model is for information relating to the menus a user can choose from. To simplify a bit from reality, in VistA all options are collected into menus, which may themselves be collected into other menus and so on; the resulting structure looks something like a tree. For a tree-like list of menu names, see VistA MenuCommands New. To simplify a bit more, each user is assigned a primary menu that contains all the options she needs to do her job. The software that uses this option information is called MenuMan, and it uses this file to solve several different problems that actually require it to use more than one database paradigm for the same data.

When a user is signed into VistA, MenuMan has to make sure she can reach all the options in her primary menu tree, but only those options. For example, if a user tries to jump to an option, MenuMan has to read the option she wants to jump to and then decide whether that option is indeed somewhere among the options she can reach from her primary menu (and its children, and their children, and so on). This is very much a tree operation, not a table operation, and storing this in relational tables would make MenuMan slow, since it would have to recompute all the interrelationships among the options every time it tried to solve this problem. Instead, MenuMan builds a tree for each primary menu that contains all the interrelationships, and builds an index on that tree to quickly identify 1) which options are contained within each tree, and 2) if so, how do you navigate through the tree to reach those options. A partially hierarchical model is exactly right for this problem, since it is about traversing trees, so that is what VistA uses in its compiled menu trees in the ^XUTL global.

On the other hand, options are also managed individually--locked or unlocked, changes to the menu text, sending out new versions if they are patched, and so on. For this kind of management, an option's location within the compiled menu trees is unimportant. All that matters is being able to quickly and independently find and manipulate each option. For this a relational approach is appropriate, so VistA stores options partially relationally in the Option file. Indeed, although all of the information about each option is stored in the Option file, some of it is stored a second time in the menu trees that VistA copies out of the Option file and reorganizes--the same information is stored both relationally and hierarchically because MenuMan needs it one way for some operations but not others. This is more or less the right solution to MenuMan's problems, but the software industry's warring camps would instead choose to twist the solution one way or the other to make it fit their ideological loyalties. The VistA way is not to take one side or the other nor even to remain neutral, but to advocate an end to the war so the two sides can get on with intermarrying to produce healthier, prettier, and hopefully wiser children.

The software industry has a huge failure rate. It is in the midst of an ongoing, growing, forty-year problem called The Software Crisis, in which as the costs of hardware plummet and the power of that hardware increases geometrically, the costs of software are rising, as is the failure rate. No one has solved or shows any immediate signs of solving this industry-wide dilemma, but everyone who admits to the existence of this crisis claims to have a solution, and all of the solutions are technological. Use SQL and all your troubles will go away. No, use C#. No, use brand X instead. These approaches cannot solve the crisis because none of them deal with the fundamental issues. They all propose that The Software Crisis is a werewolf, so all you need is a magic bullet, which by coincidence they just happen to have with them and are willing to part with in exchange for your savings account. An expensive hammer can indeed solve every problem; it just solves most of them badly, and so do these many technologies-du-jour. Each has a small number of proper uses and a large number of inappropriate uses, and each is proposed to be the only solution you need for entire classes of problems.

When an entire industry is this deluded, you can expect crazy behavior. For example, if the American College of Physician Executives says the state of medical informatics is a disaster, except for VistA, which is great, you can expect the industry to complain as one that no, that cannot be so, for look: VistA is not written in brand X. No matter how bad most medical software is, VistA's success cannot be as interesting as which technologies it uses. Failure according to the prescribed technology is the only permitted option. Succeeding with a technology that everyone agrees cannot work must be a mirage. Look away, before the VistA heathens corrupt you with their non-brand-X ways.

So let me play Ariadne to your Theseus and offer you the golden clew with which to escape the labyrinth: ignore technology; focus on features. The short answer to whether you should choose a relational or hierarchical model is: don't. Forget about technology. Leave that to the programmers. Focus instead on the problems you are trying to solve, the features you want, the kinds of information you need. That is by far the harder problem, and something the technologist cannot do for you. If instead you focus on the technology, not only are you artificially constraining the solution (thus increasing the chances of failure) but also neglecting the harder problem that only you can do (thus ensuring failure).

[Greg Woodhouse adds] I like Rick's point about metaphors here. Regardless how expressive a model may be, the set of tools provided by a DBMS does tend to influence the way we model. The basic data type in LISP (List Processing) is the list, and it is no great surprise that a LISP programmer will be more likely to think about a problem in terms of (nested) lists than a C programmer. Similarly, MUMPS arrays are different from C arrays or Perl hashes, and the basic abstraction supported by the language DOES influence they way MUMPS programmers "see the world". But whether, the basic tools you have available are lists, array, decorated trees (globals) or relations, you are able to express and work with structures much more complex than those directly supported by the language.

There is reason why there are so many programming languages out there, even if they are computationally equivalent, and that is some make tasks of a certain type easier. Similarly, there is no "right" database model. The relational model has been tremendously successful, and is in some ways the database analog of Algol-like programming languages (Algol/68, Pascal, Ada, C, ...) which have been similarly successful and influential in the theory of programming languages. But the relational model no more renders other models (such as the Fileman model) irrelevant or useless, than Pascal renders Scheme or Haskell irrelevant.

[Cameron Schehuber] The 3-tier architecture with SQL at the database tier permits (in theory) the ability to swap vendors of the DBMS. The cost is one of performance unless stored procedures are used ... which then end up tying you to the DBMS vendor. ANSI M provides a way out of that problem since the business logic (usually the middle tier) can be combined with the database and the combination can be ported from one vendor's implementation to another with the exact same code and not have to change any "stored procedures" as they are part of the ANSI M code. (Though vendors can certainly have a significant impact on performance).

Greg wrote:

I find it useful to think in terms of Data Types. I believe that what you are saying here is that it is important to abstract away from the primitives used to implement other types. Just as pointers are the basic primitive used in a language like Pascal to implement abstract Data Types, tuples and relations are the basic primitives used in the relational world to model other structures. I believe it is unnecessarily narrow (and in fact, a caricature of the relational model) to think of the table as the basic *abstraction* of this model. That would be like saying pointers and subfiles are the basic abstractions with which one works in Fileman. That's just not true. They are *primitives* used to model abstractions that can be quite complex.

Think about it this way: Bricks and mortar may be used to construct buildings (well, maybe not out here in earthquake country), but when an architect looks at a building, (s)he does not see (just) brick and mortar. There is much more that can be said about buildings than simply that they are built out of certain fundamental components.