Common Issues
In this section, I'll be identifying common issues with the types of systems discussed in the previous section. Depending on previous experience, these might appear obvious (you've seen it before) or surprising (you'll see it eventually). I would suggest that when you take on a new legacy project, you go through these and see what applies. Being aware of and expecting these issues will help you deal with them. I have included an Issues Checklist in the appendix to help with this.
No Documentation
When a system has been running for a long time, it is not unusual for supporting artefacts to be lost. The users of the system will notice if it stops working and will complain, but will anyone notice if items, such as documentation, are lost? This can happen for several reasons:
Support systems may be retired, and information removed. This can happen when there is no obvious link from the documentation to the system it concerns.
Documentation still exists but no one knows where it is or if it has been moved and the links/locations are not valid. This is a common problem when migrating document management systems.
Information is in an uncommon or unknown format. It's amazing how many files end in .doc.
Information was only ever stored locally (and the machines have been wiped) or not stored at all.
The agile manifesto says we should favor working software over comprehensive documentation, but it does not say we should not document anything. There are various types of metadata about our system that can make supporting it very difficult if missing. For example:
Usernames and passwords: The direct users of the system and its administrators may know their access details but what about the system's sub-components? What about administrator access for the database or directory server? If these are lost, then performing maintenance actions can be incredibly difficult.
Release instructions: Maybe you have the source code, but do you know how to build and release the software? Data is more likely to be kept than a running, unused service so your build and deploy server probably won't be present after 10 years. Was your build configuration in the source control system?
Last release branch details: Which one of the multiple code branches or labels was released?
Communication protocols: This is one of my personal bugbears. Many systems have been designed over the last 10 years to communicate with each other via XML messaging. Whereas developers used to document binary message formats, they've often not done so with XML as it's deemed to be 'human-readable' just because it's text-based. XML data blocks are rarely obvious, and they have so many optional elements that they are very difficult to reverse engineer. This is especially true if you only control one side of the communications, for example, receiving a message from an external source.
Licenses and other legal agreements: I'm going to talk a little more about licenses later, but can you track down the legal information about how you can use the third-party software and hardware elements in your system? They may be more restrictive than you think.
Users, including external systems: Do you know who your users are? What external systems rely on the one you control?
It is always worth tracking down and cataloguing all relevant documentation as a first step to maintaining a legacy system.
Lost Knowledge
There is an overlap between lost knowledge and a lack of documentation. You could argue that no knowledge would be lost if everything was documented but this is unrealistic and incredibly time-consuming. It's likely the reasoning behind design decisions has been lost. You have the artefact (the system) but not the how and the why as to its creation.
Note
Many years ago, I worked on a system that, once a day, collected some customer information and emailed it to some users. As part of the process, a file was created (called client-credit.tmp in the /tmp folder on a Unix machine). The next day, this file was overwritten. When we began upgrading it, we moved the process into memory, so no temporary files were created. I received a very angry phone call a few days after go-live from someone whose system was broken. It turned out that another system (in a totally different part of the organization) was FTP'ing onto the server, copying client-credit.tmp from /tmp and using it. At some point, there must have been a discussion between the two departments about giving access to this information and how to do it. However, once this hack was implemented (maybe this was not supposed to be the permanent solution) it was forgotten and those responsible moved to different jobs. It was interesting that no one from either department knew this was happening.
These types of lost knowledge can create some nasty traps for you to fall into. Some of the techniques and processes discussed in later sections should help you identify and avoid them.
Hidden Knowledge
Also referred to as Special Voodoo, these are useful but obscure pieces of information that may either work around a bug or initiate a non-obvious function. Although not lost, it is also not documented and deliberately hidden by an individual or small group of operators. Why would anyone do this? The most common answer is job security. If you're the only person who knows how to bring up the system when it freezes or create a certain report, then it makes you much harder to replace.
This is very, very common in legacy systems and can mean that the very people you need to help you support and improve the system might be actively working against you. I'm going to be covering this a little more in a later section on Politics.
Note
I once worked for an ISV (Independent Software Vendor) that had written a complicated piece of analysis software. This had started as a simple tool and had grown organically over 10 years and was now a system used by large organizations. There was an operator for this software at one site who would 'work from home' 1 day a month in order to complete a monthly report – which he presented when he returned the next day. His colleagues (and boss) thought that he was working all day on the report, but he spent it playing golf. The report only took 10 minutes to generate by exporting the data from an obscure part of the system and then loading it into a spreadsheet template.
Unused Functionality
In a large system that performs complex functions, there is a good chance that there are huge chunks of functionality that are not used. This tends to be either due to an over-zealous development team that added features that have never been used (over-engineering) or features that are now unneeded. Often, parts of the system are replicated elsewhere in newer systems and we can think of our legacy system as being partly replaced.
Why do I describe this as a problem? Can't we just ignore these? If you don't know whether part of a system needs to be supported or maintained, you might waste a long time trying to migrate something that is not used. However, if you decide that part of the system isn't required, and then do not support or maintain it, you might later discover that it was just used very infrequently.
The owners of these rarely used features might be difficult to track down, making it hard to get a sign-off for changes.
Some unused features might still be required even if you never use them or ever intend to use them. This is actually very common in software used in a regulated or safety critical environment. For example, having the ability to delete records for security reasons (privacy regulation) or features used in disaster scenarios. You shouldn't turn off the control systems dealing with the meltdown of a nuclear power station just because it hasn't been used in 20 years. Unfortunately, there are examples of safety systems being removed in just this way – it's important you know this.
Even if your system doesn't have a potential catastrophe, you should consider whether the Business Continuity Plans and High Availability systems still work.
Is this tractor used? It's old but not abandoned. It might start but is it used to do any work? Perhaps it's the only tractor in the farm that has a certain attachment to drive an essential, but rarely used, piece of equipment? Importantly, if you're the mechanic on the farm do you spend a huge amount of time keeping this working? Can you even find anyone to ask about this?
No Coherent Design/Inconsistent Implementation
This can be due to an active decision to avoid top-down design but is more often due to a system growing in an organic way.
Many successful, large systems don't start as large systems. They start as a useful tool doing a difficult but specific task. If they do the job well, then they'll have features and options added. Over a long period of time it can morph into a large and complex system with no coherent design. Common indicators of this are:
Business logic in the incorrect place, such as in a client rather than a service. Perhaps this started as a standalone application that morphed into a multiuser application.
Multiple tools for doing the same job – for example, different XML parsers being used in different parts of the system or several ways of sending an email report. This is often due to a new developer bolting on a feature using the tools they know without looking at the rest of the system.
Inappropriate use of frameworks. For example, using threads against advice in an application server or trying to get a request-response style service to schedule tasks. This is normally due to adding a feature that was never originally envisioned and trying to do it in the current framework.
Many layers and mapping objects. New features may not fit into the current 'design' and require a lot of supporting code to work with what is there.
Sudden changes in code style – from pattern use to formatting. Where different developers add features at different points.
Some of these may be bad (time for a refactor) but for others, it may just be inconsistent. Why does this matter?
Lack of coherence in design and consistency in implementation makes a system much harder to modify. It's similar to driving in a different country – the road systems may make perfect sense, but they take a while for a foreigner to get used to. And if you move country every week, you'll never be a good driver again. Similarly, it does take a developer a while to get used to patterns and tools and they will be much less productive in an incoherent system.
It's much harder to predict the non-functional behavior in such a system. If each part of the system works differently, then performance tuning becomes hugely complex.
Fragility (versus Stability)
The first thing to note is that I've used the term fragility rather than instability. The difference is subtle, but important. An unstable system will stop working in a frequent and unpredictable way (if it was predicable then you'd have specific bugs). Whereas a fragile system will work as required but will crash horribly when a change is made.
Fragility can be worse than instability from a maintainer's perspective. If the users consider a system to be fragile, it can be difficult to get permission to modify it. Sometimes this fragility may be perceived rather than actual, and this is common when the system is old and no longer understood – the fear is that making any changes will stop it from working and no one will be able to repair it.
In contrast, sometimes an unstable system can work in your favor, as you can make mistakes, and no one will notice.
Later in the book, I'll be covering some of the tools and techniques for making changes to fragile systems.
Tight Coupling
In a tightly coupled system, it can be very difficult to pull apart the system's different components. This makes it hard to upgrade in a piecemeal fashion, leaving you with a stressful 'big bang' approach to upgrading and maintenance. It's also very difficult to performance-tune problematic functions.
Loose Coupling
What? Surely loose coupling is a good thing? However, what about the anecdote I told earlier about the temporary file that was FTP'd by another service? Although this is a terrible design, isn't it an example of loose coupling?
With a tightly coupled system, it's usually obvious what the dependencies are, and nothing runs without them being in a specific state. With a loosely coupled system (particularly one where services depend on each other and communicate asynchronously via a bus) this may not be so obvious. This is particularly problematic when there is a lot of unused and rarely used functionality (see earlier issues), as you don't know how or when to trigger potentially important events.
Zombie Technologies
The reason these are called Zombie rather than Dead is that when I investigated some example technologies, I found out that almost no technology dies. There is always some hobbyist somewhere insisting that it's still alive, therefore, I'd count the following as zombie (dead in reality) technologies:
Technologies where the company that created and supported them no longer exists
Technologies no longer officially supported by their creators (including open source projects that have not been touched in years)
Technologies not compatible with the latest versions
Technologies where it is not possible or practical to find engineers that have any knowledge of them
Technologies where important parts have been lost (such as the source code)
I'm sure I'll get complaints about some of the preceding definitions, but these are all situations that make a system much harder to upgrade or maintain. Trying to solve a bug or modify the deployment for these can be difficult and modifying functionality might be impossible.
We should also remember that fashions and the way technologies are used can change a lot over time. Well-written Java code that used best practices from 2001 will look completely different from code a Java developer would write today.
Licensing
All the components in a system, both software and hardware, will be covered by a license. This is something very often forgotten by technologists when inheriting a legacy system. Licenses can say almost anything and might put very restrictive conditions on what you do with components. This applies to everything through the entire stack and potentially even some of the data in the system if it was supplied by a third party with restrictions on its use.
Here are some questions you should ask about a system and its licenses:
Can you make changes?
Does it stop you using virtualization?
Are you not allowed to run the software on certain hardware, or does the hardware only run certain types of software?
Is reverse engineering prohibited and will this affect your ability to instrument it?
Can you modify any source you might have?
Can you even find the licenses?
Is any of the data owned by a third party?
The cost implications of not understanding licensing can be high. In the January 2013 issue of Computing Magazine, some of the implications of licensing and virtualization were discussed and the following quote was given:
Note
Hidden Licensing Costs
"If a company has SQLServer on VMware, licensed on a per core basis that can be dynamically scheduled across hosts – all machines need licenses." – Sean Robinson, License Dashboard quoted in Computing.
As well as problems regarding restricted use, licenses are also a sunk cost. If an organization has paid a large amount of money for a product, then it can be politically very difficult to change, even if the alternative has a lower overall cost.
Note
Sunk Costs are retrospective (past) costs that have already been incurred and cannot be recovered. It is an interesting area of behavioral economic research. In theory, a sunk cost should not affect future decisions – it is already spent and can't be recovered. In practice, people find it very difficult to 'write-off' previous expenditure, even if this makes economic sense.
Regulation
Regulation is very similar to licensing in that it also concerns external rules imposed on a system that are outside the normal functional and non-functional requirements. Although licenses stay the same when you want to change the system, the difference with regulation is that it tends to change and force functional changes on you.
Shortly before writing this book, there have been changes to many websites in the UK. A huge number have started informing users that they store cookies and requesting permission from users to allow it. This is due to the Privacy and Electronic Communication Regulations (PECR) Act 2011 which is the UK's implementation of the European Union's ePrivacy laws. It's often referred to as The Cookie Law.
Of course, not all websites have implemented this and the ones that have are the high-volume websites with permanent development teams. Many smaller organizations with legacy websites that aren't actively developed, have not started asking these questions and are arguably not compliant. I'm not a lawyer though, so seek legal advice if you think you have a problem!
It's interesting that so many organizations have taken this regulation so seriously and it's probably because the use of cookies is so easy for an individual outside the organization to test. We can imagine someone writing a script to find UK-based websites that don't comply and then trying to sue them for infringing their privacy rights.
There are many other regulations that affect IT systems, from industry-specific ones (finance, energy, and telecoms are heavily regulated industries), to wide-ranging, cross-industry ones, such as privacy and data protection. Regulations regarding IT systems are increasing, and systems normally must comply even if they were written a long time before the regulation came into existence.
Please remember that failure to comply with regulations may be a criminal offence. If you breach a license agreement, you might find yourself being sued for a large amount of unpaid fees, but if you fail to comply with regulations (for example, money laundering or safety reporting), the consequences could be much worse.
Politics
Technology workers are often very bad at spotting political factors that affect their projects. Here are some questions you should ask yourself about your project (whether legacy, green-field, or somewhere in-between).
Who owns the project?
Who uses the project?
Who pays for the project?
Who gets fired if it fails?
Whose job is at risk... if successful?
What we should be looking out for are conflicts of interest – so let's look at the last two I've listed.
If a project fails, then the sponsors of the project and the implementers of the project will take the blame and their jobs may be at risk. This is to be expected and is one of the reasons we're motivated to do a good job and why we get harassed so much by project sponsors.
However, we rarely consider who suffers when a project is a success. It's much easier to get funding for a migration or upgrade if there is a defined business benefit; even easier still if there are increased revenues or costs savings (profit!). When an IT system saves money, it's often because jobs can be rationalized, that is, people get fired.
Note
Once, I was taken to see a client by a salesman for the software company we worked for. We walked through a room of data entry clerks (most of whom were women) and, being a stereotypical salesman, he stopped to flirt several times (he already had a couple of divorces and was working on his third). Once we left the room, he said "It's a shame isn't it?". I was a little confused and asked what he meant. His reply was "Well, once we've installed the data-importing software you've written, they're all out of a job". I hadn't been out of university for long and was naive to the real-world effects (and potential politics) of what we were doing.
Even if jobs aren't made redundant by a system improvement, the user might suffer in other ways. If tasks become standardized/commoditized, then it's much easier to replace a specific user with someone else who is cheaper (outsource the processes) or has a better attitude. Remember the examples of hidden knowledge I gave earlier? An improved system might remove this advantage for power users.
We must remember that many people simply don't like change. Technology workers are very unusual, as we like change; it's often what drew us into the job in the first place. We relish the way we work changing and making ourselves redundant probably holds no fear, as we move jobs every 18 months anyway. The users of a legacy system may have been doing the same job in the same way for 10 years and the thought of learning something new (and possibly being bad at it) can scare them.
You should also question the costs and benefits. There is a good chance that the cost is incurred by someone different to the beneficiary of the project. This is often the case in larger organizations where the IT department is separate from the business units using the software. Different solutions to issues may incur costs to different groups. For example, a re-write of a piece of software may come out of the business unit's budget, whereas an 'upgrade' might be viewed as the IT department's cost. Virtualizing an old system will probably be a cost to the IT department but the cost increase or savings on business software might fall to the end users. This will cause conflicts of interest and the decisions taken may be based on the political power of the department leaders rather than being the best decision for the organization.
Politics can be very destructive, and I've seen this as the cause of failure of many projects. From personal experience, I'd say that it's more often the cause of failure than bad technical decisions or poor implementation.
Organization Constraints and Change are Reflected in the System
Beyond the deliberate effects of politics, there are more subtle marks that an organization can leave on the systems that are developed for it. This is described as Conway's Law (http://en.wikipedia.org/wiki/Conway's_law).
Not only are these organizational constraints reflected in the system when developed but these constraints will change over time. As people leave, teams merge or split and the barriers that existed may disappear or new ones may be created. However, the artefacts in the system will remain and build up over time. This can become obvious in long-lived legacy systems.
This affects software and hardware. Consider a system that two different departments in an organization are using. They may want separate databases (that only they have access to) and may want their processes/reports to run on servers they own (perhaps they had to provide some budget for the system's creation, and they want exclusive use of 'their' server). These departments may change dramatically over time, but you'll still have two databases and two servers.
Understanding the organizational structures of software and business teams can help to explain which quirks of the system are due to this and which are due to some large, complex subtlety that you're yet to understand (hat tip to Richard Jaques, whom I stole this sentence from).
External Processes Have Evolved to Fit around the System
An IT system is part of a larger process and workflow. It will have been developed to fit into the process to make it more efficient and effective. However, it doesn't just fit into the outside world; it will also affect it. The processes surrounding the system will change and evolve to work with it and its quirks. With a legacy system, these processes will have become embedded in the organization and can become very difficult to change (remember that people don't like change).
This can lead to the peculiar situation where you must get a newer, better system to work with old and inefficient processes that were originally imposed by the proceeding system. When gathering requirements for a new system, you should remember this and make sure you are finding out what is needed rather than what is currently done.
External Systems Have Evolved to Fit around the System
The external systems that interact with the legacy system will have also evolved to match its interfaces. This is especially true if the legacy system is quite old and the external systems were created afterwards.
This can lead to a situation where a replacement system must match old and unwanted interfaces in order to interact with dependent systems that only have those interfaces to connect to the legacy system that is now being replaced. Ancient interfaces can therefore survive long after the system they were originally designed for has been removed.
Decaying Data
I've already described how systems can grow organically, but the data within them can also decay (The metaphor isn't perfect but I'm trying to make this interesting). This is a decrease in the overall quality of the data within the system caused by small errors gradually introduced by the users and external feeder systems. Examples include:
Data entry (typing) errors
Copy-and-paste reproduction errors such as missing the last character or extra whitespace
Old data, that is, the details are no longer true, such as an address changing
Feeder systems changing formats
Data corruption
Undeleted test data
When a system is new, there are often data quality checking tasks to keep the data accurate, but with time these tend to be dropped or forgotten (especially tasks to prune data no longer required). This is another example of something from behavioral economics, called The Tragedy of the Commons.
Note
The Tragedy of the Commons
When many people have joint use of a common resource, they all benefit from using it. The more an individual uses the resource, then the more they personally benefit. If they mistreat it or overuse it, then it may degrade but the degradation is spread amongst all the users. If an individual spends time improving the resource, then any increased benefit will also be spread amongst the whole group. From a purely logical point of view, it makes sense for any individual to use it but not spend any time improving the resource. They are acting independently and rationally (according to self-interest) but the resource will degrade over time.
These errors in data can (and sometimes without being noticed) accumulate until the entire system becomes unusable.
Now What?
Did you recognize any of those issues from previous projects you've worked on? I've not listed the system being 'bad' as a problem, as it might not be perfect (all real systems have issues and bugs), but if it's a legacy system then it must have value, or it would just be turned off.
When working on a legacy system, it's worth recording issues and potential issues, such as those listed previously. Consider starting a 'Problems and Issues' document for your legacy system and recording real and potential problems. Please see Appendix 2 - Legacy Project Questions as a potential starting point for this.