Historical introduction to scripting
By the time Clementine Version 4 was released in 1997, the workbench had gained substantial market traction. Its revolutionary visual programming interface had enabled a more business-focused approach to analytics than ever before—all the major families of algorithms were represented in an easy-to-use form, ODBC had enabled integration with a comprehensive range of data, and commercial partners were busy rebadging Clementine to reach a wider audience through new market channels.
The workbench lacked one major kind of functionality, that of automation, to enable the embedding of data mining within other applications. It was therefore decided that automation would form the centre piece of Version 5, and it would be provided by two major features: batch mode and scripting. Batch mode enabled running the workbench without the user interface so that streams could be run in the background, could be scheduled to run at a given time or at regular intervals, and could be run as part of a larger application. Scripting enabled the user to gain automated control of stream execution, even without the user being present; this was also a prerequisite for any complex operation executed in batch mode.
The motivation behind scripting was to provide a number of capabilities:
- Gain control of the order of stream execution where this matters, that is, when using the Set Globals node
- Automate repetitive processes, for example, cross-validation or the exploration of many different sets of fields or options
- Remove the need for user intervention so that streams could run in the background
- Manipulate complex streams, for example, if the need arose to create 1000 different Derive nodes
These motives led to an underlying philosophy of scripting, that is, scripts replace the user, not the stream. This means that the operations of scripting should be at the same level as the actions of the user, that is, they would create nodes and link them, control their settings, execute streams, and save streams and models. Scripts would not be used to implement data manipulation or algorithms directly; these would remain in the domain of the stream itself. This reflects a fundamental fact about technologies—they are defined by what they cannot do as by what they can. These principles are not inflexible, for example, cross-validation might be considered as part of an algorithm but was one of the first scripts to be written; however, they guided the design of the scripting language. A consequence of this philosophy was that there could be no interaction between script and data; the restriction was lifted only later with the introduction of access to output objects.
A number of factors influenced the design of the scripting language in addition to the above philosophy:
- In line with the orientation towards nontechnical users, the language should be simple
- The timescale for implementation was short, so the language should be easy to implement
- The language should be familiar, and so should use existing programming concepts and constructs, and not attempt to introduce new ones
These philosophical and practical constraints led to a programming language influenced by BASIC, with structured features taken from POP-11 and an object-oriented approach to nodes taken from Smalltalk and its descendants.