Nov. 11, 2021, noon

Problems that should be approached with machine learning can look extremely similar to problems that should be approached in a different way. There’s often no theoretical reason you cannot approach a problem with machine learning if it can be expressed that way, but we should always be asking ourselves what we actually should be doing instead of just what we can do. #rules #networks #machine learning

The so-called “Law of the Instrument” states: “When all you have is a hammer, everything looks like a nail.” Whether this should be considered, as the article states, a “cognitive bias” (much less a “law”) is up for debate, but it nonetheless describes something significant about how problems are approached. Machine learning is no different.

This may not be a particularly profound point, I recognize, but it is much harder to put into practice than it is to recognize in theory. Allow me to make a demonstration.

Sequence to Sequence models (seq2seq) are often employed for tasks like machine translation, text summarization, and similar use cases. Seq2seq models are recurrent neural networks learn associations between input and output in the form of sequences of arbitrary length. The recurrent component (an LSTM or GRU) allows the network to model non-local dependencies between elements. The architecture is composed of an encoder, which takes the input sequence and passes it through the hidden layer on to a decoder, which renders it into the output.

This post is not meant to be an in-depth discussion or tutorial of seq2seq models. There are already plenty of those out there. What is important to know is that seq2seq models are a powerful and important class of models which are straightforwardly applicable in a circumstance where straight transliteration of some sequence of symbols is required, or extension of some input sequence is desired.

Now consider a specific task: the task of mapping Arabic numerals to Roman numerals. This task is a particularization of a broader machine translation task, restricted to the domain of numerals, which has the advantage of the semantics being identical between the input sequence and the output sequence (in theory, anyway). Moreover, it looks like the kind of thing machine learning might be needed for. Consider the following examples:

8 = VIII

389 = CCCLXXXIX

50,002 =

~~L~~II999,999 =

~~CMXCIX~~CMXCIX

On its surface, there is quite a discrepancy between the representational schema for Arabic numerals, despite both systems being base 10. Indeed, upon examining the rules for generating Roman numerals, there are some complexities.

Adjacent symbols of equal value are summed

A symbol of higher value is summed with a symbol of lower value to its right unless the symbol following that symbol is of higher value than it

A symbol of lower value to the left of a symbol of higher value is subtracted from the value of the symbol of higher value

A symbol cannot appear more the three times in sequence

Roman numerals use the following primitives:

Following the rules given above, that means that the largest number that can be represented in this system is 3999, or MMMCMXCIX (MMM = 3000 / C - M = 900 / X - C = 90 / I - X = 9). To correct this, several competing (and non-standardized) conventions were employed in the Roman world to represent larger numbers. The most common and comprehensible seems to be the use of a vinculum, which denotes multiplication by 1000 for a numeral written with an overline (e.g V would be 5 ✕ 1000, so 5000). For convenience, I am representing the overline a strikethrough. The vinculum numeral would then be adjoined to the ordinary numeral such that 5226 would be ~~V~~CCXXVI. Notably, this move expands the set of primitives from 7 to 14, and allows numbers up to 3,999,999 to be represented.

There are three curious properties of the system from a modern lens. First, it lacks 0. The history of 0 is interesting in its own right, but suffice it to say for now that the Romans, like much of the ancient world (except for a few, including the Egyptians) lacked the concept itself in their systems. Second, is the shift at 4000. Whereas Arabic numerals increment predictably in terms of the writing conventions (at each 10, increment to the column to the left), Roman numerals increment more or less predictably until 4000, when they suddenly shift. Lastly is the terminus of the system at 3,999,999, as vinculua do not apply successively (at least, not commonly).

Despite all of these oddities, the task of generating Roman numerals from Arabic numerals is no particular challenge for a seq2seq model. The code and torch files can be found here. The model is 99.3% accurate, and it took comparatively little time to set up and train. With more effort, the accuracy could certainly be pushed up considerably higher.

The purpose of this exercise, though, is a contrast. The rules for generative Roman numerals are known, and it turns out that it is fairly trivial to write an algorithm which generates all and only well-formed Roman numerals from Arabic numerals, given nothing but the correspondence between the primitives and their Arabic numeral counterparts. The code for this algorithm is here for interested parties. Even the algorithm itself is some form of overkill, since I wrote it explicitly to be as minimalist as possible in the knowledge provided to the algorithm. Much smaller, more efficient algorithms can be written, or, since there are only 4,000,000 well-formed Roman numerals, a lookup table is a conceivable approach.

From time to time, everything does look like a nail. It’s easy enough to say “keep an open mind,” or “don’t get locked into particular kinds of solutions,” and putting them into practice is a fair bit harder, for sure. It may seem more impressive or interesting or technically savvy to write a model, but the way you really demonstrate skill is with simple, readable, efficient solutions. A solution should be only as complicated as it must be in proportion to the complexity of the problem. This is a hard lesson to learn, and it requires both depth and breadth, but it is important to never stop asking what should be done to solve a problem, rather than only what can, especially if the answer always seems to be the same thing.

On Chinese Rooms, Roman Numerals, and Neural Networks

When Everything Looks Like a Nail

Language Technology Needs Linguists, But Companies Need Convincing

Financial Contracts Are More Fintech than LegalTech

The Archaic Language Whereby Lawyers Draft

The Paper Hard Drive, or, Where are Our Contracts Anyway?

The Perilous Complexity of Information Extraction from Financial Contracts