Implementing a custom MS Word for DOS file parser to properly do GSM SS7

Yes, I'm not kidding!

In recent months, I've been writing quite a bit of GSM MAP (Mobile Application Part) code. MAP is the protocol used heavily in the GSM core network and especially on the roaming interfaces between different operators. It is specified in GSM TS 09.02 and later 3GPP TS 29.002.

The protocol specification relies on ASN.1 description of the messages as well as the regular BER encoding rules. ASN.1 is this marvelous technology that allows a protocol to be specified in an abstract and formal notation, in an extensible way, removing all the problems of human-written marshalling code, full of errors and differences due to different developers interpreting a human-readable specification in different ways.

So far so good. You think it should be simple to write a parser and generator for MAP messages: Simply feed them into the ASN.1 compiler of your choice, it will generate code in the target language you require.

As long as both sides of the communication do that using exactly the same revision of the specification (and don't make implementation mistakes), this will work. The reality looks very different, though :( When I test my code against something like one million of real-world messages captured on a production SS7 roaming interface, it produces errors already on packet number six of that trace.

The problem is: The protocol designers have not specified the first versions in a really extensible way, i.e. a given operation originally only returned one atomic data field, and it was later extended to return a sequence of data fields. Thus, there is one additional level of hierarchy in the encoding.

Not only that, but in their infinite wisdom, the designers of MAP have also failed to include versioning information in each and every message header. Instead, it is part of the application context name, which is only part of the first message of every conversation.

Furthermore, different versions of the MAP specifications disagree on whether certain fields are deemed optional or not. This is further complicated by somewhat strange versioning habits. There is the Revision number of the TS 09.02 (like 3.8.10), then there is a different version number encoded in the corresponding ASN.1 files like 'version9(9)' and individual operations then have v1/v2/v3 in their application context name.

Some even more wiser decision must have been to remove the description of older messages from the later versions of the specifications. So even specifications published in the year 2000 no longer include definitions of messages that were still part 5 years earlier. Why does it matter? Because today, in 2011, you still see MAP message on the international SS7 interfaces that are encoded in some of the earliest versions of the MAP protocol!

And if all of this was not enough, the biggest bummer is: For most of the releases of the specification, the ASM.1 text files are not distributed separately, but they are interspersed with human-readable text in the actual specification documents (which can be 600 pages long, nothing you want to cut+paste).

Even worse: If you go to the ETSI homepage and download the PDF version of old 09.02 specs, they will actually provide a PDF with a scanned paper print-out, i.e. no searching and no copy+pasting.

Luckily, the 3GPP has made the history of 3.8.0 and later available on their FTP server. But they are in MS Word for DOS format, like they were written originally. This format can not be opened by OpenOffice, and as far as I know not even by any of the Windows Word versions that MS has released in the last 10 years.

So what did I do? I actually installed MS Word 5.5 for DOS (provided as Freeware from Microsoft) and ran it in DOSEMU, to convert the specs into RTF format. This way I can at least open them and look at them in a modern text processor.

But this still does not solve the copy+paste problem.

I finally found antiword, but it mainly focuses on Word for Windows files and only does rudimentary text extraction from Word for DOS files. But hey, there is an online copy of chapter 16 from the File Formats Handbook, apparently published by Dr.Dobb's (who remembers them!!) at some time in the past.

So what did I do? I wrote some custom parser for those old Word/DOS files, which parses the paragraph format descriptions and tries to identify those sections that contain the ASN.1 code. As they are almost the only part in the specification that is enclosed with a border line on all four pages, this should work pretty fine. Early results are quite promising!

My hope is now that the ETSI stylesheets did not change too much over time, i.e. that this parser will be able to extract the ASN.1 spec for all of the protocol versions that I can find. If that works, I can run them through a validator, then pretty-print them and putt them all in one git tree in chronological order. And maybe at some point in 2011, we will have the marvels of an unified diff between two different MAP versions. The strange part is: Diff was developed in the 1970ies, GSM in the late 1980ies. They should have known about it back then, and used a revision control system like SCCS to record all the changes in the specification they make.

I guess this all is a glimpse how a digital archaeologist of the 22nd century must feel when analyzing ancient artefacts and trying to understand what the heck his ancestors have been doing back then.

UPDATE: The tool can be found at http://cgit.osmocom.org/cgit/asn1_docextract/