At the 34th annual Chaos Communication Congress, a team of Osmocom folks continued the many years old tradition of operating an experimental Osmocom based GSM network at the event. Though I've originally started that tradition, I'm not involved in installation and/or operation of that network, all the credits go to Lynxis, neels, tsaitgaist and the larger team of volunteers surrounding them. My involvement was only to answer the occasional technical question and to look at bugs that show up in the software during operation, and if possible fix them on-site.
34C3 marks two significant changes in terms of its cellular network:
- the new post-nitb Osmocom stack was used, with OsmoBSC, OsmoMSC and OsmoHLR
- both an GSM/GPRS network (on 1800 MHz) was operated ,as well as (for the first time) an UMTS network (in the 850 MHz band)
The good news is: The team did great work building this network from scratch, in a new venue, and without relying on people that have significant experience in network operation. Definitely, the team was considerably larger and more distributed than at the time when I was still running that network.
The bad news is: There was a seemingly endless number of bugs that were discovered while operating this network. Some shortcomings were known before, but the extent and number of bugs uncovered all across the stack was quite devastating to me. Sure, at some point from day 2 onwards we had a network that provided [some level of] service, and as far as I've heard, some ~ 23k calls were switched over it. But that was after more than two days of debugging + bug fixing, and we still saw unexplained behavior and crashes later on.
This is such a big surprise as we have put a lot of effort into testing over the last years. This starts from the osmo-gsm-tester software and continuously running test setup, and continues with the osmo-ttcn3-hacks integration tests that mainly I wrote during the last few months. Both us and some of our users have also (successfully!) performed interoperability testing with other vendors' implementations such as MSCs. And last, but not least, the individual Osmocom developers had been using the new post-NITB stack on their personal machines.
So what does this mean?
- I'm sorry about the sub-standard state of the software and the resulting problems we've experienced in the 34C3 network. The extent of problems surprised me (and I presume everyone else involved)
- I'm grateful that we've had the opportunity to discover all those bugs, thanks to the GSM team at 34C3, as well as Deutsche Telekom for donating 3 ARFCNs from their spectrum, as well as the German regulatory authority Bundesnetzagentur for providing the experimental license in the 850 MHz spectrum.
- We need to have even more focus on automatic testing than we had so far. None of the components should be without exhaustive test coverage on at least the most common transactions, including all their failure modes (such as timeouts, rejects, ...)
My preferred method of integration testing has been by using TTCN-3 and Eclipse TITAN to emulate all the interfaces surrounding a single of the Osmocom programs (like OsmoBSC) and then test both valid and invalid transactions. For the BSC, this means emulating MS+BTS on Abis; emulating MSC on A; emulating the MGW, as well as the CTRL and VTY interfaces.
I currently see the following areas in biggest need of integration testing:
- OsmoHLR (which needs a GSUP implementation in TTCN-3, which I've created on the spot at 34C3) where we e.g. discovered that updates to the subscriber via VTY/CTRL would surprisingly not result in an InsertSubscriberData to VLR+SGSN
- OsmoMSC, particularly when used with external MNCC handlers, which was so far blocked by the lack of a MNCC implementation in TTCN-3, which I've been working on both on-site and after returning back home.
- user plane testing for OsmoMGW and other components. We currently only test the control plane (MGCP), but not the actual user plane e.g. on the RTP side between the elements
- UMTS related testing on OsmoHNBGW, OsmoMSC and OsmoSGSN. We currently have no automatic testing at all in these areas.
Even before 34C3 and the above-mentioned experiences, I concluded that for 2018 we will pursue a test-driven development approach for all new features added by the sysmocom team to the Osmocom code base. The experience with the many issues at 34C3 has just confirmed that approach. In parallel, we will have to improve test coverage on the existing code base, as outlined above. The biggest challenge will of course be to convince our paying customers of this approach, but I see very little alternative if we want to ensure production quality of our cellular stack.
So here we come: 2018, The year of testing.