matt9j logo

matt9j

A networking researcher interested in the challenges of infrastructure at the end of the tail of connectivity

9 minutes read

Preamble

I’ve been helping out the Seattle Community Network (SCN), with an ongoing project to build a crowdsourced network performance measurement application for Android. While understanding modern network performance, particularly wireless networks, is extremely subtle, “speedtests” offer a crude yet popular way to measure a network’s performance, and are easy for general audiences to interpret.

Unsurprisingly, SCN sought to include a “speedtest” capability in their app! A team of an undergraduate researcher (Zhennan(John) Zhou) and a volunteer high school student (Ashwin Chintalapati) organized by Esther Jang got started on the project, and started integrating iperf3 (C, BSD-3 into the application. Due to its maturity, consistent history of open source activity, and explicit offer of “a library version of the functionality that can be used in other programs,” I thought it was a reasonable choice. After a couple of weeks their efforts stalled though, and I was asked for some input.

This marked the beginning of the journey…

For part one, see my prior post Building iperf3 For Android 11+

Integrating iperf3 with an Android Application

Once we had a stable way to build libiperf, it was time to actually integrate its functionality into our application to provide the “speedtest” capability. Since the test would run for 10s of seconds, we needed it to operate asynchronously and not block the main UI thread. The team also wanted the test to be able to complete if the user switches away from the application while the test is running. Android has a great flowchart in their developer documentation for deciding where and how to run tasks in modern Android. Since the test needed to be decoupled from the main application context (to allow switching), the team selected AndroidX’s WorkManager (Overview Ref).

At a high-level, WorkManager allows the application to schedule work requests for either one-time work or periodic work. It has some cool capabilities to put constraints on when tasks run (like requiring a WiFi connection, when not on battery, when the device is idle, etc.), and an entire process for managing work windows for periodic maintenance tasks. Fortunately our task is relatively straightforward: we want to immediately run a test in response to user input, allow users to chancel running tests if they start one by mistake, and get the results of the test upon completion.

The Worker class (Ref) provides a synchronous interface for running work in its doWork() method and can be enqueued in the WorkManager to execute immediately. Since libiperf is a C library, we call a native C function across the Java Native Interface which initializes the iperf test c-struct with the intended test parameters and starts the test. All of this works well enough, and allows us to run a single iperf test to completion. At this point things start to get a bit more interesting…

Stopping iperf3 tests

It was an important feature that in-progress tests could be cancelled in case a heavyweight test was started accidentally! The WorkManager provides a straightforward API to cancel enqueued or in-progress work. The Worker class presents a synchronous interface though– how does the stop signal get to the worker? It turns out there are two ways, either by providing an override for the onStopped() callback, or by polling the isStopped() property in the main doWork() method. Since our doWork() method just indirectly wraps the underlying libiperf run_client() function, which is also synchronous and blocks on an underlying select(), there would not be an easy way to regularly poll isStopped() during the test without extensively modifying libiperf.

Examining the source code for the main iperf3 command line program, which is implemented with libiperf and supports cancellation, we saw that they handle test cancellation by intercepting SIGINT, SIGTERM, and SIGHUP and calling iperf_got_sigend() from the libiperf api whenever one of the signals is received. This looked promising and easy to integrate with the isStopped() worker callback! So we connected the callbacks, started a test, pressed the cancel button, and…

watched our app crash.

Well not exactly crash, just, close? Digging into the implementation of iperf_got_sigend() quickly reveals a surprising twist for an embeddable api… a call to exit()!

Process assumptions in the libiperf API

It turns out that the iperf API wasn’t quite as embeddable as we had originally hoped, and really was designed for implementing iperf-like command line programs instead of providing iperf as a part of a larger application. In Android the entire application is kept within the same process, including our background workers and the foreground UI. Calling exit doesn’t just stop the background worker, but terminates the entire process/application! We wouldn’t be able to stop our test the same way, and would have to figure out something else.

Manually setting the done flag

Deep into the libiperf source at this point, we noticed a boolean done flag being checked in the client loop, but no high-level api to set it outside iperf_got_sigend(). Setting it to true appeared to cause the client to break out of its main run loop and return, exactly the behavior we wanted! Setting it directly required extending the libiperf API, and we’re planning to upstream these changes for future integrators.

But there is no magic– since the done flag is only checked at each iteration of the run loop, we still have to work around the blocking select at its heart. For uplink tests the loop runs with every transmitted packet, providing plenty of responsiveness. In downlink though the loop only runs when a packet is received, which could stall if there is a problem in the network. Fortunately the iperf authors anticipated this problem, and provide a fallback timeout!

Unfortunately, this timeout defaults to (a very conservative) two minutes : /

Configuring the timeout was also not exposed in the high-level API yet, but we were able to add an extension easily along with our earlier efforts. We chose a value of 5 seconds, which should be long enough for the types of connections we hope to measure, but still responsive enough for acceptable UX.

This approach of polling the done flag is eerily similar to just checking is_stopped() (the high-level Java function provided by Worker), but has the advantage of not needing to make a heavyweight JNI call for each received packet during the test, and not needing to add JNI code directly into libiperf. Unfortunately though, this approach now requres us to now reason about threading and concurrent access to a value in C/C++, something that is not needed in the regular single-threaded with signals iperf client!

Setting done from a different thread

The done flag in particular is polled from the main worker thread, which blocks on select(), but will need to be set from a different worker thread provided by the app’s WorkManager running onStopped(). To make matters even more complicated, done is part of the internal to libiperf iperf_test struct, which is allocated at the beginning of the test and freed at the end of the test. We need to protect access to the memory in this struct to ensure it is still valid when accessing it from onStopped().

Since we only run one test at a time (by design, to prevent the tests from interfering), it is relatively straightforward to make a singleton holding a pointer to the current test and a mutex protecting access to the test struct. If multiple concurrent tests were needed this approach could be extended to implenting a store of the currently running tests, with some kind of ID system to get access to the correct test instance.

Tracking the test instance across threads with a singleton test manager is not particularly elegent or optimized for maximum performance, but is robust, performant enough, and safe.

So we finally have it all. A means to run iperf tests in the background of our application, and a means to stop them on demand from the foreground UI. In testing I start some tests, and stop some tests, and start a test, and then… nothing happens. The UI is responsive, but no traffic is being generated for the latest tests.

Learning to better understand the Worker contract

It turns out that this is where things get even more complicated. After adding extensive debug logging and an hour or so of narrowing down on where things get hung up, it becomes apparent that the mutex protecting the singleton test manager is locked, preventing new tests from starting without a full restart of the app. But being modern C++ programmers we were using std::lock_guard, an RAII-based wrapper for scoped mutex ownership. This should be impossible!

After much head scratching and many dead ends, we generate a new hypothesis: what if the worker thread is somehow “killed” uncleanly, leaving the process memory in an invalid state? We add more debug telmetry building an RAII function entry/exit logger to validate our assumptions and verify that both the main worker thread and stop signal thread are returning correctly. After around a dozen tests, it becomes clear that our logging in the cleanup of the JNI function consistently appears, but the logs at the end of the main function do not!

At this point, being neither an expert in Java nor Android’s latest Java runtime, it’s extremely counter-intuitive to me why background worker threads would be killed “mid-function” while the parent process (the application) remains running. On a hunch, and reading between the lines of the onStopped() method documentation, we add a CountDownLatch to our Worker implementation to count down when the native run_test function has completed, and hold our onStopped() implementation from returning until the latch is triggered. Lo and behold, our main function exit logging now consistently appears, and the mutex returns to expected operation.

The documentation for onStopped() in the managing work guide says:

WorkManager invokes ListenableWorker.onStopped() as soon as your Worker has been stopped. Override this method to close any resources you may be holding onto.

It may be an implementation detail, or a quirk of the interaction between the JNI and WorkManager, but should be something to be aware of if you find yourself using RAII principles to manage long-lived native state in a JNI function. At the point onStopped() returns it seems all resources must be released, including resources managed with RAII in your main thread!

Epilogue: gracefully handling errors

After all this, we finally were able to complete our integration of the iperf tool into our high-level Android testing app. We did encounter one more issue, which was libiperf 3.10.1 handling test errors (like an unreachable server or timeout) with and internal call to exit() rather than returning an error code from the run_client() function! This appears to have been accidental, and was concurrently fixed by Joakim Sørensen.

Hopefully as more folks use libiperf externally we can continue to improve the stability and test coverage of this valuable open-source resource!

Recent posts

See more

Categories

About

Currently a grad student at the University of Washington in the ICTD Lab...