Connectivity loss caused by Content Filter dead lock.

We are using a Content Filter Network Extension to perform telemetry over the network activity of enterprise iOS devices. The filter itself is not blocking any connection.

We encountered an issue where our Content Filter got stuck in a deadlock in the startFilter method of the NEFilterControlProvider. This resulted in a crash report where we see 64 threads stuck in the startFilter call. While the content filter was stuck in a deadlock, the device network connectivity was lost.

We solved the deadlock issue coming from our logger, however, we would like to get a better understanding on the following points:

  1. What are the critical paths where a Content Filter can have a device wide impact on network connectivity?
  2. What is the behavior of the OS when the Content Filter is unresponsive (e.g. in startFilter, handle(Report), handleNewFlow)? Will it try to start the filter again? Force kill it ?
  3. We saw that startFilter was called multiple times in our crash reports whereas we expected it to be called only on vendor configuration changes. What is the lifecycle of the filter control provider and filter data provider ? When are the different methods like startFilter called ?

We would like our Content Filter to never cause disruptions and implement a circuit breaker behavior in case any issue occurs. Do you have any recommendation on how to achieve this ?

Answered by DTS Engineer in 895509022

Content filters are definitely in a privileged position, and it’s absolutely possible that a borked content filter will bork the device as a whole.

The specific pathology you hit, a thread explosion affecting Dispatch, is particularly bad because lots of system code relies on Dispatch to make progress.

1- What are the critical paths where a Content Filter can have a device wide impact on network connectivity?

I don’t see any way to reasonable enumerate all possible ways that a content filter can bork the system. The issue you hit was about the filter being unresponsive, but that’s just one possibility. For example, your content filter could accept a flow, return a pause() verdict, and then never resume the flow. That’ll bork the system but there’s no good way for it to detect that your filter has failed, rather than it just taking a long time to resolve the flow.

2- What is the behavior of the OS when the Content Filter is unresponsive … ?

I’m not 100% sure. I’m gonna do some digging and get back to you.

3- We saw that startFilter was called multiple times

Right. This can happen for a variety of reasons. The only thing you can rely on is that the system shouldn’t start the same filter twice. And “same” has two meanings in this case:

  • Within a process, it shouldn’t start the same provider object while it’s already started.
  • Across processes, it shouldn’t start an instance of your provider object in process A while another instance is running in process B.

The lifecycle you typically see is:

  1. The system starts a process to run your appex.
  2. Within that, it instantiates your provider object.
  3. And then starts it.
  4. Then one of two things happens:
    • The process terminates unexpectedly, in which case the system starts again from step 1.
    • The provider stops cleanly, in which case the system terminates the appex process. This may or may not run the provider object’s deinitialiser.

However, this is typical, not guaranteed. It’s possible for the system to instantiate a second instance of your provider object in the same process. This is rare, but possible, with appex packaging. And its de rigueur for sysex packaging.

I don’t think the system will ever start the same instance twice (so, it won’t do something like init, start, stop, start, stop, deinit) but I can’t see anything in the API contract to prohibit that.

we expected it to be called only on vendor configuration changes.

A simple configuration change shouldn’t stop your provider. Rather, the system updates the provider object’s filterConfiguration property. As explained in the docs, providers are expected to monitor that via KVO.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Content filters are definitely in a privileged position, and it’s absolutely possible that a borked content filter will bork the device as a whole.

The specific pathology you hit, a thread explosion affecting Dispatch, is particularly bad because lots of system code relies on Dispatch to make progress.

1- What are the critical paths where a Content Filter can have a device wide impact on network connectivity?

I don’t see any way to reasonable enumerate all possible ways that a content filter can bork the system. The issue you hit was about the filter being unresponsive, but that’s just one possibility. For example, your content filter could accept a flow, return a pause() verdict, and then never resume the flow. That’ll bork the system but there’s no good way for it to detect that your filter has failed, rather than it just taking a long time to resolve the flow.

2- What is the behavior of the OS when the Content Filter is unresponsive … ?

I’m not 100% sure. I’m gonna do some digging and get back to you.

3- We saw that startFilter was called multiple times

Right. This can happen for a variety of reasons. The only thing you can rely on is that the system shouldn’t start the same filter twice. And “same” has two meanings in this case:

  • Within a process, it shouldn’t start the same provider object while it’s already started.
  • Across processes, it shouldn’t start an instance of your provider object in process A while another instance is running in process B.

The lifecycle you typically see is:

  1. The system starts a process to run your appex.
  2. Within that, it instantiates your provider object.
  3. And then starts it.
  4. Then one of two things happens:
    • The process terminates unexpectedly, in which case the system starts again from step 1.
    • The provider stops cleanly, in which case the system terminates the appex process. This may or may not run the provider object’s deinitialiser.

However, this is typical, not guaranteed. It’s possible for the system to instantiate a second instance of your provider object in the same process. This is rare, but possible, with appex packaging. And its de rigueur for sysex packaging.

I don’t think the system will ever start the same instance twice (so, it won’t do something like init, start, stop, start, stop, deinit) but I can’t see anything in the API contract to prohibit that.

we expected it to be called only on vendor configuration changes.

A simple configuration change shouldn’t stop your provider. Rather, the system updates the provider object’s filterConfiguration property. As explained in the docs, providers are expected to monitor that via KVO.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Connectivity loss caused by Content Filter dead lock.
 
 
Q