NSJSONSerialization silently drops U+FEFF from JSON string content — keys merge, characters vanish

Question

Created 1w

Replies 4

Boosts 0

Participants 2

NSJSONSerialization silently drops U+FEFF from JSON string content — keys merge, characters vanish

TL;DR: NSJSONSerialization deletes U+FEFF (ZERO WIDTH NO-BREAK SPACE / BOM) from anywhere inside parsed JSON strings — not just a leading document BOM, and even when written as the \uFEFF escape (it's removed after unescaping). Distinct strings/keys silently collapse onto their U+FEFF-less twins. If you're seeing JSON keys mysteriously merge or a character disappear from a parsed value, this is probably why. It is not your code. Workaround and exhaustive scope below.

The workaround

Two options, depending on how attached you are to Foundation:
A. Stay on NSJSONSerialization — swap U+FEFF for a private-use sentinel before parsing, restore after. You must handle both the raw bytes and the \uFEFF escape (the escape bites too, since deletion happens post-unescape):

// 1. Pick a private-use scalar you've verified is absent from the source text.
// 2. Replace every in-content U+FEFF (raw char AND \uFEFF escape) with it.
// 3. Parse. NSJSONSerialization preserves the sentinel.
// 4. Recursively restore the sentinel -> U+FEFF in the parsed tree.
static id RestoreSentinel(id o, NSString *s, NSString *bom) {
    if ([o isKindOfClass:NSString.class])
        return [o rangeOfString:s].location == NSNotFound ? o
             : [o stringByReplacingOccurrencesOfString:s withString:bom];
    if ([o isKindOfClass:NSArray.class]) {
        NSMutableArray *a = [NSMutableArray arrayWithCapacity:[o count]];
        for (id e in o) [a addObject:RestoreSentinel(e, s, bom)];
        return a;
    }
    if ([o isKindOfClass:NSDictionary.class]) {
        NSMutableDictionary *d = [NSMutableDictionary dictionary];
        [o enumerateKeysAndObjectsUsingBlock:^(id k, id v, BOOL *stop) {
            d[RestoreSentinel(k, s, bom)] = RestoreSentinel(v, s, bom);
        }];
        return d;
    }
    return o;
}

Swap the escape form with a backslash-parity-aware regex so \uFEFF (escaped backslash + literal "uFEFF") is left intact:

(?<!\\)((?:\\\\)*)\\u[Ff][Ee][Ff][Ff]   ->   $1<sentinel>

B. Don't use Foundation for this file — a spec-compliant C parser like ++yyjson++ preserves U+FEFF and is faster on large files. (This is the route swift-transformers took for tokenizer.json.)

Minimal repro

// Object keys collapse:
NSData *d1 = [@"{\"\\uFEFF#\":1,\"#\":2}" dataUsingEncoding:NSUTF8StringEncoding];
id o1 = [NSJSONSerialization JSONObjectWithData:d1 options:0 error:nil];
// EXPECTED: 2 keys ("\uFEFF#" and "#");  ACTUAL: 1 key ("#") — \uFEFF stripped, keys merged

// String content lost:
NSData *d2 = [@"[\"\\uFEFF\"]" dataUsingEncoding:NSUTF8StringEncoding];
id o2 = [NSJSONSerialization JSONObjectWithData:d2 options:0 error:nil];
// EXPECTED: ["\uFEFF"] (one code point);  ACTUAL: [""] (empty string)

Same outcome whether U+FEFF arrives as raw EF BB BF bytes or the \uFEFF escape.

Why this is a bug, not a quirk

Per RFC 8259 §7, a JSON string is a sequence of Unicode code points; U+FEFF is ordinary content and doesn't require escaping. Tolerating a leading document BOM is fine — deleting U+FEFF from string content is not. U+FEFF leads a double life (BOM signal vs. ZERO WIDTH NO-BREAK SPACE character); Foundation treats every occurrence as a stray BOM to scrub.

Scope — exhaustive, not anecdotal

I swept all 1,112,064 valid Unicode scalars (U+0000–U+10FFFF minus surrogates) through a parse round-trip, in both the \uFEFF-escape and raw-UTF-8 forms:

U+FEFF is the only scalar altered. Every other scalar round-trips byte-identically — including the other zero-widths (U+200B, U+2060, U+00A0), which all survive.
No Unicode normalization occurs (NFD stays decomposed, combining sequences and compatibility characters are preserved).

So this is a deliberate BOM-stripping heuristic applied too broadly to string content — narrow and fixable, not general mangling.

Why it's nasty in practice

U+FEFF is zero-width, so the corruption is invisible — no trace in a diff or editor. Real-world hit: ML tokenizer vocabularies (e.g. Google's Gemma) legitimately contain U+FEFF-bearing tokens; loading tokenizer.json via NSJSONSerialization collapses those keys and assigns wrong token IDs, with zero visible symptom until output is subtly wrong.
Filed as FB23271905 — please dupe if this has bitten you. More duplicates is what gets it triaged.

Answered by DTS Engineer in 894997022

As for whether it's a regression: you're in a much better position to answer that than we are …

You’d be surprised. Tracking, down the origin of specific behaviours like this is challenging, especially for Foundation, which is entangled with the Swift open source efforts.

Regardless, if this is new code then I’d be very surprised to see your current bug get traction, primarily because of the compatibility risk involved. Still, I’m not the one you have to convince (-:

I do want to point out an alternative path you could take, namely to engage with the Foundation folks directly via the Swift open source efforts. They’re currently in the process of building a new substrate for JSON serialisation and deserialisation — see “New Codable” prototype available for feedback on the Swift Forums — and it’s easier to imagine a change like this being accepted there.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Answer 1

DTS Engineer OP

Apple

5d

Reading this, and your bug report, it’s not clear whether this is a regression? Has this changed in recent releases? Or has it been this way as far back as you’re able to test?

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Answer 2

FossilCoder OP

5d

Hi Quinn,

Thanks so much for looking at this — really appreciate it!

We stumbled onto the bug while building a local LLM inference engine for macOS. We load Gemma-4's tokenizer vocabulary from a tokenizer.json file, and some of the vocab keys contain U+FEFF — for example "<U+FEFF>#". When we parsed the file through NSJSONSerialization, those keys silently collapsed onto their BOM-less twins, overwriting the correct token IDs. The model then produced garbage output and it took a while to track down why.

We've worked around it on our end with a sentinel-swap before parsing, so we're not blocked — we're reporting it because it seems like the kind of silent data corruption that could surprise other developers and be very hard to diagnose.

As for whether it's a regression: you're in a much better position to answer that than we are since you have the source history. We'd love to know either way.

Happy to provide any additional repro material if it helps.

Best, Kolja

Answer 3

DTS Engineer OP

Apple

4d

Recommended

As for whether it's a regression: you're in a much better position to answer that than we are …

You’d be surprised. Tracking, down the origin of specific behaviours like this is challenging, especially for Foundation, which is entangled with the Swift open source efforts.

Regardless, if this is new code then I’d be very surprised to see your current bug get traction, primarily because of the compatibility risk involved. Still, I’m not the one you have to convince (-:

I do want to point out an alternative path you could take, namely to engage with the Foundation folks directly via the Swift open source efforts. They’re currently in the process of building a new substrate for JSON serialisation and deserialisation — see “New Codable” prototype available for feedback on the Swift Forums — and it’s easier to imagine a change like this being accepted there.

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Answer 4

FossilCoder OP

4d

Thanks for the pointer — we'll engage with the Swift Forums thread for the Swift side. For our Objective-C project we've decided to sidestep the issue entirely by switching to yyjson, a spec-compliant C parser. It's MIT-licensed, header-only, and fully testable — and it just does the right thing with U+FEFF. Probably the cleanest resolution for native code anyway.

Thanks again for your time and for the Swift Forums tip!

Best, Kolja