Closed Bug 1907726 (CVE-2024-9399) Opened 10 months ago Closed 9 months ago

Webtransport datagrams can crash UA

Categories

(Core :: Networking, defect, P2)

All
Unspecified
defect

Tracking

()

RESOLVED FIXED
131 Branch
Tracking Status
firefox-esr115 --- wontfix
firefox-esr128 131+ fixed
firefox129 --- wontfix
firefox130 --- wontfix
firefox131 + fixed

People

(Reporter: marten.richter, Assigned: kershaw)

References

(Regression)

Details

(5 keywords, Whiteboard: [necko-triaged][necko-priotity-queue][adv-main131+][adv-esr128.3+])

Attachments

(4 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0

Steps to reproduce:

I am developing a node.js package for webtransport support.
Today, I tried my unit tests, and the unit test using Firefox as a client resulted in a crash in Firefox.
So, in principle, a harmful site can crash Firefox (I do not know if this is enough to satisfy the security flag).

You can find the server code here:
https://github.com/fails-components/webtransport/blob/a58e330ce680018aecb79477d2cdb3e3830134e9/main/test/fixtures/server.js#L194
In principle, a minimal example can be constructed using the package and the server code snippet.
The client code would be:
https://github.com/fails-components/webtransport/blob/a58e330ce680018aecb79477d2cdb3e3830134e9/main/test/datagrams.spec.js#L87

In principle, I can also construct a minimal example, but I have already identified the problem in the code, but I do not know, how one would like to patch it.

Actual results:

It crashed.

Expected results:

It should not crash.

I am not sure if a browser crash is enough to justify the security flag, so remove it at your discretion.
I will now type in the results of my debugging session.

The crash happens at
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp#1079
so it is a failed assertion, so it is not a very harmful issue except for the crash of the UA.

The mechanism is as follows.
Datagrams arrive before the session is established.
The datagrams leading to the crash were sent via:
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp#1106
but I have also seen datagrams arriving without a crash using the socket thread as target via
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp#1114 .

The crash happens as the packet is sent via:
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp#1106
before the retargeting via:
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp#1106
called at the SessionReady happened via line:
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/dom/webtransport/parent/WebTransportParent.cpp#480
.
However, the call of
https://searchfox.org/mozilla-central/rev/9dd09bebbff975e6b0fdabe0032753fcc3176e4f/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp#1079
is happening after the retarget happens so that the target and the current thread do not match, and the assertion leads to a saved crash.

It is clearly a typical race condition, which may only occur when the unit tests are executed on the same fast computer, but I am not certain.

I have a working build and debug environment and may write or test a patch. (I would expect that reproducing takes some time.)
Anyway, I do not know how to change it.
One way would be to repost the packet to the right thread if the threads do not match, or to prevent receiving datagrams if the session is not ready. This would be okay as datagrams are, per spec, unreliable, even though I do not find this desirable.
Anyway, if someone from the necko team has suggestions, I will gladly test them.

Just as comment,

Group: firefox-core-security → core-security
Component: Untriaged → Networking
Product: Firefox → Core
Hardware: Unspecified → All
Version: other → unspecified
Flags: needinfo?(kershaw)

What additional info would help?
1.) I can generate a crash and send the crash report to Mozilla.
2.) I can also create a minimal server example using node.js with my package (although I did not test if the timing also occurs in a server library release build and if it is server machine dependent). Up until now, it has always been a Debian docker container under Windows 11 using the debug build of the server library.

1.) would be easiest.

(In reply to marten.richter from comment #4)

What additional info would help?
1.) I can generate a crash and send the crash report to Mozilla.
2.) I can also create a minimal server example using node.js with my package (although I did not test if the timing also occurs in a server library release build and if it is server machine dependent). Up until now, it has always been a Debian docker container under Windows 11 using the debug build of the server library.

1.) would be easiest.

Option 2 would be the best for us.
It it takes too much time, could you record a http log?
In about:logging, please select Logging to file and set MOZ_LOG to timestamp,sync,nsHttp:5,WebTransport:5,nsWebTransport:5.
You could send the log file to [email protected].

Thanks.

Flags: needinfo?(kershaw) → needinfo?(marten.richter)
Group: core-security → network-core-security

Ok, as a first step, I have sent the quested logs to [email protected] .
I have also tagged a crash report with the bug number.

By the way, when I started the computer today, I was not immediately able to reproduce the bug; it seems to depend heavily on the current load.

I will now look if I can create a small proof of concept.

Flags: needinfo?(marten.richter)

I'm sorry. I had only an hour for writing a demonstration, but I am not getting the web transport session (from localhost context) to connect to the web transport server. So I think this will wait until the weekend. If not, the information already provided is sufficient. I am also not sure if it is reproducible at another machine, as it is very timing-dependent.

Thanks Marten for the logs.
Keeping this in priority review until Kershaw has had an initial look at the logs.

Flags: needinfo?(kershaw)
Whiteboard: [necko-triaged][necko-priotity-review]
Severity: -- → S3
Priority: -- → P2

from comment 2

so it is a failed assertion, so it is not a very harmful issue except for the crash of the UA.

That's a MOZ_ASSERT and not a MOZ_RELEASE_ASSERT: in a release build that will protect nothing. Just to be sure, you were running a debug build of Firefox, right? If not we've got additional mysteries beyond this crash!

The crash you mentioned in comment 6 appears to be bp-9b610391-3cf4-41b7-8032-1b3750240715. In that opt build execution continues past the WebTransportSessionProxy::NotifyDatagramReceived thread assertion you linked in comment 2 and then hits a MOZ_RELEASE_ASSERT several frames deep inside the listener->OnDatagramReceived(aData); call at the end of that function. That assertion is in IPC, and is also a "are we on the expected thread?" assertion. So this might be safe after all?

FYI you can find links to your submitted crashes through the "about:crashes" page, and links to those crashes would be helpful for future bug reports. Most folks working on Firefox wouldn't have the access needed to search the comment field

Status: UNCONFIRMED → NEW
Ever confirmed: true

from comment 2

so it is a failed assertion, so it is not a very harmful issue except for the crash of the UA.

That's a MOZ_ASSERT and not a MOZ_RELEASE_ASSERT: in a release build that will protect nothing. Just to be sure, you were running a debug build of Firefox, right? If not we've got additional mysteries beyond this crash!

I did not record this properly on my notes. The first occurrence was inside playwright in a Linux container. I do not know if it is a debug build.
I did not record whether I was using nightly or the release build for the bug report.
So, I create new reports:
nightly: bp-3200d1ee-b4de-43b0-9b1c-051ed0240717
128.0 release: I can not supply an ID, as it did not show up in the list (I tried it several times). However, I have tagged one of the crashes with the bug id.

The crash you mentioned in comment 6 appears to be bp-9b610391-3cf4-41b7-8032-1b3750240715. In that opt build execution continues past the WebTransportSessionProxy::NotifyDatagramReceived thread assertion you linked in comment 2 and then hits a MOZ_RELEASE_ASSERT several frames deep inside the listener->OnDatagramReceived(aData); call at the end of that function. That assertion is in IPC, and is also a "are we on the expected thread?" assertion. So this might be safe after all?

Yes, this is the correct id. It may be safe. But of course, it should not crash, so at least it is a bug.

FYI you can find links to your submitted crashes through the "about:crashes" page, and links to those crashes would be helpful for future bug reports. Most folks working on Firefox wouldn't have the access needed to search the comment field
Thanks, I did not know this. It is my first bug report regarding Firefox with a crash...

Just to clarify, the debugging I did was on the repo, with the usual ./mach build, so I assume it was a debug build?

A default Firefox dev build from ./mach is a debug build, yes.

I was now able to distill a minimal example for the server codes, but I can not reproduce the error as it uses the release server code, and it does show a different timing behavior, which is bad.
So the only way is for reproduction, that you also set up a development environment for my package.

I use Windows 11 and the development is inside a linux development container.
So please clone the repo from:
https://github.com/fails-components/webtransport
in a windows filesystem and start the development container from the development docker file in VS code in windows.
Then, inside the container, check out all submodules and install node dependencies. (It may take a while as it tries to build the package).

Inside transports/http3-quiche may be some build directories please delete them before proceeding (as this may be release builds).
Then run npm run build-debug in the root directory. (If something goes wrong, you can also rebuild-debug.)
Now a debug library should be available and be used automatically.

You can now start the server with:
node main/test/fixtures/server.js,
but before change the following lines to:

 server = new Http3Server({
      port: 8080,
      host: '0.0.0.0',
      /* port: 0,
      host: '127.0.0.1',*/
      secret: 'mysecret',
      cert: certificate.cert, // unclear if it is the correct format
      privKey: certificate.private
    })

Now, the server should be up and running.
Create a webserver for the following code and replace the CERT_HASH to the hash the server printed:

const SERVER_URL = 'https://127.0.0.1:8080'
const CERT_HASH =
  '50:E2:4E:30:9C:90:93:17:E4:82:FA:99:A6:F7:DC:C8:E1:00:35:5D:F2:02:52:BE:09:9F:5D:8D:A1:1D:67:56'

// import { WebTransportPolyfill } from '../lib/index.browser'
// @ts-nocheck
let client
console.log('FYI CERT_HASH', CERT_HASH)

const afterEach = async () => {
  if (client != null) {
    client.close()
    client = undefined
  }
}

// eslint-disable-next-line no-unused-vars
async function getReaderValue(readableStream) {
  const reader = readableStream.getReader()

  try {
    const { done, value } = await reader.read()

    if (done) {
      throw new Error('Stream ended')
    }

    if (!value) {
      throw new Error('Stream value was undefined')
    }

    return value
  } finally {
    reader.releaseLock()
  }
}

class TimeoutError extends Error {
  /**
   * @param {string} message
   */
  constructor(message) {
    super(message)

    this.name = this[Symbol.toStringTag] = 'TimeoutError'
  }
}

/**
 * @template T
 * @param {Promise<T>} promise
 * @param {number} timeout
 * @returns {Promise<T>}
 */
export async function pTimeout(promise, timeout) {
  let ref

  const value = await Promise.race([
    promise,
    new Promise((resolve, reject) => {
      ref = setTimeout(() => {
        reject(new TimeoutError('timeout'))
      }, timeout)
    })
  ])

  clearTimeout(ref)

  return value
}

/**
 * Read a stream contents to the end and return it
 *
 * @template T
 * @param {ReadableStream<T>} readable
 * @param {number} [expected]
 * @returns {T[]}
 */
export async function readStream(readable, expected) {
  const reader = readable.getReader()

  try {
    /** @type {T[]} */
    const output = []
    let outputlength = 0

    while (true) {
      const { done, value } = await reader.read()

      if (done) {
        break
      }

      if (value != null) {
        outputlength += value.length
        output.push(value)
      }

      if (expected != null && outputlength >= expected) {
        break
      }
    }

    return output
  } finally {
    reader.releaseLock()
  }
}

/**
 * @param {ReadableStream<Uint8Array>} stream
 * @returns {string}
 */
export async function readStringFromStream(stream) {
  const decoder = new TextDecoder()
  let output = ''

  for (const buf of await readStream(stream)) {
    output += decoder.decode(buf, {
      stream: true
    })
  }

  output += decoder.decode()

  return output
}

// eslint-disable-next-line no-unused-vars
function readCertHash(certHash) {
  return Uint8Array.from(`${certHash}`.split(':').map((i) => parseInt(i, 16)))
}
// Firefox
export async function main() {
  // client context - waits for the server to open a bidi stream then pipes it back to them
  // client context - pipes the server's datagrams back to them
  console.log('mark 1')
  // eslint-disable-next-line no-undef
  client = new WebTransport(`${SERVER_URL}/datagrams_server_send`, {
    serverCertificateHashes: [
      {
        algorithm: 'sha-256',
        value: readCertHash(CERT_HASH)
      }
    ]
  })
  console.log('mark 2')
  try {
    await client.ready
  } catch (error) {
    console.log('Browser error', error)
    throw error
  }
  console.log('mark 3')

  // datagram transport is unreliable, at least one message should make it through
  const expected = 1
  console.log('mark 4')

  const received = await pTimeout(
    readStream(client.datagrams.readable, expected),
    1000
  )
  console.log('mark 5')
  expect(received).to.have.lengthOf(expected)
  console.log('mark 6')

  await afterEach()
}

and add some code so that the main function is executed.
Please note that the code is extracted from @fails-components/web transports, and its licenses apply (it contains code from different contributors; see the git for the history of the testing code for credits).

I am sorry, that it does not work easier. (The solution I worked on, would have made all this automatic, but used an release build).

Good and bad news:
1.) Bad news: I could not reproduce the bug today for a while (hours) until it worked (probably background load).
2.) Good news: Once it worked also, my minimal example worked (as a Windows build). I will attach it. Just do a npm install followed by a npm start , then connect to the webserver at http://localhost:6060 and if your PC allows the right timing it will crash. But if it does not...., I also cannot help you.

Also, I have attached a debugger to the nightly release build. Yes, the crash happens inside IPC, but the reason is the same.
One way to fix it is to send datagrams (and maybe other wt specific stuff) always to the socketthread, or to buffer the messages before retargetto, but sending them to the nonsocket thread makes no sense and only results in a possible client DOS:

This is a npm code, that worked some times on windows 11. May also work on linux and Mac OS. Just run npm install and npm start. If the timing of the packets works out the UA will crash.

I did some additional debugging.
If I understand the code correctly, the code is constructed in the following way:
1.) If a webtransport command (such as datagram or opening a stream) is received before the session is opened, it will be placed inside "mPendingEvents", which will be processed after the session is opened in WebTransportSessionProxy::OnStopRequest(..), which puts the request on the socket thread, as it is called after retargetTo. Setting the variable " mStopRequestCalled" signals, so that the messages can be sent directly to the socket thread, which happens automatically using retargetTo.
2.) Such precautions are already in place for OnIncomingStreamAvailableInternal, but for Datagrams it comes too late in notifyDatagramReceived, where we may be on the wrong target.

Here is a tested patch:

diff --git a/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp b/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp
--- a/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp
+++ b/netwerk/protocol/webtransport/WebTransportSessionProxy.cpp
@@ -1076,7 +1076,6 @@ void WebTransportSessionProxy::NotifyDat
   nsCOMPtr<WebTransportSessionEventListener> listener;
   {
     MutexAutoLock lock(mMutex);
-    MOZ_ASSERT(mTarget->IsOnCurrentThread());

     if (!mStopRequestCalled) {
       CopyableTArray<uint8_t> copied(aData);
@@ -1090,6 +1089,20 @@ void WebTransportSessionProxy::NotifyDat
     if (mState != WebTransportSessionProxyState::ACTIVE || !mListener) {
       return;
     }
+
+    // It may happen, that the datagram is sent to the main thread and not the socket thread,
+    // and the session was started in the mean time,
+    // in this case, we have to redirect it to the socketthread
+     if (!mTarget->IsOnCurrentThread()) {
+       mTarget->Dispatch(NS_NewRunnableFunction(
+          "WebTransportSessionProxy::OnDatagramReceived",
+          [self = RefPtr{this}, data{std::move(aData)}]() mutable {
+            self->NotifyDatagramReceived(std::move(data));
+          }));
+        return;
+     }
+     MOZ_ASSERT(mTarget->IsOnCurrentThread());
+
     listener = mListener;
   }

That does not crash in my test setup. I did not submit it, as it is still a denial of service issue, and I do not know if you can do some harm to IPC., so I did not want to do it in public if I am not told to do it.
I would also recommend reviewing the other incoming webtransport events if similar patterns occur.

As I did not get feedback. I would post the patch (without referencing the bug) after the 26.8.2024, if there are no objections.

(In reply to marten.richter from comment #17)

As I did not get feedback. I would post the patch (without referencing the bug) after the 26.8.2024, if there are no objections.

Sorry for the late reply. I think your investigation in comment #16 is correct.
I'll submit a patch based on yours if you don't mind.
Thanks!

Flags: needinfo?(kershaw) → needinfo?(marten.richter)

Yes please go ahead, as I have currently no access to my dev maschine this would speed the process up.

Flags: needinfo?(marten.richter)
Assignee: nobody → kershaw
Whiteboard: [necko-triaged][necko-priotity-review] → [necko-triaged][necko-priotity-queue]

We're hitting a release assert, so maybe it isn't a big deal, but we're trying to send from the wrong thread so it is possible other things not guarded by the release assert are also confused.

While debugging I checked the mechanism, and most calls (so except the one with the test) were guarded. But the retargeting itself has a big risk of introducing similar problems in the future as you have to manually check all possible calling orders, but this would require a big structural change.

The crash occurs because WebTransportSessionProxy::OnDatagramReceivedInternal is called before WebTransportSessionProxy::OnStopRequest.
When this happens, WebTransportSessionProxy::mTarget is the main thread, so a task is dispatched to the main thread. This causes WebTransportSessionProxy::NotifyDatagramReceived to be called on the main thread.

If WebTransportSessionProxy::NotifyDatagramReceived is invoked while WebTransportSessionProxy::mStopRequestCalled is true, it can lead to OnDatagramReceived being called on the main thread (instead of the socket thread), resulting in a crash.

Pushed by kjang@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f7c00d9439f9 Make sure WebTransportSessionProxy::NotifyDatagramReceived is called after OnStopRequest, r=necko-reviewers,jesup
Group: network-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 9 months ago
Keywords: regression
Regressed by: 1820424
Resolution: --- → FIXED
Target Milestone: --- → 131 Branch

I have run my tests against current nightly and it seems to be fixed.
(The only failing test I have against firefox are reagrding streamlimits, but this is another story...)

QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-

Please nominate this for ESR128 approval when you get a chance. It grafts cleanly.

Flags: needinfo?(kershaw)

The crash occurs because WebTransportSessionProxy::OnDatagramReceivedInternal is called before WebTransportSessionProxy::OnStopRequest.
When this happens, WebTransportSessionProxy::mTarget is the main thread, so a task is dispatched to the main thread. This causes WebTransportSessionProxy::NotifyDatagramReceived to be called on the main thread.

If WebTransportSessionProxy::NotifyDatagramReceived is invoked while WebTransportSessionProxy::mStopRequestCalled is true, it can lead to OnDatagramReceived being called on the main thread (instead of the socket thread), resulting in a crash.

Original Revision: https://phabricator.services.mozilla.com/D220013

Attachment #9423893 - Flags: approval-mozilla-esr128?

esr128 Uplift Approval Request

  • User impact if declined: Possible crash.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: yes
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing: N/A
  • Risk associated with taking this patch: Low risk
  • Explanation of risk level: The patch is straightforward and the logic of the code is not changed.
  • String changes made/needed: N/A
  • Is Android affected?: yes
Flags: needinfo?(kershaw)
Attachment #9423893 - Flags: approval-mozilla-esr128? → approval-mozilla-esr128+
Whiteboard: [necko-triaged][necko-priotity-queue] → [necko-triaged][necko-priotity-queue][adv-main131+]
Whiteboard: [necko-triaged][necko-priotity-queue][adv-main131+] → [necko-triaged][necko-priotity-queue][adv-main131+][adv-esr128.3+]
Attached file advisory.txt
Alias: CVE-2024-9399
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size:

OSZAR »