flypig.co.uk

List items

Items from the current list are shown below.

Blog

All items from February 2024

29 Feb 2024 : Day 171 #
After some considerably thought about the various options and after trying to fix the rendering pipeline with minimal changes, I've decided to change tack today. There's so much code that's been removed from the GLScreenBuffer.h and GLSCreenBuffer.cpp files that I don't see any way to resurrect the functionality without moving large parts of the removed code back in again.

Now, ideally it would be possible to add this to the EmbedLite code, rather than the gecko code. But as a first step I'm going to try to add it back in just as it was before. Following that I can then look at re-architecting it to minimise the changes needed to the gecko code itself. It would be a shame to end up with a patch that essentially just reverts a whole load of changes from upstream, but if that's where we end up, but with a working offscreen renderer, then maybe that's what we'll have to have.

Over the last few days I've already made a few changes to the code, but ironically they've only so far been to the EmbedLite portion of the code. But they're all also aimed at getting the SwapChain object working correctly. If I'm now going to reverse the upstream changes to this particular pipeline, then the SwapChain will be lost (it might get restored later; let's see). So I don't need the changes I made any more.
$ git diff
diff --git a/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
           b/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
index 34cff71f6e07..82cdf357f926 100644
--- a/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
+++ b/embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
@@ -109,6 +109,7 @@ EmbedLiteCompositorBridgeParent::PrepareOffscreen()
 
   GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->GetCompositor())->gl();
   NS_ENSURE_TRUE(context, );
+  bool initSwapChain = false;
 
   // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
   // See: https://phabricator.services.mozilla.com/D75055
@@ -125,6 +126,7 @@ EmbedLiteCompositorBridgeParent::PrepareOffscreen()
 
     SwapChain* swapChain = context->GetSwapChain();
     if (swapChain == nullptr) {
+      initSwapChain = true;
       swapChain = new SwapChain();
       new SwapChainPresenter(*swapChain);
       context->mSwapChain.reset(swapChain);
@@ -133,6 +135,13 @@ EmbedLiteCompositorBridgeParent::PrepareOffscreen()
     if (factory) {
       swapChain->Morph(std::move(factory));
     }
+
+    if (initSwapChain) {
+      bool success = context->ResizeScreenBuffer(mEGLSurfaceSize);
+      if (!success) {
+          NS_WARNING("Failed to create SwapChain back buffer");
+      }
+    }
   }
 }
 
$ git status
On branch sailfishos-esr91
Your branch is up-to-date with 'origin/sailfishos-esr91'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)
        modified:   embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
        modified:   gecko-dev (untracked content)

no changes added to commit (use "git add" and/or "git commit -a")
$ git checkout embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
Updated 1 path from the index
As you can see, the changes weren't very major anyway. So I've started reconstructing the GlScreenBuffer code. It's actually quite extensive and there are a lot of edge cases related to the EGL code. It's going to take quite a few rounds of changes, failed compilations, following up on missing or changed code and then recompilations. Each of these takes quite a while, so I'm bracing myself for quite a long haul here.

I've made some changes, I'm going to set it to compile and see what errors come out. It's also time for my work day, so I'll return to this — and all of the errors that come out of it — later on this evening.

[...]

I'm back to looking at this again and it's time consider the errors that came out of the most recent partial build. They look a bit like this.
64:31.86 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:98:60: error:
    ‘SurfaceCaps’ does not name a type; did you mean ‘SurfaceFactory’?
64:31.86    static UniquePtr<ReadBuffer> Create(GLContext* gl, const SurfaceCaps& caps,
64:31.86                                                             ^~~~~~~~~~~
64:31.86                                                             SurfaceFactory
64:31.88 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:99:45:
    error: ‘GLFormats’ does not name a type; did you mean ‘eFormData’?
64:31.88                                        const GLFormats& formats,
64:31.88                                              ^~~~~~~~~
64:31.88                                              eFormData
[...]
64:32.20 ${PROJECT}/gecko-dev/gfx/gl/GLContext.h:3537:54:
    error: ‘SurfaceCaps’ does not name a type; did you mean ‘SurfaceFormat’?
64:32.20    bool InitOffscreen(const gfx::IntSize& size, const SurfaceCaps& caps);
64:32.20                                                       ^~~~~~~~~~~
64:32.20                                                       SurfaceFormat
64:32.23 ${PROJECT}/gecko-dev/gfx/gl/GLContext.h:3546:59:
    error: ‘SurfaceCaps’ does not name a type; did you mean ‘SurfaceFormat’?
64:32.24    bool CreateScreenBuffer(const gfx::IntSize& size, const SurfaceCaps& caps) {
64:32.24                                                            ^~~~~~~~~~~
64:32.24                                                            SurfaceFormat
[...]
There are, as you can see, many, many, many errors. For the rest of this evening my recipe will be this:
  1. Build code.
  2. Examine compile-time errors.
  3. Fix the first one or two erros shown.
  4. Go to step 1.
This would take an inordinate amount of time with a standard build, but thankfully I can do partial builds just of the gfx/gl code.
$ sfdk engine exec
$ sb2 -t SailfishOS-devel-aarch64.default
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
Initially these partial builds were taking less than a second, with errors being hit almost immediately. Now after a couple of hours of fixing compile-time errors the builds are taking longer, maybe nearing ten seconds or so. That's how I'm judging success right now.

My guess is that it'll be a few days at least before I've got all of these errors resolved. I'll continue charting my progress with these diary entries of course, but they may be a little shorter than usual, since the last thing anyone wants to read about is this iterative build-check-fix churn.

At least it's quite fulfilling for me, gradually watching the errors seep away. It's mundane but fulfilling work, just a little laborious. Let's see how far I've got by the end of tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
28 Feb 2024 : Day 170 #
Today I woke up to discover a bad result. The build I started yesterday stalled about half way through. This does happen very occasionally, but honestly since I dropped down to just using a single process, it's barely happened at all. So that's more than a little annoying. Nevertheless I've woken up early today and it does at least mean that my first task of the day is an easy one: kick off the build once again.

So here goes... Once it's done, I'll give the changes I made yesterday a go to see whether they've fixed the segfault.

[...]

Finally the build completed, second time lucky it seems. So now SwapChain, SurfaceFactory and the SharedSurface back buffer should all be created respectively in this order. And this should also be the correct order. Let's find out.

Now there's still a crash, but it does at least get further than last time:
$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
[...]
Created LOG for EmbedLiteLayerManager
=============== Preparing offscreen rendering context ===============
CONSOLE message:
OpenGL compositor Initialized Succesfully.
Version: OpenGL ES 3.2 V@0502.0 (GIT@704ecd9a2b, Ib3f3e69395, 1609240670)
    (Date:12/29/20)
Vendor: Qualcomm
Renderer: Adreno (TM) 619
FBO Texture Target: TEXTURE_2D
JSScript: ContextMenuHandler.js loaded
JSScript: SelectionPrototype.js loaded
JSScript: SelectionHandler.js loaded
JSScript: SelectAsyncHelper.js loaded
JSScript: FormAssistant.js loaded
JSScript: InputMethodHandler.js loaded
EmbedHelper init called
Available locales: en-US, fi, ru
Frame script: embedhelper.js loaded
Segmentation fault
That's without the debugger. To find out where precisely it's crashing we can execute it again, but this time with the debugger attached:
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 13568]
0x0000007ff110a378 in mozilla::gl::SwapChain::Size
    (this=this@entry=0x7ed81ce090)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) bt
#0  0x0000007ff110a378 in mozilla::gl::SwapChain::Size
    (this=this@entry=0x7ed81ce090)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#1  0x0000007ff3667cc8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    PresentOffscreenSurface (this=0x7fc4b41c20)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:199
#2  0x0000007ff3680fe0 in mozilla::embedlite::nsWindow::PostRender
    (this=0x7fc4c331e0, aContext=<optimized out>)
    at mobile/sailfishos/embedshared/nsWindow.cpp:248
#3  0x0000007ff2a664fc in mozilla::widget::InProcessCompositorWidget::PostRender
    (this=0x7fc4658990, aContext=0x7f17ae4848)
    at widget/InProcessCompositorWidget.cpp:60
#4  0x0000007ff1291074 in mozilla::layers::LayerManagerComposite::Render
    (this=this@entry=0x7ed81afa80, aInvalidRegion=..., aOpaqueRegion=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    Compositor.h:575
#5  0x0000007ff12914f0 in mozilla::layers::LayerManagerComposite::
    UpdateAndRender (this=this@entry=0x7ed81afa80)
    at gfx/layers/composite/LayerManagerComposite.cpp:657
#6  0x0000007ff12918a0 in mozilla::layers::LayerManagerComposite::
    EndTransaction (this=this@entry=0x7ed81afa80, aTimeStamp=..., 
    aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT)
    at gfx/layers/composite/LayerManagerComposite.cpp:572
#7  0x0000007ff12d303c in mozilla::layers::CompositorBridgeParent::
    CompositeToTarget (this=0x7fc4b41c20, aId=..., aTarget=0x0,
    aRect=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#8  0x0000007ff12b8784 in mozilla::layers::CompositorVsyncScheduler::Composite
    (this=0x7fc4d01e30, aVsyncEvent=...)
    at gfx/layers/ipc/CompositorVsyncScheduler.cpp:256
#9  0x0000007ff12b0bfc in mozilla::detail::RunnableMethodArguments
    <mozilla::VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler,
    void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent
    const&), StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=...,
    m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:887
[...]
#21 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p mFrontBuffer
$1 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) 
Looking at the above, it seems that the back buffer isn't causing a crash any more. The problem now seems to be the front buffer. That's okay: that's progress! There are only two situations in which the front buffer gets set. First it happens if the SwapChainPresenter destructor is called. In this case the back buffer held by the presenter is moved into the front buffer, then the presenter's back buffer is set to null. Second it happens when the SwapChain::Swap() method is called. In this case the back buffer held by the presenter and the front buffer held by the swap chain are switched. In some sense, the Swap() method isn't really going to help us because if the front buffer is null beforehand, afterwards the back buffer will be null, which is also no good.

Checking the ESR 78 code, there is no mFrontBuffer variable, but there is an mFront which appears to be doing ostensibly the same thing. The mFront is only every used to switch the back buffer in to it, or to be accessed by EmbedLiteCompositorBridgeParent::GetPlatformImage(). In the latter case it's used, but not set.

So the arrangement isn't so dissimilar. Perhaps the main difference is that in ESR 78 there's no call to get the size of the front buffer as there is in ESR 91. Just as a reminder again: it's this size request that's causing the crash.

In ESR 78 the Swap() method is called from PublishFrame(), which is called from EmbedLiteCompositorBridgeParent::PresentOffscreenSurface(). It would be good to try to find out whether there's anything tying these together, to understand the sequencing, but the code is too convoluted for me to figure that out by hand.

So, instead, I'm going to look at the call to SwapChain::Size(). This is a call I added myself on top of the changes since ESR 91 and which doesn't have an immediately obvious equivalent call in ESR 78, so there must have been some reason why I added it.

Looking at the code in ESR 78 I can see that this is the reason I added this call:
  GLScreenBuffer* screen = context->Screen();
  MOZ_ASSERT(screen);

  if (screen->Size().IsEmpty() || !screen->PublishFrame(screen->Size())) {
    NS_ERROR("Failed to publish context frame");
  }
Compare that to the attempt I made to replicate the functionality in ESR 91:
  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  SwapChain* swapChain = context->GetSwapChain();
  MOZ_ASSERT(swapChain);

  const gfx::IntSize& size = swapChain->Size();
  if (size.IsEmpty() || !swapChain->PublishFrame(size)) {
    NS_ERROR("Failed to publish context frame");
  }
The obvious question is, what is context->Screen() returning in ESR 78 and where is it created. Unfortunately the answer is complex. It's returning the following member of GLContext:
  UniquePtr<GLScreenBuffer> mScreen;
[...]
  GLScreenBuffer* Screen() const { return mScreen.get(); }
This gets created from a call to GLContext::InitOffscreen(), like this:
Delete all breakpoints? (y or n) y
(gdb) b CreateScreenBufferImpl
Breakpoint 7 at 0x7fb8e837d8: file gfx/gl/GLContext.cpp, line 2120.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 7, mozilla::gl::GLContext::
    CreateScreenBufferImpl (this=this@entry=0x7eac109140, size=..., caps=...)
    at gfx/gl/GLContext.cpp:2120
2120                                           const SurfaceCaps& caps) {
(gdb) bt
#0  mozilla::gl::GLContext::CreateScreenBufferImpl
    (this=this@entry=0x7eac109140, size=..., caps=...)
    at gfx/gl/GLContext.cpp:2120
#1  0x0000007fb8e838ec in mozilla::gl::GLContext::CreateScreenBuffer
    (caps=..., size=..., this=0x7eac109140)
    at gfx/gl/GLContext.h:3517
#2  mozilla::gl::GLContext::InitOffscreen (this=0x7eac109140, size=...,
    caps=...)
    at gfx/gl/GLContext.cpp:2578
#3  0x0000007fb8e83ac8 in mozilla::gl::GLContextProviderEGL::CreateOffscreen
    (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::
    REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7fa50ed378)
    at gfx/gl/GLContextProviderEGL.cpp:1443
#4  0x0000007fb8ee475c in mozilla::layers::CompositorOGL::CreateContext
    (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:250
#5  mozilla::layers::CompositorOGL::CreateContext (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:223
#6  0x0000007fb8f053bc in mozilla::layers::CompositorOGL::Initialize
    (this=0x7eac003420, out_failureReason=0x7fa50ed730)
    at gfx/layers/opengl/CompositorOGL.cpp:374
#7  0x0000007fb8fdcf7c in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7f8c99d3f0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1534
#8  0x0000007fb8fe65e8 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7f8c99d3f0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1491
#9  0x0000007fb8fe6730 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7f8c99d3f0,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1587
#10 0x0000007fbb2e31b4 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7f8c99d3f0, aBackendHints=..., 
    aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:77
#11 0x0000007fb88c13d0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7f8c99d3f0, msg__=...)
    at PCompositorBridgeParent.cpp:1391
[...]
#27 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6
(gdb) 
Recall that the call to CreateOffscreen() at frame 3 is now a call to CreateHeadless(). And it looks like that's where things really start to diverge.

After thinking long and hard about this I don't think it's going to be possible to fit everything that's needed into the current SwapChain structure. So tomorrow I'm going to start putting back in all of the pieces from ESR 78 that were ripped out of ESR 91. This should be a much more tractable exercise than trying to reconstruct the functionality from scratch. Once I've got a working renderer I can then take the diff and try to fit as much of what's needed as possible into the swap chain structure.

But I'm not going to be able to do that today as it's time for me to head to bed. I'll pick this up in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
27 Feb 2024 : Day 169 #
I've been trying to get the swap chain to initialise correctly over the last few days. This is part of the code that I made large changes to early on in this process, before the build would fully compile. I'm now having to simultaneously unravel the changes I made, while at the same time finally figuring out what they're supposed to be doing. It's quite a relief to finally get the chance to fix the mistakes I made in the past.

But the task right now is a little more prosaic. I'm just trying to get the thing to run without crashing. Getting the actual rendering working will be stage two of this process.

So I'm still trying to get the back buffer to be initialised before it's accessed. Sounds simple, but the code is a bit of web. We have a call to Resize() which is crashing and a call to PrepareOffscreen() which creates the swap chain. We need to create the swap chain and initialise the back buffer before the Resize() happens.

If we follow the backtraces back, the ordering problem seems to end up here:
PLayerTransactionParent*
EmbedLiteCompositorBridgeParent::AllocPLayerTransactionParent
    (const nsTArray<LayersBackend>& aBackendHints, const LayersId& aId)
{
  PLayerTransactionParent* p =
    CompositorBridgeParent::AllocPLayerTransactionParent(aBackendHints, aId);

  EmbedLiteWindowParent *parentWindow = EmbedLiteWindowParent::From(mWindowId);
  if (parentWindow) {
    parentWindow->GetListener()->CompositorCreated();
  }

  if (!StaticPrefs::embedlite_compositor_external_gl_context()) {
    // Prepare Offscreen rendering context
    PrepareOffscreen();
  }
  return p;
}
That's because the call stack for CreateContext(), which is where the SwapChain::Resize() gets called, includes CompositorBridgeParent::AllocPLayerTransactionParent(), whereas the SwapChain object is created in PrepareOffscreen(). As we can see, these happen in the wrong order.

One thing that seems worth trying is configuring the back buffer immediately after creating the swap chain. So I've given it a go by adding the call to ResizeScreenBuffer() in directly after the SwapChain constructor is called, like this:
    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);

      bool success = context->ResizeScreenBuffer(mEGLSurfaceSize);
      if (!success) {
          NS_WARNING("Failed to create SwapChain back buffer");
      }
    }
When I execute the updated code, this call to resize the screen buffer now triggers a crash.
$ gdb harbour-webview
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
=============== Preparing offscreen rendering context ===============

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 16991]
mozilla::gl::SwapChain::Resize (this=0x7ed81ce090, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
134           mFactory->CreateShared(size);
(gdb) bt
#0  mozilla::gl::SwapChain::Resize (this=0x7ed81ce090, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
#1  0x0000007ff110dc14 in mozilla::gl::GLContext::ResizeScreenBuffer
    (this=this@entry=0x7ed819ee40, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff366824c in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    PrepareOffscreen (this=this@entry=0x7fc4bef570)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:132
#3  0x0000007ff36682b8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4bef570, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:90
#4  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4bef570, msg__=...)
    at PCompositorBridgeParent.cpp:1285
[...]
#19 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) p size.width
$2 = 1080
(gdb) p size.height
$3 = 2520
(gdb) p mFactory.mTuple.mFirstA
$5 = (mozilla::gl::SurfaceFactory *) 0x0
(gdb) 
As we can see from the above value of mFactory.mTuple.mFirstA and the code below, the reason for the crash is that the SurfaceFactory needed to generate the surface hasn't yet been initialised. As before, it's all about the sequencing.
bool SwapChain::Resize(const gfx::IntSize& size) {
  UniquePtr<SharedSurface> newBack =
      mFactory->CreateShared(size);
  if (!newBack) return false;

  if (mPresenter->mBackBuffer) mPresenter->mBackBuffer->ProducerRelease();

  mPresenter->mBackBuffer.reset(newBack.release());

  mPresenter->mBackBuffer->ProducerAcquire();

  return true;
}
It turns out, the factory is created before, but isn't set until afterwards:
  if (context->IsOffscreen()) {
    UniquePtr<SurfaceFactory> factory;
    if (context->GetContextType() == GLContextType::EGL) {
      // [Basic/OGL Layers, OMTC] WebGL layer init.
      factory = SurfaceFactory_EGLImage::Create(*context);
    } else {
      // [Basic Layers, OMTC] WebGL layer init.
      // Well, this *should* work...
      factory = MakeUnique<SurfaceFactory_Basic>(*context);
    }

    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }
  }
So I've rejigged things. Crucially though, although the factory should be reset independently of whether we're creating a new swap chain or not, we don't want the resize to happen except when it's a new swap chain. I've therefore had to create a new initSwapChain Boolean to capture whether this is a new swap chain or not. If it is, we can then perform the resize after the factory code has executed.
  bool initSwapChain = false;
  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  if (context->IsOffscreen()) {
[...]
    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      initSwapChain = true;
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }

    if (initSwapChain) {
      bool success = context->ResizeScreenBuffer(mEGLSurfaceSize);
      if (!success) {
          NS_WARNING("Failed to create SwapChain back buffer");
      }
    }
  }
This seems worth a try, so I've set the build off running again and we'll see how it pans out when it's done.

As always the build is taking a very long, so we'll have to wait until the morning to find out how this has gone.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
26 Feb 2024 : Day 168 #
Overnight the build I started yesterday successfully finished. That, in itself, is a bit of a surprise (no stupid syntax errors in my code!). This morning I've copied over the packages and installed them, and now I'm on the train ready to debug.

I optimistically run the app without the debugger. The window appears again. There's no rendering, just a white screen, but there's also no immediate crash and no obvious errors in the debug output.

After running for around twenty seconds or so, the app then crashes.
$ time harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
[...]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
Command terminated by signal 11
real    0m 20.82s
user    0m 0.87s
sys     0m 0.23s
This is quite unexpected behaviour if I'm honest. Something is causing it to crash after a prolonged period ("prolonged" meaning from the perspective of computation, rather than from the perspective of the user).

That was without the debugger; I'd better try it with the debugger to find out why it's crashing.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) r
Starting program: /usr/bin/harbour-webview 
[...]
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 18684]
mozilla::gl::SwapChain::Resize (this=0x0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
134           mFactory->CreateShared(size);
(gdb) bt
#0  mozilla::gl::SwapChain::Resize (this=0x0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
#1  0x0000007ff110dc14 in mozilla::gl::GLContext::ResizeScreenBuffer
    (this=this@entry=0x7edc19ee40, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff119b8d4 in mozilla::layers::CompositorOGL::CreateContext
    (this=this@entry=0x7edc002f10)
    at gfx/layers/opengl/CompositorOGL.cpp:264
#3  0x0000007ff11b0ea8 in mozilla::layers::CompositorOGL::Initialize
    (this=0x7edc002f10, out_failureReason=0x7f17aac520)
    at gfx/layers/opengl/CompositorOGL.cpp:394
#4  0x0000007ff12c68e8 in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7fc4b7b450, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#5  0x0000007ff12d1964 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc4b7b450, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1436
#6  0x0000007ff12d1a94 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7fc4b7b450,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1546
#7  0x0000007ff36682b8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4b7b450, aBackendHints=..., 
    aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80
#8  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4b7b450, msg__=...)
    at PCompositorBridgeParent.cpp:1285
#9  0x0000007ff0ca9fe4 in mozilla::layers::PCompositorManagerParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/
    ProtocolUtils.h:675
#10 0x0000007ff0bc985c in mozilla::ipc::MessageChannel::DispatchAsyncMessage
    (this=this@entry=0x7fc4d82fb8, aProxy=aProxy@entry=0x7edc002aa0, aMsg=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/
    ProtocolUtils.h:675
[...]
#23 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
As before, it runs for around twenty seconds, then crashes. The line that's causing the crash is this one:
bool SwapChain::Resize(const gfx::IntSize& size) {
  UniquePtr<SharedSurface> newBack =
      mFactory->CreateShared(size);
[...]
}
And the reason isn't because mFactory is null, it's because this (meaning the SwapChain instance) is null. But when I try to access the memory to show that it's null using the debugger I start getting strange errors:
(gdb) p mFactory
Cannot access memory at address 0x8
(gdb) frame 1
#1  0x0000007ff110dc14 in mozilla::gl::GLContext::ResizeScreenBuffer
    (this=this@entry=0x7edc19ee40, size=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p mSwapChain
Cannot access memory at address 0x7edc19f838
(gdb) p this
$1 = (mozilla::gl::GLContext * const) 0x7edc19ee40
(gdb) frame 2
#2  0x0000007ff119b8d4 in mozilla::layers::CompositorOGL::CreateContext
    (this=this@entry=0x7edc002f10)
    at gfx/layers/opengl/CompositorOGL.cpp:264
264       bool success = context->ResizeScreenBuffer(mSurfaceSize);
(gdb) p context
$2 = {mRawPtr = 0x7edc19ee40}
(gdb) p context->mRawPtr
Attempt to take address of value not located in memory.
(gdb) p context->mRawPtr->mSwapChain
Attempt to take address of value not located in memory.
I wonder if this is being caused by a memory leak that quickly gets out of hand? Placing a breakpoint on GLContext::ResizeScreenBuffer()K shows that it's not due to repeated calls to this method: this gets called only once, at which point there's an immediate segfault.
(gdb) b GLContext::ResizeScreenBuffer
Breakpoint 1 at 0x7ff110dbdc: file gf
x/gl/GLContext.cpp, line 1885.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview
[...]
Thread 37 "Compositor" hit Breakpoint 1, mozilla::gl::GLContext::
    ResizeScreenBuffer (this=this@entry=0x7ed419ee40, size=...)
    at gfx/gl/GLContext.cpp:1885
1885    bool GLContext::ResizeScreenBuffer(const gfx::IntSize& size) {
(gdb) c
Continuing.

Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
mozilla::gl::SwapChain::Resize (this=0x0, size=...)
    at gfx/gl/GLScreenBuffer.cpp:134
134           mFactory->CreateShared(size);
(gdb)                            
I'm curious to know what's happening after twenty seconds that would cause this. Looking more carefully at the backtrace for the crash above, it's strange that an attempt is being made to create the compositor. Shouldn't that have already been created? I wonder if this delay is related to network connectivity.

As usual I'm attempting this debugging on the train. But my development phone has no Internet connectivity here. So perhaps it's waiting for a connection before creating the compositor? Maybe the connection fails after twenty seconds at which point the compositor is created and the library segfaults.

This seems plausible, even if it doesn't quite explain the peculiar nature of the debugging that followed, where I couldn't access any of the variables.

Let's assume this is the case, back up a bit, and try to capture some state before the crash happens. If the crash is causing memory corruption, that might explain the lack of accessible variables. And if that's the case, then catching execution before the memory gets messed up should allow us to get a clearer picture.
(gdb) b CompositorOGL::CreateContext
Breakpoint 2 at 0x7ff119b764: file gfx/layers/opengl/CompositorOGL.cpp,
    line 227.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
We're coming in to London now, so time to pause and rapidly pack up my stuff before we pull in to the station!

[...]

You'll be pleased to hear I made it off the train safely and with all my belongings. It was touch-and-go for a few seconds there though. I'm now travelling in the opposite direction on (I hope) the adjacent tracks. Time to return to that debugging.

I'm happy to discover, despite having literally pulled the plug on my phone mid-debug, that on reattaching the cable and restoring my gnu screen session, the debugger is still in exactly the same state that I left it. Linux is great!

And now we have a bit more luck again from the captured backtrace:
Thread 37 "Compositor" hit Breakpoint 2, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7edc002ed0)
    at gfx/layers/opengl/CompositorOG
L.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
(gdb) p context
$3 = <optimized out>
(gdb) p mSwapChain
No symbol "mSwapChain" in current context.
(gdb) p context
$4 = <optimized out>
(gdb) bt
#0  mozilla::layers::CompositorOGL::CreateContext (this=this@entry=0x7edc002ed0)
    at gfx/layers/opengl/CompositorOG
L.cpp:227
#1  0x0000007ff11b0ea8 in mozilla::layers::CompositorOGL::Initialize
    (this=0x7edc002ed0, out_failureReason=0x7f17a6b520)
    at gfx/layers/opengl/CompositorOGL.cpp:394
#2  0x0000007ff12c68e8 in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7fc4beb0e0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#3  0x0000007ff12d1964 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc4beb0e0, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1436
#4  0x0000007ff12d1a94 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7fc4beb0e0,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1546
#5  0x0000007ff36682b8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4beb0e0, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80
#6  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4beb0e0, msg__=...)
    at PCompositorBridgeParent.cpp:1285
[...]
#21 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) n
[New LWP 32378]
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) n
[New LWP 32389]
[LWP 7850 exited]
232       void* widgetOpenGLContext =
(gdb) n
[New LWP 32476]
[LWP 32389 exited]
234       if (widgetOpenGLContext) {
(gdb) n
248       if (!context && gfxEnv::LayersPreferOffscreen()) {
(gdb) n
249         nsCString discardFailureId;
(gdb) n
250         context = GLContextProvider::CreateHeadless(
(gdb) n
252         if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) {
(gdb) n
249         nsCString discardFailureId;
(gdb) n
257       if (!context) {
(gdb) n
264       bool success = context->ResizeScreenBuffer(mSurfaceSize);
(gdb) p context
$7 = {mRawPtr = 0x7edc19ee40}
(gdb) p context.mRawPtr
$8 = (mozilla::gl::GLContext *) 0x7edc19ee40
(gdb) p context.mRawPtr.mSwapChain
$9 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*, mozilla::DefaultDelete<mozilla::gl::SwapChain>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>}, mFirstA = 0x0}, <No data fields>}}
(gdb) p context.mRawPtr.mSwapChain.mTuple.mFirstA
$10 = (mozilla::gl::SwapChain *) 0x0
(gdb)
We can conclude that the SwapChain hasn't been created yet. Which means this new bit of code I added, which is the code that's crashing, is being called too early. That's not quite what I was expecting. Just to check I've added a breakpoint to EmbedLiteCompositorBridgeParent::PrepareOffscreen(), which is where the SwapChain is created. This is just to double-check the ordering.
(gdb) b EmbedLiteCompositorBridgeParent::PrepareOffscreen
Breakpoint 3 at 0x7ff366810c: file mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp, line 104.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 2, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7ed8002da0)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::
    CreateContext() {
(gdb) 
This confirms it: the CreateContext() call is happening before the PrepareOffscreen() call. I'll need to think about this again then.

The train is now coming in to Cambridge. I'm not taking any chances this time and will be packing up with plenty of time to spare! Sadly that's going to have to be it for today, but I'll pick this up again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
25 Feb 2024 : Day 167 #
It felt like I had to stop halfway through a thought yesterday. But sometimes tiredness gets the better of me, I can feel myself spiralling into incoherence and the only sensible thing to do is head to bed. Sometimes it's a slow descent while relaxing or reading a book; other times I reach the incoherent end of the spectrum before my mind has even caught up with the fact I'm heading there.

So let me try to regroup. What we learnt yesterday was that OffscreenSize() previously returned mScreen->Size() and mScreen was created in InitOffscreen(). This InitOffscreen() method no longer exists — it was removed in D75055 — but was originally called in GLContextProviderEGL::CreateOffscreen().

The GLContextProviderEGL::CreateOffscreen() method also no longer exists in the codebase, replaced as it was in D79390:
$ git log -1 -S "GLContextProviderEGL::CreateOffscreen" \
    gecko-dev/gfx/gl/GLContextProviderEGL.cpp
commit 4232c2c466220d42223443bd5bd2f3c849123380
Author: Jeff Gilbert <jgilbert@mozilla.com>
Date:   Mon Jun 15 18:26:12 2020 +0000

    Bug 1632249 - Replace GLContextProvider::CreateOffscreen with
    GLContext::CreateOffscreenDefaultFb. r=lsalzman
    
    Differential Revision: https://phabricator.services.mozilla.com/D79390
Looking at the ESR 91 code and the diffs applied to them it's not immediately obvious to me where this was getting called from and what's replacing it now, but we can get a callstack for how that was being called using the debugger on ESR 78. Here's the (abridged) backtrace:
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b GLContextProviderEGL::CreateOffscreen Breakpoint 6 at 0x7fb8e839f0:
    file gfx/gl/GLContextProviderEGL.cpp, line 1400.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 6, mozilla::gl::GLContextProviderEGL::
    CreateOffscreen (size=..., minCaps=..., 
    flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE,
    out_failureId=out_failureId@entry=0x7fa516f378)
    at gfx/gl/GLContextProviderEGL.cpp:1400
1400        CreateContextFlags flags, nsACString* const out_failureId) {
(gdb) bt
#0  mozilla::gl::GLContextProviderEGL::CreateOffscreen (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, 
    out_failureId=out_failureId@entry=0x7fa516f378)
    at gfx/gl/GLContextProviderEGL.cpp:1400
#1  0x0000007fb8ee475c in mozilla::layers::CompositorOGL::CreateContext
    (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:250
#2  mozilla::layers::CompositorOGL::CreateContext (this=0x7eac003420)
    at gfx/layers/opengl/CompositorOGL.cpp:223
#3  0x0000007fb8f053bc in mozilla::layers::CompositorOGL::Initialize
    (this=0x7eac003420, out_failureReason=0x7fa516f730)
    at gfx/layers/opengl/CompositorOGL.cpp:374
#4  0x0000007fb8fdcf7c in mozilla::layers::CompositorBridgeParent::NewCompositor
    (this=this@entry=0x7f8c99dc60, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1534
#5  0x0000007fb8fe65e8 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7f8c99dc60, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1491
#6  0x0000007fb8fe6730 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7f8c99dc60,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1587
#7  0x0000007fbb2e31b4 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7f8c99dc60, aBackendHints=..., 
    aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:77
#8  0x0000007fb88c13d0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7f8c99dc60, msg__=...)
    at PCompositorBridgeParent.cpp:1391
#9  0x0000007fb88f86ac in mozilla::layers::PCompositorManagerParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:866
[...]
#24 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6
(gdb) 
So examining the code at frame 1, we see that what was a call to CreateOffscreen():
  // Allow to create offscreen GL context for main Layer Manager
  if (!context && gfxEnv::LayersPreferOffscreen()) {
    SurfaceCaps caps = SurfaceCaps::ForRGB();
    caps.preserve = false;
    caps.bpp16 = gfxVars::OffscreenFormat() == SurfaceFormat::R5G6B5_UINT16;

    nsCString discardFailureId;
    context = GLContextProvider::CreateOffscreen(
        mSurfaceSize, caps, CreateContextFlags::REQUIRE_COMPAT_PROFILE,
        &discardFailureId);
  }
Is now a call to CreateHeadless():
  // Allow to create offscreen GL context for main Layer Manager
  if (!context && gfxEnv::LayersPreferOffscreen()) {
    nsCString discardFailureId;
    context = GLContextProvider::CreateHeadless(
        {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId);
    if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) {
      context = nullptr;
    }
  }
Whether we're on ESR 78 or ESR 91, this is all happening inside CompositorOGL::CreateContext().

Looking at the difference between the previous code that was called in CreateOffscreen() and the new code being called in CreateHeadless() my heart sinks a bit. There's so much that's been removed. It's true that CreateOffscreen() did go on to call CreateHeadless() in ESR 78, but there's so much other initialisation code in ESR 78, I just can't believe we can safely throw it all away.

But I'm going to persevere down this road I've started on, gradually building things back up only where they're needed to get things working. Right now that still means fixing the crash when OffscreenSize() is called.

I've not quite reached the point where the two sides of this circle meet up and the correct position to create the mBackBuffer emerges, but I feel like this exploration is getting us closer.

It's time for work now, I'll pick this up later on today.

[...]

After thinking on this some more, I've come to the conclusion that the right place to set up the mBackBuffer variable is in, or near, the call to GLContextProviderEGL::CreateHeadless(). It's there that the mScreen object would have been created in ESR 78 and checking with the debugger shows that the ordering is appropriate: CreateHeadless() gets called before CompositeToDefaultTarget(), which is what we need.
(gdb) delete break
(gdb) b EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget
Breakpoint 3 at 0x7ff3667880: EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget. (2 locations)
(gdb) b GLContextProviderEGL::CreateHeadless
Breakpoint 4 at 0x7ff1133740: file gfx/gl/GLContextProviderEGL.cpp, line 1245.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 4, mozilla::gl::GLContextProviderEGL::
    CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f1faad1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1245
1245        const GLContextCreateDesc& desc, nsACString* const out_failureId) {
(gdb) bt
#0  mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=...,
    out_failureId=out_failureId@entry=0x7f1faad1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1245
#1  0x0000007ff119b81c in mozilla::layers::CompositorOGL::CreateContext
    (this=this@entry=0x7ee0002f50)
    at gfx/layers/opengl/CompositorOGL.cpp:250
#2  0x0000007ff11b0e24 in mozilla::layers::CompositorOGL::Initialize
    (this=0x7ee0002f50, out_failureReason=0x7f1faad520)
    at gfx/layers/opengl/CompositorOGL.cpp:389
#3  0x0000007ff12c6864 in mozilla::layers::CompositorBridgeParent::
    NewCompositor (this=this@entry=0x7fc4b3d260, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1493
#4  0x0000007ff12d18e0 in mozilla::layers::CompositorBridgeParent::
    InitializeLayerManager (this=this@entry=0x7fc4b3d260, aBackendHints=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1436
#5  0x0000007ff12d1a10 in mozilla::layers::CompositorBridgeParent::
    AllocPLayerTransactionParent (this=this@entry=0x7fc4b3d260,
    aBackendHints=..., aId=...)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:1546
#6  0x0000007ff3668238 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4b3d260, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:80
#7  0x0000007ff0c65ad0 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4b3d260, msg__=...)
    at PCompositorBridgeParent.cpp:1285
#8  0x0000007ff0ca9fe4 in mozilla::layers::PCompositorManagerParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/
    ProtocolUtils.h:675
[...]
#22 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) c
Continuing.
[New LWP 2694]
=============== Preparing offscreen rendering context ===============
[New LWP 2695]

Thread 36 "Compositor" hit Breakpoint 3, non-virtual thunk to mozilla::embedlite::EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget
    (mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>) ()
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.h:58
58        virtual void CompositeToDefaultTarget(VsyncId aId) override;
(gdb) bt
#0  non-virtual thunk to mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget(mozilla::layers::BaseTransactionId
    <mozilla::VsyncIdType>) ()
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.h:58
#1  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0df60, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#2  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4b3d260)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#3  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4b3d260, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#4  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int,
    int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul,
    3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#16 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
The only problem is that the SwapChain object is held by GLContext whereas CreateHeadless() is part of GLContextProviderEGL(), which has access to very little, let alone the SwapChain. The good news is GLContextProviderEGL does have access to GLContext. The structure is something like this:
class GLContext 
public:
    UniquePtr<SwapChain> mSwapChain;
}

class GLContextEGL final : public GLContext {
}

void EmbedLiteCompositorBridgeParent::PrepareOffscreen() {
    const CompositorBridgeParent::LayerTreeState* state =
    CompositorBridgeParent::GetIndirectShadowTree(RootLayerTreeId());
    GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->
    GetCompositor())->gl();

    swapChain = new SwapChain();
    new SwapChainPresenter(*swapChain);
    context->mSwapChain.reset(swapChain);
}

already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
    context = GLContextProvider::CreateHeadless({CreateContextFlags::
    REQUIRE_COMPAT_PROFILE}, &discardFailureId);
    // Here we have access to context->mSwapChain;
}
Based on this, it looks like adding code in to CompositorOGL::CreateContext(), after the call to GLContextProvider::CreateHeadless() (which is actually a call to GLContextProviderEGL::CreateHeadless() might be the right place to put the code to create mBackBuffer.

So, after the calls in CompositorOGL::CreateContext() I've inserted the following code:
  bool success = context->ResizeScreenBuffer(mSurfaceSize);
  if (!success) {
      NS_WARNING("Failed to create SwapChain back buffer");
  }
This will make for my first attempt to fix this. I've set the build running and we'll have to find out for sure whether that's improved the situation tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
24 Feb 2024 : Day 166 #
We're back on to looking at the SwapChain code, part of the offscreen rendering pipeline, today. This is all for the purpose of getting the WebView working. Currently opening a WebView will cause the parent app to crash; it's clearly something that needs fixing before ESR 91 will be acceptable for daily use.

We managed to persuade the SwapChain object to be created by force-disabling the embedlite.compositor.external_gl_context static preference. Now we've found that the mBackBuffer is null which also results in a segfault. The mBackBuffer variable is a member of SwapChainPresenter *mPresenter, which is itself contained within the SwapChain class.

Looking through the code in GLScreenBuffer.cpp there are currently only two ways for the mBackBuffer variable to be initialised. Either it gets swapped in as a result of a call to SwapBackBuffer() like this:
std::shared_ptr<SharedSurface> SwapChainPresenter::SwapBackBuffer(
    std::shared_ptr<SharedSurface> back) {
[...]
  auto old = mBackBuffer;
  mBackBuffer = back;
  if (mBackBuffer) {
    mBackBuffer->WaitForBufferOwnership();
    mBackBuffer->ProducerAcquire();
    mBackBuffer->LockProd();
  }
  return old;
}

Or it gets created as a result of a call to the Resize() method like this:
bool SwapChain::Resize(const gfx::IntSize& size) {
  UniquePtr<SharedSurface> newBack =
      mFactory->CreateShared(size);
  if (!newBack) return false;

  if (mPresenter->mBackBuffer) mPresenter->mBackBuffer->ProducerRelease();

  mPresenter->mBackBuffer.reset(newBack.release());

  mPresenter->mBackBuffer->ProducerAcquire();

  return true;
}
We should check whether either of these are being called. It is possible that SwapBackBuffer() is being called but with an empty back parameter, or that the CreateShared() call in Resize() is failing. Either of these would leave us in our current situation. However just as likely is that neither of these are being called and there's not even an attempt being made to initialise mBackBuffer. We need to know!
(gdb) delete break
(gdb) b SwapChainPresenter::SwapBackBuffer
Breakpoint 1 at 0x7ff1109c14: file gfx/gl/GLScreenBuffer.cpp, line 82.
(gdb) b SwapChain::Resize
Breakpoint 2 at 0x7ff110a398: file gfx/gl/GLScreenBuffer.cpp, line 132.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 4362]
0x0000007ff110a38c in mozilla::gl::SwapChain::OffscreenSize
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) frame 1
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4b3f070, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer
$1 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) 
No breakpoint hits before the segfault, so neither of those methods are being called. Armed with this knowledge we must now turn our thoughts to solutions, and there are multiple potential solutions we could choose: guard against a null access in SwapChain::OffscreenSize(); ensure mBackBuffer is created at the same time as the mPresenter that contains it; find out if the process flow should be calling one of the swap or resize methods prior to this call.

It's now time for my work day, so answers to these questions will have to wait until tonight. Still, this is progress.

[...]

To try to get a handle on where this mBackBuffer ought to be created, I thought it might help to figure out some sequencing. Here's how things seem to be happening:
  1. The SwapChain is created in EmbedLiteCompositorBridgeParent::PrepareOffscreen().
  2. The SwapChainPresenter() is created immediately afterwards.
  3. In EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget() the SwapChain::OffscreenSize() method is called, causing the crash.
  4. Immediately after this SwapChain::Resize() is called, which if done earlier, would prevent the crash.
Here's that sequence of code where the crash is caused:
    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
      return;
    }
It's also worth noting, I think, that SwapChain::Acquire() calls SwapBackBuffer() with a non-null back buffer parameter, which if called early enough would also prevent the code from crashing subsequently when OffscreenSize() is read.

Having re-reviewed the original D75055 changeset that introduced the SwapChain, along with the history of the related files that aren't part of the changeset, I'm beginning to get a better picture.

For example, the code that makes the failing call to OffscreenSize() was added by me earlier on in this process:
$ git blame \
    embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp \
    -L 153,156
^01b1a0352034 embedthread/EmbedLiteCompositorBridgeParent.cpp
    (Raine Makelainen  2020-07-24 16:25:17 +0300 153)
    MutexAutoLock lock(mRenderMutex);
d59d44a5bccaf embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
    (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 154)
    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
d59d44a5bccaf embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp
    (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 155)
    && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
^01b1a0352034 embedthread/EmbedLiteCompositorBridgeParent.cpp
    (Raine Makelainen      2020-07-24 16:25:17 +0300 156)
    return;
Prior to this the logic was somewhat different:
$ git diff d59d44a5bccaf~ d59d44a5bccaf

@@ -153,7 +157,8 @@ EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget
    (VsyncId aId)
[...] 
   if (context->IsOffscreen()) {
     MutexAutoLock lock(mRenderMutex);
-    if (context->OffscreenSize() != mEGLSurfaceSize && !context->
    ResizeOffscreen(mEGLSurfaceSize)) {
+    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
+      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
       return;
     }
   }
The code for returning the offscreen size was also added by me slightly earlier on the same day:
$ git blame gfx/gl/GLScreenBuffer.cpp -L 128,130
f536dbf9f6f8a (David Llewellyn-Jones 2023-09-04 08:52:00 +0100 128)
    const gfx::IntSize& SwapChain::OffscreenSize() const {
f536dbf9f6f8a (David Llewellyn-Jones 2023-09-04 08:52:00 +0100 129)
      return mPresenter->mBackBuffer->mFb->mSize;
f536dbf9f6f8a (David Llewellyn-Jones 2023-09-04 08:52:00 +0100 130)
    }
And as you can see, a lot of code was changed in this commit, especially this code:
diff --git a/gfx/gl/GLScreenBuffer.cpp b/gfx/gl/GLScreenBuffer.cpp
index 0398dd7dc6a2..e71263068777 100644
--- a/gfx/gl/GLScreenBuffer.cpp
+++ b/gfx/gl/GLScreenBuffer.cpp
@@ -116,4 +116,38 @@ SwapChain::~SwapChain() {
   }
 }
 
[...]
+
+const gfx::IntSize& SwapChain::OffscreenSize() const {
+  return mPresenter->mBackBuffer->mFb->mSize;
+}
+
[...]
So much — if not all — of this faulty code is down to me. Back then I was coding without being able to test, so this isn't a huge surprise. But it does also mean I have more scope to control the situation and make changes to the implementation.

Prior to all these changes the OffscreenSize() implementation looked like this:
const gfx::IntSize& GLContext::OffscreenSize() const {
  MOZ_ASSERT(IsOffscreen());
  return mScreen->Size();
}
The mScreen being used here is analogous to the mBackBuffer and is created in this method:
bool GLContext::CreateScreenBufferImpl(const IntSize& size,
                                       const SurfaceCaps& caps) {
  UniquePtr<GLScreenBuffer> newScreen =
      GLScreenBuffer::Create(this, size, caps);
  if (!newScreen) return false;

  if (!newScreen->Resize(size)) {
    return false;
  }

  // This will rebind to 0 (Screen) if needed when
  // it falls out of scope.
  ScopedBindFramebuffer autoFB(this);

  mScreen = std::move(newScreen);

  return true;
}
And the mScreen is created in a method called InitOffscreen().
bool GLContext::InitOffscreen(const gfx::IntSize& size,
                              const SurfaceCaps& caps) {
  if (!CreateScreenBuffer(size, caps)) return false;
[...]
Finally InitOffScreen() is being called in GLContextProviderCGL::CreateOffscreen():
already_AddRefed<GLContext> GLContextProviderCGL::CreateOffscreen(
    const IntSize& size,
    const SurfaceCaps& minCaps,
    CreateContextFlags flags,
    nsACString* const out_failureId) {
  RefPtr<GLContext> gl = CreateHeadless(flags, out_failureId);
  if (!gl) {
    return nullptr;
  }

  if (!gl->InitOffscreen(size, minCaps)) {
    *out_failureId = NS_LITERAL_CSTRING("FEATURE_FAILURE_CGL_INIT");
    return nullptr;
  }

  return gl.forget();
}
It feels like we've nearly come full circle, which would be good because then that would be enough to make a clear decision about how to address this. But it's already late here now and time for me to call it a night, so that decision will have to wait for tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 Feb 2024 : Day 165 #
Today I'm following up the tasks I set out for myself yesterday:
 
Tomorrow I'll do a sweep of the other code to check whether any attempt is being made to initialise it somewhere else. If not I'll add in some initialisation code to see what happens.

As you may recall the WebView is crashing because the SwapChain returned from the context is null. It should be created somewhere, but it's not yet clear where. So the first question is whether there's some code to create it that isn't being called, or whether there's nowhere in the code currently set to create it.

A quick grep of the code throws up a few potential places where it could be being created:
$ grep -rIn "new SwapChain(" *
embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp:128:
    swapChain = new SwapChain();
gecko-dev/dom/webgpu/CanvasContext.cpp:95:
    mSwapChain = new SwapChain(aDesc, extent, mExternalImageId, format);
gecko-dev/dom/webgpu/CanvasContext.cpp:139:
    mSwapChain = new SwapChain(desc, extent, mExternalImageId, gfxFormat);
The most promising of these is going to be the code in EmbedLiteCompositorBridgeParent.cpp since that's EmbedLite-specific code. In fact, probably something I added myself during the ESR 91 changes (since SwapChain is new to ESR 91):
$ git blame \
    embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp \
    -L 128,128
d59d44a5bccaf (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 128)
    swapChain = new SwapChain();
Confirmed. I even added a note to myself at the time to explain that this might need fixing:
$ git blame \\
    embedding/embedlite/embedthread/EmbedLiteCompositorBridgeParent.cpp \\
    -L 113,114
d59d44a5bccaf (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 113)
    // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
d59d44a5bccaf (David Llewellyn-Jones 2023-09-04 09:03:49 +0100 114)
    // See: https://phabricator.services.mozilla.com/D75055
Here's the relevant bit of code:
void
EmbedLiteCompositorBridgeParent::PrepareOffscreen()
{
  fprintf(stderr,
      "=============== Preparing offscreen rendering context ===============\n");

  const CompositorBridgeParent::LayerTreeState* state =
      CompositorBridgeParent::GetIndirectShadowTree(RootLayerTreeId());
  NS_ENSURE_TRUE(state && state->mLayerManager, );

  GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->
      GetCompositor())->gl();
  NS_ENSURE_TRUE(context, );

  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  if (context->IsOffscreen()) {
    UniquePtr<SurfaceFactory> factory;
    if (context->GetContextType() == GLContextType::EGL) {
      // [Basic/OGL Layers, OMTC] WebGL layer init.
      factory = SurfaceFactory_EGLImage::Create(*context);
    } else {
      // [Basic Layers, OMTC] WebGL layer init.
      // Well, this *should* work...
      factory = MakeUnique<SurfaceFactory_Basic>(*context);
    }

    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }
  }
}
So either this method isn't being run, or context->IsOffscreen() must be set to false. Let's find out. Conveniently I can once again continue with my debugging session from yesterday:
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b EmbedLiteCompositorBridgeParent::PrepareOffscreen
Breakpoint 10 at 0x7ff366808c: file mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp, line 104.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 14342]
mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
129       return mPresenter->mBackBuffer->mFb->mSize;
(gdb) bt
#0  mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4aebc90, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0b230, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h:82
#3  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4aebc90)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4aebc90, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
#6  mozilla::detail::RunnableMethodArguments<int, int, int, int>::apply
    <mozilla::layers::CompositorBridgeParent, void (mozilla::layers::
    CompositorBridgeParent::*)(int, int, int, int)> (m=<optimized out>,
    o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
So the crash is happening before PrepareOffscreen() is called. Looking through the code there's actually only one place it is called and that's inside EmbedLiteCompositorBridgeParent::AllocPLayerTransactionParent(). There it's wrapped in a condition:
  if (!StaticPrefs::embedlite_compositor_external_gl_context()) {
    // Prepare Offscreen rendering context
    PrepareOffscreen();
  }
So I should check whether the problem is that this isn't being called, or the condition is false. A quick debug confirms that it's the latter: the method is entered but the value of the static preference means the PrepareOffScreen() call is never being made:
Thread 36 "Compositor" hit Breakpoint 11, mozilla::embedlite::
    EmbedLiteCompositorBridgeParent::AllocPLayerTransactionParent
    (this=0x7fc4bc68c0, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:79
79      {
(gdb) n
80        PLayerTransactionParent* p =
(gdb) n
83        EmbedLiteWindowParent *parentWindow = EmbedLiteWindowParent::From(mWindowId);
(gdb) n
84        if (parentWindow) {
(gdb) n
85          parentWindow->GetListener()->CompositorCreated();
(gdb) n
88        if (!StaticPrefs::embedlite_compositor_external_gl_context()) {
(gdb) n
mozilla::layers::PCompositorBridgeParent::OnMessageReceived
    (this=0x7fc4bc68c0, msg__=...) at PCompositorBridgeParent.cpp:1286
1286    PCompositorBridgeParent.cpp: No such file or directory.
(gdb) 
As we can see, it comes down to this embedlite.compositor.external_gl_context static preference, which needs to be set to false for the condition to be entered.

This preference isn't being set for the WebView, although it is set for the browser:
$ pushd ../sailfish-browser/
$ grep -rIn "embedlite.compositor.external_gl_context" *
apps/core/declarativewebutils.cpp:239:
    webEngineSettings->setPreference(QString(
    "embedlite.compositor.external_gl_context"), QVariant(true));
data/prefs.js:5:
    user_pref("embedlite.compositor.external_gl_context", true);
tests/auto/mocks/declarativewebutils/declarativewebutils.cpp:62:
    webEngineSettings->setPreference(QString(
    "embedlite.compositor.external_gl_context"), QVariant(true));
$ popd
I'm going to set it to false explicitly for the WebView. But this immediately makes me feel nervous: this setting isn't new and there's a reason it's not being touched in the WebView code. It makes me think that I'm travelling down a rendering pipeline path that I shouldn't be.

So as well as trying out this change I'm also going to ask for some expert advice from the Jolla team about this, just in case it's actually important that I don't set this to false and that the real issue is somewhere else.

But, it's the start of my work day, so that will all have to wait until later.

[...]

I've added in the change to set the embedlite.compositor.external_gl_context static preference to false:
diff --git a/lib/webenginesettings.cpp b/lib/webenginesettings.cpp
index de9e4b86..780f6555 100644
--- a/lib/webenginesettings.cpp
+++ b/lib/webenginesettings.cpp
@@ -110,6 +110,12 @@ void SailfishOS::WebEngineSettings::initialize()
     engineSettings->setPreference(QStringLiteral("intl.accept_languages"),
                                   QVariant::fromValue<QString>(langs));
 
+    // Ensure the renderer is configured correctly
+    engineSettings->setPreference(QStringLiteral("gfx.webrender.force-disabled"),
+                                  QVariant(true));
+    engineSettings->setPreference(QStringLiteral("embedlite.compositor.external_gl_context"),
+                                  QVariant(false));
+
     Silica::Theme *silicaTheme = Silica::Theme::instance();
 
     // Notify gecko when the ambience switches between light and dark
The code has successfully built; now it's time to test it. On a dry run of the new code it crashes seemingly somewhere close to where it was crashing before. But crucially the debug print from inside the PrepareOffscreen() method is now being output. So we've definitely moved a step forwards.
$ harbour-webview
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
[...]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
=============== Preparing offscreen rendering context ===============
Segmentation fault
To find out what's going on we can step through. And after the new install it's finally time to start a new debugging session.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b EmbedLiteCompositorBridgeParent::PrepareOffscreen
Function "EmbedLiteCompositorBridgeParent::PrepareOffscreen" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (EmbedLiteCompositorBridgeParent::PrepareOffscreen) pending.
(gdb) r
[...]
Thread 36 "Compositor" hit Breakpoint 1, mozilla::embedlite::
    EmbedLiteCompositorBridgeParent::PrepareOffscreen
    (this=this@entry=0x7fc4be7560)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:104
104     {
(gdb) n
[LWP 18280 exited]
105       fprintf(stderr,
    "=============== Preparing offscreen rendering context ===============\n");
(gdb) 
=============== Preparing offscreen rendering context ===============
107       const CompositorBridgeParent::LayerTreeState* state =
    CompositorBridgeParent::GetIndirectShadowTree(RootLayerTreeId());
(gdb) 
108       NS_ENSURE_TRUE(state && state->mLayerManager, );
(gdb) 
110       GLContext* context = static_cast<CompositorOGL*>
    (state->mLayerManager->GetCompositor())->gl();
(gdb) p context
$1 = <optimized out>
(gdb) n
111       NS_ENSURE_TRUE(context, );
(gdb) n
3540    ${PROJECT}/obj-build-mer-qt-xr/dist/include/GLContext.h:
    No such file or directory.
(gdb) n
117         if (context->GetContextType() == GLContextType::EGL) {
(gdb) n
119           factory = SurfaceFactory_EGLImage::Create(*context);
(gdb) n
126         SwapChain* swapChain = context->GetSwapChain();
(gdb) n
127         if (swapChain == nullptr) {
(gdb) n
128           swapChain = new SwapChain();
(gdb) n
129           new SwapChainPresenter(*swapChain);
(gdb) n
130           context->mSwapChain.reset(swapChain);
(gdb) n
133         if (factory) {
(gdb) n
134           swapChain->Morph(std::move(factory));
(gdb) n
mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4be7560, aBackendHints=..., aId=...)
    at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:92
92        return p;
(gdb) c
Continuing.

Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
0x0000007ff110a38c in mozilla::gl::SwapChain::OffscreenSize
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) bt
#0  0x0000007ff110a38c in mozilla::gl::SwapChain::OffscreenSize
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be7560, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0b1a0, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#3  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4be7560)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4be7560, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
Here's the code that's causing the crash:
const gfx::IntSize& SwapChain::OffscreenSize() const {
  return mPresenter->mBackBuffer->mFb->mSize;
}
It would be helpful to know which of these values is null, but unhelpfully the values have been optimised out.
(gdb) p mPresenter
value has been optimized out
(gdb) p this
$2 = <optimized out>
However if we go up a stack frame we can have better luck, applying the trick we used on Day 164 to extract the SwapChain object from the context via the UniquePtr class:
(gdb) frame 1
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be7560, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p context
$3 = (mozilla::gl::GLContext *) 0x7ee019ee40
(gdb) p context->mSwapChain.mTuple.mFirstA
$5 = (mozilla::gl::SwapChain *) 0x7ee01ce090
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter
$6 = (mozilla::gl::SwapChainPresenter *) 0x7ee01a1380
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer
$7 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer->mFb
Cannot access memory at address 0x20
(gdb) 
As we can see from this, the missing value is the mBackBuffer value inside the SwapChainPresenter object of the SwapChain class.

It's clear what the next step is: find out why the mBackBuffer value isn't being set and, if necessary, set it. But that's a task that'll have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 Feb 2024 : Day 164 #
Working on the WebView implementation, yesterday we reached the point where the WebView component no longer crashed the app hosting it. We did this by ensuring the correct layer manager was used for rendering.

But now we're left with a bunch of errors. The ones that need fixing immediately are the following:
[W] unknown:7 - file:///usr/share/harbour-webview/qml/harbour-webview.qml:7:30:
    Type WebViewPage unavailable 
         initialPage: Component { WebViewPage { } } 
                                  ^
[W] unknown:13 - file:///usr/share/harbour-webview/qml/pages/
    WebViewPage.qml:13:5: Type WebView unavailable 
         WebView { 
         ^
[W] unknown:141 - file:///usr/lib64/qt5/qml/Sailfish/WebView/
    WebView.qml:141:9: Type TextSelectionController unavailable 
             TextSelectionController { 
             ^
[W] unknown:14 - file:///usr/lib64/qt5/qml/Sailfish/WebView/Controls/
    TextSelectionController.qml:14:1: module "QOfono" is not installed 
     import QOfono 0.2 
     ^
This cascade of errors all reduces to the last:
[W] unknown:14 - file:///usr/lib64/qt5/qml/Sailfish/WebView/Controls/
    TextSelectionController.qml:14:1: module "QOfono" is not installed 
     import QOfono 0.2 
     ^
The reason for this is also clear. The spec file for sailfish-components-webview makes clear that libqofono 0.117 or above is needed. I don't have this on my system for whatever reason (I'll need to investigate), but to work around this I hacked the spec file so that it wouldn't refuse to install on a system with a lower version, like this:
diff --git a/rpm/sailfish-components-webview.spec
           b/rpm/sailfish-components-webview.spec
index 766933ba..c311ebcf 100644
--- a/rpm/sailfish-components-webview.spec
+++ b/rpm/sailfish-components-webview.spec
@@ -18,7 +18,7 @@ Requires: sailfishsilica-qt5 >= 1.1.123
 Requires: sailfish-components-media-qt5
 Requires: sailfish-components-pickers-qt5
 Requires: embedlite-components-qt5 >= 1.21.2
-Requires: libqofono-qt5-declarative >= 0.117
+Requires: libqofono-qt5-declarative >= 0.115
 
 %description
 %{summary}.
There's no build-time requirement, so I thought I might get away with it. But clearly not.

It seems a bit odd that a text selector component should be requiring an entire separate phone library in order to work. Let's take a look at why.

The ofono code comes at the end of the file. There are two OfonoNetworkRegistration components called cellular1Status and cellular2Status. These represent the state of the two SIM card slots in the device. You might ask why there are only two; can't you have more than two SIM card slots? Well, yes, but I guess this is a problem for future developers to deal with.

These two components feed into the following Boolean value at the top of the code:
    readonly property bool _canCall: cellular1Status.registered
        || cellular2Status.registered
Later on in the code we see this being used, like this:
        isPhoneNumber = _canCall && _phoneNumberSelected
So what's this all for? When you select some text the browser will present you with some options for what to do with it. Copy to clipboard? Open a link? If it thinks it's a phone number it will offer to make a call to it for you. Unless you don't have a SIM card installed. So that's why libqofono is needed here.

You might wonder how it knows it's a phone number at all. The answer to this question isn't in the sailfish-components-webview code. The answer is in embedlite-components, in the SelectionPrototype.js file where we find this code:
  _phoneRegex: /^\+?[0-9\s,-.\(\)*#pw]{1,30}$/,

  _getSelectedPhoneNumber: function sh_getSelectedPhoneNumber() {
    return this._isPhoneNumber(this._getSelectedText().trim());
  },

  _isPhoneNumber: function sh_isPhoneNumber(selectedText) {
    return (this._phoneRegex.test(selectedText) ? selectedText : null);
  },
So the decision about whether something is a phone number or not comes down to whether it satisfies the regex /^\+?[0-9\s,-.\(\)*#pw]{1,30}$/ and whether you have a SIM card installed.

But that's a bit of a diversion. We only care about this new libqofono. Why is this newer version needed and why don't I have it on my system? Let's find out when and why it was changed. $ git blame import/controls/TextSelectionController.qml -L 14,14 16ef5cdf4 (Pekka Vuorela 2023-01-05 12:09:27 +0200 14) import QOfono 0.2 $ git log -1 16ef5cdf4 commit 16ef5cdf44c2eafd7d93e17a41927ef5da700c2b Author: Pekka Vuorela <pekka.vuorela@jolla.com> Date: Thu Jan 5 12:09:27 2023 +0200 [components-webview] Migrate to new qofono import. JB#59690 Also dependency was missing. The actual change here was pretty small.
$ git diff 16ef5cdf44c2eafd7d93e17a41927ef5da700c2b~ \
    16ef5cdf44c2eafd7d93e17a41927ef5da700c2b
diff --git a/import/controls/TextSelectionController.qml
           b/import/controls/TextSelectionController.qml
index 5c8f2845..71bd83cc 100644
--- a/import/controls/TextSelectionController.qml
+++ b/import/controls/TextSelectionController.qml
@@ -11,7 +11,7 @@
 
 import QtQuick 2.1
 import Sailfish.Silica 1.0
-import MeeGo.QOfono 0.2
+import QOfono 0.2
 
 MouseArea {
     id: root
diff --git a/rpm/sailfish-components-webview.spec
           b/rpm/sailfish-components-webview.spec
index 9a2a3154..5729a8d9 100644
--- a/rpm/sailfish-components-webview.spec
+++ b/rpm/sailfish-components-webview.spec
@@ -18,6 +18,7 @@ Requires: sailfishsilica-qt5 >= 1.1.123
 Requires: sailfish-components-media-qt5
 Requires: sailfish-components-pickers-qt5
 Requires: embedlite-components-qt5 >= 1.21.2
+Requires: libqofono-qt5-declarative >= 0.117
 
 %description
 %{summary}.
The import has been updated as have the requirements. But there's been no change to the code, so the libqofono version requirement is probably only needed to deal with the name change of the import.

None of this seems essential for ESR 91. My guess is that this change has gone into the development code but hasn't yet made it into a release. So I'm going to hack around it for now (being careful not to commit my hacked changes into the repository).

I've already amended the version number in the spec file, so to get things to work I should just have to reverse this change:
-import MeeGo.QOfono 0.2
+import QOfono 0.2
I can do that on-device. This should do it:
sed -i -e 's/QOfono/MeeGo.QOfono/g' \
    /usr/lib64/qt5/qml/Sailfish/WebView/Controls/TextSelectionController.qml
Great! That's removed the QML error. But now the app is back to crashing again before it gets to even try to render something on-screen:
$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[...]
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
Segmentation fault
So it's back to the debugger again. But this will have to wait until this evening.

[...]

It's the evening and time to put the harbour-webview example through the debugger.
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 24061]
mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
129       return mPresenter->mBackBuffer->mFb->mSize;
(gdb) bt
#0  mozilla::gl::SwapChain::OffscreenSize (this=0x0)
    at gfx/gl/GLScreenBuffer.cpp:129
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be8da0, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b808c in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4d0c0b0, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    LayersTypes.h:82
#3  0x0000007ff12b80e8 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4be8da0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b8174 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4be8da0, x=<optimized out>,
    y=<optimized out>, width=<optimized out>, height=<optimized out>)
    at gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12b0d10 in mozilla::detail::RunnableMethodArguments<int, int,
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void
    (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int),
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>,
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul,
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
This is now a proper crash, not something induced intentionally by the code. Here's the actual code causing the crash taken from GLSCreenBuffer.cpp:
const gfx::IntSize& SwapChain::OffscreenSize() const {
  return mPresenter->mBackBuffer->mFb->mSize;
}
The problem here being that the SwapChain object itself is null. So we should look in the calling method to find out what's going on there. Here's the relevant code this time from EmbedLiteCompositorBridgeParent.cpp:
void
EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget(VsyncId aId)
{
  GLContext* context = static_cast<CompositorOGL*>(state->mLayerManager->
      GetCompositor())->gl();
[...]
  if (context->IsOffscreen()) {
    MutexAutoLock lock(mRenderMutex);
    if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize
      && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) {
      return;
    }
  }
With a bit of digging we can see that the value being returned by context->GetSwapChain() is null:
(gdb) frame 1
#1  0x0000007ff3667930 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4be8da0, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:
    No such file or directory.
(gdb) p context
$2 = (mozilla::gl::GLContext *) 0x7ed819ee00
(gdb) p context->GetSwapChain()
Cannot evaluate function -- may be inlined
(gdb) p context.mSwapChain
$3 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*,
    mozilla::DefaultDelete<mozilla::gl::SwapChain>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>},
    mFirstA = 0x0}, <No data fields>}}
(gdb) p context.mSwapChain.mTuple
$4 = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*,
    mozilla::DefaultDelete<mozilla::gl::SwapChain>,
    (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> =
    {<mozilla::DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>},
    mFirstA = 0x0}, <No data fields>}
(gdb) p context.mSwapChain.mTuple.mFirstA
$5 = (mozilla::gl::SwapChain *) 0x0
(gdb) 
You may recall that way back in the first three weeks of working on Gecko I hit a problem with the rendering pipeline. The GLScreenBuffer structure that the WebView has been using for a long time had been completely removed and replaced with this SwapChain class.

At the time I struggled with how to rearrange the code so that it compiled. I made changes that I couldn't test. And while I did get it to compile, these changes are now coming back to haunt me. Now I need to actually fix this rendering pipeline properly.

There's a bit of me that is glad I'm finally having to do this. I really want to know how it's actually supposed to work.

Clearly the first task will be to figure out why the mSwapChain member of GLContext is never being set. With any luck this will be at the easier end of the difficulty spectrum.

I'm going to try to find where mSwapChain is being — or should be being — set. To do that I'll need to find out where the context is coming from. The context is being passed by CompositorOGL so that would seem to be a good place to start.

Looking through the CompositoryOGL.cpp file we can see that the mGLContext member is being initialised from a value passed in to CompositorOGL::Initialize(). The debugger can help us work back from there.
(gdb) break CompositorOGL::Initialize
Breakpoint 1 at 0x7ff11b0c3c: file gfx/layers/opengl/CompositorOGL.cpp,
    line 380.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 36 "Compositor" hit Breakpoint 1, mozilla::layers::CompositorOGL::
    Initialize (this=0x7ee0002f50, out_failureReason=0x7f1faac520)
    at gfx/layers/opengl/CompositorOGL.cpp:380
380     bool CompositorOGL::Initialize(nsCString* const out_failureReason) {
(gdb)
Ah! This is interesting. It's not being passed in because there are two different overloads of the CompositorOGL::Initialize() method and the code is using the other one. In this other piece of code the context is created directly:
bool CompositorOGL::Initialize(nsCString* const out_failureReason) {
  ScopedGfxFeatureReporter reporter("GL Layers");

  // Do not allow double initialization
  MOZ_ASSERT(mGLContext == nullptr || !mOwnsGLContext,
             "Don't reinitialize CompositorOGL");

  if (!mGLContext) {
    MOZ_ASSERT(mOwnsGLContext);
    mGLContext = CreateContext();
  }
[...]
Let's see what happens with the context creation.
Thread 36 "Compositor" hit Breakpoint 5, mozilla::layers::CompositorOGL::
    CreateContext (this=this@entry=0x7ee0002f50)
    at gfx/layers/opengl/CompositorOGL.cpp:227
227     already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() {
(gdb) n
231       nsIWidget* widget = mWidget->RealWidget();
(gdb) 
232       void* widgetOpenGLContext =
(gdb) 
234       if (widgetOpenGLContext) {
(gdb) 
248       if (!context && gfxEnv::LayersPreferOffscreen()) {
(gdb) b GLContextProviderEGL::CreateHeadless
Breakpoint 6 at 0x7ff1133740: file gfx/gl/GLContextProviderEGL.cpp, line 1245.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 6, mozilla::gl::GLContextProviderEGL::
    CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1245
1245        const GLContextCreateDesc& desc, nsACString* const out_failureId) {
(gdb) n
1246      const auto display = DefaultEglDisplay(out_failureId);
(gdb) 
1247      if (!display) {
(gdb) p display
$8 = std::shared_ptr<mozilla::gl::EglDisplay> (use count 1, weak count 2)
    = {get() = 0x7ee0004cb0}
(gdb) n
1250      mozilla::gfx::IntSize dummySize = mozilla::gfx::IntSize(16, 16);
(gdb) b GLContextEGL::CreateEGLPBufferOffscreenContext
Breakpoint 7 at 0x7ff11335b8: file gfx/gl/GLContextProviderEGL.cpp, line 1233.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 7, mozilla::gl::GLContextEGL::
    CreateEGLPBufferOffscreenContext (
    display=std::shared_ptr<mozilla::gl::EglDisplay> (use count 2, weak count 2)
    = {...}, desc=..., size=..., 
    out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1233
1233        const mozilla::gfx::IntSize& size, nsACString* const
    out_failureId) {
(gdb) b GLContextEGL::CreateEGLPBufferOffscreenContextImpl
Breakpoint 8 at 0x7ff1133160: file gfx/gl/GLContextProviderEGL.cpp, line 1185.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 8, mozilla::gl::GLContextEGL::
    CreateEGLPBufferOffscreenContextImpl (
    egl=std::shared_ptr<mozilla::gl::EglDisplay> (use count 3, weak count 2) =
    {...}, desc=..., size=..., useGles=useGles@entry=false, 
    out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:1185
1185        nsACString* const out_failureId) {
(gdb) n
1186      const EGLConfig config = ChooseConfig(*egl, desc, useGles);
(gdb) 
1187      if (config == EGL_NO_CONFIG) {
(gdb) 
1193      if (GLContext::ShouldSpew()) {
(gdb) 
1197      mozilla::gfx::IntSize pbSize(size);
(gdb) 
1307    include/c++/8.3.0/bits/shared_ptr_base.h: No such file or directory.
(gdb) 
1208      if (!surface) {
(gdb) 
1214      auto fullDesc = GLContextDesc{desc};
(gdb) 
1215      fullDesc.isOffscreen = true;
(gdb) 
1217          egl, fullDesc, config, surface, useGles, out_failureId);
(gdb) b GLContextEGL::CreateGLContext
Breakpoint 9 at 0x7ff1132548: file gfx/gl/GLContextProviderEGL.cpp, line 618.
(gdb) c
Continuing.

Thread 36 "Compositor" hit Breakpoint 9, mozilla::gl::GLContextEGL::
    CreateGLContext (egl=std::shared_ptr<mozilla::gl::EglDisplay>
    (use count 4, weak count 2) = {...}, desc=...,
    config=config@entry=0x55558fc450, surface=surface@entry=0x7ee0008f40,
    useGles=useGles@entry=false, out_failureId=out_failureId@entry=0x7f1faed1c8)
    at gfx/gl/GLContextProviderEGL.cpp:618
618         nsACString* const out_failureId) {
(gdb) n
621       std::vector<EGLint> required_attribs;
(gdb) 
We're getting down into the depths now. It's surprisingly thrilling to be seeing this code again. I recall that this GLContextEGL::CreateGLContext() method is where a lot of the action happens.

But my head is full and this feels like a good place to leave things. Inside this method might be the right place to initialise mSwapChain, but it's definitely not happening here.

Tomorrow I'll do a sweep of the other code to check whether any attempt is being made to initialise it somewhere else. If not I'll add in some initialisation code to see what happens.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Feb 2024 : Day 163 #
We're making good progress with the WebView rendering pipeline. The first issue to fix, which we've been looking at for the last couple of days, has been ensuring the layer manger is of the Client type, rather than the WebRender type. There's a new WEBRENDER_SOFTWARE feature that was introduced between ESR 78 and ESR 91 which is causing the trouble. In previous builds we disabled the WEBRENDER feature, but now with the new feature it's being enabled again. we need to ensure it's not enabled.

So the key questions to answer today are: how was WEBRENDER being disabled on ESR 78; and can we do something equivalent for WEBRENDER_SOFTWARE on ESR 91.

In the gfxConfigureManager.cpp file there are a couple of encouraging looking methods called gfxConfigManager::ConfigureWebRender() and gfxConfigManager::ConfigureWebRenderSoftware(). These enable and disable the web renderer and software web renderer features respectively. Unsurprisingly, the latter is a new method for ESR 91, but the former is available in both ESR 78 and ESR 91, so I'll concentrate on that one first.

When looking at the code in these we also need to refer back to the initialisation method, because that's where some key variables are being created:
void gfxConfigManager::Init() {
[...]
  mFeatureWr = &gfxConfig::GetFeature(Feature::WEBRENDER);
[...]
  mFeatureWrSoftware = &gfxConfig::GetFeature(Feature::WEBRENDER_SOFTWARE);
[...]
So these two variables — mFeatureWr and mFeatureWrSoftware are feature objects which we can then use to enable and disable various features.

In ESR 78 the logic for whether mFeatureWr should be enabled or not is serpentine. I'm not going to try to work through by hand, rather I'll set the debugger on it and see which way it slithers.

Happily my debug session is still running from yesterday (I think it's been running for three days now), so I can continue straight with that. I'll include the full step-through, but there's a lot of it so don't feel you have to follow along, I'll summarise the important parts afterwards.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b gfxConfigManager::ConfigureWebRender
Breakpoint 5 at 0x7fb90a8d88: file gfx/config/gfxConfigManager.cpp, line 194.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::gfx::gfxConfigManager::
    ConfigureWebRender (this=this@entry=0x7fa7972598)
    at gfx/config/gfxConfigManager.cpp:194
194     void gfxConfigManager::ConfigureWebRender() {
(gdb) n
206       mFeatureWrCompositor->SetDefaultFromPref("gfx.webrender.compositor",
    true,
(gdb) n
209       if (mWrCompositorForceEnabled) {
(gdb) n
213       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_COMPOSITOR,
(gdb) n
219       if (!mHwStretchingSupport && mScaledResolution) {
(gdb) n
225       bool guardedByQualifiedPref = ConfigureWebRenderQualified();
(gdb) n
300     obj-build-mer-qt-xr/dist/include/nsTStringRepr.h: No such file or directory.
(gdb) p *mFeatureWr
$15 = {mDefault = {mMessage = '\000' <repeats 63 times>, mStatus =
    mozilla::gfx::FeatureStatus::Unused}, mUser = {mMessage = '\000'
    <repeats 63 times>, mStatus = mozilla::gfx::FeatureStatus::Unused},
    mEnvironment = {mMessage = '\000' <repeats 63 times>,
    mStatus = mozilla::gfx::FeatureStatus::Unused}, mRuntime = {mMessage =
    '\000' <repeats 63 times>, mStatus = mozilla::gfx::FeatureStatus::Unused}, 
  mFailureId = {<nsTSubstring<char>> = {<mozilla::detail::nsTStringRepr<char>> =
    {mData = 0x7fbc7d4f42 <gNullChar> "", mLength = 0, mDataFlags =
    mozilla::detail::StringDataFlags::TERMINATED, mClassFlags =
    mozilla::detail::StringClassFlags::NULL_TERMINATED}, 
      static kMaxCapacity = 2147483637}, <No data fields>}}
(gdb) p mFeatureWr->GetValue()
$16 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->IsEnabled()
$17 = false
(gdb) p mFeatureWr->mDefault.mStatus
$30 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->mRuntime.mStatus
$31 = mozilla::gfx::FeatureStatus::Unused
(gdb) n
235       if (mWrEnvForceEnabled) {
(gdb) p mWrEnvForceEnabled
$18 = false
(gdb) n
237       } else if (mWrForceEnabled) {
(gdb) p mWrForceEnabled
$19 = false
(gdb) n
239       } else if (mFeatureWrQualified->IsEnabled()) {
(gdb) p mFeatureWrQualified->IsEnabled()
$20 = false
(gdb) n
253       if (mWrForceDisabled ||
(gdb) p mWrForceDisabled
$21 = false
(gdb) p mWrEnvForceDisabled
$22 = false
(gdb) p mWrQualifiedOverride.isNothing()
Cannot evaluate function -- may be inlined
(gdb) n
261       if (!mFeatureHwCompositing->IsEnabled()) {
(gdb) n
268       if (mSafeMode) {
(gdb) n
276       if (mIsWindows && !mIsWin10OrLater && !mDwmCompositionEnabled) {
(gdb) p mIsWindows
$23 = false
(gdb) p mIsWin10OrLater
$24 = false
(gdb) p mDwmCompositionEnabled
$25 = true
(gdb) n
283           NS_LITERAL_CSTRING("FEATURE_FAILURE_DEFAULT_OFF"));
(gdb) n
285       if (mFeatureD3D11HwAngle && mWrForceAngle) {
(gdb) n
301       if (!mFeatureWr->IsEnabled() && mDisableHwCompositingNoWr) {
(gdb) p mFeatureWr->IsEnabled()
$26 = false
(gdb) p mDisableHwCompositingNoWr
$27 = false
(gdb) n
324           NS_LITERAL_CSTRING("FEATURE_FAILURE_DEFAULT_OFF"));
(gdb) n
326       if (mWrDCompWinEnabled) {
(gdb) n
334       if (!mWrPictureCaching) {
(gdb) n
340       if (!mFeatureWrDComp->IsEnabled() && mWrCompositorDCompRequired) {
(gdb) n
348       if (mWrPartialPresent) {
(gdb) n
gfxPlatform::InitWebRenderConfig (this=<optimized out>)
    at gfx/thebes/gfxPlatform.cpp:2733
2733      if (Preferences::GetBool("gfx.webrender.program-binary-disk", false)) {
(gdb) c
[...]
That's a bit too much detail there, but the key conclusion is that mFeatureWr (which represents the state of the WEBRENDER feature starts off disabled and the value is never changed. So by the end of the gfxConfigManager::ConfigureWebRender() method the feature remains disabled. It's not changed anywhere else and so we're left with our layer manager being created as a Client layer manager, which is what we need.

We can see that it's set to disabled from the following sequence, copied from the full debugging session above:
(gdb) p mFeatureWr->IsEnabled()
$17 = false
(gdb) p mFeatureWr->mDefault.mStatus
$30 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->mRuntime.mStatus
$31 = mozilla::gfx::FeatureStatus::Unused
Features are made from multiple layers of states. Each layer can be either set or unused. To determine the state of a feature each layer is examined in order until one of them is set to something other than Unused. The first unused layer provides the actual state of the feature.

The layers are the following:
  1. mRuntime
  2. mUser
  3. mEnvironment
  4. mStatus
  5. mDefault
The mDefault layer provides a backstop: if all other layers are Unused then whatever value the mDefault layer takes is the value of the feature (even if that value is Unused).

So, to summarise and bring all this together, the mFeatureWr feature is enabled if all of the following hold:
  1. mFeatureWr->mDefault.mStatus is set to anything other than Unused.
  2. The mStatus value of one of the other layers is set to something other than Unused and is either Available or ForceEnabled.
Looking at the values from the debugging session above, we can therefore see exactly why mFeatureWr->IsEnabled() is returning false: it's simply never had any other value set on it.

Now we need to compare this to the process for ESR 91. Before we get into it it's worth noting that the WEBRENDER feature in ESR 91 is also (correctly) disabled, so we may not see any big differences here with this. Let's see.

Again, I can continue with the debugging session I've been running for the last few days:
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b gfxConfigManager::ConfigureWebRender
Breakpoint 9 at 0x7ff138d708: file gfx/config/gfxConfigManager.cpp, line 215.
(gdb) b gfxConfigManager::ConfigureWebRenderSoftware
Breakpoint 10 at 0x7ff138d41c: file gfx/config/gfxConfigManager.cpp, line 125.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 9, mozilla::gfx::gfxConfigManager::
    ConfigureWebRender (this=this@entry=0x7fd7da72f8)
    at gfx/config/gfxConfigManager.cpp:215
215     void gfxConfigManager::ConfigureWebRender() {
(gdb) p mFeatureWr->IsEnabled()
$13 = false
(gdb) p mFeatureWr->mDefault.mStatus
$14 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWr->mRuntime.mStatus
$15 = mozilla::gfx::FeatureStatus::Unused
So as we go in to the ConfigureWebRender() method the value is set to disabled. This is the same as for ESR 78.
(gdb) n
230       mFeatureWrCompositor->SetDefaultFromPref("gfx.webrender.compositor",
    true,
(gdb)
233       if (mWrCompositorForceEnabled) {
(gdb)
237       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_COMPOSITOR,
(gdb)
243       if (!mHwStretchingSupport.IsFullySupported() && mScaledResolution) {
(gdb)
253       ConfigureWebRenderSoftware();
(gdb) n
At this point we're jumping in to the ConfigureWebRenderSoftware() method. We're going to continue into it, since we're interested to know what happens there. But it's worth noting that this is a departure from what happens on ESR 78.
Thread 7 "GeckoWorkerThre" hit Breakpoint 10, mozilla::gfx::gfxConfigManager::
    ConfigureWebRenderSoftware (this=this@entry=0x7fd7da72f8)
    at gfx/config/gfxConfigManager.cpp:125
125     void gfxConfigManager::ConfigureWebRenderSoftware() {
(gdb) p mFeatureWrSoftware->IsEnabled()
$16 = false
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$17 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$18 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mRuntime.mStatus
$19 = mozilla::gfx::FeatureStatus::Unused
Going in we also see that the mFeatureWrSoftware feature is disabled.
(gdb) n
128       mFeatureWrSoftware->EnableByDefault();
(gdb) n
134       if (mWrSoftwareForceEnabled) {
(gdb) p mFeatureWrSoftware->IsEnabled()
$20 = true
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$21 = mozilla::gfx::FeatureStatus::Available
(gdb) p mFeatureWrSoftware->mRuntime.mStatus
$22 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mUser.mStatus
$23 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mEnvironment.mStatus
$24 = mozilla::gfx::FeatureStatus::Unused
(gdb) p mFeatureWrSoftware->mDefault.mStatus
$25 = mozilla::gfx::FeatureStatus::Available
But this is immediately switched to being enabled; in this case set as having a default value of Available. So far there have been no conditions on the execution, so we're guaranteed to reach this state every time. Let's continue.
(gdb) p mWrSoftwareForceEnabled
$33 = false
(gdb) n
136       } else if (mWrForceDisabled || mWrEnvForceDisabled) {
(gdb) p mWrForceDisabled
$26 = false
(gdb) p mWrEnvForceDisabled
$27 = false
Here there was an opportunity to disable the feature if either mWrForceDisabled or mWrEnvForceDisabled were set to true, but since both were set to false we skip over this possibility. This might be our way in to disabling it, so we may want to return to this. But let's continue on with the rest of the debugging for now.
(gdb) n
141       } else if (gfxPlatform::DoesFissionForceWebRender()) {
(gdb) n
145       if (!mHasWrSoftwareBlocklist) {
(gdb) p mHasWrSoftwareBlocklist
$28 = false
At this point the mHasWrSoftwareBlocklist variable is set to false which causes us to jump out of the ConfigureWebRenderSoftware() method early. So we'll return back up the stack to the ConfigureWebRender() method and continue from there.
(gdb) n
mozilla::gfx::gfxConfigManager::ConfigureWebRender
    (this=this@entry=0x7fd7da72f8)
    at gfx/config/gfxConfigManager.cpp:254
254       ConfigureWebRenderQualified();
(gdb) n
256       mFeatureWr->EnableByDefault();
(gdb) n
262       if (mWrSoftwareForceEnabled) {
(gdb) p mFeatureWr->IsEnabled()
$29 = true
(gdb) n
Here we see another change from ESR 78. The mFeatureWr feature is enabled here. We already know it's ultimately disabled so we should keep an eye out for where that happens.
266       } else if (mWrEnvForceEnabled) {
(gdb) 
268       } else if (mWrForceDisabled || mWrEnvForceDisabled) {
(gdb)
275       } else if (mWrForceEnabled) {
(gdb) p mWrForceEnabled
$30 = false
(gdb) n
279       if (!mFeatureWrQualified->IsEnabled()) {
(gdb) p mFeatureWrQualified->IsEnabled()
$31 = false
(gdb) n
282         mFeatureWr->Disable(FeatureStatus::Disabled, "Not qualified",
(gdb) n
287       if (!mFeatureHwCompositing->IsEnabled()) {
(gdb) p mFeatureWr->IsEnabled()
$32 = false
So here it gets disabled again and the reason is because mFeatureWrQualified is disabled. Here's the comment text that goes alongside this in the code (the debugger skips these comments):
    // No qualified hardware. If we haven't allowed software fallback,
    // then we need to disable WR.
So we'll end up with this being disabled whatever happens. There's not much to see in the remainder of the method, but let's skip through the rest of the steps for completeness.
(gdb) n
293       if (mSafeMode) {
(gdb) n
302       if (mXRenderEnabled) {
(gdb) n
312       mFeatureWrAngle->EnableByDefault();
(gdb) n
313       if (mFeatureD3D11HwAngle) {
(gdb) n
335         mFeatureWrAngle->Disable(FeatureStatus::Unavailable,
    "OS not supported",
(gdb) n
339       if (mWrForceAngle && mFeatureWr->IsEnabled() &&
(gdb) n
347       if (!mFeatureWr->IsEnabled() && mDisableHwCompositingNoWr) {
(gdb) n
367       mFeatureWrDComp->EnableByDefault();
(gdb) n
368       if (!mWrDCompWinEnabled) {
(gdb) n
369         mFeatureWrDComp->UserDisable("User disabled via pref",
(gdb) n
373       if (!mIsWin10OrLater) {
(gdb) n
375         mFeatureWrDComp->Disable(FeatureStatus::Unavailable,
(gdb) n
380       if (!mIsNightly) {
(gdb) n
383         nsAutoString adapterVendorID;
(gdb) n
384         mGfxInfo->GetAdapterVendorID(adapterVendorID);
(gdb) n
385         if (adapterVendorID == u"0x10de") {
(gdb) n
383         nsAutoString adapterVendorID;
(gdb) n
396       mFeatureWrDComp->MaybeSetFailed(
(gdb) n
399       mFeatureWrDComp->MaybeSetFailed(mFeatureWrAngle->IsEnabled(),
(gdb) n
403       if (!mFeatureWrDComp->IsEnabled() && mWrCompositorDCompRequired) {
(gdb) n
411       if (mWrPartialPresent) {
(gdb) n
654     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/StaticPrefList_gfx.h:
    No such file or directory.
(gdb) n
433       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_SHADER_CACHE,
(gdb) n
435       if (!mFeatureWr->IsEnabled()) {
(gdb) n
436         mFeatureWrShaderCache->ForceDisable(FeatureStatus::Unavailable,
(gdb) n
441       mFeatureWrOptimizedShaders->EnableByDefault();
(gdb) n
442       if (!mWrOptimizedShaders) {
(gdb) n
446       ConfigureFromBlocklist(nsIGfxInfo::FEATURE_WEBRENDER_OPTIMIZED_SHADERS,
(gdb) n
448       if (!mFeatureWr->IsEnabled()) {
(gdb) n
449         mFeatureWrOptimizedShaders->ForceDisable(FeatureStatus::Unavailable,
(gdb) n
And we're out of the method. So that's it: we can see that mFeatureWr is disabled here, as expected. However when it comes to mFeatureWrSoftware it's a different story. The value is enabled by default; to get it disabled we'll need to ensure one of mWrForceDisabled or mWrEnvForceDisabled is set to true.

Both of these are set in the initialisation method, like this:
void gfxConfigManager::Init() {
[...]
  mWrForceDisabled = StaticPrefs::gfx_webrender_force_disabled_AtStartup();
[...]
  mWrEnvForceDisabled = gfxPlatform::WebRenderEnvvarDisabled();
[...]
Here's the code that creates the former:
ONCE_PREF(
  "gfx.webrender.force-disabled",
   gfx_webrender_force_disabled,
   gfx_webrender_force_disabled_AtStartup,
  bool, false
)
That's from the autogenerated obj-build-mer-qt-xr/modules/libpref/init/StaticPrefList_gfx.h file. This is being generated from the gecko-dev/modules/libpref/init/StaticPrefList.yaml file, the relevant part of which looks like this:
# Also expose a pref to allow users to force-disable WR. This is exposed
# on all channels because WR can be enabled on qualified hardware on all
# channels.
- name: gfx.webrender.force-disabled
  type: bool
  value: false
  mirror: once
The latter is set using an environment variable:
/*static*/
bool gfxPlatform::WebRenderEnvvarDisabled() {
  const char* env = PR_GetEnv("MOZ_WEBRENDER");
  return (env && *env == '0');
}
Okay, we've reached the end of this piece of investigation. What's clear is that there may not be any Sailfish-specific code for disabling the web render layer manager because it's being disabled by default anyway.

For the software web render layer manager we could set the MOZ_WEBRENDER environment variable to 0 to force it to be disabled and this will be handy for testing. But in the longer term we should probably put some code into sailfish-browser to explicitly set the gfx.webrender.force-disabled static preference to true.

As I look in to this I discover something surprising. Even though web render is disabled by default, doing some grepping around the code threw the following up in the sailfish-browser code:
void DeclarativeWebUtils::setRenderingPreferences()
{
    SailfishOS::WebEngineSettings *webEngineSettings =
        SailfishOS::WebEngineSettings::instance();

    // Use external Qt window for rendering content
    webEngineSettings->setPreference(
        QString("gfx.compositor.external-window"), QVariant(true));
    webEngineSettings->setPreference(
        QString("gfx.compositor.clear-context"), QVariant(false));
    webEngineSettings->setPreference(
        QString("gfx.webrender.force-disabled"), QVariant(true));
    webEngineSettings->setPreference(
        QString("embedlite.compositor.external_gl_context"), QVariant(true));
}
This is fine for the browser, but it's not going to get executed for the WebView, so I'll need to set this in WebEngineSettings::initialize() as well. Thankfully, making this change turns out to be pretty straightforward:
diff --git a/lib/webenginesettings.cpp b/lib/webenginesettings.cpp
index de9e4b86..13b21d5b 100644
--- a/lib/webenginesettings.cpp
+++ b/lib/webenginesettings.cpp
@@ -110,6 +110,10 @@ void SailfishOS::WebEngineSettings::initialize()
     engineSettings->setPreference(QStringLiteral("intl.accept_languages"),
                                   QVariant::fromValue<QString>(langs));
 
+    // Ensure the web renderer is disabled
+    engineSettings->setPreference(QStringLiteral("gfx.webrender.force-disabled"),
+                                  QVariant(true));
+
     Silica::Theme *silicaTheme = Silica::Theme::instance();
 
     // Notify gecko when the ambience switches between light and dark
As well as this change I also had to amend the rawwebview.cpp file to accommodate some of the API changes I made earlier to gecko. I guess I've not built the sailfish-components-webview packages recently or this would have come up. Nevertheless the fix isn't anything too dramatic:
diff --git a/import/webview/rawwebview.cpp b/import/webview/rawwebview.cpp
index 1b1bb92a..2eab77f5 100644
--- a/import/webview/rawwebview.cpp
+++ b/import/webview/rawwebview.cpp
@@ -37,7 +37,7 @@ public:
     ViewCreator();
     ~ViewCreator();
 
-    quint32 createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext) override;
+    quint32 createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext, bool hidden) override;
 
     static std::shared_ptr<ViewCreator> instance();
 
@@ -54,9 +54,10 @@ ViewCreator::~ViewCreator()
     SailfishOS::WebEngine::instance()->setViewCreator(nullptr);
 }
 
-quint32 ViewCreator::createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext)
+quint32 ViewCreator::createView(const quint32 &parentId, const uintptr_t &parentBrowsingContext, bool hidden)
 {
     Q_UNUSED(parentBrowsingContext)
+    Q_UNUSED(hidden)
 
     for (RawWebView *view : views) {
         if (view->uniqueId() == parentId) {
Having fixed all this, I've built and transferred the new packages over to my phone. Now when I run the harbour-webview example app I get something quite different to the crash we were seeing before:
[defaultuser@Xperia10III gecko]$ harbour-webview 
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[D] main:30 - WebView Example
[D] main:44 - Using default start URL:  "https://www.flypig.co.uk/search/"
[D] main:47 - Opening webview
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:7 - file:///usr/share/harbour-webview/qml/harbour-webview.qml:7:30:
    Type WebViewPage unavailable 
         initialPage: Component { WebViewPage { } } 
                                  ^
[W] unknown:13 - file:///usr/share/harbour-webview/qml/pages/
    WebViewPage.qml:13:5: Type WebView unavailable 
         WebView { 
         ^
[W] unknown:141 - file:///usr/lib64/qt5/qml/Sailfish/WebView/WebView.qml:141:9:
    Type TextSelectionController unavailable 
             TextSelectionController { 
             ^
[W] unknown:14 - file:///usr/lib64/qt5/qml/Sailfish/WebView/Controls/
    TextSelectionController.qml:14:1: module "QOfono" is not installed 
     import QOfono 0.2 
     ^
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change" {file:
    "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
No crash, several errors, but (of course) still a blank screen: no actual rendering taking place. But this is still really good progress. The WebView application which was completely crashing before, is now running, just not rendering. That means we now have the opportunity to debug and fix it. One more step forwards.

I'll look into the rendering more tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Feb 2024 : Day 162 #
Yesterday we were looking in to the WebView rendering pipeline. We got to the point where we had a backtrace showing the flow that resulted in a WebRender layer manager being created, when the EmbedLite code was expecting a Client layer manager. The consequence was that the EmbedLite code forcefully killed itself.

That was on ESR 91. Today I want to find the equivalent flow on ESR 78 to see how it differs. To do this I need to first install the same harbour-webview-example code that I'm using for testing on my ESR 78 device. Then set it off with the debugger:
$ gdb harbour-webview
[...]
(gdb) b nsBaseWidget::CreateCompositorSession
Function "nsBaseWidget::CreateCompositorSession" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (nsBaseWidget::CreateCompositorSession) pending.
(gdb) r
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 1, nsBaseWidget::
    CreateCompositorSession (this=this@entry=0x7f8ccbf3d0, aWidth=1080,
    aHeight=2520, aOptionsOut=aOptionsOut@entry=0x7fa7972ac0)
    at widget/nsBaseWidget.cpp:1176
1176        int aWidth, int aHeight, CompositorOptions* aOptionsOut) {
(gdb) n
1180        CreateCompositorVsyncDispatcher();
(gdb) n
1182        gfx::GPUProcessManager* gpu = gfx::GPUProcessManager::Get();
(gdb) n
1186        gpu->EnsureGPUReady();
(gdb) n
67      obj-build-mer-qt-xr/dist/include/mozilla/StaticPtr.h:
    No such file or directory.
(gdb) n
1193        bool enableAPZ = UseAPZ();
(gdb) n
1194        CompositorOptions options(enableAPZ, enableWR);
(gdb) n
1198        bool enableAL =
(gdb) n
1203        options.SetUseWebGPU(StaticPrefs::dom_webgpu_enabled());
(gdb) n
50      obj-build-mer-qt-xr/dist/include/mozilla/layers/CompositorOptions.h:
    No such file or directory.
(gdb) n
1210        options.SetInitiallyPaused(CompositorInitiallyPaused());
(gdb) n
53      obj-build-mer-qt-xr/dist/include/mozilla/layers/CompositorOptions.h:
    No such file or directory.
(gdb) 
39      in obj-build-mer-qt-xr/dist/include/mozilla/layers/CompositorOptions.h
(gdb) 
1217          lm = new ClientLayerManager(this);
(gdb) p enableWR
$1 = false
(gdb) p enableAPZ
$2 = <optimized out>
(gdb) p enableAL
$3 = <optimized out>
(gdb) p gfx::gfxConfig::IsEnabled(gfx::Feature::ADVANCED_LAYERS)
$4 = false
(gdb) p mFissionWindow
$5 = false
(gdb) p StaticPrefs::layers_advanced_fission_enabled()
No symbol "layers_advanced_fission_enabled" in namespace "mozilla::StaticPrefs".
(gdb) p StaticPrefs::dom_webgpu_enabled()
$6 = false
(gdb) p options.UseWebRender()
Cannot evaluate function -- may be inlined
(gdb) p options
$7 = {mUseAPZ = true, mUseWebRender = false, mUseAdvancedLayers = false,
    mUseWebGPU = false, mInitiallyPaused = false}
(gdb) 
As we can see, on ESR 78 things are different: the options.mUseWebRender field is set to false compared to ESR 91 where it's set to true. What's feeding in to these values?

The options structure and its functionality is defined in CompositorOptions.h. Checking through the code there we can see that mUseWebRender is set at initialisation, either to the default value of false if the default constructor is used, or an explicit value if the following constructor overload is used:
  CompositorOptions(bool aUseAPZ, bool aUseWebRender,
                    bool aUseSoftwareWebRender)
      : mUseAPZ(aUseAPZ),
        mUseWebRender(aUseWebRender),
        mUseSoftwareWebRender(aUseSoftwareWebRender) {
    MOZ_ASSERT_IF(aUseSoftwareWebRender, aUseWebRender);
  }
It's never changed after that. So going back to our nsBaseWidget::CreateCompositorSession() code, the only part we need to concern ourselves with is the value that's passed in to the constructor.

For both ESR 78 and ESR 91, the value that's passed in is that of the local enableWR variable. The logic for this value is really straightforward for ESR 78:
    bool enableWR =
        gfx::gfxVars::UseWebRender() && WidgetTypeSupportsAcceleration();
Let's find out how this value is being set:
(gdb) p WidgetTypeSupportsAcceleration()
$8 = true
(gdb) p gfx::gfxVars::UseWebRender()
Cannot evaluate function -- may be inlined
We can't call the UseWebRender() method directly, but we can extract the value it would return by digging into the data structures. This is all following from the code in gfxVars.h:
(gdb) p gfx::gfxVars::sInstance.mRawPtr.mVarUseWebRender.mValue
$11 = false
That's useful, but it doesn't tell us everything we need to know. The next step is to find out where and why this value is being set to false.
$ grep -rIn "gfxVars::SetUseWebRender(" * --include="*.cpp"
gecko-dev/gfx/thebes/gfxPlatform.cpp:2750:    gfxVars::SetUseWebRender(true);
gecko-dev/gfx/thebes/gfxPlatform.cpp:3297:    gfxVars::SetUseWebRender(false);
gecko-dev/gfx/ipc/GPUProcessManager.cpp:479:  gfx::gfxVars::SetUseWebRender(false);
These are being set in gfxPlatform::InitWebRenderConfig(), gfxPlatform::NotifyGPUProcessDisabled() and GPUProcessManager::DisableWebRender() respectively.

Let's find out which is responsible.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) break gfxPlatform::InitWebRenderConfig
Breakpoint 2 at 0x7fb9013328: file gfx/thebes/gfxPlatform.cpp, line 2691.
(gdb) b gfxPlatform::NotifyGPUProcessDisabled
Breakpoint 3 at 0x7fb9016fb0: file gfx/thebes/gfxPlatform.cpp, line 3291.
(gdb) b GPUProcessManager::DisableWebRender
Breakpoint 4 at 0x7fb907f858: GPUProcessManager::DisableWebRender. (3 locations)
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 2, gfxPlatform::InitWebRenderConfig
    (this=0x7f8c8bbf60)
    at gfx/thebes/gfxPlatform.cpp:2691
2691    void gfxPlatform::InitWebRenderConfig() {
(gdb) n
2692      bool prefEnabled = WebRenderPrefEnabled();
(gdb) n
2693      bool envvarEnabled = WebRenderEnvvarEnabled();
(gdb) n
2698      gfxVars::AddReceiver(&nsCSSProps::GfxVarReceiver());
(gdb) n
2708      ScopedGfxFeatureReporter reporter("WR", prefEnabled || envvarEnabled);
(gdb) n
2709      if (!XRE_IsParentProcess()) {
(gdb) n
2723      gfxConfigManager manager;
(gdb) n
2725      manager.ConfigureWebRender();
(gdb) n
2733      if (Preferences::GetBool("gfx.webrender.program-binary-disk", false)) {
(gdb) n
2738      if (StaticPrefs::gfx_webrender_use_optimized_shaders_AtStartup()) {
(gdb) n
2739        gfxVars::SetUseWebRenderOptimizedShaders(
(gdb) n
2743      if (Preferences::GetBool("gfx.webrender.software", false)) {
(gdb) p gfxConfig::IsEnabled(Feature::WEBRENDER)
$12 = false
(gdb) n
2749      if (gfxConfig::IsEnabled(Feature::WEBRENDER)) {
(gdb) n
2791      if (gfxConfig::IsEnabled(Feature::WEBRENDER_COMPOSITOR)) {
(gdb) p gfxConfig::IsEnabled(Feature::WEBRENDER_COMPOSITOR)
$13 = false
(gdb) n
2795      Telemetry::ScalarSet(
(gdb) n
2799      if (gfxConfig::IsEnabled(Feature::WEBRENDER_PARTIAL)) {
(gdb) n
2805      gfxVars::SetUseGLSwizzle(
(gdb) n
2810      gfxUtils::RemoveShaderCacheFromDiskIfNecessary();
(gdb) r
[...]
No other breakpoints are hit. So as we can see here, on ESR 78 the value for UseWebRender() is left as the default value of false. The reason for this is that gfxConfig::IsEnabled(Feature::WEBRENDER) is returning false. We might need to investigate further where this Feature::WEBRENDER configuration value is coming from or being set, but let's switch to ESR 91 now to find out how things are happening there.

The value of enableWR has a much more complex derivation in ESR 91 compared to that in ESR 78. Here's the logic (note that I've simplified the code to remove the unnecessary parts):
    bool supportsAcceleration = WidgetTypeSupportsAcceleration();
    bool enableWR;
    if (supportsAcceleration ||
        StaticPrefs::gfx_webrender_unaccelerated_widget_force()) {
      enableWR = gfx::gfxVars::UseWebRender();
    } else if (gfxPlatform::DoesFissionForceWebRender() ||
               StaticPrefs::
                   gfx_webrender_software_unaccelerated_widget_allow()) {
      enableWR = gfx::gfxVars::UseWebRender();
    } else {
      enableWR = false;
    }
In practice supportsAcceleration is going to be set to true, which simplifies things and brings us back to this condition:
      enableWR = gfx::gfxVars::UseWebRender();
Let's follow the same investigatory path that we did for ESR 78.
$ grep -rIn "gfxVars::SetUseWebRender(" * --include="*.cpp"
gecko-dev/gfx/thebes/gfxPlatform.cpp:2713:    gfxVars::SetUseWebRender(true);
gecko-dev/gfx/thebes/gfxPlatform.cpp:3435:      gfxVars::SetUseWebRender(true);
gecko-dev/gfx/thebes/gfxPlatform.cpp:3475:    gfxVars::SetUseWebRender(false);
The second of these appears in some code that's compile-time conditional on the platform being Windows XP, so we can ignore it. The other two appear in gfxPlatform::InitWebRenderConfig() and gfxPlatform::FallbackFromAcceleration() respectively. I'm going to go out on a limb and say that we're interested in the former, but let's check using the debugger to make sure.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) b gfxPlatform::InitWebRenderConfig
Breakpoint 7 at 0x7ff12ef954: file gfx/thebes/gfxPlatform.cpp, line 2646.
(gdb) b gfxPlatform::FallbackFromAcceleration
Breakpoint 8 at 0x7ff12f3048: file gfx/thebes/gfxPlatform.cpp, line 3381.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Thread 7 "GeckoWorkerThre" hit Breakpoint 7, gfxPlatform::InitWebRenderConfig
    (this=0x7fc4a48c90)
    at gfx/thebes/gfxPlatform.cpp:2646
2646    void gfxPlatform::InitWebRenderConfig() {
(gdb) n
2647      bool prefEnabled = WebRenderPrefEnabled();
(gdb) n
2648      bool envvarEnabled = WebRenderEnvvarEnabled();
(gdb)
[New LWP 27297]
2653      gfxVars::AddReceiver(&nsCSSProps::GfxVarReceiver());
(gdb) 
2663      ScopedGfxFeatureReporter reporter("WR", prefEnabled || envvarEnabled);
(gdb) 
32      ${PROJECT}/obj-build-mer-qt-xr/dist/include/gfxCrashReporterUtils.h:
    No such file or directory.
(gdb) 
2664      if (!XRE_IsParentProcess()) {
(gdb) 
2678      gfxConfigManager manager;
(gdb) 
2679      manager.Init();
(gdb) 
2680      manager.ConfigureWebRender();
(gdb) 
2682      bool hasHardware = gfxConfig::IsEnabled(Feature::WEBRENDER);
(gdb) 
2683      bool hasSoftware = gfxConfig::IsEnabled(Feature::WEBRENDER_SOFTWARE);
(gdb) 
2684      bool hasWebRender = hasHardware || hasSoftware;
(gdb) p hasHardware
$10 = false
(gdb) p hasSoftware
$11 = true
(gdb) p hasWebRender
$12 = <optimized out>
(gdb) n
2701      if (gfxConfig::IsEnabled(Feature::WEBRENDER_SHADER_CACHE)) {
(gdb) n
2705      gfxVars::SetUseWebRenderOptimizedShaders(
(gdb) n
2708      gfxVars::SetUseSoftwareWebRender(!hasHardware && hasSoftware);
(gdb) n
2712      if (hasWebRender) {
(gdb) n
2713        gfxVars::SetUseWebRender(true);
(gdb) c
[...]
So there we can see that the WebRender layer manager is being activated in ESR 91 due to Feature::WEBRENDER_SOFTWARE being enabled.

So we have a clear difference. In ESR 78 Feature::WEBRENDER is set to false. In ESR 91 the Feature::WEBRENDER_SOFTWARE has been added which is enough for the WebRender layer manager to be enabled.

This is good progress. The next step is to figure out where Feature::WEBRENDER_SOFTWARE is being set to enabled and find out how to disable it. I'll take a look at that tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 Feb 2024 : Day 161 #
Yesterday I was complaining about the difficulty debugging while travelling by train. Before I'd even posted the diary entry I'd received some beautiful new creations from Thigg to illustrate my experiences. I think he's captured it rather too well and it's a real joy to be able to share this creation with you.
 
A pig with wings sitting in the storage department of a train with a laptop on its lap, entangled into way too many usb cables.

This is just so great! But although this was the most representative, it wasn't my favourite of the images Thigg created. I'll be sharing some of the others at other times when I have the pleasure of enjoying train-based-development, so watch out for more!

On to a fresh day, and this morning the package I started building yesterday evening on the train has finally finished. But that's not as helpful to me as I was hoping it would be when I kicked it off. The change I made was to annotate the code with some debug output. Since then I've been able to find out all the same information using the debugger.

To recap the situation, we've been looking at WebView rendering. Currently any attempt to use the WebView will result in a crash. That's because the the EmbedLite PuppetWdigetBase code, on discovering that the layer manager is of type LAYERS_WR (Web Renderer) is intentionally triggering a crash. It requires the layer manager to be of type LAYERS_CLIENT to prevent this crash from happening.

So my task for today is to find out where the layer manager is being created and establish why the wrong type is being used. To get a good handle on the situation I'll also need to compare this against the same paths in ESR 78 to find out whey they're different.

Looking through the code there are two obvious places where a WebLayerManager is created. First there's code in PuppetWidget that looks like this:
bool PuppetWidget::CreateRemoteLayerManager(
    const std::function<bool(LayerManager*)>& aInitializeFunc) {
  RefPtr<LayerManager> lm;
  MOZ_ASSERT(mBrowserChild);
  if (mBrowserChild->GetCompositorOptions().UseWebRender()) {
    lm = new WebRenderLayerManager(this);
  } else {
    lm = new ClientLayerManager(this);
  }
[...]
Second there's some code in nsBaseWidget that looks like this (I've left some of the comments in, since they're relevant):
already_AddRefed<LayerManager> nsBaseWidget::CreateCompositorSession(
    int aWidth, int aHeight, CompositorOptions* aOptionsOut) {
[...]
    gfx::GPUProcessManager* gpu = gfx::GPUProcessManager::Get();
    // Make sure GPU process is ready for use.
    // If it failed to connect to GPU process, GPU process usage is disabled in
    // EnsureGPUReady(). It could update gfxVars and gfxConfigs.
    gpu->EnsureGPUReady();

    // If widget type does not supports acceleration, we may be allowed to use
    // software WebRender instead. If not, then we use ClientLayerManager even
    // when gfxVars::UseWebRender() is true. WebRender could coexist only with
    // BasicCompositor.
[...]
    RefPtr<LayerManager> lm;
    if (options.UseWebRender()) {
      lm = new WebRenderLayerManager(this);
    } else {
      lm = new ClientLayerManager(this);
    }
[...]
It should be pretty easy to check using the debugger whether either of these are the relevant routes when setting up the layer manager. I still have the debugging session open from yesterday:
(gdb) break nsBaseWidget.cpp:1364
Breakpoint 3 at 0x7ff2a57b64: file widget/nsBaseWidget.cpp, line 1364.
(gdb) break PuppetWidget.cpp:616
Breakpoint 4 at 0x7ff2a67d48: file widget/PuppetWidget.cpp, line 616.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]
Created LOG for EmbedLiteLayerManager

Thread 7 "GeckoWorkerThre" hit Breakpoint 3, nsBaseWidget::
    CreateCompositorSession (this=this@entry=0x7fc4dad520,
    aWidth=aWidth@entry=1080, aHeight=aHeight@entry=2520,
    aOptionsOut=aOptionsOut@entry=0x7fd7da7770)
    at widget/nsBaseWidget.cpp:1364
1364        options.SetInitiallyPaused(CompositorInitiallyPaused());
(gdb) n
43      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h: No such file or directory.
(gdb) 
1369          lm = new WebRenderLayerManager(this);
(gdb) p options
$4 = {mUseAPZ = true, mUseWebRender = true, mUseSoftwareWebRender = true,
    mAllowSoftwareWebRenderD3D11 = false, mAllowSoftwareWebRenderOGL = false, 
  mUseAdvancedLayers = false, mUseWebGPU = false, mInitiallyPaused = false}
(gdb) 
The options structure is really clean and it's helpful to be able to see all of the contents like this.

So we now know that the Web Render version of the layer manager is being created in nsBaseWidget::CreateCompositorSession(). There are two questions that immediately spring to mind: first, if the Client version of the layer manager were being created at this point, would it fix things? Second, is it possible to run with the Web Render layer manager instead?

I also want to know exactly what inputs are being used to decide which type of layer manager to use. Stepping through the nsBaseWidget::CreateCompositorSession() is likely to help with this, so let's give that a go.
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) break nsBaseWidget::CreateCompositorSession
Breakpoint 5 at 0x7ff2a578f8: file widget/nsBaseWidget.cpp, line 1308.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/harbour-webview 
[...]

Thread 7 "GeckoWorkerThre" hit Breakpoint 5, nsBaseWidget::
    CreateCompositorSession (this=this@entry=0x7fc4db8a30,
    aWidth=aWidth@entry=1080, aHeight=aHeight@entry=2520,
    aOptionsOut=aOptionsOut@entry=0x7fd7da7770)
    at widget/nsBaseWidget.cpp:1308
1308        int aWidth, int aHeight, CompositorOptions* aOptionsOut) {
(gdb) n
1312        CreateCompositorVsyncDispatcher();
(gdb) n
1314        gfx::GPUProcessManager* gpu = gfx::GPUProcessManager::Get();
(gdb) n
1318        gpu->EnsureGPUReady();
(gdb) n
1324        bool supportsAcceleration = WidgetTypeSupportsAcceleration();
(gdb) n
1327        if (supportsAcceleration ||
(gdb) n
1329          enableWR = gfx::gfxVars::UseWebRender();
(gdb) n
195     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/gfxVars.h:
    No such file or directory.
(gdb) n
1338        bool enableAPZ = UseAPZ();
(gdb) n
1339        CompositorOptions options(enableAPZ, enableWR, enableSWWR);
(gdb) p supportsAcceleration
$8 = <optimized out>
(gdb) p enableAPZ
$5 = true
(gdb) p enableWR
$6 = true
(gdb) p enableSWWR
$7 = true
(gdb) n
1357        options.SetUseWebGPU(StaticPrefs::dom_webgpu_enabled());
(gdb) p StaticPrefs::dom_webgpu_enabled()
$9 = false
(gdb) n
mozilla::Atomic<bool, (mozilla::MemoryOrdering)0, void>::operator bool
    (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h:67
67      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h: No such file or directory.
(gdb) n
nsBaseWidget::CreateCompositorSession (this=this@entry=0x7fc4db8a30,
    aWidth=aWidth@entry=1080, aHeight=aHeight@entry=2520, 
    aOptionsOut=aOptionsOut@entry=0x7fd7da7770)
    at widget/nsBaseWidget.cpp:1364
1364        options.SetInitiallyPaused(CompositorInitiallyPaused());
(gdb) n
43      ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/
    CompositorOptions.h: No such file or directory.
(gdb) n
1369          lm = new WebRenderLayerManager(this);
(gdb) 
That gives us some things to work with, but to actually dig into what this all means will have to wait until the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 Feb 2024 : Day 160 #
It's been a long couple of days running an event at work, but now I'm on the train heading home and looking forward to a change of focus for a bit.

And part of that is getting the opportunity to take a look at the backtrace generated yesterday for the WebView rendering pipeline. I won't copy it out again in full, but it might be worth giving a high-level summary.
#0  PuppetWidgetBase::Invalidate (this=0x7fc4dac130, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
#1  PuppetWidgetBase::UpdateBounds (...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:395
#2  EmbedLiteWindowChild::CreateWidget (this=0x7fc4d626d0)
    at xpcom/base/nsCOMPtr.h:851
#3  RunnableMethodArguments<>::applyImpl...
    at obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
Now that I've mentally parsed the backtrace, it's clearly not as useful as I was hoping. But it is something to go on. The line that's causing the crash is the one with MOZ_CRASH() in it below.
void
PuppetWidgetBase::Invalidate(const LayoutDeviceIntRect &aRect)
{
[...]

  if (mozilla::layers::LayersBackend::LAYERS_CLIENT == lm->GetBackendType()) {
    // No need to do anything, the compositor will handle drawing
  } else {
    MOZ_CRASH("Unexpected layer manager type");
  }
[...]
That means that lm->GetBackendType() is returning something other than LAYERS_CLIENT.

It would be nice to know what value is actually being returned, but it looks like this will be easier said than done with the code in its present form. There's nowhere to place the required breakpoint and no variable to extract it from. The LayerManager is an interface and it's not clear what will be inheriting it at this point.

While I'm on the train it's also particularly challenging for me to do any debugging. It is technically possible and I've done it before, but it requires me to attach USB cables between my devices, which is fine until I lose track of time and find I've arrived at my destination. I prefer to spend my time on the train coding, or reviewing code, if I can.

So I'm going to examine the code visually first. So let's suppose it's EmbedLiteAppProcessParentManager that's inheriting from LayerManager. This isn't an absurd suggestion, it's quite possibly the case. So then the value returned will be a constant:
  virtual mozilla::layers::LayersBackend GetBackendType() override {
    return LayersBackend::LAYERS_OPENGL; }
Again, there's nothing to hang a breakpoint from there. So I've added a debug output so the value can be extracted explicitly.
  LOGW("WEBVIEW: Invalidate LAYERS_CLIENT: %d", lm->GetBackendType());
  if (mozilla::layers::LayersBackend::LAYERS_CLIENT == lm->GetBackendType()) {
    // No need to do anything, the compositor will handle drawing
  } else {
    MOZ_CRASH("Unexpected layer manager type");
  }
There's nothing wrong with this approach, except that it requires a rebuild of the code, which I've just set going. Hopefully it'll forge through the changes swiftly.

In the meantime, let's continue with our thought that the layer manager is of type EmbedLiteAppProcessParentManager and that the method is therefore returning LAYERS_OPENGL. The enum in LayersTypes.h shows that this definitely takes a different value from LAYERS_CLIENT:
enum class LayersBackend : int8_t {
  LAYERS_NONE = 0,
  LAYERS_BASIC,
  LAYERS_OPENGL,
  LAYERS_D3D11,
  LAYERS_CLIENT,
  LAYERS_WR,
  LAYERS_LAST
};
Which does make me wonder how this has come about. Isn't it inevitable that the code will crash in this case?

I'll need to check if either the return value or the test condition has changed since ESR 78. But the other possibility is that it's something else inheriting the LayerManager class.

[...]

Now I'm back home and have access to the debugger. The code is still building — no surprise there — so while I wait let's attache the debugger and see what it throws up.
(gdb) p lm->GetBackendType()
$2 = mozilla::layers::LayersBackend::LAYERS_WR
(gdb) ptype lm
type = class mozilla::layers::LayerManager : public mozilla::layers::FrameRecorder {
  protected:
    nsAutoRefCnt mRefCnt;
[...]
    virtual mozilla::layers::LayersBackend GetBackendType(void);
[...]
  protected:
    ~LayerManager();
[...]
} *
(gdb) p this->GetLayerManager(0, mozilla::layers::LayersBackend::LAYERS_NONE, LAYER_MANAGER_CURRENT)
$2 = (mozilla::layers::LayerManager *) 0x7fc4db1250
Direct examination of the LayerManager doesn't show what the original object type is that's inheriting it. But there is a trick you can do with gdb to get it to tell you:
(gdb) set print object on
(gdb) p this->GetLayerManager(0, mozilla::layers::LayersBackend::LAYERS_NONE, LAYER_MANAGER_CURRENT)
$3 = (mozilla::layers::WebRenderLayerManager *) 0x7fc4db1250
(gdb) set print object off
So the actual type of the layer manager is WebRenderLayerManager. This is clearly a problem, because this will always return LAYERS_WR as its backend type:
  LayersBackend GetBackendType() override { return LayersBackend::LAYERS_WR; }
All this debugging has been useful; so useful in fact that it's made the debug prints I added on the train completely redundant. No matter, I'll leave the build running anyway.

Tomorrow I must find out where the layer manager is being created and also what the layer manager type is on ERS 78 for comparison.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
17 Feb 2024 : Day 159 #
I've been working through a singular bug stack for so long now that it feels strange to have a completely open ended set of possibilities for which direction to go next. We had a small diversion yesterday into creating random grids of avatars but before that were focused for a while on the task of getting DuckDuckGo working. Today I have to make some new decisions about where to go next.

There are two things I'd really like to look into. While working on the browser over the last weeks it's been stable enough to use pretty successfully as a browser. But occasionally the renderer crashes out completely, pulling the browser down, for no obvious reason. It's sporadic enough that there's no obvious cause. But if I could get a backtrace from the crash that might be enough to start looking in to it.

So my first option is looking in to these sporadic crashes. They're not nice for users and might signify a deeper issue.

The second option is fixing the webview rendering pipeline. That needs a little explanation. On Sailfish OS the browser is used in one of two ways, either as a web browser, or as a webview embedded in some other application.

The best example of this is the email client which uses the webview to render messages. These often contain HTML, so it makes perfect sense to use a good embedded browser rending engine for them.

So these are two different use-cases. But they also happen to have two different rendering pipelines. Currently the browser pipeline works nicely, but the webview pipeline is broken. I'd really like to fix it.

I've decided to go with the native rendering pipeline task first (issue 1043 on GitHub). It's clearly defined, but also potentially a big job, so needs some attention. But if I continue to see browser crashes I may switch focus to those instead.

For the native rendering pipeline the first step is straightforward: install a webview project on my phone. There are plenty out there, but I also have several basic examples already written in the "projects" folder on my laptop, and which I should be able to just build and install on my phone for testing.

Digging through projects I find one called "harbour-webview-example" (sounds promising) with the following as the main page of the app:
import QtQuick 2.0
import Sailfish.Silica 1.0
import Sailfish.WebView 1.0
import Sailfish.WebEngine 1.0
import uk.co.flypig.webview 1.0

Page {
    allowedOrientations: Orientation.All

    WebView {
        anchors.fill: parent
        active: true
        url: "http://www.sailfishos.org"
        onLinkClicked: {
          WebEngine.notifyObservers("exampleTopic", url)
        }
    }
}
Straightforward. But containing a webview. Attempting to run it, the results sadly aren't good:
$ harbour-webview
[D] unknown:0 - QML debugging is enabled. Only use this in a safe environment.
[...]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 228}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:228:12

Created LOG for EmbedPrefs
Content JS: resource://gre/modules/SearchSettings.jsm, function: get, message:
    [JavaScript Warning: "get: No settings file exists, new profile?
    NotFoundError: Could not open the file at
    .cache/harbour-webview/harbour-webview/.mozilla/search.json.mozlz4"]
Created LOG for EmbedLiteLayerManager
Segmentation fault
There are a couple of JavaScript errors (may or may not be related) and a crash (definitely relevant). Let's see if we can get a backtrace from the crash:
$ gdb harbour-webview 
GNU gdb (GDB) Mer (8.2.1+git9)
Copyright (C) 2018 Free Software Foundation, Inc.
[...]
Thread 7 "GeckoWorkerThre" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 12581]
0x0000007ff367c0a8 in mozilla::embedlite::PuppetWidgetBase::Invalidate
    (this=0x7fc4dac130, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
274         MOZ_CRASH("Unexpected layer manager type");
(gdb) bt
#0  0x0000007ff367c0a8 in mozilla::embedlite::PuppetWidgetBase::Invalidate
    (this=0x7fc4dac130, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
#1  0x0000007ff368093c in mozilla::embedlite::PuppetWidgetBase::UpdateBounds
    (this=0x7fc4dac130, aRepaint=aRepaint@entry=true)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:395
#2  0x0000007ff3689b28 in mozilla::embedlite::EmbedLiteWindowChild::CreateWidget
    (this=0x7fc4d626d0)
    at xpcom/base/nsCOMPtr.h:851
#3  0x0000007ff367a094 in mozilla::detail::RunnableMethodArguments<>::applyImpl
    <mozilla::embedlite::EmbedLiteWindowChild, void
    (mozilla::embedlite::EmbedLiteWindowChild::*)()>
    (mozilla::embedlite::EmbedLiteWindowChild*, void
    (mozilla::embedlite::EmbedLiteWindowChild::*)(), mozilla::Tuple<>&,
    std::integer_sequence<unsigned long>) (args=..., m=<optimized out>,
    o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
[...]
#28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6
(gdb) 
That's definitely something to go on. But unfortunately I'm tight for time today, so digging in to this backtrace will have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
16 Feb 2024 : Day 158 #
A couple of days back I included a small graphic that showed avatars and usernames of everyone who has helped, commented on, liked, boosted and generally supported this dev diary. Here it is again (because you can never say thank you enough!).
 
Many, many avatars with names underneath, showing all of the people I could find who I think have interacted with the dev diary, along with the words 'Thank you'

It's actually a slide from my FOSDEM presentation, and while I do like it, that's not the real reason I'm showing it again here. During my presentation I mentioned that it had taken me a long time to create the slide. And that's true. But I thought it might be interesting to find out how it was created. Maybe it could have been done in a simpler, better, or quicker way?

Creating it required four steps:
  1. Collect the names and avatars.
  2. Process the avatars.
  3. Plot the names and avatars on the canvas.
  4. Tidy things up.
I did my best to automate the process, with small scripts to handle steps two and three. While most of my energy was spent on step one, it's the automated steps that might be of interest to others.

Let me first give a quick overview of how I collected the data in step one. This did take a long time — longer than I expected — primarily because there were more people interacting with my dev diary than I'd quite appreciated. Initially I collected names from the sailfish forum. There's a thread about the dev diary and I picked up most of the names from there, or from direct messages.

Each user on the forum has an avatar, even if it's been auto-generated by the Discourse forum software. Whenever someone posts their avatar is shown next to their comment. But this is a small version of the avatar. Select the username at the top of a post and a more detailed summary of the account is shown, including a larger version of the same image. Right click on this and it can be saved out to disc.

If you try this you'll notice the avatar is actually a square image, even though it's shown in the forum as circular. A mask is being applied to it on the client side. This will be important for step two.

At this point I also added other users I could think of who, while they may not have made a post on the forum, had nevertheless interacted in important ways with the dev diary. This included many different types of interactions such as comments on IRC or matrix. In this case, I also found their avatars and usernames on the forum.

While doing this I kept a CSV file as I was going along containing two columns: username and avatar filename.

Finally I checked my Mastodon account for all the users who had interacted with my posts there. I stepped through all 149 of my dev diary Mastodon posts (as it was at the time), checking who had favourited, boosted, or replied to a post there. Once again I took a copy of their avatar and added their details to the CSV file.

So far so manual. What I was left with was a directory containing 145 images and a CSV file with 145 rows. Here's the first few rows of the CSV file to give you an idea:
000exploit, 000exploit.png
aerique, aerique.png
Adrian McEwen, amcewen.png
ApB, apb.png
Adam T, atib.png
[...]
You'll notice that it's in alphabetical order. That's because after collecting all the details I ran it through sort on the command line.

That brings us on to step two, the processing of the avatars (which sounds far more grand than it is). Different avatars were in different formats (jpeg or PNG), with different sizes, but mostly square in shape. They needed to all be the same size, the same format (PNG) and with a circular mask applied.

For this I used the convert tool, which is part of the brilliant ImageMagick suite. It performs a vast array of image manipulations all from the command line. Here's just the a small flavour from its help output:
$ convert --help
Version: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
Copyright: (C) 1999-2021 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP(4.5) 
Delegates (built-in): bzlib djvu fftw fontconfig freetype heic jbig jng jp2
    jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
Usage: convert-im6.q16 [options ...] file [ [options ...] file ...]
    [options ...] file

Image Settings:
  -adjoin              join images into a single multi-image file
  -affine matrix       affine transform matrix
  -alpha option        activate, deactivate, reset, or set the alpha channel
  -antialias           remove pixel-aliasing
  -authenticate password
                       decipher image with this password
  -attenuate value     lessen (or intensify) when adding noise to an image
[...]
Happily for me, all three of my desired conversions (changing format, changing size, applying a mask) were available using the tool. I put together a very simple bash script which cycles through all of the files with a given format in a folder, processes them in the way that was needed and then output them to a different folder with a suffix added to the filename.

All of the work is done by the convert tool, via this simple addmask.sh script. It's so simple in fact that I can give the entirety of it here without it taking up too much space:
SUFFIX_FROM=$1
OUT_FOLDER=$2
SUFFIX_TO=$3

set -e

if [ $# -ne 3 ]; then
        echo "Syntax: " $0 "<extension-in> <out-folder> <out-suffix>"
        exit 0
fi

SUFFIX_FROM_DOT="${SUFFIX_FROM}."

echo "Converting <image>.$SUFFIX_FROM images to <image>-$SUFFIX_TO.png"
echo

for name in *.$SUFFIX_FROM; do
	newname="${name%.*}$SUFFIX_TO.png"
        echo "Converting $name to $OUT_FOLDER/$newname"
	convert "$name" masks/mask.png -sample 240x240 -alpha Off -compose \
	    CopyOpacity -composite -density 180 -background transparent \
	    "$OUT_FOLDER/$newname"
done
I then called it a couple of times inside the folder with the images to produce the results needed:
$ ./addmask.sh png masked -masked
$ ./addmask.sh jpg masked -masked
After processing, each of the updated images is given a new name, so I had to perform a regex search and replace on my CSV file to update them appropriately. The file now looks like this:
000exploit, 000exploit-masked.png
aerique, aerique-masked.png
Adrian McEwen, amcewen-masked.png
ApB, apb-masked.png
Adam T, atib-masked.png
[...]
I have to admit that I cheated a bit with this script. I originally wrote it back in September 2022 for Jolla's "Sailing for Ten Years" party held in Berlin on 14th October of the same year. Nico Cartron wrote up a nice summary of the event in the Sailfish Community News at the time. I was asked to give a presentation at the event; one of the slides I created for it was a thank you slide not unlike the one above. In that case it was for translators and apps, but it never actually got used during the event.

Nevertheless the script lived on in my file system and finally found itself a use. To be honest, I was pretty tight for time writing up my presentation for FOSDEM so I'm not sure if I'd have gone down this route if I didn't already have something to build on. I made some small changes to it to handle resizing, but that was pretty much the only change.

That brings us to step three. Now having a directory full of nicely processed images, I needed them to be arranged on a canvas, ideally in SVG format, so that I could then embed the result on a slide.

Since starting my role as a Research Data Scientist I've been immersed in random Python scripts. Python has the benefit of a huge array of random libraries to draw from, SVG-writing included in the form of the drawsvg 2 project. It is, honestly, a really simple way to generate SVGs quickly and easily. Now that I've tried it I think I'll be using it more often.

My aim was to arrange the avatars and names "randomly" on the page. I started by creating a method that placed a single avatar on the canvas with the username underneath. Getting the scale, font and formatting correct took a little trial and error, but I was happy that the final values all made sense. The drawsvg coordinate system works just as it should!

Arranging them at random requires an algorithm. My first instinct was to arrange them all in a grid, but with a random "jitter" applied. That is, for each image select a random angle and distance and move it by that amount on the page.

The script I created for this is a little too long to show here, but you can check it out on GitHub.

Here's how I ran it:
$ python3 createthanks.py names.csv thanks-grid.svg --grid
Using grid formation
Exported to thanks-grid.svg
The results weren't good, as you can see for yourself in this image.


 
The same avatars but arranged in a grid formation with random jitter; but the grid formation is still very clear

The avatars have been placed, but the grid formation is still very clear despite the added jitter, plus there's a gap at the end because the number of avatars in total doesn't divide by the number of avatars on each row. I wasn't happy with it.

So I came up with an alternative solution: for each avatar a random location is chosen on the canvas. As each avatar is added its position is stored in an array, then when the next position is chosen it's compared against all of these positions. If it's within a certain distance (60 units) of any of the other points, it's rejected and a new random position is chosen.

Again, you can see this algorithm given in the same file on GitHub. Here's how I ran it:
$ python3 createthanks.py names.csv thanks-random.svg
Using random formation
100%
Exported to thanks-random.svg
This is the approach I ended up using, so you can see the results in the original slide. But it's not ideal for several reasons. Most crucially it's not guaranteed to complete, for example if there isn't enough space to fit all of the avatars the algorithm will hang indefinitely while it tries to find a place to position the next avatar. It's also inefficient, with each location being compared with every other and a potentially large number of rejections being made before a suitable location is found at each step.

But I found given enough space to locate the avatars the process actually finished pretty quickly. And since I only need to run it once, not multiple times, that's actually not a problem.

In retrospect a better algorithm would have been to partition the canvas up into a grid of sizes much smaller than an avatar. Ideally it would be one pixel per pixel, but in practice we don't really know what a pixel means in this context. Besides which something approaching this is likely to be fine. Now store a Boolean associated with each of these grid points indicating whether it's available or used.

After placing an avatar mark the pixels around the location in this grid as being used, to the extent that there will be no overlap if another avatar is placed in any of the unused spots. Keep a count of the available locations.

Then a random number can be chosen between zero and the total remaining available locations in order to select the next spot.

I didn't implement this, but in my head it works really well.

Finally step four involved tidying up the files. Some of the avatars and usernames were overlapping so needed a bit of manual tweaking (but thankfully not too much). Plus I also had to manually make room for the "Thank you" text in the top left of the window. This required a bit more manual shuffling of avatars, but it all worked out in the end. I'm quite happy with how it came out.

That's it. Just a little diversion into the scripts used to create the image; I hope it's been of some interest.

There will be more of the usual gecko dev diary tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
15 Feb 2024 : Day 157 #
The new gecko packages completed their build during the gap. I'm not expecting anything special from them, apart from the debug output that I added yesterday having been removed. But it will be nice to test whether the packages are working. If they are that means I'll have dealt with the session history issues and the actor issues and it'll be time to wind back down the stack to checking DuckDuckGo.

I've just now fired up sailfish-browser and all of the JavaScript errors seem to have been resolved. If nothing else, that's pleasing to see. So now it's time to return to DuckDuckGo again.

You'll recall that I spent a long time getting DuckDuckGo to work. The main problem was the Sec-Fetch-* headers. This allowed the main search page to load. But there was still an issue in that the search results weren't being displayed. I thought this could potentially be related to the session history issue, which is why I started looking at it. But I wasn't at all certain.

Now that the session history is fixed and the JavaScript output is clean, there's a small chance this will have fixed DuckDuckGo as well.

And it has! Searching with DuckDuckGo now produced nice results. And the page feels nice and smooth as well. With the Forwards and Back buttons now also working as expected, it's really starting to feel like a usable browser finally.
 
The screenshots of DuckDuckGo with captions: DuckDuckGo landing page; Search as you type; Search results now appear!

But my euphoria is short-lived. I'm able to get the browser into a state where DuckDuckGo is no longer working correctly: it just displays a blank page again.

After closing and firing up the browser again everything works as expected again.

It takes me ages to figure out how to get it back into the state where DuckDuckGo is broken. The breakage looks the same as the Sec-Fetch-* header issues that we were experiencing a couple of weeks back. If that's the case, then it's likely there's some route to getting to the DuckDuckGo page that's getting the wrong flags from the user interface and then offering up the wrong Sec-Fetch-* headers as a result.

What do I mean by a route? I mean some set of interactions to get there: loading the page from the history, a bookmark, the URL bar, pressing back, as a page loaded from the command line. All of these are routes and each will potentially need slightly different flags so that the gecko engine knows what's going on and can send the correct Sec-Fetch-* headers to match.

I thought I'd fixed them all, but it would have been easy to miss some. And so it appears.

So I try all the ways to open the page I can think of from different starting states. After having exhausted what I think are all the routes I realise I've missed something important.

So far I've been running the browser from the command line each time. What if the issue is to do with the way the browser is starting up?

Starting the browser from the command line isn't the same as starting it by pressing on its icon in the app grid. There are multiple rasons for this, but the most significant two are:
  1. When launched from the app grid the sandbox is used. It's not used when launched from the command line.
  2. When launched from the app grid the booster is used, unlike when launched from the command line.
And indeed, when I launch the app from the grid, DuckDuckGo fails. This is a serious issue: we can't have users being required to launch the app without sandboxing from the command line each time. But it's not immediately clear to me why sandboxing and/or the booster would make any difference.

To find out what's going wrong I need to establish the Sec-Fetch-* header values that are being sent to the site. That's going to be a little tricky because when launching from the app grid there's no debug output being sent to the console. But it might be possible to extract the same info from the system log. Let's try it:
$ devel-su journalctl --system -f
[...]
booster-browser[14470]: [D] unknown:0 - Using Wayland-EGL
autologind[5343]: library "libutils.so" not found
autologind[5343]: library "libcutils.so" not found
autologind[5343]: library "libhardware.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@2.0.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@2.1.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@3.0.so" not found
autologind[5343]: library "android.hardware.graphics.mapper@4.0.so" not found
autologind[5343]: library "libc++.so" not found
autologind[5343]: library "libhidlbase.so" not found
autologind[5343]: library "libgralloctypes.so" not found
autologind[5343]: library "android.hardware.graphics.common@1.2.so" not found
autologind[5343]: library "libion.so" not found
autologind[5343]: library "libz.so" not found
autologind[5343]: library "libhidlmemory.so" not found
autologind[5343]: library "android.hidl.memory@1.0.so" not found
autologind[5343]: library "vendor.qti.qspmhal@1.0.so" not found
[...]
It looks like the logging output is all there. The next step is to figure out a way to set the EMBED_CONSOLE="network" environment variable, so that we can capture the headers used.

But as I write this I'm hurtling towards King's Cross Station on the train and due to arrive shortly. So I'll have to leave that question open until this evening.

[...]

It's the evening now, so time to check out the logging from the SailJailed browser. Before running it I want to forcefully enable network logging. This will be easier than configuring a special environment for it.

I've made some amendments to the Logger.js directly on-device for this:
vim /usr/lib64/mozembedlite/chrome/embedlite/content/Logger.js
The changes applied are the following, in order to force the "extreme" network logging to be used:
[...]
  get stackTraceEnabled() {
    return true;
    //return this._consoleEnv.indexOf("stacktrace") !== -1;
  },

  get devModeNetworkEnabled() {
    return true;
    //return this._consoleEnv.indexOf("network") !== -1;
  }, 
[...]
Next we must set the logger running, also on the device:
$ devel-su journalctl --system -f | grep "dbus-daemon" > sailjail-log-01.txt
And finally run sailfish-browser. The DuckDuckGo page loads, but doesn't render. That's good: that's the bug we want to examine. After closing the browser down I'm left with a surprisingly tractable 80 KiB log file to investigate:
$ ls -lh sailjail-log-01.txt 
-rw-rw-r--    1 defaultu defaultu   80.0K Feb 13 22:35 sailjail-log-01.txt
The bits we're interested in are the Sec-Fetch-* request headers. This is what's in the log file for these:
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/?t=h_
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Cookie : l=wt-wt
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : none
        Sec-Fetch-User : ?1
        If-None-Match : "65cbc605-4716"
Going back to the non-sandboxed run, below are the equivalent headers sent. There are two here because the browser is downloading a second file.
[ Request details ------------------------------------------- ]
    Request: GET status: 200 OK
    URL: https://duckduckgo.com/?t=h_
    [ Request headers --------------------------------------- ]
        Host : duckduckgo.com
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : text/html,application/xhtml+xml,application/xml;q=0.9,
            image/webp,*/*;q=0.8
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Cookie : l=wt-wt
        Upgrade-Insecure-Requests : 1
        Sec-Fetch-Dest : document
        Sec-Fetch-Mode : navigate
        Sec-Fetch-Site : none
        Sec-Fetch-User : ?1
        If-None-Match : "65cbc606-4716"
[ Request details ------------------------------------------- ]
    Request: GET status: 304 Not Modified
    URL: https://content-signature-2.cdn.mozilla.net/chains/remote-settings.
        content-signature.mozilla.org-2024-03-20-10-07-03.chain
    [ Request headers --------------------------------------- ]
        Host : content-signature-2.cdn.mozilla.net
        User-Agent : Mozilla/5.0 (X11; Linux aarch64; rv:91.0) Gecko/20100101
            Firefox/91.0
        Accept : */*
        Accept-Language : en-GB,en;q=0.5
        Accept-Encoding : gzip, deflate, br
        Connection : keep-alive
        Sec-Fetch-Dest : empty
        Sec-Fetch-Mode : cors
        Sec-Fetch-Site : cross-site
        If-Modified-Since : Tue, 30 Jan 2024 10:07:04 GMT
        If-None-Match : "b437816fe2de8ff3d429925523643815"
To be honest I'm a bit confused by this: for the initial request there are no difference between the values provided for the Sec-Fetch-* headers in either case. I was expecting to see a difference.

Maybe a caching issue? Let's try deleting the profile, killing any running booster processes and trying again.
$ mv ~/.local/share/org.sailfishos/browser/.mozilla/ \
    ~/.local/share/org.sailfishos/browser/mozilla.bak
$ ps aux | grep browser
 4628 defaultu /usr/bin/invoker [...] --application=sailfish-browser
 4640 defaultu /usr/bin/firejail [...] --application=sailfish-browser
 4674 defaultu /usr/libexec/mapplauncherd/booster-browser
    --application=sailfish-browser
 4675 defaultu booster [browser]
 4702 defaultu grep browser
 6184 defaultu /usr/bin/invoker --type=silica-session [...]
    --application=jolla-email
 6243 defaultu /usr/bin/firejail [...] --application=sailfish-browser
 6248 defaultu /usr/bin/firejail [...] --application=jolla-email
 6442 defaultu /usr/libexec/mapplauncherd/booster-browser
    --application=jolla-email
 6452 defaultu booster [browser]
$ kill -1 4640
$ kill -1 6243
After deleting the browser's local config storage and killing the SailJail/booster processes, DuckDuckGo then works successfully again. So probably it was a caching issue.

When running the browser from the launcher there's an added complication that usually the booster is still running in the background for performance reasons. So changes to the browser code and files may not always be applied if this is the case. After killing the booster processes, this seems to have been fixed.

After this additional investigation I'm satisfied that this doesn't look like an issue with the code after all.

I'm going to leave it there for today. Tomorrow I'll be looking for a completely new task to work on. But I think the browser has got to the stage where it would be worth having more hands to test it, so the next task may be to figure out how that can happen.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
14 Feb 2024 : Day 156 #
It is the morning, time to take a look into why EmbedLiteGlobalHelper.js isn't being initialised on ESR 91. We saw yesterday that this was why the Window Actors weren't being registered. If we want them registered we're going to have to get EmbedLiteGlobalHelper.js back up and running.

Now I think about it, it's not that EmbedLiteGlobalHelper.js isn't being loaded. We know it's being loaded because we see this line in the debug output when running ESR 91:
$ EMBED_CONSOLE=1 sailfish-browser
[...]
JSComp: EmbedLiteGlobalHelper.js loaded
[...]
That line is being directly output from the EmbedLiteGlobalHelper.js file itself:
function EmbedLiteGlobalHelper()
{
  L10nRegistry.registerSources([new FileSource(
                                   "0-mozembedlite",
                                   ["en-US", "fi", "ru"],
                                   "chrome://browser/content/localization/
                                   {locale}/")])

  Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
}
But! There is a line missing from here that's in the ESR 78 code:
function EmbedLiteGlobalHelper()
{
  ActorManagerParent.flush();

  L10nRegistry.registerSource(new FileSource(
                                  "0-mozembedlite",
                                  ["en-US", "fi", "ru"],
                                  "chrome://browser/content/localization/
                                  {locale}/"))

  Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
}
Could it be that the call to ActorManagerParent.flush() is what kicks the ActorManagerParent into life? Presumably there's a reason it was removed. It was most likely me that removed it, but I don't recall. But we should be able to find out.
$ git log -S "ActorManagerParent" -1 -- jscomps/EmbedLiteGlobalHelper.js
commit 0bf2601425ec1d8d639255d6a7c32231e7e38eae
Author: David Llewellyn-Jones <david.llewellyn-jones@jolla.com>
Date:   Thu Nov 23 21:55:13 2023 +0000

    Remove EmbedLiteGlobalHelper.js errors
    
    Makes three changes to address errors that were coming from
    EmbedLiteGlobalHelper.js:
    
    1. Use ComponentUtils.generateNSGetFactory() instead of
       XPCOMUtils.generateNSGetFactory().
    
    2. Remove call to ActorManagerParent.flush(). See Bug 1649843.
    
    3. Use L10nRegistry.registerSources() instead of
       L10nRegistry.registerSource(). See Bug 1648631.
    
    See the following related upstream changes:
    
    https://phabricator.services.mozilla.com/D95206
    
    https://phabricator.services.mozilla.com/D81243
That "Bug 1648631" that's being referred to there is described as "Remove legacy JS actors infrastructure and migrate remaining actors to JSWindowActors".

Presumably there was some error resulting from the call to ActorManagerParent.flush(), but November was a long time ago now (way back on Day 88 in fact). According to the diary entry then, the change was to remove the following error:
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteGlobalHelper.js,
    line 32: TypeError: ActorManagerParent.flush is not a function
JavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteGlobalHelper.js,
    line 34: TypeError: L10nRegistry.registerSource is not a function
With all of the other changes we've been making, restoring the removed code to see what happens doesn't sound like the worst idea right now. Let's try it.
$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
[...]
Created LOG for EmbedLite
ACTOR: addJSProcessActors
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: addJSWindowActors
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
AJavaScript error: file:///usr/lib64/mozembedlite/components/
    EmbedLiteGlobalHelper.js, line 31: TypeError: ActorManagerParent.flush
    is not a function
CTOR: RegisterWindowActor: 
ACTOR: RegisterWindowActor: 
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
Call EmbedLiteApp::StopChildThread()
It turns out the code I added to output the name of the actors is broken: the log output is being generated, but without the name of the actor concerned. But I don't fancy doing a complete rebuild to fix that so we'll just have to do without for now.

The results here are interesting. This has clearly brought the ActorManagerParent back to life. But there is still this error about flush() not being a function. And when comparing the code in ActorManagerParent.jsm between ESR 78 and ESR 91 it is true that this flush() method has been removed.

That leaves us with an interesting quandary. In ESR 78 the flush() function was being used to trigger initialisation of the module. Now we need something else to do the same.

Thankfully there is a really simple solution. We can just instantiate an ActorManagerParent object without calling any functions on it.
function EmbedLiteGlobalHelper()
{
  // Touch ActorManagerParent so that it gets initialised
  var actor = new ActorManagerParent();

  L10nRegistry.registerSources([new FileSource(
                                   "0-mozembedlite",
                                   ["en-US", "fi", "ru"],
                                   "chrome://browser/content/localization/
                                   {locale}/")])

  Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
}
Now we get a nice clean startup without any errors:
$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
[...]
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
ACTOR: addJSProcessActors
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: RegisterProcessActor: 
ACTOR: addJSWindowActors
ACTOR: RegisterWindowActor: 
[...]
ACTOR: RegisterWindowActor: 
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
Call EmbedLiteApp::StopChildThread()
Excellent! The final step then is to remove all of the debugging code I added and see where that leaves us. With any luck, this will resolve the session history issues and allow us to head back in to checking DuckDuckGo once again.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
13 Feb 2024 : Day 155 #
This morning the build I started yesterday has completed successfully. Now to test it.

You'll recall that the purpose of the build was to add some debugging output to find out what's going on with the Window Actors and the LoginManager actor in particular. Is it being registered? Are others being registered? Are others being requested? Every time one of these events occurs we should get some appropriate debug output so that we know.

To actually see the output we'll need to activate the BrowsingContext log output, like this:
$ MOZ_LOG="BrowsingContext:5" sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext Creating 0x00000003
    in Parent
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext Parent: Connecting
    0x00000003 to 0x00000000 (private=0, remote=0, fission=0, oa=)
ACTOR: Getting LoginManager
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
ACTOR: Getting LoginManager
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
[Parent 4610: Unnamed thread 7908002670]: D/BrowsingContext Parent: Detaching
    0x00000003 from 0x00000000
Call EmbedLiteApp::StopChildThread()
Redirecting call to abort() to mozalloc_abort
It's a little hard to tell, but there are actually only a couple of relevant lines of output in there. Let's clean that up a bit:
$ MOZ_LOG="BrowsingContext:5" sailfish-browser 2>&1 | grep -E "ACTOR:|error"
ACTOR: Getting LoginManager
[Parent 4720: Unnamed thread 7668002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
ACTOR: Getting LoginManager
[Parent 4720: Unnamed thread 7668002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
So what we have is an access to GetActor() but without any apparent registrations for this or any other actors.

This... looks suspicious to me. I'm not entirely convinced that the logging I've added is working, particularly for the printf() output I added to the JSActorService class.

I don't want to have to make another build, but thankfully we can check this using the debugger.
$ MOZ_LOG="BrowsingContext:5" gdb sailfish-browser
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b JSActorService::RegisterWindowActor
Function "JSActorService::RegisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (JSActorService::RegisterWindowActor) pending.
(gdb) b JSActorService::UnregisterWindowActor
Function "JSActorService::UnregisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (JSActorService::UnregisterWindowActor) pending.
(gdb) b JSActorService::RegisterProcessActor
Function "JSActorService::RegisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 3 (JSActorService::RegisterProcessActor) pending.
(gdb) b JSActorService::UnregisterProcessActor
Function "JSActorService::UnregisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 4 (JSActorService::UnregisterProcessActor) pending.
(gdb) r
[...]
ACTOR: Getting LoginManager
[Parent 8108: Unnamed thread 7fc0002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
[...]
ACTOR: Getting LoginManager
[Parent 8108: Unnamed thread 7fc0002670]: D/BrowsingContext ACTOR: GetActor: 
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 542:
    NotFoundError: WindowGlobalChild.getActor: No such JSWindowActor
    'LoginManager'
[Parent 8108: Unnamed thread 7fc0002670]: D/BrowsingContext Parent: Detaching
    0x00000003 from 0x00000000
Call EmbedLiteApp::StopChildThread()
Redirecting call to abort() to mozalloc_abort
[...]
(gdb) info break
Num Type       Disp Enb  What
1   breakpoint keep y    in mozilla::dom::JSActorService::RegisterWindowActor
                            (nsTSubstring<char> const&, mozilla::dom::
                            WindowActorOptions const&, mozilla::ErrorResult&) 
                         at dom/ipc/jsactor/JSActorService.cpp:60
2   breakpoint keep y    in mozilla::dom::JSActorService::UnregisterWindowActor
                            (nsTSubstring<char> const&) 
                         at dom/ipc/jsactor/JSActorService.cpp:109
3   breakpoint keep y    in mozilla::dom::JSActorService::RegisterProcessActor
                            (nsTSubstring<char> const&, mozilla::dom::
                            ProcessActorOptions const&, mozilla::ErrorResult&) 
                         at dom/ipc/jsactor/JSActorService.cpp:231
4   breakpoint keep y    in mozilla::dom::JSActorService::UnregisterProcessActor
                            (nsTSubstring<char> const&) 
                         at dom/ipc/jsactor/JSActorService.cpp:275
(gdb) 
Even though the breakpoints found places to attach, there are no hits. It really does look like the entire actor functionality is missing, which may or may not be intended.

The obvious thing to do now is to check the same thing using ESR 78. So let's do that.
$ gdb sailfish-browser 
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b JSActorService::RegisterWindowActor
Function "JSActorService::RegisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (JSActorService::RegisterWindowActor) pending.
(gdb) b JSActorService::UnregisterWindowActor
Function "JSActorService::UnregisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (JSActorService::UnregisterWindowActor) pending.
(gdb) b JSActorService::RegisterProcessActor
Function "JSActorService::RegisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 3 (JSActorService::RegisterProcessActor) pending.
(gdb) b JSActorService::UnregisterProcessActor
Function "JSActorService::UnregisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 4 (JSActorService::UnregisterProcessActor) pending.
(gdb) r
[...]
Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService::
    RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., 
    aRv=...) at dom/ipc/JSActorService.cpp:51
51	                                         ErrorResult& aRv) {
(gdb) handle SIGPIPE nostop
Signal        Stop	Print	Pass to program	Description
SIGPIPE       No	Yes	Yes		Broken pipe
(gdb) p aName
$1 = (const nsACString &) @0x7fa69e3d80: {<mozilla::detail::nsTStringRepr
    <char>> = {mData = 0x7fa69e3d94 "AboutHttpsOnlyError", mLength = 19, 
    mDataFlags = (mozilla::detail::StringDataFlags::TERMINATED |
    mozilla::detail::StringDataFlags::INLINE), 
    mClassFlags = mozilla::detail::StringClassFlags::INLINE},
    static kMaxCapacity = 2147483637}
(gdb) p aName.mData
$2 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3d94
    "AboutHttpsOnlyError"
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService::
    RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., 
    aRv=...) at dom/ipc/JSActorService.cpp:51
51	                                         ErrorResult& aRv) {
(gdb) p aName.mData
$3 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3d94
    "AudioPlayback"
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService::
    RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., 
    aRv=...) at dom/ipc/JSActorService.cpp:51
51	                                         ErrorResult& aRv) {
(gdb) p aName.mData
$4 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3d94
    "AutoComplete"
(gdb) c
[...] 
There are dozens more of them. But also, crucially, this one: 
Thread 10 "GeckoWorkerThre" hit Breakpoint 1, mozilla::dom::JSActorService:: RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., aRv=...) at dom/ipc/JSActorService.cpp:51 51 ErrorResult& aRv) { (gdb) p aName.mData $17 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7fa69e3a84 "LoginManager" (gdb) bt #0 mozilla::dom::JSActorService::RegisterWindowActor (this=this@entry=0x7f80480850, aName=..., aOptions=..., aRv=...) at dom/ipc/JSActorService.cpp:51 #1 0x0000007fba9881b0 in mozilla::dom::ChromeUtils::RegisterWindowActor (aGlobal=..., aName=..., aOptions=..., aRv=...) at dom/base/ChromeUtils.cpp:1243 #2 0x0000007fbae8b4f4 in mozilla::dom::ChromeUtils_Binding::registerWindowActor (cx_=<optimized out>, argc=<optimized out>, vp=0x7fa69e3b00) at ChromeUtilsBinding.cpp:4263 #3 0x0000007f00460260 in ?? () #4 0x0000007fbcb1ba80 in Interpret (cx=0x7fa69e3ad0, state=...) at js/src/vm/Activation.h:541 #5 0x0000007fbcb1ba80 in Interpret (cx=0x7f802270c0, state=...) at js/src/vm/Activation.h:541 #6 0x0000000000000000 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) 
Eventually it gets through them all and the page loads. None of the other breakpoints get hit, so there don't appear to be any deregistrations.

This is enlightening. It means that this is an iceberg moment: the error from LoginManager is exposing a much more serious problem under the surface. Honestly, I'm puzzled as to why this hasn't caused more breakages of other elements of the browser user interface.

The next step is to look at the backtrace and find out why the same code isn't executing in ESR 91. It's not the best backtrace to be honest because it's backstopped by the interpreter, but it at least gives us something to work with.

So back to ESR 91:
(gdb) b ChromeUtils::RegisterWindowActor
Breakpoint 5 at 0x7ff2c9a0d8: file dom/base/ChromeUtils.cpp, line 1350.
(gdb) b ChromeUtils_Binding::registerWindowActor
Breakpoint 6 at 0x7ff31bf858: file ChromeUtilsBinding.cpp, line 5237.
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /usr/bin/sailfish-browser 
[...]
No hits.

The backtrace gets lost when it hits JavaScript, but it looks like the JavaScript action is happen in the ActorManagerParent.jsm file. So back to ESR 78, and I've added a couple of extra debug prints to the ActorManagerParent.jsm file:
  addJSProcessActors(actors) {
    dump("ACTOR: addJSProcessActors\n");
    this._addActors(actors, "JSProcessActor");
  },
  addJSWindowActors(actors) {
    dump("ACTOR: addJSWindowActors\n");
    this._addActors(actors, "JSWindowActor");
  },
And there's an immediate hit:
$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
ACTOR: addJSProcessActors
ACTOR: addJSWindowActors
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
[..]
And for a bit more clarity:
$ sailfish-browser 2>&1 | grep "ACTOR:"
ACTOR: addJSProcessActors
ACTOR: addJSWindowActors
I've copied those same debug lines over to the ESR 91 code to see if the same methods are being called there.
$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
[...]
Nothing that I can see, but just to be certain:
$ sailfish-browser 2>&1 | grep "ACTOR:"
ACTOR: Getting LoginManager
ACTOR: Getting LoginManager
So somewhere between EmbedLiteFaviconService.js being loaded and EmbedLiteGlobalHelper.js being loaded the actors are registered on ESR 78, but that's not happening on ESR 91.

Time for a break, but when I get back I'll look into this further.

[...]

Back to it. So the question I need to answer now is "where in ESR 78 are the actors getting registered?" There are two clear candidates. The first is BrowserGlue.jsm. This includes code to get the ActorManagerParent:
ChromeUtils.defineModuleGetter(
  this,
  "ActorManagerParent",
  "resource://gre/modules/ActorManagerParent.jsm"
);
It even adds some actors of its own during initialisation:
  // initialization (called on application startup)
  _init: function BG__init() {
[...]
    ActorManagerParent.addJSProcessActors(JSPROCESSACTORS);
    ActorManagerParent.addJSWindowActors(JSWINDOWACTORS);
    ActorManagerParent.addLegacyActors(LEGACY_ACTORS);
    ActorManagerParent.flush();
[...]
  },
Another, potentially more promising candidate, is EmbedLiteGlobalHelper.js. It's more promising for multiple reasons. First, it's part of embedlite-components, which means it's intended for use with sailfish-browser. Second, something in the back of my mind tells me sailfish-browser uses its own version of the browser glue. Third, and perhaps most compelling, the debug output messages come straight before EmbedLiteGlobalHelper.js is claiming to be initialised, which would fit with the the actor initialisation being part of the initialisation of EmbedLiteGlobalHelper.js.

I should be able to check this pretty straightforwardly. If I comment out the code in EmbedLiteGlobalHelper.js related to the actors like this:
//ChromeUtils.defineModuleGetter(
//  this,
//  "ActorManagerParent",
//  "resource://gre/modules/ActorManagerParent.jsm"
//);

Services.scriptloader.loadSubScript("chrome://embedlite/content/Logger.js");

// Common helper service

function EmbedLiteGlobalHelper()
{
  //ActorManagerParent.flush();
[...]
Then the errors received start to look very similar to those for ESR 91:
$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 442:
    NotSupportedError: WindowGlobalChild.getActor: Could not get
    JSWindowActorProtocol: LoginManager is not registered
To double-check, we can run sailfish-browser using the debugger with breakpoints on the relevant methods like this:
$ gdb sailfish-browser
GNU gdb (GDB) Mer (8.2.1+git9)
[...]
(gdb) b JSActorService::RegisterWindowActor
Function "JSActorService::RegisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (JSActorService::RegisterWindowActor) pending.
(gdb) b JSActorService::UnregisterWindowActor
Function "JSActorService::UnregisterWindowActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (JSActorService::UnregisterWindowActor) pending.
(gdb) b JSActorService::RegisterProcessActor
Function "JSActorService::RegisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 3 (JSActorService::RegisterProcessActor) pending.
(gdb) b JSActorService::UnregisterProcessActor
Function "JSActorService::UnregisterProcessActor" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 4 (JSActorService::UnregisterProcessActor) pending.
(gdb) r
[...]
No hits. So, in conclusion it seems that EmbedLiteGlobalHelper.js isn't being initialised on ESR 91. The task now is to find out why.

Once again, this feels like progress, but an answer for this question will have to wait until tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
12 Feb 2024 : Day 154 #
I'm still trying to track down the reason for the following error today:
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
It's being a bit elusive, because there doesn't seem to be any clear difference between the way the LoginManager is set up on ESR 78 compared to ESR 91.

So I thought I should let it settle overnight. One possibility I came up with during this pondering process was that there's an error during the initialisation of LoginManager that's resulting in it not becoming available for use later.

But a careful check of the debug output prior to the above error doesn't give any indication that anything like this is going wrong. Here's the entire output.
[defaultuser@Xperia10III gecko]$ sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libutils.so" not found
library "libcutils.so" not found
library "libhardware.so" not found
library "android.hardware.graphics.mapper@2.0.so" not found
library "android.hardware.graphics.mapper@2.1.so" not found
library "android.hardware.graphics.mapper@3.0.so" not found
library "android.hardware.graphics.mapper@4.0.so" not found
library "libc++.so" not found
library "libhidlbase.so" not found
library "libgralloctypes.so" not found
library "android.hardware.graphics.common@1.2.so" not found
library "libion.so" not found
library "libz.so" not found
library "libhidlmemory.so" not found
library "android.hidl.memory@1.0.so" not found
library "vendor.qti.qspmhal@1.0.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
Call EmbedLiteApp::StopChildThread()
Apart from the LoginManager error the output is looking surprisingly clean now. But I do still need to figure this one out.

Another possibility that crossed my mind is that there's a problem with the way WindowGlobalChild::GetActor() is working in general. In other words, the issue isn't really with the LoginManager at all but rather with the code for accessing it.

On the face of it this seems unlikely: the error is only showing for the LoginManager and no other actors. And there are plenty of other uses. I count 167 instances, only 9 of which are for LoginManager.
$ grep -rIn ".getActor(\"" * --include="*.js*" | wc -l
167
$ grep -rIn ".getActor(\"LoginManager\")" * --include="*.js*" | wc -l
9
Nevertheless it is possible that only the LoginManager happens to be being requested. Unlikely, but possible.

To try to find out, I've added some debugging code so that something will be output whenever GetActor() is called off the WindowGlobalChild:
already_AddRefed<JSWindowActorChild> WindowGlobalChild::GetActor(
    JSContext* aCx, const nsACString& aName, ErrorResult& aRv) {

  MOZ_LOG(BrowsingContext::GetLog(), LogLevel::Debug, ("ACTOR: GetActor: ",
      PromiseFlatCString(aName)));

  return JSActorManager::GetActor(aCx, aName, aRv)
      .downcast<JSWindowActorChild>();
}
I've also added some debugging closer to the action in LoginManagerChild.jsm:
  static forWindow(window) {
    let windowGlobal = window.windowGlobalChild;
    if (windowGlobal) {
      dump("ACTOR: Getting LoginManager\n");
      return windowGlobal.getActor("LoginManager");
    }

    return null;
  }
Let's see if either of those provide any insight. Unfortunately these changes require a build, so it will take a while. During the build, I'm going to look into the third possibility I thought about: that the LoginManager isn't being registered correctly.

[...]

Before the build completes I realise that all of the action registering and unregistering actors happens in JSActorService.cpp. There's a hash table called mWindowActorDescriptors which stores all of the actors, alongside registration and removal methods. To help understand what's going on there it will be helpful to add some debug output here too, to expose any actors that are added or removed here. So I've cancelled the build while I add it in.

Here's an example:
void JSActorService::RegisterWindowActor(const nsACString& aName,
                                         const WindowActorOptions& aOptions,
                                         ErrorResult& aRv) {
  MOZ_ASSERT(NS_IsMainThread());
  MOZ_ASSERT(XRE_IsParentProcess());

  printf("ACTOR: RegisterWindowActor: %s", PromiseFlatCString(aName));
[...]
I've added similar debug output for UnregisterWindowActor(), RegisterProcessActor() and UnregisterProcessActor() as well.

Now I've set it building again. Time for a break while my laptop does all the work for me.

[...]

It turns out the break was longer than I anticipated. I thought the build might finish quickly but it's still chugging away several hours later as it heads towards bed time.

So I'll have to pick this up in the morning once it's built. I'm looking forward to finding out what's really happening with this Actor code.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
11 Feb 2024 : Day 153 #
Yesterday we finally saw the session history start to work. But it wasn't without errors and we were still left with the following showing in the console:
Warning: couldn't PurgeHistory. Was it a file download? TypeError:
    legacyHistory.PurgeHistory is not a function

JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
That's two separate errors. We started to look into the first of these, which is emanating from embedhelper.js. The PurgeHistory() method has been renamed to purgeHistory() in nsISHistory.idl. So with any luck if we just make the same change in embedhelper.js it'll fix the first of these.

Happily the embedhelper.js file is part of embedlite-components which makes it super-quick to test. No need to rebuild gecko. And the change does the trick: no more PurgeHistory errors. I experience a couple of cases where it seems to drop some items from the history, but I think this might be teething troubles (and also not just an ESR 91 issue... I'm sure I've seen it in the current release build as well). I'll have to see if I can find a way to reliably reproduce it.

What about the second error relating to LoginManagerChild.jsm then?
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
Here's the bit of code from LoginManagerChild.jsm causing the error:
  static forWindow(window) {
    let windowGlobal = window.windowGlobalChild;
    if (windowGlobal) {
      return windowGlobal.getActor("LoginManager");
    }

    return null;
  }
There's no change between this bit of code and the code in ESR 78. So the reason for the error must be buried a little deeper.

The code in WindowGlobalChild.cpp related to this has changed — become much simpler in fact — but I'm not yet convinced that this is the reason for the error.
already_AddRefed<JSWindowActorChild> WindowGlobalChild::GetActor(
    JSContext* aCx, const nsACString& aName, ErrorResult& aRv) {
  return JSActorManager::GetActor(aCx, aName, aRv)
      .downcast<JSWindowActorChild>();
}
The simplification is because the code has been moved in to JSActorManager.cpp, but it's really doing something that looks pretty similar.

There are no obvious differences in the LoginManager.jsm code itself. Just some additional telemetry and minor reformatting.

I've checked a bunch of things. The LoginManager.jsm file is contained within omni.ja. It's apparently accessed in multiple other places in both ESR 78 and ESR 91 in the same way. There is a very small change to the way it's being registered. From this:
    {
        'cid': '{cb9e0de8-3598-4ed7-857b-827f011ad5d8}',
        'contract_ids': ['@mozilla.org/login-manager;1'],
        'jsm': 'resource://gre/modules/LoginManager.jsm',
        'constructor': 'LoginManager',
    },
To this:
    {
        'js_name': 'logins',
        'cid': '{cb9e0de8-3598-4ed7-857b-827f011ad5d8}',
        'contract_ids': ['@mozilla.org/login-manager;1'],
        'interfaces': ['nsILoginManager'],
        'jsm': 'resource://gre/modules/LoginManager.jsm',
        'constructor': 'LoginManager',
    },
But I don't see why that change would make any difference. So right now I'm unfortunately a bit stumped. Maybe sleeping on it will help. I guess I'll find out in the morning.

So that's it for today, but hopefully I'll make a bit more progress on this tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
10 Feb 2024 : Day 152 #
This morning I woke to find the build had completed. Hooray! That means I can test the changes I made yesterday to add an InitSessionHistory() call into the EmbedLite code.

After installing and running the code I still see the occasional related error, but the main errors we were getting before about the sessionHistory being null are no longer appearing.

That's a small but important step. But even more important is the fact that the Back and Forwards buttons are now working as well. Not only a good sign, but also important functionality being restored. I've been finding it quite challenging using the browser without the navigation buttons. So this is a very welcome result.

A couple of additional errors are also appearing now. These are new; first another error related to the history:
Warning: couldn't PurgeHistory. Was it a file download? TypeError: 
    legacyHistory.PurgeHistory is not a function
But also an error about the LoginManager:
JavaScript error: resource://gre/modules/LoginManagerChild.jsm, line 541:
    NotFoundError: WindowGlobalChild.getActor:
    No such JSWindowActor 'LoginManager'
Starting with the first error, the PurgeHistory() method certainly exists in nsISHistory.idl:
  /**
   * Called to purge older documents from history.
   * Documents can be removed from session history for various
   * reasons. For example to  control memory usage of the browser, to
   * prevent users from loading documents from history, to erase evidence of
   * prior page loads etc...
   *
   * @param numEntries        The number of toplevel documents to be
   *                          purged from history. During purge operation,
   *                          the latest documents are maintained and older
   *                          'numEntries' documents are removed from history.
   * @throws                  <code>NS_SUCCESS_LOSS_OF_INSIGNIFICANT_DATA</code>
   *                          Purge was vetod.
   * @throws                  <code>NS_ERROR_FAILURE</code> numEntries is
   *                          invalid or out of bounds with the size of history.
   */
  void PurgeHistory(in long aNumEntries);
It's no longer present in ESR 91 though. Let's find out why not.
$ git log -1 -S "PurgeHistory" docshell/shistory/nsISHistory.idl
commit f9f96d23ca42f359e143d0ae98234021e86179a7
Author: Andreas Farre <farre@mozilla.com>
Date:   Wed Sep 16 14:51:01 2020 +0000

    Bug 1662410 - Part 1: Fix usage of ChildSHistory.legacySHistory . r=peterv
    
    ChildSHistory.legacySHistory isn't valid for content processes when
    session history in the parent is enabled. We try to fix this by either
    delegating to the parent by IPC or move the implementation partially
    or as a whole to the parent.
    
    Differential Revision: https://phabricator.services.mozilla.com/D89353
This is definitely the problem that's causing the issue, as we can see in the diff:
$ git diff f9f96d23ca42f359e143d0ae98234021e86179a7~ \
    f9f96d23ca42f359e143d0ae98234021e86179a7 \
    -- docshell/shistory/nsISHistory.idl
diff --git a/docshell/shistory/nsISHistory.idl b/docshell/shistory/
    nsISHistory.idl
index 3d914924c94d..1f5b9c5477e9 100644
--- a/docshell/shistory/nsISHistory.idl
+++ b/docshell/shistory/nsISHistory.idl
@@ -87,7 +87,7 @@ interface nsISHistory: nsISupports
    * @throws                  <code>NS_ERROR_FAILURE</code> numEntries is
    *                          invalid or out of bounds with the size of history.
    */
-  void PurgeHistory(in long aNumEntries);
+  void purgeHistory(in long aNumEntries);
 
   /**
    * Called to register a listener for the session history component.
@@ -255,7 +255,7 @@ interface nsISHistory: nsISupports
    * Collect docshellIDs from aEntry's children and remove those
    * entries from history.
    *
-   * @param aEntry           Children docshellID's will be collected from 
+   * @param aEntry           Children docshellID's will be collected from
    *                         this entry and passed to RemoveEntries as aIDs.
   */
   [noscript, notxpcom]
@@ -265,7 +265,7 @@ interface nsISHistory: nsISupports
   void Reload(in unsigned long aReloadFlags);
 
   [notxpcom] void EnsureCorrectEntryAtCurrIndex(in nsISHEntry aEntry);
-  
+
   [notxpcom] void EvictContentViewersOrReplaceEntry(in nsISHEntry aNewSHEntry,
       in bool aReplace);
 
   nsISHEntry createEntry();
Helpfully we can immediately see from this that the call hasn't exactly been removed. It's just been slightly renamed, switching the initial uppercase "P" for a lowercase "p". With any luck then, this should be an easy fix.

This is where the call is defined, but to fix it we also now need to know where the call comes from. Doing a quick grep on the code highlights that it's being called in embedhelper.js which is part of the embedlite-components package.

Here's the relevant section:
      case "embedui:addhistory": {
        // aMessage.data contains: 1) list of 'links' loaded from DB, 2) current 'index'.

        let docShell = content.docShell;
        let sessionHistory = docShell.QueryInterface(Ci.nsIWebNavigation).sessionHistory;
        let legacyHistory = sessionHistory.legacySHistory;
        let ioService = Cc["@mozilla.org/network/io-service;1"].getService(Ci.nsIIOService);

        try {
          // Initially we load the current URL and that creates an unneeded entry in History -> purge it.
          if (legacyHistory.count > 0) {
            legacyHistory.PurgeHistory(1);
          }
        } catch (e) {
            Logger.warn("Warning: couldn't PurgeHistory. Was it a file download?", e);
        }

Changing legacyHistory.PurgeHistory(1) to legacyHistory.purgeHistory(1) will hopefully do the trick. Unfortunately I'm already out of time for today, so we'll have to wait until tomorrow to find out for certain. But I feel like we're making progress.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
9 Feb 2024 : Day 151 #
Yesterday we were comparing ESR 78 and ESR 91 to see why the session history is initialised in the former but not in the latter. We reached this bit of code in ESR 91:
  // If we are an in-process browser, we want to set up our session history.
  if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
      !mOwnerContent->HasAttr(kNameSpaceID_None, nsGkAtoms::disablehistory)) {
    // XXX(nika): Set this up more explicitly?
    mPendingBrowsingContext->InitSessionHistory();
  }
As you can see, this could potentially initialise the session history as long as the condition is true going in to it.

But when we stepped through the program we found it wasn't true. That's because mIsTopLevelContent was set to false, which itself was because BrowsingContext::mParentWindow was non-null.

That makes sense intuitively: if the window has a parent, then it's not a top level element.

However with ESR 78 we have a similar situation, because the code looks like this:
  if (aBrowsingContext->IsTop()) {
    aBrowsingContext->InitSessionHistory();
  }
But this is being executed, suggesting that in this case there is no parent window. Why the difference?

Although that seems like a fair question, there's another question I'd like to try to answer first. The D100348 change that caused this problem in the first place moves the InitSessionHistory() call from nsWebBrowser::Create() to sFrameLoader::TryRemoteBrowserInternal(). So clearly the authors of that change thought the execution flow would go via the latter method. So it would be good to know why it isn't for us using EmbedLite.

Looking through the code, there is at least one route to getting to InitSessionHistory() via TryRemoteBrowserInternal() that looks like this (I routed this by hand... quite laborious):
nsFrameLoader::LoadURI()
Document::InitializeFrameLoader()
Document::MaybeInitializeFinalizeFrameLoaders()
nsFrameLoader::ReallyStartLoading()
nsFrameLoader::ReallyStartLoadingInternal()
nsFrameLoader::EnsureRemoteBrowser()
nsFrameLoader::TryRemoteBrowser()
nsFrameLoader::TryRemoteBrowserInternal()
BrowsingContext::InitSessionHistory()
In other words a call to LoadURI() can then call InitializeFrameLoader() which can then call MaybeInitializeFinalizeFrameLoaders() and so on, all the way to InitSessionHistory().

The top half of this, from LoadURI() to ReallyStartLoadingInternal() is certainly being called when the code is executed. We can see that by placing some breakpoints on various methods and checking the results:
Thread 10 "GeckoWorkerThre" hit Breakpoint 4, nsFrameLoader::
    LoadURI (this=this@entry=0x7fc1419ef0, aURI=0x7fc15feb90,
    aTriggeringPrincipal=0x7fc11a47d0, aCsp=0x7ee8004780,
    aOriginalSrc=aOriginalSrc@entry=true) at dom/base/nsFrameLoader.cpp:600
600                                     bool aOriginalSrc) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 3, mozilla::dom::Document::
    InitializeFrameLoader (this=this@entry=0x7fc11a1960, 
    aLoader=aLoader@entry=0x7fc1419ef0) at dom/base/Document.cpp:8999
8999    nsresult Document::InitializeFrameLoader(nsFrameLoader* aLoader) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 2, mozilla::dom::Document::
    MaybeInitializeFinalizeFrameLoaders (this=0x7fc11a1960)
    at dom/base/Document.cpp:9039
9039    void Document::MaybeInitializeFinalizeFrameLoaders() {
(gdb) c
Continuing.
[...]
Thread 10 "GeckoWorkerThre" hit Breakpoint 9, nsFrameLoader::
    ReallyStartLoadingInternal (this=this@entry=0x7fc147afe0)
    at dom/base/nsFrameLoader.cpp:664
664     nsresult nsFrameLoader::ReallyStartLoadingInternal() {
(gdb) p mIsRemoteFrame
$3 = false
(gdb) 
But that's as far as it goes. Beyond that it's the fact that mIsRemoteFrame is false that prevents the InitSessionHistory() from being called, because the relevant code inside ReallyStartLoadingInternal() looks like this:
  if (IsRemoteFrame()) {
    if (!EnsureRemoteBrowser()) {
      NS_WARNING("Couldn't create child process for iframe.");
      return NS_ERROR_FAILURE;
    }
[...]
We can understand this better by also considering what IsRemoteFrame() is doing:
bool nsFrameLoader::IsRemoteFrame() {
  if (mIsRemoteFrame) {
    MOZ_ASSERT(!GetDocShell(), "Found a remote frame with a DocShell");
    return true;
  }
  return false;
}
If it jumps past this, then it looks like the next place where InitSessionHistory() could potentially get called is in nsFrameLoader::MaybeCreateDocShell(). Maybe that's the place it's supposed to happen for non-remote frames. But in that case we're back to the same problem of mPendingBrowsingContext being false that we started with.

Once again, the reason is that there's a parent browsing context:
(gdb) p mPendingBrowsingContext.mRawPtr->mParentWindow.mRawPtr->
    mBrowsingContext.mRawPtr
$8 = (mozilla::dom::BrowsingContext *) 0x7fc0ba3f00
Contrast this with the ESR 78 way of doing things. In that case the call to >nsWebBrowser::Create() is coming from EmbedLiteViewChild::InitGeckoWindow(). There the BrowsingContext is explicitly created detached and so has no parent widget.

The route via LoadURI() in ESR 91 is far less direct. Given this, one potential solution, which I think I rather like, is to explicitly initialise the session history where EmbedLite initialises the window, like this:
void
EmbedLiteViewChild::InitGeckoWindow(const uint32_t parentId,
                                    mozilla::dom::BrowsingContext
                                    *parentBrowsingContext,
                                    const bool isPrivateWindow,
                                    const bool isDesktopMode,
                                    const bool isHidden)
{
[...]
  RefPtr<BrowsingContext> browsingContext = BrowsingContext::CreateDetached
      (nullptr, parentBrowsingContext, nullptr, EmptyString(),
      BrowsingContext::Type::Content);
  browsingContext->SetUsePrivateBrowsing(isPrivateWindow); // Needs to be called before attaching
  browsingContext->EnsureAttached();
  browsingContext->InitSessionHistory();
[...]
  mWebBrowser = nsWebBrowser::Create(mChrome, mWidget, browsingContext,
                                     nullptr);
In this case the session history is created broadly speaking where it would have been in ESR 78. This should be safe given that it isn't initialised at any later time. And the nice thing about it is that it doesn't require any changes to the core gecko code, only to the EmbedLite code.

So, with this change made, it's time to set off the build. It'll take a while to complete, but once it has, we can give it a spin to test it.

For the record, and while the build is running, here is the backtrace for nsFrameLoader::MaybeCreateDocShell():
(gdb) bt
#0  nsFrameLoader::MaybeCreateDocShell (this=this@entry=0x7fc1476da0)
    at dom/base/nsFrameLoader.cpp:2241
#1  0x0000007ff2def6c0 in nsFrameLoader::ReallyStartLoadingInternal
    (this=this@entry=0x7fc1476da0)
    at dom/base/nsFrameLoader.cpp:751
#2  0x0000007ff2defa88 in nsFrameLoader::ReallyStartLoading
    (this=this@entry=0x7fc1476da0)
    at dom/base/nsFrameLoader.cpp:656
#3  0x0000007ff2d25198 in mozilla::dom::Document::
    MaybeInitializeFinalizeFrameLoaders (this=0x7fc11f88d0)
    at dom/base/Document.cpp:9068
#4  0x0000007ff2cceebc in mozilla::detail::RunnableMethodArguments<>::applyImpl
    <mozilla::dom::Document, void (mozilla::dom::Document::*)()>
    (mozilla::dom::Document*, void (mozilla::dom::Document::*)(),
    mozilla::Tuple<>&, std::integer_sequence<unsigned long>)
    (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
#5  mozilla::detail::RunnableMethodArguments<>::apply<mozilla::dom::Document,
    void (mozilla::dom::Document::*)()> (m=<optimized out>, o=<optimized out>, 
    this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
[...]
#20 0x0000007ff4e7e564 in js::RunScript (cx=cx@entry=0x7fc0233bd0, state=...)
    at js/src/vm/Interpreter.cpp:395
#21 0x0000007ff4e7e9b0 in js::InternalCallOrConstruct (cx=cx@entry=0x7fc0233bd0,
    args=..., construct=construct@entry=js::NO_CONSTRUCT, 
    reason=reason@entry=js::CallReason::Call) at js/src/vm/Interpreter.cpp:543
If we get the right outcome from this build this backtrace may not be useful any more, but it's worth taking a record just in case.

The build is running, so that's it for today. We'll find out whether it worked or not tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
8 Feb 2024 : Day 150 #
FOSDEM'24 is over and it's back to semi-normal again now. In particular, as promised, I'm going to be back to writing daily dev diaries and hopefully getting the Sailfish Browser into better shape running ESR 91. My first action is going to be to continue in my attempt to get the back and forwards buttons working by fixing the sessionHistory. Then I'll move back up my task stack to return to the Set-Fetch-* headers and DuckDuckGo.

But before jumping back into coding let me also say how great it was to meet so many interesting people — both old friends and new &mdsah; at FOSDEM. The Linux on Mobile stand was buzzing the entire time with a crowd of curious FOSS enthusiasts. I was also really happy to have the chance to talk about these dev diaries in the FOSS on Mobile Devices devroom. In case you missed it there's a video available, as well as the slides and the LaTeX source for the slides.

I'm not going to go over my talk again here, but I will take the opportunity to share two of the images from the slides.

First, to reiterate my thanks for everyone who has been following this dev diary, helping with coding, sharing ideas, performing testing, writing their own code changes (to either this or other packages that gecko relies on), producing images, boosting or liking on Mastodon, commenting on things. I really appreciate it all. I've tried to capture everyone, but I apologise if I manage to miss anyone off.
 
Many, many avatars with names underneath, showing all of the people I could find who I think have interacted with the dev diary, along with the words 'Thank you'

The other graphic I wanted to share summarises my progress to date. This is of course a significant simplification: it's been a lot more messy than this in practice. But it serves as some kind of overview.
 
A timeline that loops backwards and forwards across the page, showing 149 days. Along the timeline at various points the progress details are marked: 45: First successful build; 50: First successful execution; 83: Rendering works; 85: APZ; 90: JS errors; 96: Search; 100: Static prefs; 128: PDF printing; 143: Sec-Fetch-* headers; 149: Session history

As you can see, this takes us right up to today and the session history.

Before this latest two week break I wrote myself some notes to help me get back up to speed when I returned. Here's what my notes say:
 
When I get back I'll be back to posting these dev diaries again. And as a note to myself, when I do, my first action must be to figure out why nsFrameLoader is being created in ESR 78 but not ESR 91. That might hold the key to why the sessionHistory isn't getting called in ESR 91. And as a last resort, I can always revert commit 140a4164598e0c9ed53.

I'm glad I wrote this, because otherwise I'd have completely forgotten the details of what I was working on by now. So let's continue where we left off.

First off I'll try debugging ESR 78 to get a backtrace for the call to the nsFrameLoader constructor.

Unfortunately, although when I place a breakpoint on nsFrameLoader::Create() it does get hit when using ESR 78, it's not possible to extract a backtrace. The debugger just complains and tries to output a core dump.

Happily by placing breakpoints on promising looking methods that call it or its parents, I'm able to step back through the calls and place a breakpoint on nsGenericHTMLFrameElement::GetContentWindow() which, if hit, is guaranteed to then call nsFrameLoader::Create(). And it does git hit. And from this method it's possible to extract a backtrace:
Thread 10 "GeckoWorkerThre" hit Breakpoint 4, nsGenericHTMLFrameElement::
    GetContentWindowInternal (this=this@entry=0x7fba4f9560)
    at dom/html/nsGenericHTMLFrameElement.cpp:100
100       EnsureFrameLoader();
(gdb) bt
#0  nsGenericHTMLFrameElement::GetContentWindowInternal
    (this=this@entry=0x7fba4f9560)
    at dom/html/nsGenericHTMLFrameElement.cpp:100
#1  0x0000007ff39194bc in nsGenericHTMLFrameElement::GetContentWindow
    (this=this@entry=0x7fba4f9560)
    at dom/html/nsGenericHTMLFrameElement.cpp:116
#2  0x0000007ff3518bf0 in mozilla::dom::HTMLIFrameElement_Binding::
    get_contentWindow (cx=0x7fb82263f0, obj=..., void_self=0x7fba4f9560,
    args=...)
    at HTMLIFrameElementBinding.cpp:857
#3  0x0000007ff35b8290 in mozilla::dom::binding_detail::GenericGetter
    <mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::
    binding_detail::ThrowExceptions> (cx=0x7fb82263f0, argc=<optimized out>,
    vp=<optimized out>)
    at dist/include/js/CallArgs.h:245
#4  0x0000007ff4e19610 in CallJSNative (args=..., reason=js::CallReason::Getter, 
    native=0x7ff35b80c0 <mozilla::dom::binding_detail::GenericGetter
    <mozilla::dom::binding_detail::NormalThisPolicy, mozilla::dom::
    binding_detail::ThrowExceptions>(JSContext*, unsigned int, JS::Value*)>,
    cx=0x7fb82263f0)
    at dist/include/js/CallArgs.h:285
#5  js::InternalCallOrConstruct (cx=cx@entry=0x7fb82263f0, args=...,
    construct=construct@entry=js::NO_CONSTRUCT, 
    reason=reason@entry=js::CallReason::Getter) at js/src/vm/Interpreter.cpp:585
#6  0x0000007ff4e1a268 in InternalCall (reason=js::CallReason::Getter, args=...,
    cx=0x7fb82263f0)
    at js/src/vm/Interpreter.cpp:648
#7  js::Call (reason=js::CallReason::Getter, rval=..., args=..., thisv=...,
    fval=..., cx=0x7fb82263f0)
    at js/src/vm/Interpreter.cpp:665
#8  js::CallGetter (cx=cx@entry=0x7fb82263f0, thisv=..., getter=...,
    getter@entry=..., rval=...)
    at js/src/vm/Interpreter.cpp:789
[...]
#44 0x0000007fefa864ac in QThread::exec() () from /usr/lib64/libQt5Core.so.5
#45 0x0000007fefa8b0e8 in ?? () from /usr/lib64/libQt5Core.so.5
#46 0x0000007fef971a4c in ?? () from /lib64/libpthread.so.0
#47 0x0000007fef65b89c in ?? () from /lib64/libc.so.6
(gdb) 
By reading through all of the functions in this backtrace, starting at the top and moving downwards, I need to find out where the ESR 91 code diverges from ESR 78 in a way that means this method doesn't get called with ESR 91.

This involves reading through each of the methods as they are in ESR 91 to see if they're different at any point. If they are I can run ESR 91 to establish whether that difference is the actual cause of the different execution paths between the two.

But before I do that I want to review things again.

On both ESR 78 and ESR 91 the nsFrameLoader::Create() method is called. However on ESR 91 changeset D100348 means that this no longer calls InitSessionHistory().

Instead, and because of this change, on ESR 91 it's the TryRemoteBrowserInternal() method needs to be called in order for the session history to be initialised. on ESR 91 the route to that method appears to be via nsFrameLoader::GetBrowsingContext().

On ESR 78 the GetBrowsingContext() method gets called like this:
Thread 8 "GeckoWorkerThre" hit Breakpoint 3, nsFrameLoader::GetBrowsingContext
    (this=0x7f89886f80)
    at dom/base/nsFrameLoader.cpp:3194
3194      if (IsRemoteFrame()) {
(gdb) bt
#0  nsFrameLoader::GetBrowsingContext (this=0x7f89886f80)
    at dom/base/nsFrameLoader.cpp:3194
#1  0x0000007fbb5ad250 in mozilla::dom::HTMLIFrameElement::
    MaybeStoreCrossOriginFeaturePolicy (this=0x7f895858f0)
    at dom/html/HTMLIFrameElement.cpp:252
#2  mozilla::dom::HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy
    (this=0x7f895858f0)
    at dom/html/HTMLIFrameElement.cpp:241
#3  0x0000007fbb5ad390 in mozilla::dom::HTMLIFrameElement::RefreshFeaturePolicy
    (this=0x7f895858f0, aParseAllowAttribute=aParseAllowAttribute@entry=true)
    at dom/html/HTMLIFrameElement.cpp:329
#4  0x0000007fbb5ad4b8 in mozilla::dom::HTMLIFrameElement::BindToBrowsingContext
    (this=<optimized out>, aBrowsingContext=<optimized out>)
    at dom/html/HTMLIFrameElement.cpp:72
#5  0x0000007fbc7abf60 in mozilla::dom::BrowsingContext::Embed
    (this=<optimized out>)
    at docshell/base/BrowsingContext.cpp:505
#6  0x0000007fbaae320c in nsFrameLoader::MaybeCreateDocShell (this=0x7f89886f80)
    at dist/include/mozilla/RefPtr.h:313
#7  0x0000007fbaae5468 in nsFrameLoader::ReallyStartLoadingInternal
    (this=this@entry=0x7f89886f80)
    at dom/base/nsFrameLoader.cpp:612
#8  0x0000007fbaae5864 in nsFrameLoader::ReallyStartLoading (
dwarf2read.c:10473: internal-error: process_die_scope::process_die_scope
    (die_info*, dwarf2_cu*): Assertion `!m_die->in_process' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
On ESR 91 the GetBrowsingContext() method never seems to get called at all. But I've put breakpoints on most of the other places in the backtrace to check whether they get hit on ESR 91:
Breakpoint 3 at 0x7ff3925d4c: file dom/html/HTMLIFrameElement.cpp, line 221.
(gdb) b HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy
Note: breakpoint 3 also set at pc 0x7ff3925d4c.
Breakpoint 4 at 0x7ff3925d4c: file dom/html/HTMLIFrameElement.cpp, line 221.
(gdb) b HTMLIFrameElement::RefreshFeaturePolicy
Breakpoint 5 at 0x7ff3925fb0: file dom/html/HTMLIFrameElement.cpp, line 266.
(gdb) b HTMLIFrameElement::BindToBrowsingContext
Breakpoint 6 at 0x7ff3926134: file dom/html/HTMLIFrameElement.cpp, line 70.
(gdb) b BrowsingContext::Embed
Breakpoint 7 at 0x7ff4ab2b0c: file ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    mozilla/RefPtr.h, line 313.
(gdb) b nsFrameLoader::MaybeCreateDocShell
Breakpoint 8 at 0x7ff2dedc98: file dom/base/nsFrameLoader.cpp, line 2179.
(gdb) b  nsFrameLoader::ReallyStartLoadingInternal
Breakpoint 9 at 0x7ff2def63c: file dom/base/nsFrameLoader.cpp, line 664.
(gdb) info break
Num Type       Disp Enb What
2   breakpoint keep y   in nsFrameLoader::GetBrowsingContext() 
                        at dom/base/nsFrameLoader.cpp:3489
3   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           MaybeStoreCrossOriginFeaturePolicy() 
                        at dom/html/HTMLIFrameElement.cpp:221
4   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           MaybeStoreCrossOriginFeaturePolicy() 
                        at dom/html/HTMLIFrameElement.cpp:221
5   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           RefreshFeaturePolicy(bool) 
                        at dom/html/HTMLIFrameElement.cpp:266
6   breakpoint keep y   in mozilla::dom::HTMLIFrameElement::
                           BindToBrowsingContext(mozilla::dom::BrowsingContext*) 
                        at dom/html/HTMLIFrameElement.cpp:70
7   breakpoint keep y   in mozilla::dom::BrowsingContext::Embed() 
                        at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/
                           RefPtr.h:313
8   breakpoint keep y   in nsFrameLoader::MaybeCreateDocShell() 
                        at dom/base/nsFrameLoader.cpp:2179
9   breakpoint keep y   in nsFrameLoader::ReallyStartLoadingInternal() 
                        at dom/base/nsFrameLoader.cpp:664
(gdb) r
And one of them does get hit:
Thread 10 "GeckoWorkerThre" hit Breakpoint 9, nsFrameLoader::
    ReallyStartLoadingInternal (this=this@entry=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:664
664     nsresult nsFrameLoader::ReallyStartLoadingInternal() {
(gdb) c
Continuing.
Let's see what happens next...
Thread 10 "GeckoWorkerThre" hit Breakpoint 8, nsFrameLoader::
    MaybeCreateDocShell(this=this@entry=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:2179
2179    nsresult nsFrameLoader::MaybeCreateDocShell() {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 7, mozilla::dom::BrowsingContext::
    Embed (this=0x7fc161aa70)
    at docshell/base/BrowsingContext.cpp:711
711       if (auto* frame = HTMLIFrameElement::FromNode(mEmbedderElement)) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 6, mozilla::dom::HTMLIFrameElement::
    BindToBrowsingContext (this=0x7fc14db4c0)
    at dom/html/HTMLIFrameElement.cpp:70
70      void HTMLIFrameElement::BindToBrowsingContext(BrowsingContext*) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 5, mozilla::dom::HTMLIFrameElement::
    RefreshFeaturePolicy (this=0x7fc14db4c0, 
    aParseAllowAttribute=aParseAllowAttribute@entry=true) at
    dom/html/HTMLIFrameElement.cpp:266
266     void HTMLIFrameElement::RefreshFeaturePolicy(bool aParseAllowAttribute) {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 3, mozilla::dom::HTMLIFrameElement::
    MaybeStoreCrossOriginFeaturePolicy (this=this@entry=0x7fc14db4c0)
    at dom/html/HTMLIFrameElement.cpp:221
221     void HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy() {
(gdb) c
Continuing.

Thread 10 "GeckoWorkerThre" hit Breakpoint 2, nsFrameLoader::GetBrowsingContext
    (this=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:3489
3489      if (mNotifyingCrash) {
(gdb) 
As a quick reminder, this is what the inside of this method looks like:
BrowsingContext* nsFrameLoader::GetBrowsingContext() {
  if (mNotifyingCrash) {
    if (mPendingBrowsingContext && mPendingBrowsingContext->EverAttached()) {
      return mPendingBrowsingContext;
    }
    return nullptr;
  }
  if (IsRemoteFrame()) {
    Unused << EnsureRemoteBrowser();
  } else if (mOwnerContent) {
    Unused << MaybeCreateDocShell();
  }
  return GetExtantBrowsingContext();
}
Let's compare this to the relevant variable values.
(gdb) p mNotifyingCrash
$1 = false
(gdb) p mIsRemoteFrame
$2 = false
(gdb) p mOwnerContent
$3 = (mozilla::dom::Element *) 0x7fc14db4c0
(gdb) 
With these values IsRemoteFrame() will return false and the MaybeCreateDocShell() path will be entered, rather than the EnsureRemoteBrowser() path that we want.

In ESR 78 we have a slightly different version of this method:
BrowsingContext* nsFrameLoader::GetBrowsingContext() {
  if (IsRemoteFrame()) {
    Unused << EnsureRemoteBrowser();
  } else if (mOwnerContent) {
    Unused << MaybeCreateDocShell();
  }
  return GetExtantBrowsingContext();
}
But given the values of the variables, we'll get the same result:
(gdb) p mIsRemoteFrame
$1 = false
(gdb) p mOwnerContent
$2 = (
dwarf2read.c:10473: internal-error: process_die_scope::process_die_scope
    (die_info*, dwarf2_cu*): Assertion `!m_die->in_process' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) n
(gdb) p (mOwnerContent != 0)
$3 = true
(gdb) 
It's beginning to look like the problem here is that IsRemoteFrame() is always returning false, so that the code we want to get called never gets called.

Having said that, there is also a reference to InitSessionHistory() in MaybeCreateDocShell() so we ought to check that too:
Thread 10 "GeckoWorkerThre" hit Breakpoint 8, nsFrameLoader::MaybeCreateDocShell
    (this=this@entry=0x7fc160da30)
    at dom/base/nsFrameLoader.cpp:2179
2179    nsresult nsFrameLoader::MaybeCreateDocShell() {
(gdb) n
2180      if (GetDocShell()) {
(gdb) 
nsFrameLoader::GetBrowsingContext (this=0x7fc160da30) at dom/base/
    nsFrameLoader.cpp:3500
3500      return GetExtantBrowsingContext();
(gdb) 
mozilla::dom::HTMLIFrameElement::MaybeStoreCrossOriginFeaturePolicy
    (this=this@entry=0x7fc14db4c0)
    at dom/html/HTMLIFrameElement.cpp:232
232       RefPtr<BrowsingContext> browsingContext = mFrameLoader->
    GetBrowsingContext();
(gdb) 
234       if (!browsingContext || !browsingContext->IsContentSubframe()) {
(gdb) 
238       if (ContentChild* cc = ContentChild::GetSingleton()) {
(gdb) 
232       RefPtr<BrowsingContext> browsingContext = mFrameLoader->
    GetBrowsingContext();
(gdb) 
nsFrameLoader::MaybeCreateDocShell (this=this@entry=0x7fc160da30) at
    dom/base/nsFrameLoader.cpp:2237
2237      InvokeBrowsingContextReadyCallback();
(gdb) n
2239      mIsTopLevelContent = mPendingBrowsingContext->IsTopContent();
(gdb) 
2241      if (mIsTopLevelContent) {
(gdb) 
2252      nsCOMPtr<nsIDocShellTreeOwner> parentTreeOwner;
(gdb) 
1363    ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
    No such file or directory.
(gdb) 
859     in ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h
(gdb) 
2258      RefPtr<EventTarget> chromeEventHandler;
(gdb) 
2259      bool parentIsContent = parentDocShell->GetBrowsingContext()->
    IsContent();
(gdb) 
2260      if (parentIsContent) {
(gdb) 
2263        parentDocShell->GetChromeEventHandler(getter_AddRefs
    (chromeEventHandler));
(gdb) 
289     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:
    No such file or directory.
(gdb) 
2278      nsCOMPtr<nsPIDOMWindowOuter> newWindow = docShell->GetWindow();
(gdb) n
2285      newWindow->SetFrameElementInternal(mOwnerContent);
(gdb) 
2288      if (mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
(gdb) 
2295      if (!docShell->Initialize()) {
(gdb) 
2301      NS_ENSURE_STATE(mOwnerContent);
(gdb) 
We've now reached this condition, which we really want the programme counter to enter:
  // If we are an in-process browser, we want to set up our session history.
  if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
      !mOwnerContent->HasAttr(kNameSpaceID_None, nsGkAtoms::disablehistory)) {
    // XXX(nika): Set this up more explicitly?
    mPendingBrowsingContext->InitSessionHistory();
  }
But it isn't to be:
(gdb) p mIsTopLevelContent 
$4 = false
(gdb) n
2304      if (mIsTopLevelContent && mOwnerContent->IsXULElement(nsGkAtoms::browser) &&
(gdb) 
2315      HTMLIFrameElement* iframe = HTMLIFrameElement::FromNode(mOwnerContent);
(gdb) 
The value of the mIsTopLevelContent comes from earlier in the same method:
  mIsTopLevelContent = mPendingBrowsingContext->IsTopContent();
Checking in BrowsingContext.h we see this:
  bool IsTopContent() const { return IsContent() && IsTop(); }
[...]
  bool IsContent() const { return mType == Type::Content; }
[...]
  bool IsTop() const { return !GetParent(); }
And to round this off, the GetParent() method is defined in BrowsingContext.cpp like this:
BrowsingContext* BrowsingContext::GetParent() const {
  return mParentWindow ? mParentWindow->GetBrowsingContext() : nullptr;
}
This call to IsTop() matches the call that the call to InitSessionHistory() is conditioned on elsewhere as well. Let's check the relevant values:
(gdb) p mPendingBrowsingContext.mRawPtr->mType
$5 = mozilla::dom::BrowsingContext::Type::Content
(gdb) p mPendingBrowsingContext.mRawPtr->mParentWindow
$6 = {mRawPtr = 0x7fc11bb430}
(gdb) 
This means that GetParent() is returning a non-null value, hence IsTo() must be returning false.

Which is why the session isn't being initialised here.

That's enough digging for today. More tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment