flypig.co.uk

Gecko-dev Diary

Starting in August 2023 I'll be upgrading the Sailfish OS browser from Gecko version ESR 78 to ESR 91. This page catalogues my progress.

Latest code changes are in the gecko-dev sailfishos-esr91 branch.

There is an index of all posts in case you want to jump to a particular day.

Gecko RSS feed Click the icon for the Gecko-dev Diary RSS feed.

Gecko

5 most recent items

24 Jun 2024 : Day 268 #
Today hasn't quite been the day of development I was planning. That's okay, it happens sometimes, and while I've not been doing development, the sun has been shining and nature has been making it's lazy hum. It's not been bad to take the opportunity to relax.

What's more, my day was made emphatically better by receiving this Gecko-dev related poem from Leif-Jöran Olsson (ljo) on Mastodon:

Summer solstice and a supporting full moon ends the code removal phase. A sea of browser backtrace ejects gives support for switching to incremental introduction of nibbles of code. The WebGL context path is buried together with any remaining anxiety. The energy collected awakens the concavenator to pair up in dynamic duo with flypig's rejuvenated gecko.

Genuine art! It sums up where things are at nicely, as you may recall I've recently restored GLScreenBuffer alongside a minimal set of changes (the convex hull of its dependencies) need to get the build to compile. Partial compile that is.

I kicked off a build overnight, but by the morning it's hit some errors. They look like this:
[...]
254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp: In member function ‘void mozilla::
    embedlite::EmbedLiteCompositorBridgeParent::GetPlatformImage(const std::
    function<void(void*, int, int)>&)’:
254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:227:37: error: ‘class mozilla::gl::
    GLContext’ has no member named ‘Screen’
254:42.53    GLScreenBuffer* screen = context->Screen();
254:42.53                                      ^~~~~~

254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp: In member function ‘void* mozilla::
    embedlite::EmbedLiteCompositorBridgeParent::GetPlatformImage(int*, int*)’:
254:42.53 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:257:37: error: ‘class mozilla::gl::
    GLContext’ has no member named ‘Screen’
254:42.53    GLScreenBuffer* screen = context->Screen();
254:42.53                                      ^~~~~~

254:44.41 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: 
    EmbedLiteCompositorBridgeParent.o] Error 1
In this and the following error output I've added some newlines to try to separate out the errors and hopefully make them a little clearer.

All of these errors amount to the same thing and will be easy to fix. The necessary change is to restore the GLContext::Screen() method, which I've done, and set the build off again. It presumably got past the partial build because the call is being made from inside EmbedLiteCompositorBridgeParent.cpp, which as I also discussed yesterday, doesn't get touched by the partial build.

It was a pretty obvious error and someone more astute than I am could certainly have picked it up just by observation, without the need to do the build. But it's also easy when working with compiled languages to rely on the compiler to pick these kinds of errors up. So I missed it and it lost me some time.

My second build failed as well, this time due to the following variable being missing from the GLContext class:
  UniquePtr<GLScreenBuffer> mScreen;
In my defence I had added it, but it got removed again while performing a git checkout -d command to restore the Screen() method. It's a poor defence, but it's how it went down.

So I'm now on to my third build of the day. So far so good, I'm hoping it'll complete before bed-time so as to give me the chance to test it.

Frustratingly it gets all the way to the linker before it fails again.
394:09.19 toolkit/library/build/libxul.so

401:01.08 /home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/
    SailfishOS-devel-aarch64.default/opt/cross/bin/aarch64-meego-linux-gnu-ld: 
    ../../../gfx/gl/Unified_cpp_gfx_gl0.o: in function `mozilla::gl::
    SurfaceFactory::NewTexClient(mozilla::gfx::IntSizeTyped<mozilla::gfx::
    UnknownUnits> const&)':
401:01.10 ${PROJECT}/gecko-dev/gfx/gl/SharedSurface.cpp:204: undefined 
    reference to `mozilla::layers::SharedSurfaceTextureClient::Create(mozilla::
    UniquePtr<mozilla::gl::SharedSurface, mozilla::DefaultDelete<mozilla::gl::
    SharedSurface> >, mozilla::gl::SurfaceFactory*, mozilla::layers::
    LayersIPCChannel*, mozilla::layers::TextureFlags)'

401:01.10 /home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/
    SailfishOS-devel-aarch64.default/opt/cross/bin/aarch64-meego-linux-gnu-ld: 
    libxul.so: hidden symbol 
    `_ZN7mozilla6layers26SharedSurfaceTextureClient6CreateENS_9UniquePtr
    INS_2gl13SharedSurfaceENS_13DefaultDeleteIS4_EEEEPNS3_14SurfaceFactory
    EPNS0_16LayersIPCChannelENS0_12TextureFlagsE' isn't defined

401:01.10 /home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/
    SailfishOS-devel-aarch64.default/opt/cross/bin/aarch64-meego-linux-gnu-ld: 
    final link failed: bad value
The problem here is a method that's being declared in a header but not implemented in the source file. By carefully working through the error output we can see that the missing code is the implementation for SharedSurfaceTextureClient::Create(). Here's the method shown in the error message, but cleaned up and reformatted to make things clearer:
SharedSurfaceTextureClient::Create(
    UniquePtr<SharedSurface, DefaultDelete<SharedSurface> >,
    SurfaceFactory*,
    LayersIPCChannel*,
    TextureFlags
)
We can also see from the error messages that it's being called here:
  RefPtr<layers::SharedSurfaceTextureClient> ret;
  ret = layers::SharedSurfaceTextureClient::Create(std::move(surf), this,
                                                   mAllocator, mFlags);
In TextureClientSharedSurface.h we can see the method signature in the header. The fact there's a signature is the reason the compiler didn't notice and it wasn't until the linker that the error was uncovered:
class SharedSurfaceTextureClient : public TextureClient {
 public:
[...]
  static already_AddRefed<SharedSurfaceTextureClient> Create(
      UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory,
      LayersIPCChannel* aAllocator, TextureFlags aFlags);
[...]
};
But the implementation is indeed missing from TextureClientSharedSurface.cpp. We can get the implementation that we were using before using git diff, which gives us the following:
$ git diff
[...]
-already_AddRefed<SharedSurfaceTextureClient> SharedSurfaceTextureClient::
    Create(
-    UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory,
-    LayersIPCChannel* aAllocator, TextureFlags aFlags) {
-  if (!surf) {
-    return nullptr;
-  }
-  TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(
    );
-  SharedSurfaceTextureData* data =
-      new SharedSurfaceTextureData(std::move(surf));
-  return MakeAndAddRef<SharedSurfaceTextureClient>(data, flags, aAllocator);
-}
There's also a mangled method name appearing in the error output. We can demangle it to try to find out if this is something separate we need to fix: $ c++filt '_ZN7mozilla6layers26SharedSurfaceTextureClient6CreateENS_9 UniquePtrINS_2gl13SharedSurfaceENS_13DefaultDeleteIS4_EEEEPNS3_14 SurfaceFactoryEPNS0_16LayersIPCChannelENS0_12TextureFlagsE' mozilla::layers::SharedSurfaceTextureClient::Create(mozilla::UniquePtr >, mozilla::gl::SurfaceFactory*, mozilla::layers::LayersIPCChannel*, mozilla::layers::TextureFlags) Cleaning that up, we get this:
SharedSurfaceTextureClient::Create(
    UniquePtr<SharedSurface, DefaultDelete<SharedSurface> >,
    SurfaceFactory*,
    LayersIPCChannel*,
    TextureFlags
)
Having demangled and cleaned it up, it's clear this is the same error as before, so nothing more to do on this front.

After making these fixes and running the partial build again, it now throws up the following error:
In file included from Unified_cpp_gfx_layers6.cpp:128:
${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: In 
    static member function ‘static already_AddRefed<mozilla::layers::
    SharedSurfaceTextureClient> mozilla::layers::SharedSurfaceTextureClient::
    Create(mozilla::UniquePtr<mozilla::gl::SharedSurface>, mozilla::gl::
    SurfaceFactory*, mozilla::layers::LayersIPCChannel*, mozilla::layers::
    TextureFlags)’:
${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:102:63: 
    error: ‘class mozilla::gl::SharedSurface’ has no member named 
    ‘GetTextureFlags’
   TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(
    );
To fix this I need to add in the removed GetTextureFlags() method to SharedSurface.cpp and the related signature in the SharedSurface.h header:
-  // Specifies to the TextureClient any flags which
-  // are required by the SharedSurface backend.
-  virtual layers::TextureFlags GetTextureFlags() const;
[...]
-layers::TextureFlags SharedSurface::GetTextureFlags() const {
-  return layers::TextureFlags::NO_FLAGS;
-}
With this change the partial build finally goes through, including the final linking stage. But I'll still need to run the full build again before I can test anything. So I've kicked it off. There's no way it'll complete before the morning, so that'll have to be it for the day.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 Jun 2024 : Day 267 #
I have a bit more time for development today than I had yesterday, so I'm hoping I can properly follow up on this issue I noticed yesterday with the library working or not, depending on which version of the package is installed.

As part of this, I want to explore what happens when I run a configuration with the "working WebGL" packages (i.e. the ones with all of the changes from my latest commit reverted), plus my latest library, but also running the WebView rather than the browser.

I'm expecting this to fail, but it'll be interesting to see where.

[...]

And it does fail. But now I have a backtrace to inspect from it and it's a lot more interesting than the backtraces from the Wayland failure we've been getting so often recently. Here's the backtrace:
Thread 38 &quot;Compositor&quot; received signal SIGSEGV, Segmentation fault.
[Switching to LWP 9220]
0x0000007ff110864c in mozilla::gl::SwapChain::OffscreenSize (this=<optimized 
    out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No 
    such file or directory.
(gdb) bt
#0  0x0000007ff110864c in mozilla::gl::SwapChain::OffscreenSize (
    this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#1  0x0000007ff3666230 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4ad76f0, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
#2  0x0000007ff12b64d8 in mozilla::layers::CompositorVsyncScheduler::
    ForceComposeToTarget (this=0x7fc4c39f60, aTarget=aTarget@entry=0x0, 
    aRect=aRect@entry=0x0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h:
    82
#3  0x0000007ff12b6534 in mozilla::layers::CompositorBridgeParent::
    ResumeComposition (this=this@entry=0x7fc4ad76f0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#4  0x0000007ff12b65c0 in mozilla::layers::CompositorBridgeParent::
    ResumeCompositionAndResize (this=0x7fc4ad76f0, x=<optimized out>, 
    y=<optimized out>, 
    width=<optimized out>, height=<optimized out>)
    at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:794
#5  0x0000007ff12af15c in mozilla::detail::RunnableMethodArguments<int, int, 
    int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void (mozilla:
    :layers::CompositorBridgeParent::*)(int, int, int, int), 
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 
    StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 
    2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
#6  mozilla::detail::RunnableMethodArguments<int, int, int, int>::apply<mozilla:
    :layers::CompositorBridgeParent, void (mozilla::layers::
    CompositorBridgeParent::*)(int, int, int, int)> (m=<optimized out>, 
    o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
#7  mozilla::detail::RunnableMethodImpl<mozilla::layers::
    CompositorBridgeParent*, void (mozilla::layers::CompositorBridgeParent::*)(
    int, int, int, int), true, (mozilla::RunnableKind)0, int, int, int, int>::
    Run (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1201
#8  0x0000007ff0801ab8 in nsThread::ProcessNextEvent (this=0x7fc4c01730, 
    aMayWait=<optimized out>, aResult=0x7f1796bcb7)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:869
#9  0x0000007ff07f098c in NS_ProcessNextEvent (aThread=<optimized out>, 
    aThread@entry=0x7fc4c01730, aMayWait=aMayWait@entry=false)
    at ${PROJECT}/gecko-dev/xpcom/threads/nsThreadUtils.cpp:466
#10 0x0000007ff0bbcab0 in mozilla::ipc::MessagePumpForNonMainThreads::Run (
    this=0x7edc001840, aDelegate=0x7f1796bdc0)
    at ${PROJECT}/gecko-dev/ipc/glue/MessagePump.cpp:300
#11 0x0000007ff0b7b87c in MessageLoop::RunInternal (
    this=this@entry=0x7f1796bdc0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#12 0x0000007ff0b7bac0 in MessageLoop::RunHandler (this=0x7f1796bdc0)
    at ${PROJECT}/gecko-dev/ipc/chromium/src/base/message_loop.cc:352
#13 MessageLoop::Run (this=this@entry=0x7f1796bdc0)
    at ${PROJECT}/gecko-dev/ipc/chromium/src/base/message_loop.cc:334
#14 0x0000007ff08034b8 in nsThread::ThreadFunc (aArg=0x7fc4c018d0)
    at ${PROJECT}/gecko-dev/xpcom/threads/nsThread.cpp:392
#15 0x0000007feca419f0 in ?? () from /usr/lib64/libnspr4.so
#16 0x0000007fefd05a4c in ?? () from /lib64/libpthread.so.0
#17 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6
While we're here, let's do a little exploration into why this crash occurred using the debugger.
(gdb) frame 1
#1  0x0000007ff3666230 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    CompositeToDefaultTarget (this=0x7fc4ad76f0, aId=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290
290     ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No 
    such file or directory.
(gdb) p context
$1 = (mozilla::gl::GLContext *) 0x7edc19ede0
(gdb) p context->mSwapChain
$2 = {
  mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*, 
    mozilla::DefaultDelete<mozilla::gl::SwapChain>, (mozilla::detail::
    StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::
    DefaultDelete<mozilla::gl::SwapChain>> = {<No data fields>}, mFirstA = 
    0x7edc1ce070}, <No data fields>}}
(gdb) p context->mSwapChain.mTuple
$3 = {<mozilla::detail::CompactPairHelper<mozilla::gl::SwapChain*, mozilla::
    DefaultDelete<mozilla::gl::SwapChain>, (mozilla::detail::StorageType)1, (
    mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::
    SwapChain>> = {<No data fields>}, mFirstA = 0x7edc1ce070}, <No data fields>}
(gdb) p context->mSwapChain.mTuple.mFirstA
$4 = (mozilla::gl::SwapChain *) 0x7edc1ce070
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter
$5 = (mozilla::gl::SwapChainPresenter *) 0x7edc1a1300
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer
$6 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
(gdb) 
What's this telling us? Well, it's very similar to the crash we got back on Day 177 when we first started trying out the WebView. The SwapChain is being created and accessed, but it's deep inside the object that the problem occurs: it's the SharedSurface backbuffer object stored inside the SwapChainPresenter object, stored inside a smart pointer, stored inside the GLContext that's stored inside the SwapChain that's not been set:
(gdb) p context->mSwapChain.mTuple.mFirstA->mPresenter->mBackBuffer
$6 = std::shared_ptr<mozilla::gl::SharedSurface> (empty) = {get() = 0x0}
This might be an initialisation issue, or it might be more involved. It's not quite the same as what was happening on Day 177 since the code is different this time. But the underlying issue is the same.

To be honest, this is just what I'd expect. But it also tells us that this whole process hasn't been in vain: cutting out things brought us to a similar point to before, but we're closer to resolving both the WebGL and WebView issues this time.

The next step is to establish whether the new SwapChain is getting used. I'd previously thought it was never used by the browser, but I have a new perspective now: although it's not used when rendering general web pages, maybe it's used when rending WebGL within a page? Most pages don't do this, but when they do, I'm now expecting there to be some offscreen rendering.

I've placed a breakpoint on the SwapChain constructor. To start with, here's where the SwapChain gets created when using a WebView component. This is for comparison, captured using the latest code:
=============== Preparing offscreen rendering context ===============
[Switching to LWP 9891]

Thread 37 &quot;Compositor&quot; hit Breakpoint 1, mozilla::gl::SwapChain::
    SwapChain (this=0x7ee01ce090)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63
63      ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h: No such file or directory.
(gdb) bt
#0  mozilla::gl::SwapChain::SwapChain (this=0x7ee01ce090)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63
#1  0x0000007ff3666ac0 in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    PrepareOffscreen (this=this@entry=0x7fc4b01c50)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#2  0x0000007ff3666b7c in mozilla::embedlite::EmbedLiteCompositorBridgeParent::
    AllocPLayerTransactionParent (this=0x7fc4b01c50, aBackendHints=..., 
    aId=...)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:90
#3  0x0000007ff0c63d90 in mozilla::layers::PCompositorBridgeParent::
    OnMessageReceived (this=0x7fc4b01c50, msg__=...) at 
    PCompositorBridgeParent.cpp:1285
[...]
#18 0x0000007ff6a0289c in ?? () from /lib64/libc.so.6
(gdb) 
As we can see from this, it's created inside the EmbedLiteCompositorBridgeParent::PrepareOffscreen() method. Here's what the code looks like that's creating it, for reference:
void
EmbedLiteCompositorBridgeParent::PrepareOffscreen()
{
  fprintf(stderr, &quot;=============== Preparing offscreen rendering context 
    ===============\n&quot;);

  const CompositorBridgeParent::LayerTreeState* state = CompositorBridgeParent::
    GetIndirectShadowTree(RootLayerTreeId());
  NS_ENSURE_TRUE(state && state->mLayerManager, );

  GLContext* context = static_cast<CompositorOGL*>(
    state->mLayerManager->GetCompositor())->gl();
  NS_ENSURE_TRUE(context, );

  // TODO: The switch from GLSCreenBuffer to SwapChain needs completing
  // See: https://phabricator.services.mozilla.com/D75055
  if (context->IsOffscreen()) {
    UniquePtr<SurfaceFactory> factory;
    if (context->GetContextType() == GLContextType::EGL) {
      // [Basic/OGL Layers, OMTC] WebGL layer init.
      factory = SurfaceFactory_EGLImage::Create(*context);
    } else {
      // [Basic Layers, OMTC] WebGL layer init.
      // Well, this *should* work...
      factory = MakeUnique<SurfaceFactory_Basic>(*context);
    }

    SwapChain* swapChain = context->GetSwapChain();
    if (swapChain == nullptr) {
      swapChain = new SwapChain();
      new SwapChainPresenter(*swapChain);
      context->mSwapChain.reset(swapChain);
    }

    if (factory) {
      swapChain->Morph(std::move(factory));
    }
  }
}
Now I want to know whether it's ever used by the browser using an execution flow that doesn't depend on EmbedLite.

When I render a website without WebGL (e.g. the Jolla site) the constructor goes unused. But if I visit a site that uses WebGL (e.g. my personal website where the animated background is generated using a WebGL shader) it does get hit. It comes with a crazy long backtrace that shows it's happening inside a DOM element, which is again what I'd expect. I've chopped quite a lot out from the below backtrace, but still kept the parts I think are most relevant:
Thread 8 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, mozilla::gl::SwapChain::
    SwapChain (this=0x7fc9ce3588)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63
63      ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h: No such file or directory.
(gdb) bt
#0  mozilla::gl::SwapChain::SwapChain (this=0x7fc9ce3588)
    at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:63
#1  0x0000007ff369a54c in mozilla::WebGLContext::WebGLContext (
    this=0x7fc9ce30f0, host=..., desc=...)
    at include/c++/8.3.0/bits/move.h:74
#2  0x0000007ff36a9c90 in mozilla::WebGLContext::<lambda()>::operator() (
    __closure=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h:33
#3  mozilla::WebGLContext::Create (host=..., desc=..., 
    out=out@entry=0x7fcb9660c8)
    at ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:562
#4  0x0000007ff3661920 in mozilla::HostWebGLContext::Create (ownerData=..., 
    desc=..., out=out@entry=0x7fcb9660c8)
    at ${PROJECT}/gecko-dev/dom/canvas/HostWebGLContext.cpp:59
#5  0x0000007ff3691374 in mozilla::ClientWebGLContext::<lambda()>::operator() (
    __closure=<optimized out>)
    at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:625
#6  mozilla::ClientWebGLContext::CreateHostContext (
    this=this@entry=0x7fc9991820, requestedSize=...)
    at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:654
#7  0x0000007ff3691e5c in mozilla::ClientWebGLContext::SetDimensions (
    this=0x7fc9991820, signedWidth=<optimized out>, signedHeight=<optimized 
    out>)
    at ${PROJECT}/gecko-dev/dom/canvas/ClientWebGLContext.cpp:563
#8  0x0000007ff362b27c in mozilla::dom::CanvasRenderingContextHelper::
    UpdateContext (this=0x7e6036c790, aCx=<optimized out>, 
    aNewContextOptions=...,
    aRvForDictionaryInit=...)
    at ${PROJECT}/gecko-dev/dom/canvas/CanvasRenderingContextHelper.cpp:238
#9  0x0000007ff363a348 in mozilla::dom::CanvasRenderingContextHelper::
    GetContext (this=this@entry=0x7e6036c790, aCx=0x7fc81defd0, aContextId=...,
    aContextOptions=..., aRv=...)
    at ${PROJECT}/gecko-dev/dom/canvas/CanvasRenderingContextHelper.cpp:190
#10 0x0000007ff390bf18 in mozilla::dom::HTMLCanvasElement::GetContext (
    this=this@entry=0x7e6036c710, aCx=aCx@entry=0x7fc81defd0, aContextId=...,
    aContextOptions=aContextOptions@entry=..., aRv=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/Value.h:670
#11 0x0000007ff3549764 in mozilla::dom::HTMLCanvasElement_Binding::getContext (
    cx=0x7fc81defd0, obj=..., void_self=0x7e6036c710, args=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/RootingAPI.h:1297
#12 0x0000007ff35e0bec in mozilla::dom::binding_detail::GenericMethod<mozilla::
    dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::
    ThrowExceptions> (cx=0x7fc81defd0, argc=<optimized out>, vp=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:207
#13 0x0000007ff4e7d5d4 in CallJSNative (args=..., reason=js::CallReason::Call,
    native=0x7ff35e09ac <mozilla::dom::binding_detail::GenericMethod<mozilla::
    dom::binding_detail::NormalThisPolicy, mozilla::dom::binding_detail::
    ThrowExceptions>(JSContext*, unsigned int, JS::Value*)>, cx=0x7fc81defd0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/js/CallArgs.h:285
#14 js::InternalCallOrConstruct (cx=cx@entry=0x7fc81defd0, args=..., 
    construct=construct@entry=js::NO_CONSTRUCT, reason=reason@entry=js::
    CallReason::Call)
    at ${PROJECT}/gecko-dev/js/src/vm/Interpreter.cpp:511
[...]
#63 0x0000007fefbb189c in ?? () from /lib64/libc.so.6
(gdb)
For comparison, it's interesting to check whether the backbuffer is already created at this point. The debugger suggests not:
(gdb) p mPresenter->mBackBuffer
$3 = std::shared_ptr<mozilla::gl::SharedSurface> (expired, weak count 0) = {get(
    ) = 0x21}
On this version of the browser the WebGL is working, using offscreen rendering, but the WebView is broken. So now I'm rethinking my need to introduce all the old GLScreenBuffer code. Could I try to use the ~SwapChain after all? You may recall that I already considered this much earlier, tried it, failed and then reassessed. Maybe I now know more, enough to make it work now? I'm going to look carefully through the code and reconsider.

[...]

I've now spent a good few hours looking through the WebGLContext code, since this is what we see in the backtrace above. There's definitely something in the idea that we should be using this instead of GLContext. But WebGLContext isn't inheriting anything from GLContext and their interfaces look quite different to me. It certainly isn't the case that one would be a drop-in replacement for the other. Quite the contrary in fact. While switching to use WebGLContext might be a better solution in the long-term, I've convinced myself (again) that this isn't what we need right now.

So I'm going back to my original plan, but now we're going in the opposite direction. Rather than removing code I'll now start to reintroduce code. In particular, the one thing I'm convinced that we can't do without is the GLScreenBuffer object, as encapsulated in the GLContext::mScreen member variable.

So I'm adding this class back in. Thankfully git makes this a very easy process:
$ git checkout gfx/gl/GLScreenBuffer.cpp
$ git checkout gfx/gl/GLScreenBuffer.h
This is the minimal change I think is needed to get the WebView working again. I'm building from a base where WebGL is working. So I feel like I'm back on track again.

With just these two files reverted, attempting to build throws up a whole host of errors. Here are just a few:
${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: In destructor ‘virtual mozilla::
    gl::GLScreenBuffer:
:~GLScreenBuffer()’:
${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:205:8: error: invalid use of 
    incomplete type ‘class 
mozilla::layers::SharedSurfaceTextureClient’
   mBack->Surf()->ProducerRelease();
        ^~
In file included from ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.h:23,
                 from ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:6:
${PROJECT}/gecko-dev/gfx/gl/SharedSurface.h:45:7: note: forward declaration of 
    ‘class mozilla::laye
rs::SharedSurfaceTextureClient’
 class SharedSurfaceTextureClient;
       ^~~~~~~~~~~~~~~~~~~~~~~~~~
${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: In member function ‘void 
    mozilla::gl::GLScreenBuffe
r::BindFB(GLuint)’:
${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:218:10: error: ‘class mozilla::
    gl::GLContext’ has no
 member named ‘raw_fBindFramebuffer’; did you mean ‘raw_fBlitFramebuffer’?
     mGL->raw_fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, mInternalDrawFB);
          ^~~~~~~~~~~~~~~~~~~~
          raw_fBlitFramebuffer
This isn't unexpected. My process now is to reintroduce removed code, but only where absolutely necessary to get the build working again. So I'm essentially doing the opposite of what I was doing before: adding code rather than removing it. As before, git is my biggest help here because it's kept a neat record of everything I've changed. I'm reverting it in small pieces, so it's taking a while to make the changes, but I'm still satisfied that what I'll end up with is the smallest set of changes I can reasonably expect, given that we've added the GlScreenBuffer class back in.

It's going to be the convex hull of the GLScreenBuffer dependencies.

[...]

I've got to the stage where the partial build seems to be compiling. But it required changes to the EmbedLite code, which I don't yet have a method of including in the partial build. But it's already late here, so I'm going to set the full build running overnight and see where that gets us.

Today has been a very productive day of development. If I can be similarly productive tomorrow, I'll feel like all of the work I've been putting in over the last week, despite the slow progress, will nevertheless have been worth it.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 Jun 2024 : Day 266 #
I'm continuing to strip out methods today after eviscerating the code yesterday. There aren't many changes left and yesterday I established that none of the changed code is being executed before the crash. So I'm not confident that removing the changes will have any effect. But I don't have much else to do at this point so I may as well continue until there are no changes left to make.

It's anomalous though. I have packages built against the code with the commit reverted. I know that reverting all of the changes leaves me with a working browser. So something is clearly amiss.

Nevertheless I'm left with only a few changes now. The code is building and we'll see where this leaves us.

[...]

Code built, installed, executed. And we get the same result: a crash early on in the execution of the browser. I've removed so much code now that this doesn't feel right, so I need to check that something else hasn't broken along the way.

I've tried a whole bunch of things, including removing the profile, using different websites, restarting lipstick. None of this makes any difference.

Installing the packages for the version with working WebGL shows that things are still working for that. And when I then replace the library with the version of the library I've just built... well now that version works too, and with working WebGL as well. But of course the WebView is still broken with this version. But this clearly highlights that the problem isn't where I expected it to be.

So with much frustration I have to concede that something else — something bad — must have been happening elsewhere in the code.

Trying a second test, I install the packages for the version with broken WebGL. That gives the expected result (browser working; WebGL broken). Now I replace the library with my newly built one.

And now it crashes.

So the pattern is:
 
  1. Install working WebGL packages followed by latest libxul.so... works.
  2. Install broken WebGL packages followed by latest libxul.so... crashes.


This is food for thought for sure. This suggests that the problem sits somewhere in the interface between the updated code and one of the EmbedLite code, the QtMozEmbed code, or the sailfish-browser code.

This at least gives me something to go on. I'm going to ruminate on this overnight and try to tackle it tomorrow. This is definitely progress, just not without raising new questions which I'll need to answer.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 Jun 2024 : Day 265 #
Sadly, when I checked my machine this morning, I discovered the build I kicked off overnight didn't complete successfully. There have been a couple of errors during the compilation step. The first looks like this:
330:03.83 mobile/sailfishos
330:22.42 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp: In member function ‘void mozilla::
    embedlite::EmbedLiteCompositorBridgeParent::PrepareOffscreen()’:
330:22.42 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:116:39: error: ‘class mozilla::gl::
    GLContext’ has no member named ‘Screen’
330:22.42      GLScreenBuffer* screen = context->Screen();
330:22.42                                        ^~~~~~
The second looks like this:
330:22.43 ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/
    EmbedLiteCompositorBridgeParent.cpp:124:74: error: no matching function for 
    call to ‘mozilla::gl::SurfaceFactory_EGLImage::Create(mozilla::gl::
    GLContext*&, std::nullptr_t, mozilla::layers::TextureFlags&)’
330:22.43          factory = SurfaceFactory_EGLImage::Create(context, nullptr, 
    flags);
330:22.43                                                                       
        ^
There are some further errors, but they look like variations on these two. You might think it's odd that the full build failed when the partial build completed successfully last night. This is an occupational hazard of running partial builds. When running a partial build we have to specify the folder to start in. For example, this is the command I used last night:
$ make -j1 -C obj-build-mer-qt-xr/gfx/
This is going to rebuild everything in the gfx directory and anything that depends on it. I chose this as the root because, as far as I could recall, all of the changes I made were in this directory or one of its children. But that's not always enough, for example, it means if there's something in the project with a shared dependency that's higher up the directory hierarchy from gfx it won't necessarily get rebuilt.

If I'd run the final linker stage at the end of the process, the error may have been exposed by an undefined reference, but I was so tired last night by the time I'd made all of the changes to the source code that I could barely think straight. So I had neither the energy nor the whit to do this.

Never mind, with any luck I can fix it this morning and get another build running during the day.

The fix appears to involve reverting almost all of the changes made to EmbedLiteCompositorBridgeParent.cpp to re-accommodate the GLContext::mScreen, which I'd previous to that switched for GLContext::mSwapChain in line with changes that happened upstream between ESR 78 and ESR 91. Given that I removed GLScreenBuffer, which mScreen was an instance of, these changes aren't too surprising in retrospect. But I was never going to notice them given my state of tiredness last night.

So anyway, here we are, it's still early and a fresh build is running. Hopefully this one will enjoy more success!

[...]

And happily it does: the build has completed without any errors, in time for some evening development.

Now time to execute it. And the result is...
Thread 38 &quot;Compositor&quot; received signal SIGSEGV, Segmentation fault.
[Switching to LWP 13378]
0x0000007fe7e374cc in wl_proxy_marshal_constructor () from /usr/lib64/
    libwayland-client.so.0
(gdb) bt
#0  0x0000007fe7e374cc in wl_proxy_marshal_constructor () from /usr/lib64/
    libwayland-client.so.0
#1  0x0000007fe7b8742c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned 
    int, unsigned int, int, int, android_wlegl*, wl_event_queue*) ()
   from /usr/lib64/libhybris//eglplatform_wayland.so
#2  0x0000007fe7b874c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/
    libhybris//eglplatform_wayland.so
#3  0x0000007fe7b86728 in WaylandNativeWindow::dequeueBuffer(
    BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris//
    eglplatform_wayland.so
#4  0x0000007fe7b4d124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, 
    ANativeWindowBuffer**, int*) () from /usr/lib64/
    libhybris-platformcommon.so.1
#5  0x0000007fe4fa9188 in ?? ()
#6  0x0000000000000438 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 
Honestly, I really thought that I'd removed enough of the changes that this error would no longer occur, so I'm surprised that it's still here. I must be missing something important. There aren't really so many active changes in the code now and I'm really struggling to figure out what the problem is. So I've placed breakpoints on the main remaining edited methods. One of these will have to be hit before the crash occurs.

Here's the list of breakpoints I've set:
(gdb) info break
Num     Type           Disp Enb Address    What
1       breakpoint     keep y   <PENDING>  DirectUpdate
2       breakpoint     keep y   <PENDING>  TextureImageEGL::Resize
3       breakpoint     keep y   <PENDING>  TextureImageEGL::ReleaseTexImage
4       breakpoint     keep y   <PENDING>  TextureImageEGL::TextureImageEGL
5       breakpoint     keep y   <PENDING>  DestroyTextureData
6       breakpoint     keep y   <PENDING>  TextureClient::Destroy
7       breakpoint     keep y   <PENDING>  CompositorOGL::PrepareViewport
8       breakpoint     keep y   <PENDING>  CompositorOGL::DrawGeometry
(gdb) 
Astonishingly not one of these hits. This is crazy. As I try to add the final few breakpoints, even of the methods that look unused, I notice that there are a couple of classes that have signatures but no implementation. Is it possible this could be the reason and I've been missing this all along?

I've now removed those classes and the few methods that also had signatures without definitions. I had thought that if these were the problems either the compiler would pick up on them or it would just fail when an attempt was made to load the library. Maybe I was wrong.

So, as I say, I've removed the classes signatures and related code. The good news is that the partial build completed fine, including the linking stage. Does it run?

No. No it doesn't. The crash, along with its backtrace, remains identical.
Thread 37 &quot;Compositor&quot; received signal SIGSEGV, Segmentation fault.
[Switching to LWP 31572]
0x0000007fe7e364bc in wl_proxy_marshal_constructor () from /usr/lib64/
    libwayland-client.so.0
(gdb) bt
#0  0x0000007fe7e364bc in wl_proxy_marshal_constructor () from /usr/lib64/
    libwayland-client.so.0
#1  0x0000007fe7b8642c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned 
    int, unsigned int, int, int, android_wlegl*, wl_event_queue*) ()
   from /usr/lib64/libhybris//eglplatform_wayland.so
#2  0x0000007fe7b864c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/
    libhybris//eglplatform_wayland.so
#3  0x0000007fe7b85728 in WaylandNativeWindow::dequeueBuffer(
    BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris//
    eglplatform_wayland.so
#4  0x0000007fe7b4c124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, 
    ANativeWindowBuffer**, int*) () from /usr/lib64/
    libhybris-platformcommon.so.1
#5  0x0000007fe4f69188 in ?? ()
#6  0x0000000000000438 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 
That's not what I was hoping to see and this is deeply frustrating. I've reached the end of my usable hours for today, so I'll have to continue with this tomorrow. I'm not sure how much further I can strip code out until there's nothing left to remove, but I'll continue onward. Right now that seems like the only sane thing to do.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 Jun 2024 : Day 264 #
Overnight I ran a build having removed the Wayland code added in the last commit, or at least a decent proportion of it. This morning everything was built and it's time to test it out.

Unfortunately we still get a crash. The backtrace isn't identical to the backtrace we were getting a couple of days ago, but it's similar. Similar enough to make me think we've not actually fixed anything yet. Here's the backtrace:
Thread 37 &quot;Compositor&quot; received signal SIGSEGV, Segmentation fault.
[Switching to LWP 25343]
0x0000007fefec60e4 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000007fefec60e4 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x0000007fe7e34170 in wl_proxy_marshal_array_constructor_versioned () from /
    usr/lib64/libwayland-client.so.0
#2  0x0000007fe7e344e8 in wl_proxy_marshal_constructor () from /usr/lib64/
    libwayland-client.so.0
#3  0x0000007fe7b8442c in ServerWaylandBuffer::ServerWaylandBuffer(unsigned 
    int, unsigned int, int, int, android_wlegl*, wl_event_queue*) ()
   from /usr/lib64/libhybris//eglplatform_wayland.so
#4  0x0000007fe7b844c8 in WaylandNativeWindow::addBuffer() () from /usr/lib64/
    libhybris//eglplatform_wayland.so
#5  0x0000007fe7b83728 in WaylandNativeWindow::dequeueBuffer(
    BaseNativeWindowBuffer**, int*) () from /usr/lib64/libhybris//
    eglplatform_wayland.so
#6  0x0000007fe7b4a124 in BaseNativeWindow::_dequeueBuffer(ANativeWindow*, 
    ANativeWindowBuffer**, int*) () from /usr/lib64/
    libhybris-platformcommon.so.1
#7  0x0000007fe4f69188 in ?? ()
#8  0x0000000000000438 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 
So following this I've been working away at the code again. It's mostly business as usual: I continue to remove code, reverting pieces back to the code as it was before my latest commit. I went a bit further than usual today though, removing GLScreenBuffer entirely. That's a really significant change and it required plenty of refinement: fixing up code that made use of it; fixing up code that made use of code that made use of it. Each set of changes brings us a little closer to a working WebGL.

So this turned out to be a really big change. Having made it, I now have to build and test it. It's going to be a short post today, but that reflects the fact I've spent all my time stripping out code. Big changes, but there's not so much to say about it if I'm honest.

I've checked that the partial build goes through. But it's really late now, so I may as well set a full build running overnight. Then I can test the changes in the morning.

The one nice thing about the process I'm currently undertaking with these WebGL changes is that I know that the process is bounded. I have a broken version, I have a working version, I just need to find the tipping point between the two where it switches from broken to fixed. Once I have that it won't be the end of the story, because the changes will inevitably have broken the WebView, but at that point, once I know what's breaking the WebGL, I can go back to the working WebView and reapply that specific change. Or so the theory goes.

Before I sign off for today, I'm going to take a moment to indulge in some development philosophy.

The big difference between detective stories and real life detective work is that in the fictional version the clues are all there, you just need to find them. Real life doesn't come with that guarantee: you can spend your entire life looking for clues that don't exit. I'm sure that's a big part of the reason why, as humans, we prefer computer games over real life. In a computer game we know in advance there's a finite, bounded, solution. A big portion of the uncertainty has already been taken away.

With this particular WebGL bug I'm inhabiting this same happy computer-game place where I know there's a solution, I just have to find it. It may be a slow process, it may be a little arduous at times, but the solution is there, I just need to find it. It exists somewhere between the last commit and the current commit. We'll get there.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment