List items
Items from the current list are shown below.
Gecko
All items from March 2024
31 Mar 2024 : Day 202 #
Recently I've been looking at the EGL mesa display. Yesterday I finished restoring patch 0038, which is designed to fix this part of the rendering pipeline. Today I want to try to understand a little better what's being executed and when, in the hope that this will clarify whether what's happening is happening correctly or not.
In fact, I know there's something not working because rendering is still failing, but my hope is to narrow down what is and isn't working. After all, to badly misquote an expert on the matter, when you have eliminated the impossible, whatever remains, however improbable, must be the source of the bug.
The interesting methods, by which I mean the methods I'm interested in, which are those I've been making changes to over the last couple of days, are the following:
To understand how these are getting called, I've placed breakpoints on all of them:
So executing the code with the debugger I can quickly find out which of these methods are called, in what order, and which are never called at all.
That flow is useful, but it's not the entire picture. I also want to know the call stacks for these method calls. Some of them are obvious because there's only one path for them to be called via. But others, including the calls to GetAppDisplay(), are less clear. This method in particular is called in many different places in the code. It'd be useful to know which paths are causing it to be triggered.
The easiest way to find this out is to get a copy of the backtrace when the breakpoint has been hit. I didn't do this first time round, but a second run through allows me to collect this info too.
Here's the (slightly abridged) backtrace for the first hit of GetAppDisplay()
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
In fact, I know there's something not working because rendering is still failing, but my hope is to narrow down what is and isn't working. After all, to badly misquote an expert on the matter, when you have eliminated the impossible, whatever remains, however improbable, must be the source of the bug.
The interesting methods, by which I mean the methods I'm interested in, which are those I've been making changes to over the last couple of days, are the following:
- GetAppDisplay()
- CreateEmulatorBufferSurface()
- WaylandGLSurface::WaylandGLSurface()
- GLLibraryEGL::Create()
- GLLibraryEGL::DefaultDisplay()
- GLLibraryEGL::CreateDisplay()
- GetAndInitDisplay()
To understand how these are getting called, I've placed breakpoints on all of them:
(gdb) b GetAppDisplay Breakpoint 1 at 0x7ff1105918: file ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp, line 161. (gdb) b CreateEmulatorBufferSurface Breakpoint 2 at 0x7ff111d054: file ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp, line 980. (gdb) b WaylandGLSurface::WaylandGLSurface Breakpoint 3 at 0x7ff111d17c: file ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp, line 952. (gdb) b GLLibraryEGL::Create Breakpoint 4 at 0x7ff1118c50: file ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp, line 343. (gdb) b GLLibraryEGL::DefaultDisplay Breakpoint 5 at 0x7ff111ce84: file ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp, line 736. (gdb) b GLLibraryEGL::CreateDisplay Breakpoint 6 at 0x7ff111c8b8: file ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp, line 747. (gdb) b GetAndInitDisplay Breakpoint 7 at 0x7ff111c82c: file ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp, line 149. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/harbour-webview [...]You can probably tell that today is going to be another day of lengthy backtraces. It's true I'm afraid and I can only apologise in advance. As always, these backtraces make for very dull reading, but are important for my records. You won't lose anything by skipping them.
So executing the code with the debugger I can quickly find out which of these methods are called, in what order, and which are never called at all.
Thread 36 "Compositor" hit Breakpoint 1, mozilla::gl::GetAppDisplay () at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:161 161 ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp: No such file or directory. (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 1, mozilla::gl::GetAppDisplay () at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:161 161 in ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 4, mozilla::gl::GLLibraryEGL:: Create (out_failureId=out_failureId@entry=0x7f1f7cb1c8, aDisplay=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:343 343 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 5, mozilla::gl::GLLibraryEGL:: DefaultDisplay (this=0x7edc003200, out_failureId=out_failureId@entry=0x7f1f7cb1c8, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp:736 736 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 6, mozilla::gl::GLLibraryEGL:: CreateDisplay (this=this@entry=0x7edc003200, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f1f7cb1c8, aDisplay=aDisplay@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:747 747 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 7, mozilla::gl:: GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, display=display@entry=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149 149 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) c Continuing. [New Thread 0x7f1f4fa7e0 (LWP 2850)] =============== Preparing offscreen rendering context ===============This looks pretty healthy to be honest. After this point, none of the methods are called, so everything is happening during initialisation, which I was hoping would be the case. So far so good.
That flow is useful, but it's not the entire picture. I also want to know the call stacks for these method calls. Some of them are obvious because there's only one path for them to be called via. But others, including the calls to GetAppDisplay(), are less clear. This method in particular is called in many different places in the code. It'd be useful to know which paths are causing it to be triggered.
The easiest way to find this out is to get a copy of the backtrace when the breakpoint has been hit. I didn't do this first time round, but a second run through allows me to collect this info too.
Here's the (slightly abridged) backtrace for the first hit of GetAppDisplay()
Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GetAppDisplay () at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:161 161 ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GetAppDisplay () at ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp:161 #1 0x0000007ff112e684 in mozilla::gl::GLContextProviderEGL::CreateHeadless ( desc=..., out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1434 #2 0x0000007ff112efd4 in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1476 #3 0x0000007ff1197b04 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7edc002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254 #4 0x0000007ff11ad0f4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7edc002f10, out_failureReason=0x7f1f96f520) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:391 #5 0x0000007ff12c2d90 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc4b7b7d0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #6 0x0000007ff12cde0c in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc4b7b7d0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #7 0x0000007ff12cdf3c in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc4b7b7d0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #8 0x0000007ff36649c8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b7b7d0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #9 0x0000007ff0c5c3ac in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b7b7d0, msg__=...) at PCompositorBridgeParent.cpp:1285 #10 0x0000007ff0ca08c0 in mozilla::layers::PCompositorManagerParent:: OnMessageReceived (this=<optimized out>, msg__=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h: 675 [...] #24 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb)The above backtrace shows that the call to GetAppDisplay() is being triggered from GLContextProviderEGL::CreateHeadless(). There it's being called directly. But interestingly, later in the same CreateHeadless() it gets called again, this time indirectly via DefaultEglDisplay() and then DefaultEglLibrary (despite the similar names those are very different methods; I find I need to focus hard on this stuff to properly distinguish). Here's the (again, slightly abridged) backtrace for this second call to GetAppDisplay().
Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GetAppDisplay () at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:161 161 in ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp (gdb) bt #0 mozilla::gl::GetAppDisplay () at ${PROJECT}/gecko-dev/gfx/gl/ GLContextProviderEGL.cpp:161 #1 0x0000007ff1118e40 in mozilla::gl::DefaultEglLibrary ( out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1519 #2 0x0000007ff112e694 in mozilla::gl::DefaultEglDisplay (aDisplay=0x1, out_failureId=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:29 #3 mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1434 #4 0x0000007ff112efd4 in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1476 #5 0x0000007ff1197b04 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7edc002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254 #6 0x0000007ff11ad0f4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7edc002f10, out_failureReason=0x7f1f96f520) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:391 #7 0x0000007ff12c2d90 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc4b7b7d0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff12cde0c in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc4b7b7d0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #9 0x0000007ff12cdf3c in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc4b7b7d0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #10 0x0000007ff36649c8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b7b7d0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #11 0x0000007ff0c5c3ac in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b7b7d0, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #26 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb)From here I think it starts getting less interesting (I'm pretending it ever was!), but for completeness, here's the backtrace for GLLibraryEGL::Create():
Thread 38 "Compositor" hit Breakpoint 4, mozilla::gl::GLLibraryEGL:: Create (out_failureId=out_failureId@entry=0x7f1f96f1c8, aDisplay=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:343 343 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GLLibraryEGL::Create ( out_failureId=out_failureId@entry=0x7f1f96f1c8, aDisplay=0x1) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:343 #1 0x0000007ff1118e50 in mozilla::gl::DefaultEglLibrary ( out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1519 #2 0x0000007ff112e694 in mozilla::gl::DefaultEglDisplay (aDisplay=0x1, out_failureId=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:29 #3 mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1434 #4 0x0000007ff112efd4 in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1f96f1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1476 #5 0x0000007ff1197b04 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7edc002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254 #6 0x0000007ff11ad0f4 in mozilla::layers::CompositorOGL::Initialize ( this=0x7edc002f10, out_failureReason=0x7f1f96f520) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:391 #7 0x0000007ff12c2d90 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc4b7b7d0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff12cde0c in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc4b7b7d0, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 #9 0x0000007ff12cdf3c in mozilla::layers::CompositorBridgeParent:: AllocPLayerTransactionParent (this=this@entry=0x7fc4b7b7d0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1546 #10 0x0000007ff36649c8 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: AllocPLayerTransactionParent (this=0x7fc4b7b7d0, aBackendHints=..., aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:80 #11 0x0000007ff0c5c3ac in mozilla::layers::PCompositorBridgeParent:: OnMessageReceived (this=0x7fc4b7b7d0, msg__=...) at PCompositorBridgeParent.cpp:1285 [...] #26 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb)I won't give the backtraces for the others because their location is already pretty fixed in the code. But on checking all of the values passed in to these and the other methods it's clear that the value for mDisplay is consistently set to 1 in all cases:
(gdb) p aDisplay $1 = (EGLDisplay) 0x1 (gdb)That should be a good thing, but this doesn't mean everything is fixed and there's a nagging feeling in the back of my mind. I'll go through the flow and backtraces properly tomorrow, but that's it for today.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
30 Mar 2024 : Day 201 #
Yesterday I restored (manually, because the underlying code has changed so much) patch 0038 back into the codebase. But it doesn't yet quite compile. There are a few sticking point errors that need to be resolved. The first I'm working on looks like this:
The choice of parameters turns out to be relatively straightforward. The out_failureId error string that's actually an output rather than an input is just a pointer which gets pointed at a string literal if something goes wrong. But I'm ignoring errors anyway at this point, so I can just pass in a dummy variable. The aDisplay input value should be set to the result of GetAppDisplay() since this is the whole point of the changes that were introduced in ESR 78 to support the emulator and native devices.
After these changes the code now looks like this:
The third error is more unexpected and looks like this:
Okay, I'll be honest, I'm not really sure what's going on here. More importantly the diff has examples of how this has been changed in other places and which match pretty closely to the situation I'm trying to solve. So I've copied those snippets of code and refactored them for the ESR 91 code. The result is that I've changed the following piece of code:
Having made these changes I'm pleased to discover that the code now compiles fully. I'm able to kick off the partial build and watch it generate the libxul.so file that we need. Having copied over and installed that file, when running the app I discover that... nothing overt has changed.
That's too bad. But what would be interesting to see now is what the actual values for the display are. I'll look in to that tomorrow.
As always if you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
In file included from Unified_cpp_gfx_gl0.cpp:47: ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp: In function ‘void* mozilla::gl::CreateEmulatorBufferSurface(mozilla::gl::GLLibraryEGL*, EGLConfig, mozilla::gfx::IntSize&)’: ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:991:55: error: ‘class mozilla::gl::GLLibraryEGL’ has no member named ‘Display’; did you mean ‘fGetDisplay’? EGLSurface surface = egl->fCreateWindowSurface(egl->Display(), config, eglwindow, 0); ^~~~~~~ fGetDisplayThis is code I copied over from ESR 78. The problem here is that egl->Display() no longer exists in ESR 91, for reasons we've already discussed, but most notably because there is no longer one single display in operation: there could be multiple. In ESR 78 the EGLLibrary::Display() method looked like this:
EGLDisplay Display() const { MOZ_ASSERT(mInitialized); return mEGLDisplay; }As you can see it just returns an mEGLDisplay value — the single canonical display for the application — every time. On ESR 91 this method has been removed entirely. It took me a while to decide which alternative method was the best match, but eventually I settled on this one:
std::shared_ptr<EglDisplay> DefaultDisplay(nsACString* const out_failureId, EGLDisplay aDisplay);The difficulty, as you can see, is that while the former required no parameters, the latter has acquired two new parameters. I have to be honest that I'm a bit uncomfortable with a method that takes in an EGLDisplay in order to return an EglDisplay: it should really be the other way around. But I'm going with the flow on this.
The choice of parameters turns out to be relatively straightforward. The out_failureId error string that's actually an output rather than an input is just a pointer which gets pointed at a string literal if something goes wrong. But I'm ignoring errors anyway at this point, so I can just pass in a dummy variable. The aDisplay input value should be set to the result of GetAppDisplay() since this is the whole point of the changes that were introduced in ESR 78 to support the emulator and native devices.
After these changes the code now looks like this:
EGLSurface surface = egl->fCreateWindowSurface(egl->DefaultDisplay( out_failureId, GetAppDisplay())->mDisplay, config, eglwindow, 0);This line now compiles, so time to move on to the second error, which looks like this:
In file included from Unified_cpp_gfx_gl0.cpp:2: ${PROJECT}/gecko-dev/gfx/gl/AndroidSurfaceTexture.cpp: In static member function ‘static already_AddRefed<mozilla::gl::GLContextEGL> mozilla::gl:: AndroidSharedBlitGL::CreateContextImpl(bool)’: ${PROJECT}/gecko-dev/gfx/gl/AndroidSurfaceTexture.cpp:78:52: error: too few arguments to function std::shared_ptr<mozilla::gl::EglDisplay> mozilla::gl: :DefaultEglDisplay(nsACString*, EGLDisplay)’ const auto egl = gl::DefaultEglDisplay(&ignored); ^ In file included from ${PROJECT}/gecko-dev/gfx/gl/AndroidSurfaceTexture.cpp:9, from Unified_cpp_gfx_gl0.cpp:2: ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:27:36: note: declared here inline std::shared_ptr<EglDisplay> DefaultEglDisplay( ^~~~~~~~~~~~~~~~~There's a similar thing going on here, in that there should be an input parameter. But this is a little different because of where the code lives. It's being compiled in as part of the AndroidSurfaceTexture.cpp file. I'm wondering why this file is getting included at all, given the name seems to suggest it's rather Android focused. Nevertheless I've fixed this particular error by passing in EGL_DISPLAY_NONE as the parameter. To be honest I'm hoping this code never actually gets called so that the input value is somewhat irrelevant. Based on this assumption, this change should do the trick.
The third error is more unexpected and looks like this:
In file included from Unified_cpp_gfx_gl0.cpp:47: ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp: In function ‘void* mozilla::gl::CreateEmulatorBufferSurface(mozilla::gl::GLLibraryEGL*, EGLConfig, mozilla::gfx::IntSize&)’: ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:996:36: error: ‘nsTHashMap<nsPtrHashKey<void>, mozilla::gl::WaylandGLSurface*>’ {aka ‘class nsBaseHashtable<nsPtrHashKey<void>, mozilla::gl::WaylandGLSurface*, mozilla::gl::WaylandGLSurface*, nsDefaultConverter<mozilla::gl:: WaylandGLSurface*, mozilla::gl::WaylandGLSurface*> >’} has no member named ‘LookupForAdd’; did you mean ‘LookupResult’? auto entry = sWaylandGLSurface.LookupForAdd(surface); ^~~~~~~~~~~~ LookupResultOn the face of it it's unclear to me what's going wrong and how this should be fixed, but a dig through the git history shows that this error has been caused by change D104216 which relates to Bugzilla bug 1688833. The bug description has this to say:
After bug 1681469, there will be two APIs for LookupForAdd (WithEntryHandle). The new one is based on nsTHashtable::WithEntryHandle which is based on PLDHashTable::WithEntryHandle. The new API isn't currently public, since there are still some open questions which need to be addressed.
Okay, I'll be honest, I'm not really sure what's going on here. More importantly the diff has examples of how this has been changed in other places and which match pretty closely to the situation I'm trying to solve. So I've copied those snippets of code and refactored them for the ESR 91 code. The result is that I've changed the following piece of code:
if (surface) { WaylandGLSurface* waylandData = new WaylandGLSurface(wlsurface, eglwindow); auto entry = sWaylandGLSurface.LookupForAdd(surface); entry.OrInsert([&waylandData]() { return waylandData; }); }So that it now looks like this:
if (surface) { WaylandGLSurface* waylandData = new WaylandGLSurface(wlsurface, eglwindow); sWaylandGLSurface.WithEntryHandle(surface, [&](auto&& entry) { entry.OrInsert(waylandData); }) }Does that look sensible? Well, it compiles and it's consistent with how the change was made in other parts of the code; and right now that's good enough for me.
Having made these changes I'm pleased to discover that the code now compiles fully. I'm able to kick off the partial build and watch it generate the libxul.so file that we need. Having copied over and installed that file, when running the app I discover that... nothing overt has changed.
That's too bad. But what would be interesting to see now is what the actual values for the display are. I'll look in to that tomorrow.
As always if you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
30 Mar 2024 : Day 200 #
I thought I'd already had a go at applying patch 0038 "Fix mesa egl display and buffer initialisation" on Day 55. Looking back at my notes, I did try, but probably not hard enough. The patch failed to stick in many places and now I'm looking at the code it seems there are some really large gaps.
Trying to make a better job of it this time around I've been working my way carefully through the entire patch. I've manually applied all of the changes. As I mentioned yesterday, the patch contains a lot of material important for keeping Sailfish Browser working.
Now when I attempt to build the gecko library I start getting errors in far flung places in the code due to the way the API is accessed. The output below is typical of the errors I'm getting.
An obvious fix would be to make the parameter optional. But these errors do serve a purpose in exposing when things have changed, so it's convenient not to make it optional just yet. Maybe once everything is working I can return to this and add it as an optional parameter for the purposes of simplifying the code and reducing the size of the patches. But for now the added protection of knowing exactly when and where the calls are made is convenient.
I've not yet got through all of the changes so I'll have to continue tomorrow. I feel like I've been making quite good progress though. I just need to get the last pieces compiling.
As an aside, I finally got to watch Peter Mack's (1peter10's) FOSDEM'24 talk on The Linux Phone App Ecosystem. I've been meaning to watch it from before the event and really enjoyed it, especially the way it straddled so many different Linux-based mobile operating systems. If you've not watched it yet yourself I highly recommend it.
I'm hoping to get that completed and compiling tomorrow. In the meantime if you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Trying to make a better job of it this time around I've been working my way carefully through the entire patch. I've manually applied all of the changes. As I mentioned yesterday, the patch contains a lot of material important for keeping Sailfish Browser working.
Now when I attempt to build the gecko library I start getting errors in far flung places in the code due to the way the API is accessed. The output below is typical of the errors I'm getting.
${PROJECT}/gecko-dev/gfx/gl/AndroidSurfaceTexture.cpp: In static member function ‘static already_AddRefed<mozilla::gl::GLContextEGL> mozilla::gl:: AndroidSharedBlitGL::CreateContextImpl(bool)’: ${PROJECT}/gecko-dev/gfx/gl/AndroidSurfaceTexture.cpp:78:52: error: too few arguments to function std::shared_ptr<mozilla::gl::EglDisplay> mozilla::gl: :DefaultEglDisplay(nsACString*, EGLDisplay)’ const auto egl = gl::DefaultEglDisplay(&ignored); ^The reason for this error is clear: I changed some of the method signatures to include a display parameter. And while I've tried to update the code so that it's called correctly in all of the obvious places, it turns out there are some less obvious places too. In practice it looks to me like these are Android-specific, but if the compiler is going over them they'll need to be fixed.
An obvious fix would be to make the parameter optional. But these errors do serve a purpose in exposing when things have changed, so it's convenient not to make it optional just yet. Maybe once everything is working I can return to this and add it as an optional parameter for the purposes of simplifying the code and reducing the size of the patches. But for now the added protection of knowing exactly when and where the calls are made is convenient.
I've not yet got through all of the changes so I'll have to continue tomorrow. I feel like I've been making quite good progress though. I just need to get the last pieces compiling.
As an aside, I finally got to watch Peter Mack's (1peter10's) FOSDEM'24 talk on The Linux Phone App Ecosystem. I've been meaning to watch it from before the event and really enjoyed it, especially the way it straddled so many different Linux-based mobile operating systems. If you've not watched it yet yourself I highly recommend it.
I'm hoping to get that completed and compiling tomorrow. In the meantime if you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
28 Mar 2024 : Day 199 #
Adam Pigg (piggz, who you'll know from his daily Ofono blog amongst other Sailfish-related things, asked a good question on Mastodon yesterday. "How" Adam asks "was the texture deletion missed? Was it present in 71?" This is in relation to how I ended up solving the app seizure problem. With apologies to those who have been following along and already know, but to recap, the problem turned out to be that an EGL texture was being created for each SharedSurface_EGLImage, but then in the destructor there was no code to delete the texture.
It's a classic resource leakage problem and one you might rightly ask (and that's exactly what Adam did) how such an obvious failure could have snuck through.
The answer is that the code to delete the texture was in ESR 78 but was removed from ESR 91. That sounds strange until you also realise that the code to create the texture was removed as well. A whole raft of changes were made as a result of changeset D75055 and the changes specifically to the SharedSurfaceEGL.cpp file involved stripping out the EGL code that the WebView relies on. When attempting to reintroduce this code back I returned the texture creation code but somehow missed the texture deletion code.
And that's the challenge of where I'm at with the WebView rendering changes now. It's all down to whether I've successfully reversed these changes or not. I know it can work because it works with ESR 78. But getting all of the pieces to balance together is turning out to be a bit of a challenge. It's just a matter of time before everything fits in the right way, but it is, I'm afraid, taking a lot of time.
So thanks Adam for asking the question. It's good to reflect on these things and hopefully in my case learn how to avoid similar problems happening in future. Now on to the work of finding and fixing the other issues.
So today I've been looking through the diffs I mentioned yesterday, but admittedly without making a huge amount of progress. The one thing I've discovered, that I think may be important, is that there's a difference in the way the display is being configured between ESR 78 and ESR 91.
On ESR 78 the display is collected via a call to GetAppDisplay(). This function can be found in GLContextProviderEGL.cpp and looks like this:
The initialisation process has been changed in ESR 91. But there's still a similar process that happens when the library is initialised, the difference being that currently EGL_NO_DISPLAY is passed into the method instead.
Eventually this gets passed on to the CreateDisplay() method, which is where we need the correct value to be. Using the debugger I've checked exactly how this gets called on ESR 91. It's clear from this that the display isn't being set up as it should be.
So it's just a short one today, but rest assured I'll be writing more about all this tomorrow.
Before finishing up, I also just want to reiterate my commitment to base-2 milestones. The fact I'll be hitting a day that just happens to be represented neatly in base 10 is of no interest to me and I won't be making a big deal out of it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
It's a classic resource leakage problem and one you might rightly ask (and that's exactly what Adam did) how such an obvious failure could have snuck through.
The answer is that the code to delete the texture was in ESR 78 but was removed from ESR 91. That sounds strange until you also realise that the code to create the texture was removed as well. A whole raft of changes were made as a result of changeset D75055 and the changes specifically to the SharedSurfaceEGL.cpp file involved stripping out the EGL code that the WebView relies on. When attempting to reintroduce this code back I returned the texture creation code but somehow missed the texture deletion code.
And that's the challenge of where I'm at with the WebView rendering changes now. It's all down to whether I've successfully reversed these changes or not. I know it can work because it works with ESR 78. But getting all of the pieces to balance together is turning out to be a bit of a challenge. It's just a matter of time before everything fits in the right way, but it is, I'm afraid, taking a lot of time.
So thanks Adam for asking the question. It's good to reflect on these things and hopefully in my case learn how to avoid similar problems happening in future. Now on to the work of finding and fixing the other issues.
So today I've been looking through the diffs I mentioned yesterday, but admittedly without making a huge amount of progress. The one thing I've discovered, that I think may be important, is that there's a difference in the way the display is being configured between ESR 78 and ESR 91.
On ESR 78 the display is collected via a call to GetAppDisplay(). This function can be found in GLContextProviderEGL.cpp and looks like this:
// Use the main app's EGLDisplay to avoid creating multiple Wayland connections // See JB#56215 static EGLDisplay GetAppDisplay() { #ifdef MOZ_WIDGET_QT QPlatformNativeInterface* interface = QGuiApplication:: platformNativeInterface(); MOZ_ASSERT(interface); return (EGLDisplay)(interface->nativeResourceForIntegration(QByteArrayLiteral( "egldisplay"))); #else return EGL_NO_DISPLAY; #endif }In our case we have MOZ_WIDGET_QT defined, so it's the first half of the ifdef that's getting compiled. This gets passed in to the GLLibraryEGL::EnsureInitialized() method when the library is initialised.
The initialisation process has been changed in ESR 91. But there's still a similar process that happens when the library is initialised, the difference being that currently EGL_NO_DISPLAY is passed into the method instead.
Eventually this gets passed on to the CreateDisplay() method, which is where we need the correct value to be. Using the debugger I've checked exactly how this gets called on ESR 91. It's clear from this that the display isn't being set up as it should be.
(gdb) b GLLibraryEGL::CreateDisplay Function "GLLibraryEGL::CreateDisplay" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (GLLibraryEGL::CreateDisplay) pending. (gdb) r [...] Thread 37 "Compositor" hit Breakpoint 1, mozilla::gl::GLLibraryEGL:: CreateDisplay (this=this@entry=0x7ed8003200, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f176081c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:747 747 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GLLibraryEGL::CreateDisplay (this=this@entry=0x7ed8003200, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f176081c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:747 #1 0x0000007ff111d850 in mozilla::gl::GLLibraryEGL::DefaultDisplay ( this=0x7ed8003200, out_failureId=out_failureId@entry=0x7f176081c8) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:740 #2 0x0000007ff112ef28 in mozilla::gl::DefaultEglDisplay ( out_failureId=0x7f176081c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:33 #3 mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7f176081c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1246 #4 0x0000007ff112f804 in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f176081c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1288 #5 0x0000007ff11982f8 in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7ed8002f10) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254 #6 0x0000007ff11ad8e8 in mozilla::layers::CompositorOGL::Initialize ( this=0x7ed8002f10, out_failureReason=0x7f17608520) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:391 #7 0x0000007ff12c3584 in mozilla::layers::CompositorBridgeParent:: NewCompositor (this=this@entry=0x7fc46c2070, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1493 #8 0x0000007ff12ce600 in mozilla::layers::CompositorBridgeParent:: InitializeLayerManager (this=this@entry=0x7fc46c2070, aBackendHints=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorBridgeParent.cpp:1436 [...] #26 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb)Besides this investigation I've also started making changes to try to fix the code, but this is still very much a work-in-progress. Hopefully tomorrow I'll have something more concrete to show for my efforts.
So it's just a short one today, but rest assured I'll be writing more about all this tomorrow.
Before finishing up, I also just want to reiterate my commitment to base-2 milestones. The fact I'll be hitting a day that just happens to be represented neatly in base 10 is of no interest to me and I won't be making a big deal out of it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
27 Mar 2024 : Day 198 #
I'm looking forward to getting back to a more balanced cadence with gecko development. It's been frustrating to be stuck on app seizing for the last couple of weeks, now that it's out of the way it'll be nice to focus on other parts of the code. But I'm not going to be wandering too far afield as I continue to try to get the WebView render pipeline working.
What's become clear is that the front and back buffer are being successfully created (and now destroyed!). So now there are two other potential places where the rendering could be failing. It could be that the paint from the Web pages is failing to get on the texture. Or it could be that the paint from the texture is failing to get on the screen.
I'd like to devise ways to test both of those things, but before I do that I want to first check another area that's ripe for failure in my opinion, and that's the setting of the display value. The EGL library uses an EGLDisplay object to control where rendering happens. Although it's part of the Khronos EGL specification, the official documentation is frustratingly vague about what an EGLDisplay actually is. Thankfully the PowerVR documentation has a note that summarises it quite clearly.
The shift from ESR 78 to ESR 91 brought with it a more flexible handling of displays. In particular, while ESR 78 had just a single instance of a display, ESR 91 allows multiple displays to be configured. What the practical benefit of this is I'm not entirely certain of, but handling of EGLDisplay storage has become more complex as a result.
So whereas previously gecko had a single mDisplay value that got used everywhere, the EGLDisplay is now wrapped in a gecko-specific EglDisplay class, defined in GLLibraryEGL.h. This class captures a collection of functionalities, one of which is to store an EGLDisplay value. There can be multiple instances of EglDisplay live at any one time.
The subtle distinction in the capitalisation — EGLDisplay vs. EglDisplay — is critical. The former belongs to EGL whereas the latter belongs to gecko. The fact they're so similar and that the shift from ESR 78 to ESR 91 has resulted in a switch from one to the other in many parts of the code, makes things all the more confusing.
There's plenty of opportunity for errors here. So I'm thinking: this is something to check.
An obvious place to start these checks is with display initialisation. A quick grep of the code for eglInitialize doesn't give any useful results. However as we saw at some length on Monday, all of these EGL library calls have been abstracted away. And eglInitialize() is no different. The gecko code uses a method called GLLibraryEGL::fInitialize() instead.
Grepping for that throws up some more useful references. The most promising one being this:
Unfortunately applying the patch, especially due to the differences in the way EGLDisplay is handled, turned out to be a challenge.
Consequently I'm now working my way through this patch again. It'll take me longer than just today, so I'll continue with it until it's all applied properly and report back if I find anything important tomorrow. Raine (rainemak) also flagged up patches 0045 and 0065. The former claims to "Prioritize GMP plugins over all others, and support decoding video for h264, vp8 & vp9"; whereas the latter will:
It'll take me a while to work through these as well, which means that's it for today. I'll write more about all this tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
What's become clear is that the front and back buffer are being successfully created (and now destroyed!). So now there are two other potential places where the rendering could be failing. It could be that the paint from the Web pages is failing to get on the texture. Or it could be that the paint from the texture is failing to get on the screen.
I'd like to devise ways to test both of those things, but before I do that I want to first check another area that's ripe for failure in my opinion, and that's the setting of the display value. The EGL library uses an EGLDisplay object to control where rendering happens. Although it's part of the Khronos EGL specification, the official documentation is frustratingly vague about what an EGLDisplay actually is. Thankfully the PowerVR documentation has a note that summarises it quite clearly.
EGL uses the concept of a "display". This is an abstract object which will show the rendered graphical output. In most environments it corresponds to a single physical screen. After creating a native display for a given windowing system, EGL can use this handle to get a corresponding EGLDisplay handle for use in rendering.
The shift from ESR 78 to ESR 91 brought with it a more flexible handling of displays. In particular, while ESR 78 had just a single instance of a display, ESR 91 allows multiple displays to be configured. What the practical benefit of this is I'm not entirely certain of, but handling of EGLDisplay storage has become more complex as a result.
So whereas previously gecko had a single mDisplay value that got used everywhere, the EGLDisplay is now wrapped in a gecko-specific EglDisplay class, defined in GLLibraryEGL.h. This class captures a collection of functionalities, one of which is to store an EGLDisplay value. There can be multiple instances of EglDisplay live at any one time.
The subtle distinction in the capitalisation — EGLDisplay vs. EglDisplay — is critical. The former belongs to EGL whereas the latter belongs to gecko. The fact they're so similar and that the shift from ESR 78 to ESR 91 has resulted in a switch from one to the other in many parts of the code, makes things all the more confusing.
There's plenty of opportunity for errors here. So I'm thinking: this is something to check.
An obvious place to start these checks is with display initialisation. A quick grep of the code for eglInitialize doesn't give any useful results. However as we saw at some length on Monday, all of these EGL library calls have been abstracted away. And eglInitialize() is no different. The gecko code uses a method called GLLibraryEGL::fInitialize() instead.
Grepping for that throws up some more useful references. The most promising one being this:
static EGLDisplay GetAndInitDisplay(GLLibraryEGL& egl, void* displayType, EGLDisplay display = EGL_NO_DISPLAY) { if (display == EGL_NO_DISPLAY) { display = egl.fGetDisplay(displayType); if (display == EGL_NO_DISPLAY) return EGL_NO_DISPLAY; if (!egl.fInitialize(display, nullptr, nullptr)) return EGL_NO_DISPLAY; } return display; }That's on ESR 78. On ESR 91 things are different and for good reason. The GetAndInitDisplay() method assumes a single instance of EGLDisplay as discussed earlier. On ESR 91 the display is initialised when its EglDisplay wrapper is created:
// static std::shared_ptr<EglDisplay> EglDisplay::Create(GLLibraryEGL& lib, const EGLDisplay display, const bool isWarp) { // Retrieve the EglDisplay if it already exists { const auto itr = lib.mActiveDisplays.find(display); if (itr != lib.mActiveDisplays.end()) { const auto ret = itr->second.lock(); if (ret) { return ret; } } } if (!lib.fInitialize(display, nullptr, nullptr)) { return nullptr; } [...] }I've chopped off the end of the method there, but the section shown highlights the important part. It's also worth mentioning that in ESR 78 this and the surrounding functionality were all amended by Patch 0038 "Fix mesa egl display and buffer initialisation". I attempted to apply this patch all the way back on Day 55 and it does contain plenty of relevant changes. Here's the way the patch describes itself:
Ensure the same display is used for all initialisations to avoid creating multiple wayland connections. Fallback to a wayland window surface in case pixel buffers aren't supported. This is needed on the emulator.
Unfortunately applying the patch, especially due to the differences in the way EGLDisplay is handled, turned out to be a challenge.
Consequently I'm now working my way through this patch again. It'll take me longer than just today, so I'll continue with it until it's all applied properly and report back if I find anything important tomorrow. Raine (rainemak) also flagged up patches 0045 and 0065. The former claims to "Prioritize GMP plugins over all others, and support decoding video for h264, vp8 & vp9"; whereas the latter will:
Hardcode loopback address for profile lock filename. When engine started without network PR_GetHostByName takes 20 seconds when connman tries to resolve host name. As this is only used as part of the profile lock filename it can as well be like "127.0.0.1:+<pid>".
It'll take me a while to work through these as well, which means that's it for today. I'll write more about all this tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
26 Mar 2024 : Day 197 #
If you've been following any of these diary entries over the last couple of weeks you'll know I've been struggling to diagnose a problem related to graphics surfaces. A serious bug prevented the graphics surface from being properly created, but as soon as that was fixed another serious issue appeared: after a short period of time using the WebView the app started to seize up, rapidly progressing to the entire phone. After a while the watchdog kicked in causing the phone to reboot itself.
This is, as a general rule, not considered ideal behaviour for an application.
Since then I've been generally debugging, monitoring and annotating the code to try to figure out what was causing the problem. As of yesterday I'd narrowed the issue down to the creation of the EGL image associated with an EGL texture. Each frame the app would create the texture, then create the image from the texture and then create a surface from that.
Skipping execution from anywhere up to the image creation and beyond would result in the seizing up happening. This led me to the EGL instructions: creating and destroying the image.
I've been looking at this code in ShareSurfaceEGL.cpp quite deeply for a couple of weeks now. And finally, narrowing down the area of consideration has finally thrown up something useful.
It turns out that while the surface destructor is called correctly and that this calls fDestroyImage() correctly, that's not all it's supposed to be doing.
All of this was stuff we checked yesterday: a call to fDestroyImage() was being called for every call to fCreateImage() except two, allowing for the front and back buffer to exist at all times.
But looking at the code today I realised there was something missing. When the image is created in SharedSurface_EGLImage::Create() it needs a texture to work with. And so we have this code:
Here is our destructor code in ESR 91:
To be clear, there's still no rendering happening to the screen, but this is nevertheless an important step forwards and I'm pretty chuffed to have noticed the missing code. In retrospect, it's something I should have noticed a lot earlier, but this goes to show both how intricate these things are, and where my limitations are as a developer. It's hard to keep all of the execution paths in my head all at the same time. As a result I'm left using these often trial-and-error based approaches to finding fixes.
It's a small victory. But it means that tomorrow I can continue on with the proper job of finding out why the render never makes it to the screen. With this resolved I'm feeling more confident again that it will be possible to get to the bottom of it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
This is, as a general rule, not considered ideal behaviour for an application.
Since then I've been generally debugging, monitoring and annotating the code to try to figure out what was causing the problem. As of yesterday I'd narrowed the issue down to the creation of the EGL image associated with an EGL texture. Each frame the app would create the texture, then create the image from the texture and then create a surface from that.
Skipping execution from anywhere up to the image creation and beyond would result in the seizing up happening. This led me to the EGL instructions: creating and destroying the image.
I've been looking at this code in ShareSurfaceEGL.cpp quite deeply for a couple of weeks now. And finally, narrowing down the area of consideration has finally thrown up something useful.
It turns out that while the surface destructor is called correctly and that this calls fDestroyImage() correctly, that's not all it's supposed to be doing.
All of this was stuff we checked yesterday: a call to fDestroyImage() was being called for every call to fCreateImage() except two, allowing for the front and back buffer to exist at all times.
But looking at the code today I realised there was something missing. When the image is created in SharedSurface_EGLImage::Create() it needs a texture to work with. And so we have this code:
GLuint prodTex = CreateTextureForOffscreen(prodGL, formats, size); if (!prodTex) { return ret; } EGLClientBuffer buffer = reinterpret_cast<EGLClientBuffer>(uintptr_t(prodTex)); EGLImage image = egl->fCreateImage(context, LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr);First create the texture then pass this in to the image creation routine. But while the image is deleted in the destructor, the texture is not!
Here is our destructor code in ESR 91:
SharedSurface_EGLImage::~SharedSurface_EGLImage() { const auto& gle = GLContextEGL::Cast(mDesc.gl); const auto& egl = gle->mEgl; egl->fDestroyImage(mImage); if (mSync) { // We can't call this unless we have the ext, but we will always have // the ext if we have something to destroy. egl->fDestroySync(mSync); mSync = 0; } }The image and sync are both destroyed, but the texture never is. So what happens if we add in the texture deletion? To test this I've added it in and the code now looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() { const auto& gle = GLContextEGL::Cast(mDesc.gl); const auto& egl = gle->mEgl; egl->fDestroyImage(mImage); if (mSync) { // We can't call this unless we have the ext, but we will always have // the ext if we have something to destroy. egl->fDestroySync(mSync); mSync = 0; } if (!mDesc.gl || !mDesc.gl->MakeCurrent()) return; mDesc.gl->fDeleteTextures(1, &mProdTex); mProdTex = 0; }And now, after building and running this new version, the app no longer seizes up!
To be clear, there's still no rendering happening to the screen, but this is nevertheless an important step forwards and I'm pretty chuffed to have noticed the missing code. In retrospect, it's something I should have noticed a lot earlier, but this goes to show both how intricate these things are, and where my limitations are as a developer. It's hard to keep all of the execution paths in my head all at the same time. As a result I'm left using these often trial-and-error based approaches to finding fixes.
It's a small victory. But it means that tomorrow I can continue on with the proper job of finding out why the render never makes it to the screen. With this resolved I'm feeling more confident again that it will be possible to get to the bottom of it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
25 Mar 2024 : Day 196 #
Yesterday I finally narrowed down the error causing the WebView app to seize up during execution to a call to EglDisplay::fCreateImage(). Now it may not be this call that's the problem, it might be the way the result is used or the fact that it's not being freed properly, or maybe the parameters that are being passed in to it. But the fact that we've narrowed it down is likely to be a big help in figuring things out.
The call itself goes through to a method that looks like this:
What does this do? According to the documentation:
Super. Where does that leave us? Well, it tells us that the call to fCreateImage() in our SharedSurface_EGLImage::Create() is really just a bunch of simple wrapper calls that ends up calling an EGL function. What could be going wrong? One obvious potential problem is that the input parameters may be messed up. Another one is that each call to eglCreateImageKHR() creating an EGLImage object should be balanced out with a call to eglDestroyImageKHR() to destroy it.
We do have a call to eglDestroyImageKHR() happening in our SharedSurface_EGLImage destructor. It looks like this:
Just to ensure this matches the behaviour of the previous version I've also tested the same using the debugger on ESR 78. I got the same sequence of calls. First two creates, followed by balanced create and destroy calls so that there are exactly two images in existence at any one time:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
The call itself goes through to a method that looks like this:
EGLImage fCreateImage(EGLContext ctx, EGLenum target, EGLClientBuffer buffer, const EGLint* attribList) const { MOZ_ASSERT(HasKHRImageBase()); return mLib->fCreateImage(mDisplay, ctx, target, buffer, attribList); }Here mLib is an instance of GLLibraryEGL. It looks like we have several layers of wrappers here so let's continue digging. This goes through to the following method that's part of GLLibraryEGL:
EGLImage fCreateImage(EGLDisplay dpy, EGLContext ctx, EGLenum target, EGLClientBuffer buffer, const EGLint* attrib_list) const { WRAP(fCreateImageKHR(dpy, ctx, target, buffer, attrib_list)); }That looks similar but it's not quite the same. It is just another wrapper though, this time going through to a dynamically created method. The WRAP() macro looks like this:
#define WRAP(X) \ PROFILE_CALL \ BEFORE_CALL \ const auto ret = mSymbols.X; \ AFTER_CALL \ return retThe PROFILE_CALL, BEFORE_CALL and AFTER_CALL lines are all macros which turn into something functional in the Android build, but in our build are just empty. That means that the WRAP(fCreateImageKHR(dpy, ctx, target, buffer, attrib_list)) statement actually reduces down to just the following:
const auto ret = mSymbols.fCreateImageKHR(dpy, ctx, target, buffer, attrib_list); return retThe mSymbols object has the following defined on it:
EGLImage(GLAPIENTRY* fCreateImageKHR)(EGLDisplay dpy, EGLContext ctx, EGLenum target, EGLClientBuffer buffer, const EGLint* attrib_list);Here EGLImage is a typedef of void* and GLAPIENTRY is an empty define, giving us a final result that looks like this:
void* (*fCreateImageKHR)(EGLDisplay dpy, EGLContext ctx, EGLenum target, EGLClientBuffer buffer, const EGLint* attrib_list);We're still not quite there though. Inside GLLibraryEGL.cpp we find this:
const SymLoadStruct symbols[] = {SYMBOL(CreateImageKHR), SYMBOL(DestroyImageKHR), END_OF_SYMBOLS}; (void)fnLoadSymbols(symbols);This is packing symbols with some data which is then passed in to fnLoadSymbols(), a method for loading symbols from a dynamically loaded library. The define that's used here is the following:
#define SYMBOL(X) \ { \ (PRFuncPtr*)&mSymbols.f##X, { \ { "egl" #X } \ } \ }Notice how here it's playing around with the input argument so that, with a little judicious simplification for clarity, SYMBOL(CreateImageKHR) becomes:
mSymbols.fCreateImageKHR, {{ "eglCreateImageKHR" }}In other words (big reveal, but no big surprise) a call to mSymbols.fCreateImageKHR() will get converted into a call to the EGL function eglCreateImageKHR, loaded in from the EGL driver.
What does this do? According to the documentation:
eglCreateImage is used to create an EGLImage object from an existing image resource buffer. display specifies the EGL display used for this operation. context specifies the EGL client API context used for this operation, or EGL_NO_CONTEXT if a client API context is not required. target specifies the type of resource being used as the EGLImage source (examples include two-dimensional textures in OpenGL ES contexts and VGImage objects in OpenVG contexts). buffer is the name (or handle) of a resource to be used as the EGLImage source, cast into the type EGLClientBuffer. attrib_list is a list of attribute-value pairs which is used to select sub-sections of buffer for use as the EGLImage source, such as mipmap levels for OpenGL ES texture map resources, as well as behavioral options, such as whether to preserve pixel data during creation. If attrib_list is non-NULL, the last attribute specified in the list must be EGL_NONE.
Super. Where does that leave us? Well, it tells us that the call to fCreateImage() in our SharedSurface_EGLImage::Create() is really just a bunch of simple wrapper calls that ends up calling an EGL function. What could be going wrong? One obvious potential problem is that the input parameters may be messed up. Another one is that each call to eglCreateImageKHR() creating an EGLImage object should be balanced out with a call to eglDestroyImageKHR() to destroy it.
We do have a call to eglDestroyImageKHR() happening in our SharedSurface_EGLImage destructor. It looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() { const auto& gle = GLContextEGL::Cast(mDesc.gl); const auto& egl = gle->mEgl; egl->fDestroyImage(mImage); [...]There's an unexpected difference with the way it's called in ESR 78, where the code looks like this:
SharedSurface_EGLImage::~SharedSurface_EGLImage() { const auto& gle = GLContextEGL::Cast(mGL); const auto& egl = gle->mEgl; egl->fDestroyImage(egl->Display(), mImage); [...]Notice the extra egl->Display() value being passed in as a parameter. That's because in ESR 91 EGLLibrary is storing its own copy of the EGLDisplay:
EGLBoolean fDestroyImage(EGLImage image) const { MOZ_ASSERT(HasKHRImageBase()); return mLib->fDestroyImage(mDisplay, image); }That gives us a couple of things to look into: first, is the correctly value being passed in for image? Second, is the value stored for mDisplay valid? The underlying call to eglDestroyImage also has a Boolean return value which will return EGL_FALSE in case something goes wrong. A nice first step would be to check this return value in case it's indicating a problem. To do this I've added some additional debug output to the code:
EGLBoolean result = egl->fDestroyImage(mImage); printf_stderr("RENDER: fDestroyImage() return value: %d\n", result);The result of running it shows a large number of successful calls to fDestroyImage():
[...] [JavaScript Warning: "Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content." {file: "https://jolla.com/themes/unlike/js/ modernizr.js?x98582&ver=2.6.2" line: 4}] RENDER: fDestroyImage() return value: 1 RENDER: fDestroyImage() return value: 1 RENDER: fDestroyImage() return value: 1 RENDER: fDestroyImage() return value: 1 RENDER: fDestroyImage() return value: 1 RENDER: fDestroyImage() return value: 1 RENDER: fDestroyImage() return value: 1 [...]Since this output looks okay I've taken it a step further and added a count to the creation and deletion calls in case it shows any imbalance between the two.
[...] Frame script: embedhelper.js loaded RENDER: fCreateImage() return value: 1, 0 RENDER: fCreateImage() return value: 1, 1 CONSOLE message: [JavaScript Warning: "This page uses the non standard property “zoom”. Consider using calc() in the relevant property values, or using “transform” along with “transform-origin: 0 0”." {file: "https://jolla.com/ " line: 0}] CONSOLE message: [JavaScript Warning: "Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content." {file: "https://jolla.com/themes/unlike/js/ modernizr.js?x98582&ver=2.6.2" line: 4}] RENDER: fCreateImage() return value: 1, 2 RENDER: fDestroyImage() return value: 1, 0 RENDER: fCreateImage() return value: 1, 3 RENDER: fDestroyImage() return value: 1, 1 [...] RENDER: fCreateImage() return value: 1, 316 RENDER: fDestroyImage() return value: 1, 314 RENDER: fCreateImage() return value: 1, 317 RENDER: fDestroyImage() return value: 1, 315 [...]The increasing numbers (going up to 317 and 315 here) tell us that the balance between creates and destroys is pretty clean. There are two creates at the start which don't have matching destroys, after which everything is balanced. It seems unlikely therefore that this is the cause of the seize-ups. What's more, it all makes sense too: at any point in time there should be a front and a back buffer, so there should always be exactly two images in existence at any one time. That's a situation that's confirmed by the numbers.
Just to ensure this matches the behaviour of the previous version I've also tested the same using the debugger on ESR 78. I got the same sequence of calls. First two creates, followed by balanced create and destroy calls so that there are exactly two images in existence at any one time:
fCreateImage fCreateImage fCreateImage fDestroyImage fCreateImage fDestroyImage fCreateImage [...]In conclusion everything here looks in order on ESR 91. So tomorrow I'll move on to checking that the display value is set correctly.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
24 Mar 2024 : Day 195 #
I'm working my way through the SharedSurface_EGLImage::Create() method and gradually increasing the steps that are executed. Over the last few days I first established that preventing the SurfaceFactory_EGLImage from being created was enough to prevent the app from seizing up. Without the factory the surfaces themselves weren't going to get created. Next I enabled the factory but disabled the image creation.
Today I'm allowing the offscreen texture to be created by allowing this call to take place:
Change made, code built, binary transferred and library installed. Now running the app, there's no seizing up. So that takes us one more step closer to finding the culprit. Now I've moved the early return one step later, until after the EGLImage has been created using the texture, after these lines:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Today I'm allowing the offscreen texture to be created by allowing this call to take place:
GLuint prodTex = CreateTextureForOffscreen(prodGL, formats, size);But I've placed a return immediately afterwards so that neither the image nor the surface that builds on this are created. Once again the objective is to find out whether the app seizes up or not. If it does then that would point to the texture being the culprit. If not, it's likely something that follows it.
Change made, code built, binary transferred and library installed. Now running the app, there's no seizing up. So that takes us one more step closer to finding the culprit. Now I've moved the early return one step later, until after the EGLImage has been created using the texture, after these lines:
EGLClientBuffer buffer = reinterpret_cast<EGLClientBuffer>(uintptr_t(prodTex)); EGLImage image = egl->fCreateImage(context, LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr); if (!image) { prodGL->fDeleteTextures(1, &prodTex); return ret; }Once again, I've build, transferred and installed the updated library. And now when I run it... the app seizes up! So we have our culprit. The problem seems to be the creation of the image from the surface that's either causing the problem in itself, or triggering something else to cause the problem. The most likely offender in the latter case would be if the created image weren't being freed:
EGLImage image = egl->fCreateImage(context, LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr);This is reminiscent of a problem I experienced earlier which resulted in me having to disable the texture capture for the cover image. Now that it's narrowed down I can look into the underlying reason. That will be my task for tomorrow morning.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
23 Mar 2024 : Day 194 #
Today I'm trying to narrow things down after reconfirming that when there's no SurfaceFactory_EGLImage the app doesn't seize up. I want to focus on two methods. First to check if anything is up with SurfaceFactory_EGLImage::Create(). Second I want to try disabling elements of SharedSurface_EGLImage::Create() in case that makes any difference. It's just a short one today, but still important tasks for helping get to the bottom of things.
First up SurfaceFactory_EGLImage::Create(). There's nothing to disable here (all it does is return the factory) but there are some input parameters to check. I added some debug print code for this:
Next up is SharedSurface_EGLImage::Create(). For the first check I've got it to return almost immediately with a null return value. This may or may not prevent the seizing up from happening. Either way it will be useful to know. If it does, then I'm focusing my intention in the correct place. If it doesn't I know I need to focus elsewhere.
The builds for this aren't taking the same seven hours that a full gecko builds, but they do still take tens of minutes. This seems to make me even more impatient. I think it's because 10 minutes is long enough to be noticeable, but not long enough that it's worth me context switching to some other task.
On copying over the library and executing the WebView app I find that this change has indeed stopped the app from seizing up. So this is excellent news. It means that tomorrow I can continue through the SharedSurface_EGLImage::Create() narrowing down where the problem starts.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
First up SurfaceFactory_EGLImage::Create(). There's nothing to disable here (all it does is return the factory) but there are some input parameters to check. I added some debug print code for this:
if (HasEglImageExtensions(*gle)) { printf_stderr("RENDER: prodGL: %p\n", prodGL); printf_stderr("RENDER: caps: any %d, color %d, alpha %d, bpp16 %d, depth %d, stencil %d, premultAlpha %d, preserve %d\n", caps.any, caps.color, caps.alpha, caps.bpp16, caps.depth, caps.stencil, caps.premultAlpha, caps.preserve); printf_stderr("RENDER: allocator: %p\n", allocator.get()); printf_stderr("RENDER: flags: %#x\n", (uint32_t)flags); printf_stderr("RENDER: context: %p\n", context); // The surface allocator that we want to create this // for. May be null. RefPtr<layers::LayersIPCChannel> surfaceAllocator; ret.reset(new ptrT({prodGL, SharedSurfaceType::Basic, layers::TextureType:: Unknown, true}, caps, allocator, flags, context)); }On ESR 78 the values from this method are the following:
=============== Preparing offscreen rendering context =============== RENDER: prodGL: 0x7eac109140 RENDER: caps: any 0, color 1, alpha 0, bpp16 0, depth 0, stencil 0, premultAlpha 1, preserve 0 RENDER: allocator: (nil) RENDER: flags: 0x2 RENDER: context: 0x7eac004d50And the values in ESR 91 are identical, other than certain structures residing in different places in memory:
=============== Preparing offscreen rendering context =============== RENDER: prodGL: 0x7ed819aa50 RENDER: caps: any 0, color 1, alpha 0, bpp16 0, depth 0, stencil 0, premultAlpha 1, preserve 0 RENDER: allocator: (nil) RENDER: flags: 0x2 RENDER: context: 0x7ed8004be0I forgot to add caps.surfaceAllocator to this list, but using the debugger I was able to confirm that this is set to null in both cases.
Next up is SharedSurface_EGLImage::Create(). For the first check I've got it to return almost immediately with a null return value. This may or may not prevent the seizing up from happening. Either way it will be useful to know. If it does, then I'm focusing my intention in the correct place. If it doesn't I know I need to focus elsewhere.
The builds for this aren't taking the same seven hours that a full gecko builds, but they do still take tens of minutes. This seems to make me even more impatient. I think it's because 10 minutes is long enough to be noticeable, but not long enough that it's worth me context switching to some other task.
On copying over the library and executing the WebView app I find that this change has indeed stopped the app from seizing up. So this is excellent news. It means that tomorrow I can continue through the SharedSurface_EGLImage::Create() narrowing down where the problem starts.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
22 Mar 2024 : Day 193 #
We took a bit of an interlude from my usual plan yesterday to consider some of the many useful suggestions that I've received over the last couple of weeks. I generated quite a few log files but didn't find any obvious discrepancies in them. That's not to say we've seen the end of those ideas and I'm still interested to know if anyone else can spot anything that looks worth following up. In the meantime I'm dropping back into my usual cadence with a plan to test out changes to the source code.
The problem I'm trying to solve is the seizing up of the app. I first noticed this after fixing a bug in the SharedSurface_EGLImage::Create() method. Over the next couple of days I'm planning to work through this method disabling various parts of it to try to pin down exactly which parts are causing the problem.
The start of the method looks like this:
The rendering was already broken and this change will only break it further, but the question I want to know the answer to is: "will this stop the app from seizing up".
The answer is: "yes". This one small change means I can now leave the app running for long periods, swiping between the lipstick home screen and the app, swiping up and down on the page (which no obvious effect, but the touch input is still going through). The app remains responsive, my phone's watchdog doesn't bite and the OS doesn't reboot.
I'm glad about this. If it hadn't been the case it would have meant I'd misunderstood where the problem was coming from. Now reassured I can continue on to partition the code up and try to narrow down the error.
But that will be for tomorrow. For today, it's good to be armed with this backstop of knowledge about where the error is emanating from.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
The problem I'm trying to solve is the seizing up of the app. I first noticed this after fixing a bug in the SharedSurface_EGLImage::Create() method. Over the next couple of days I'm planning to work through this method disabling various parts of it to try to pin down exactly which parts are causing the problem.
The start of the method looks like this:
/*static*/ UniquePtr<SurfaceFactory_EGLImage> SurfaceFactory_EGLImage::Create( GLContext* prodGL, const SurfaceCaps& caps, const RefPtr<layers::LayersIPCChannel>& allocator, const layers::TextureFlags& flags) { const auto& gle = GLContextEGL::Cast(prodGL); //const auto& egl = gle->mEgl; const auto& context = gle->mContext; typedef SurfaceFactory_EGLImage ptrT; UniquePtr<ptrT> ret; if (HasEglImageExtensions(*gle)) { ret.reset(new ptrT({prodGL, SharedSurfaceType::Basic, layers::TextureType:: Unknown, true}, caps, allocator, flags, context)); } return ret; }The original problem was that the HasExtensions() condition was causing the method to be exited early. In fact, really before the method had done anything at all. So I'm forcing an early return to simulate the same behaviour; like this:
[...] typedef SurfaceFactory_EGLImage ptrT; UniquePtr<ptrT> ret; return ret; [...]I've rebuilt it using the standard partial-build process, copied the resulting libxul.so over to my phone and installed it.
The rendering was already broken and this change will only break it further, but the question I want to know the answer to is: "will this stop the app from seizing up".
The answer is: "yes". This one small change means I can now leave the app running for long periods, swiping between the lipstick home screen and the app, swiping up and down on the page (which no obvious effect, but the touch input is still going through). The app remains responsive, my phone's watchdog doesn't bite and the OS doesn't reboot.
I'm glad about this. If it hadn't been the case it would have meant I'd misunderstood where the problem was coming from. Now reassured I can continue on to partition the code up and try to narrow down the error.
But that will be for tomorrow. For today, it's good to be armed with this backstop of knowledge about where the error is emanating from.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
21 Mar 2024 : Day 192 #
Over the last week or so while I've been struggling with what's almost certainly a texture or surface related bug, I've received many more suggestions than I expected from other Sailfish OS developers, both from Sailors (Jolla employees) and independent developers. Receiving all of this helpful input is hugely motivational and much appreciated. I'm always keen to try out others' suggestions, not least because I'm very aware of how much amazing knowledge there is out there and which it's always going to be beneficial to draw from. The Sailfish OS hivemind is impressively well-informed when it comes to technical matters.
However, these diary entries all form part of a pipeline. There's the pipeline of the plans I'm working my way though, on top of which there's also the "editing" pipeline, which means there's a lag of a couple of days between me writing a post and it going online.
So since I already had a set collection of things I wanted to try (checking creation/deletion balance, measuring memory consumption, running the app through valgrind), I've not been in a position to try out some of these suggestions until now — until the other tasks had worked their way through the pipeline — essentially.
Now that I've moved off valgrind and am going back to the source code, now seems like a good time to take a look at some of the suggestions I've received. Now while on the one hand I want to give each suggestion each a fair crack at solving the problem, on the other hand many of the suggestions touch on areas I'm not so familiar with. Consequently I'm going to explain what's happening here, but I may need further input for some of them.
First up, Tone (tortoisedoc) has made a number of useful suggestions recently. The latest is about making use of extra Wayland debugging output:
Thank you for this Tone. Running using WAYLAND_DEBUG=1 certainly produces a lot of output:
Please do feel free to download the output file yourself to take a look. I've also generated a similar file for the working ESR 78 build and which I was hoping may be useful for comparison. On ESR 78 the loop it settles into is similar:
A second nice suggestion, this one made by Tomi (tomin) in the same thread, was to check in case the textures aren't being properly released, resulting in the reserve of file descriptors becoming exhausted:
This sounds very plausible indeed. My suspicion has always been that the particular issue I'm experiencing relates to textures/surfaces not being released, but it hadn't occurred to me at all that it might be file-descriptor related and it certainly wouldn't have occurred to try using lsof to list them.
Executing lsof while the app is running shows over 400 open file descriptors. But there's nothing too dramatic that I can see there. Again, please feel free to check the output file in case you can spot something of interest. Running the command repeatedly to show open file descriptors over time shows a steady increase until it hits 440, at which point it stays fairly steady:
I've also received some helpful advice from Raine (rainemak). As it was a private message I don't want to quote it here, but Raine suggests several ways to make better use of the gecko logging capabilities which I'll definitely be making use of. He also suggested to look at a couple of the patches:
I'll take a careful look at these two and report back.
Finally, I need to reiterate my thanks to Florian Xaver (wosrediinanatour). While I've not had a chance to document in this diary all of the useful points Florian has been making over the last few weeks, I still very much intend to do so.
As I round off, I also want to mention this nice article by Arthur Schiwon (blizzz) which details his impressive attempts to reduce the memory footprint of his Nextcloud Talk app. Very relevant to some of the discussion here in recent days and with thanks to Patrick (pherjung) for pointing it out.
A big thank you for all of the great suggestions, feedback and general encouragement. I'm always happy to get this input and gradually it's all helping to create a clearer picture.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
However, these diary entries all form part of a pipeline. There's the pipeline of the plans I'm working my way though, on top of which there's also the "editing" pipeline, which means there's a lag of a couple of days between me writing a post and it going online.
So since I already had a set collection of things I wanted to try (checking creation/deletion balance, measuring memory consumption, running the app through valgrind), I've not been in a position to try out some of these suggestions until now — until the other tasks had worked their way through the pipeline — essentially.
Now that I've moved off valgrind and am going back to the source code, now seems like a good time to take a look at some of the suggestions I've received. Now while on the one hand I want to give each suggestion each a fair crack at solving the problem, on the other hand many of the suggestions touch on areas I'm not so familiar with. Consequently I'm going to explain what's happening here, but I may need further input for some of them.
First up, Tone (tortoisedoc) has made a number of useful suggestions recently. The latest is about making use of extra Wayland debugging output:
It seems you have exited browser world and are stepping into wayland lands, thats a good sign, problems yes but somewhere else 🙂 you could try the WAYLAND_DEBUG=1 (or "all" iirc) to complete your logs with the wayland surface creation protocol (it might be qt-wayland doesnt support swapchains?). Its a fairly simple protocol.
Thank you for this Tone. Running using WAYLAND_DEBUG=1 certainly produces a lot of output:
$ WAYLAND_DEBUG=1 harbour-webview 2>debug-harbour-webview-esr91-01.txt [D] unknown:0 - QML debugging is enabled. Only use this in a safe environment. [D] main:30 - WebView Example [D] main:44 - Using default start URL: "https://www.flypig.co.uk/search/ " [D] main:47 - Opening webview [1522475.832] -> wl_display@1.get_registry(new id wl_registry@2) [1522477.064] -> wl_display@1.sync(new id wl_callback@3) [1522479.962] wl_display@1.delete_id(3) [1522480.054] wl_registry@2.global(1, "wl_compositor", 3) [...]Having completed the initialisation sequence the output from the app then settles into a loop contianing the following:
[1523976.770] wl_buffer@4278190081.release() [1523976.951] wl_callback@45.done(5115868) [1523985.299] -> wl_surface@20.frame(new id wl_callback@45) [1523985.422] -> wl_surface@20.attach(wl_buffer@4278190080, 0, 0) [1523985.486] -> wl_surface@20.damage(0, 0, 1080, 2520) [1523985.557] -> wl_surface@20.commit() [1523985.581] -> wl_display@1.sync(new id wl_callback@44) [1523995.055] wl_display@1.delete_id(44) [1524013.656] wl_display@1.delete_id(45)Periodically there are also touch events that punctuate the output, presumably the result of me scrolling the page.
[1531447.943] qt_touch_extension@10.touch(5123658, 68, 65538, 7950000, 16980000, 7367, 6740, 50027, 50027, 255, 0, 0, 196608, array) [1531448.190] qt_touch_extension@10.touch(5123665, 68, 65538, 7940000, 16920000, 7358, 6716, 50027, 50027, 255, 0, 0, 196608, array)This continues right up until the app seizes up. Working through the file I don't see any changes towards the end that might explain why things are going wrong, but maybe Tone or another developer with a keener eye and greater expertise than I have can spot something?
Please do feel free to download the output file yourself to take a look. I've also generated a similar file for the working ESR 78 build and which I was hoping may be useful for comparison. On ESR 78 the loop it settles into is similar:
[2484510.845] wl_buffer@4278190080.release() [2484515.820] wl_display@1.delete_id(44) [2484515.928] wl_callback@44.done(186889767) [2484515.967] -> wl_surface@20.frame(new id wl_callback@44) [2484516.006] -> wl_surface@20.attach(wl_buffer@4278190082, 0, 0) [2484516.062] -> wl_surface@20.damage(0, 0, 1080, 2520) [2484516.129] -> wl_surface@20.commit() [2484516.152] -> wl_display@1.sync(new id wl_callback@49) [2484516.942] wl_display@1.delete_id(49)There is a slight difference in ordering: one of the deletes is out of sync across the two versions. Could this be significant? It's quite possible these logs are hiding a critical piece of information, but nothing is currently leaping out at me unfortunately.
A second nice suggestion, this one made by Tomi (tomin) in the same thread, was to check in case the textures aren't being properly released, resulting in the reserve of file descriptors becoming exhausted:
The issue that @flypig is now trying to figure out sounds a bit like the one I had with Qt Scene Graph recently. I wasn’t properly releasing textures so it kept reserving dmabuf file descriptors and then eventually the app crashed because it run out of fds (or rather I think it tried to use fd > 1024 for something that’s not compatible with such “high” value). Anyway lsof -p output might be worth looking at.
This sounds very plausible indeed. My suspicion has always been that the particular issue I'm experiencing relates to textures/surfaces not being released, but it hadn't occurred to me at all that it might be file-descriptor related and it certainly wouldn't have occurred to try using lsof to list them.
Executing lsof while the app is running shows over 400 open file descriptors. But there's nothing too dramatic that I can see there. Again, please feel free to check the output file in case you can spot something of interest. Running the command repeatedly to show open file descriptors over time shows a steady increase until it hits 440, at which point it stays fairly steady:
$ while true ; do bash -c \ 'lsof -p $(pgrep "harbour-webview") | wc -l' ; \ sleep 0.5 ; done [...] 121 280 351 434 435 444 443 443 443 440 440 440From this it doesn't look like it's a case of file descriptor exhaustion, but I'd be very interested in others' opinions on this.
I've also received some helpful advice from Raine (rainemak). As it was a private message I don't want to quote it here, but Raine suggests several ways to make better use of the gecko logging capabilities which I'll definitely be making use of. He also suggested to look at a couple of the patches:
- rpm/0065-sailfishos-gecko-Prioritize-loading-of-extension-ver.patch
- rpm/0047-sailfishos-egl-Drop-swap_buffers_with_damage-extensi.patch
I'll take a careful look at these two and report back.
Finally, I need to reiterate my thanks to Florian Xaver (wosrediinanatour). While I've not had a chance to document in this diary all of the useful points Florian has been making over the last few weeks, I still very much intend to do so.
As I round off, I also want to mention this nice article by Arthur Schiwon (blizzz) which details his impressive attempts to reduce the memory footprint of his Nextcloud Talk app. Very relevant to some of the discussion here in recent days and with thanks to Patrick (pherjung) for pointing it out.
A big thank you for all of the great suggestions, feedback and general encouragement. I'm always happy to get this input and gradually it's all helping to create a clearer picture.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
20 Mar 2024 : Day 191 #
This morning I'm taking the first opportunity to test out the new libxul.so binary generated from the build yesterday. Firing up the app there's no render and although initially things seem to be running fine (apart from the complete lack of anything useful on-screen!) after a short time the browser seizes up again.
So sadly this hasn't fixed the underlying issues. But I'm hoping it still fixed something. I can at least push the app through valgrind to see if it had some effect or not.
So this is exactly what I've done.
This is good news: it seems that the change did the trick and, although the render isn't working, the change also doesn't cause the WebView to crash. It feels like this is one step closer to getting our desired result.
For those who are still following along, or just desperate to get your hands on a build to test, I appreciate it's slow progress. But each fix is a fix that's necessary and I remain confident that eventually all of the pieces will fall in to place. This is very definitely a marathon and not a sprint. An Odyssey not a citybreak. We'll get there!
Having exhausted valgrind as a means to find the problem, we now need a new approach. In order to figure out where to focus, I'm going to go back to the SharedSurface_EGLImage::Create() method. You may recall that the seizing up of execution only happened after we persuaded this method to do something sensible and not just return null all the time.
I want to revisit this. My plan is to go through this method and judiciously remove parts of it until the seizing up no longer happens. The plan is to narrow down exactly which part is causing the issue, which should also narrow down where to look for a possible fix.
Unfortunately other commitments mean it's only a short one today. I'm very much hoping to be able to get to this task of cutting out elements to see the effects tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
So sadly this hasn't fixed the underlying issues. But I'm hoping it still fixed something. I can at least push the app through valgrind to see if it had some effect or not.
So this is exactly what I've done.
$ valgrind --log-file=valgrind-harbour-webview-esr91-02.txt harbour-webviewThe resulting output file still — unsurprisingly — contains no shortage of memory mess, but checking the Compositor thread there's no sign of anything related to fGetDisplay(), GetAndInitDisplay() nor CreateOffscreen().
This is good news: it seems that the change did the trick and, although the render isn't working, the change also doesn't cause the WebView to crash. It feels like this is one step closer to getting our desired result.
For those who are still following along, or just desperate to get your hands on a build to test, I appreciate it's slow progress. But each fix is a fix that's necessary and I remain confident that eventually all of the pieces will fall in to place. This is very definitely a marathon and not a sprint. An Odyssey not a citybreak. We'll get there!
Having exhausted valgrind as a means to find the problem, we now need a new approach. In order to figure out where to focus, I'm going to go back to the SharedSurface_EGLImage::Create() method. You may recall that the seizing up of execution only happened after we persuaded this method to do something sensible and not just return null all the time.
I want to revisit this. My plan is to go through this method and judiciously remove parts of it until the seizing up no longer happens. The plan is to narrow down exactly which part is causing the issue, which should also narrow down where to look for a possible fix.
Unfortunately other commitments mean it's only a short one today. I'm very much hoping to be able to get to this task of cutting out elements to see the effects tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
19 Mar 2024 : Day 190 #
Yesterday I collected valgrind logs from ESR 78 and ESR 91. Even though both logs contain a huge number of errors, I whittled the interesting ones down to just two.
First there's an invalid read size coming from GetAndInitDisplay(). This should be straightforward to track down.
The reason the for the EGLDisplay being deleted is because of this call here:
In ESR 78 the logic around EglDisplay didn't exist. Back then there was just an EGLDisplay value, which is a pointer to an opaque EGL structure and which could be passed around everywhere. In ESR 91 an effort has been made to generalise this, so that multiple such displays can be handled simultaneously. As part of this move the value was wrapped in the similarly — but not identically — named EglDisplay, which contains an EGLDisplay pointer along with some other values and helper methods.
It looks like there's something going wrong with this. Disentangling it could be horrific, but there is something that works in our favour, which is that the two calls don't happen to far apart from one another.
In fact, the call stack for both seems to originate in this method:
The actual failure is happening inside libEGL. I looked at the code there and, to be honest, it's not clear to me what the underlying reason is. Nevertheless, it's clear that there's something going wrong here in the gecko code that can be fixed and which should address it.
When the WebView is initialised it has to create an EGL display. Starting at the DefaultEglDisplay() method I copied out above there's a problem and the problem is that the display is created not once, but twice. To understand this, we have to note the fact that some time ago inside the GLLibraryEGL::Init() method I added the following code:
This piece of code is a problem for two reasons, which become clear when we follow the flow through DefaultEglDisplay():
I'm about to make this change when something else peculiar jumps out at me. The call to GLLibraryEGL::DefaultDisplay() checks whether the value of mDefaultDisplay is valid and only calls CreateDisplay() to create a new one if it's not. That's strange, since the value should already have been set in GLLibraryEGL::Init(). The exception is if CreateDisplay() returns null. It is possible that this is what's causing this issue. Let's check:
I've deleted the five lines. I've set the build running.
While that chugs away, let's briefly look at the other potentially relevant issue that valgrind threw up. This issue relates to APZCTreeManager::PrintAPZCInfo(). Although that doesn't sound especially relevant, the code is from the Compositor thread, so may be worth looking in to just in case.
It's been a long one today and I apologise for that. Sometimes it takes a lot to unravel the code and structure my thoughts around it. The key question now is whether the small change I've made will make any difference to the rendering. We'll find that out tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
First there's an invalid read size coming from GetAndInitDisplay(). This should be straightforward to track down.
==29044== Thread 32 Compositor: ==29044== Invalid read of size 8 ==29044== at 0xCCF62E0: hybris_egl_display_get_mapping (in /usr/lib64/ libEGL.so.1.0.0) ==29044== by 0xCCF63BB: ??? (in /usr/lib64/libEGL.so.1.0.0) ==29044== by 0x758848B: fGetDisplay (GLLibraryEGL.h:193) ==29044== by 0x758848B: mozilla::gl::GetAndInitDisplay(mozilla::gl:: GLLibraryEGL&, void*, void*) (GLLibraryEGL.cpp:151) ==29044== by 0x75889C7: mozilla::gl::GLLibraryEGL::CreateDisplay(bool, nsTSubstring<char>*, void*) (GLLibraryEGL.cpp:813) ==29044== by 0x7589917: mozilla::gl::GLLibraryEGL::DefaultDisplay( nsTSubstring<char>*) (GLLibraryEGL.cpp:745) ==29044== by 0x759AFBF: DefaultEglDisplay (GLContextEGL.h:33) ==29044== by 0x759AFBF: mozilla::gl::GLContextProviderEGL::CreateHeadless( mozilla::gl::GLContextCreateDesc const&, nsTSubstring<char>*) ( GLContextProviderEGL.cpp:1246) ==29044== by 0x759B89B: mozilla::gl::GLContextProviderEGL::CreateOffscreen( mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gl:: SurfaceCaps const&, mozilla::gl::CreateContextFlags, nsTSubstring<char>*) ( GLContextProviderEGL.cpp:1288) ==29044== by 0x76042BB: mozilla::layers::CompositorOGL::CreateContext() ( CompositorOGL.cpp:256) [...] ==29044== Address 0x14c39ef0 is 0 bytes inside a block of size 40 free'd ==29044== at 0x484EAD8: operator delete(void*, unsigned long) ( vg_replace_malloc.c:935) ==29044== by 0xCCF64BB: eglTerminate (in /usr/lib64/libEGL.so.1.0.0) ==29044== by 0x7587473: fTerminate (GLLibraryEGL.h:234) ==29044== by 0x7587473: fTerminate (GLLibraryEGL.h:639) ==29044== by 0x7587473: mozilla::gl::EglDisplay::~EglDisplay() ( GLLibraryEGL.cpp:734) ==29044== by 0x758752B: destroy<mozilla::gl::EglDisplay> (new_allocator.h: 140) ==29044== by 0x758752B: destroy<mozilla::gl::EglDisplay> (alloc_traits.h:487) ==29044== by 0x758752B: std::_Sp_counted_ptr_inplace<mozilla::gl:: EglDisplay, std::allocator<mozilla::gl::EglDisplay>, (__gnu_cxx:: _Lock_policy)2>::_M_dispose() (shared_ptr_base.h:554) ==29044== by 0x7575F33: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>:: _M_release() (shared_ptr_base.h:155) ==29044== by 0x758937F: ~__shared_count (shared_ptr_base.h:728) ==29044== by 0x758937F: ~__shared_ptr (shared_ptr_base.h:1167) ==29044== by 0x758937F: ~shared_ptr (shared_ptr.h:103) ==29044== by 0x758937F: mozilla::gl::GLLibraryEGL::Init(bool, nsTSubstring<char>*, void*) (GLLibraryEGL.cpp:504) ==29044== by 0x75895F7: mozilla::gl::GLLibraryEGL::Create( nsTSubstring<char>*) (GLLibraryEGL.cpp:345) ==29044== by 0x758974F: mozilla::gl::DefaultEglLibrary(nsTSubstring<char>*) ( GLContextProviderEGL.cpp:1331) ==29044== by 0x759AFAB: DefaultEglDisplay (GLContextEGL.h:29) ==29044== by 0x759AFAB: mozilla::gl::GLContextProviderEGL::CreateHeadless( mozilla::gl::GLContextCreateDesc const&, nsTSubstring<char>*) ( GLContextProviderEGL.cpp:1246) ==29044== by 0x759B89B: mozilla::gl::GLContextProviderEGL::CreateOffscreen( mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> const&, mozilla::gl:: SurfaceCaps const&, mozilla::gl::CreateContextFlags, nsTSubstring<char>*) ( GLContextProviderEGL.cpp:1288) ==29044== by 0x76042BB: mozilla::layers::CompositorOGL::CreateContext() ( CompositorOGL.cpp:256) [...] ==29044== Block was alloc'd at ==29044== at 0x484BF24: operator new(unsigned long) (vg_replace_malloc.c:422) ==29044== by 0x14E619FB: waylandws_GetDisplay (in /usr/lib64/libhybris/ eglplatform_wayland.so) ==29044== by 0xCCF63C7: ??? (in /usr/lib64/libEGL.so.1.0.0) ==29044== by 0x14960F73: ??? (in /usr/lib64/qt5/plugins/ wayland-graphics-integration-client/libwayland-egl.so) ==29044== by 0x14778947: QtWaylandClient::QWaylandIntegration:: initializeClientBufferIntegration() (in /usr/lib64/ libQt5WaylandClient.so.5.6.3) [...]Let's dig into this one first. The code that's identified in the error output is the following:
static std::shared_ptr<EglDisplay> GetAndInitDisplay(GLLibraryEGL& egl, void* displayType, EGLDisplay display = EGL_NO_DISPLAY) { if (display == EGL_NO_DISPLAY) { display = egl.fGetDisplay(displayType); } if (!display) return nullptr; return EglDisplay::Create(egl, display, false); }It's the call to egl.fGetDisplay() that we can see towards the top of the error backtrace. But note that there are two call stacks in the output. The first is for the data being read that leads to the code above, the second is for where the memory being accessed was previously deleted:
==29044== Address 0x14c39ef0 is 0 bytes inside a block of size 40 free'd ==29044== at 0x484EAD8: operator delete(void*, unsigned long) ( vg_replace_malloc.c:935)The error itself is "Invalid read of size 8" which makes sense: we're reading from memory that's no longer allocated. Checking the call stack, here's the code (it's specifically the call to fTerminate()) that frees the memory that's subsequently read:
EglDisplay::~EglDisplay() { fTerminate(); mLib->mActiveDisplays.erase(mDisplay); }It looks like value of egl that's being passed into GetAndInitDisplay() is an instance of EglDisplay that has already been deleted. Not good.
The reason the for the EGLDisplay being deleted is because of this call here:
std::shared_ptr<EglDisplay> defaultDisplay = CreateDisplay(forceAccel, out_failureId, aDisplay);It's presumably not the call to CreateDisplay() that's causing the deletion, but rather the replacement of the EGlDisplay contained in the shared_ptr with the one that's returned from it.
In ESR 78 the logic around EglDisplay didn't exist. Back then there was just an EGLDisplay value, which is a pointer to an opaque EGL structure and which could be passed around everywhere. In ESR 91 an effort has been made to generalise this, so that multiple such displays can be handled simultaneously. As part of this move the value was wrapped in the similarly — but not identically — named EglDisplay, which contains an EGLDisplay pointer along with some other values and helper methods.
It looks like there's something going wrong with this. Disentangling it could be horrific, but there is something that works in our favour, which is that the two calls don't happen to far apart from one another.
In fact, the call stack for both seems to originate in this method:
inline std::shared_ptr<EglDisplay> DefaultEglDisplay( nsACString* const out_failureId) { const auto lib = DefaultEglLibrary(out_failureId); if (!lib) { return nullptr; } return lib->DefaultDisplay(out_failureId); }The call to DefaultEglLibrary() in this code ends up calling eglTerminate() which deletes the value. Then the call to DefaultDisplay() attempts to read the value in again. It reads the value that was just deleted, whereas it should — presumably — be using the newly created value.
The actual failure is happening inside libEGL. I looked at the code there and, to be honest, it's not clear to me what the underlying reason is. Nevertheless, it's clear that there's something going wrong here in the gecko code that can be fixed and which should address it.
When the WebView is initialised it has to create an EGL display. Starting at the DefaultEglDisplay() method I copied out above there's a problem and the problem is that the display is created not once, but twice. To understand this, we have to note the fact that some time ago inside the GLLibraryEGL::Init() method I added the following code:
std::shared_ptr<EglDisplay> defaultDisplay = CreateDisplay(forceAccel, out_failureId, aDisplay); if (!defaultDisplay) { return false; } mDefaultDisplay = defaultDisplay;Why did I add this? It was to replace the following code which had been removed from ESR 78 by D85496 in order to "Support multiple EglDisplays per GLLibraryEGL":
mEGLDisplay = CreateDisplay(forceAccel, gfxInfo, out_failureId, aDisplay); if (!mEGLDisplay) { return false; }It seemed natural, when reversing some of the changes moving from ESR 78 to ER 91, to reintroduce the creation of the display, albeit it's now an EglDisplay wrapper rather than the EGLDisplay itself. I remember these changes being a huge source of angst back when I made them around Day 77.
This piece of code is a problem for two reasons, which become clear when we follow the flow through DefaultEglDisplay():
- DefaultEglLibrary() is called. At this stage things are still being initialised and so there is no GLLibraryEGL just yet. As a result the code is called to create it.
- This leads to a call to GLLibraryEGL::Init() which initialises the library.
- Inside this Init() method the code I added to call CreateDisplay() is called, the intention being to mimic the behaviour of ESR 78.
- defaultDisplay is a shared pointer so when the assignment happens there's a call made to the EglDisplay destructor. I wasn't expecting this, but it seems to be borne out by what the debugger shows. This is the first problem: there should be no construction or destruction happening here apart from via the call to CreateDisplay().
- Execution continues until we return back into the body of the DefaultEglDisplay() method. The local lib variable now contains a usable GLLibraryEGL, off of which the DefaultDisplay() method is called.
- DefaultDisplay() then goes ahead and calls CreateDisplay() all over again in order to set up the default EGL display. This is the second problem: the display has already been created by this point; we shouldn't be doing it again.
(gdb) b EglDisplay::~EglDisplay (gdb) b GetAndInitDisplay (gdb) r [...] Thread 38 "Compositor" hit Breakpoint 4, mozilla::gl:: GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, display=display@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149 149 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, display=display@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149 #1 0x0000007ff111c9c8 in mozilla::gl::GLLibraryEGL::CreateDisplay ( this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:813 #2 0x0000007ff111cdb0 in mozilla::gl::GLLibraryEGL::Init ( this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:504 #3 0x0000007ff111d5f8 in mozilla::gl::GLLibraryEGL::Create ( out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:345 #4 0x0000007ff111d750 in mozilla::gl::DefaultEglLibrary ( out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1331 [...] #29 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 3, mozilla::gl::EglDisplay:: ~EglDisplay (this=0x7ed8003550, __in_chrg=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:733 733 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) bt #0 mozilla::gl::EglDisplay::~EglDisplay (this=0x7ed8003550, __in_chrg=<optimized out>) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:733 #1 0x0000007ff111b52c in __gnu_cxx::new_allocator<mozilla::gl::EglDisplay>:: destroy<mozilla::gl::EglDisplay> (__p=<optimized out>, this=<optimized out>) at include/c++/8.3.0/ext/new_allocator.h:140 #2 std::allocator_traits<std::allocator<mozilla::gl::EglDisplay> >:: destroy<mozilla::gl::EglDisplay> (__p=<optimized out>, __a=...) at include/c++/8.3.0/bits/alloc_traits.h:487 #3 std::_Sp_counted_ptr_inplace<mozilla::gl::EglDisplay, std:: allocator<mozilla::gl::EglDisplay>, (__gnu_cxx::_Lock_policy)2>::_M_dispose ( this=<optimized out>) at include/c++/8.3.0/bits/shared_ptr_base.h:554 #4 0x0000007ff1109f34 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>:: _M_release (this=0x7ed8003540) at include/c++/8.3.0/ext/atomicity.h:69 #5 0x0000007ff111d380 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>:: ~__shared_count (this=0x7fd40ea790, __in_chrg=<optimized out>) at include/c++/8.3.0/bits/shared_ptr_base.h:1167 #6 std::__shared_ptr<mozilla::gl::EglDisplay, (__gnu_cxx::_Lock_policy)2>:: ~__shared_ptr (this=0x7fd40ea788, __in_chrg=<optimized out>) at include/c++/8.3.0/bits/shared_ptr_base.h:1167 #7 std::shared_ptr<mozilla::gl::EglDisplay>::~shared_ptr (this=0x7fd40ea788, __in_chrg=<optimized out>) at include/c++/8.3.0/bits/shared_ptr.h:103 #8 mozilla::gl::GLLibraryEGL::Init (this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp:504 #9 0x0000007ff111d5f8 in mozilla::gl::GLLibraryEGL::Create ( out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:345 #10 0x0000007ff111d750 in mozilla::gl::DefaultEglLibrary ( out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1331 [...] #35 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb) c Continuing. Thread 38 "Compositor" hit Breakpoint 4, mozilla::gl:: GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, display=display@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149 149 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) bt #0 mozilla::gl::GetAndInitDisplay (egl=..., displayType=displayType@entry=0x0, display=display@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:149 #1 0x0000007ff111c9c8 in mozilla::gl::GLLibraryEGL::CreateDisplay ( this=this@entry=0x7ed80031c0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7fd40eb1c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:813 #2 0x0000007ff111d918 in mozilla::gl::GLLibraryEGL::DefaultDisplay ( this=0x7ed80031c0, out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:745 #3 0x0000007ff112efc0 in mozilla::gl::DefaultEglDisplay ( out_failureId=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextEGL.h:33 #4 mozilla::gl::GLContextProviderEGL::CreateHeadless (desc=..., out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1246 #5 0x0000007ff112f89c in mozilla::gl::GLContextProviderEGL::CreateOffscreen ( size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags::REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7fd40eb1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1288 #6 0x0000007ff11982bc in mozilla::layers::CompositorOGL::CreateContext ( this=this@entry=0x7ed8002ed0) at ${PROJECT}/gecko-dev/gfx/layers/opengl/CompositorOGL.cpp:254 [...] #27 0x0000007ff6a0289c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/ clone.S:78 (gdb) c [...]Assuming that all of the above is correct, on the face of it there seems to be a simple solution, which is to remove the code I added to GLLibraryEGL::Init() which calls CreateDisplay(). An important consequence of this is that mDefaultDisplay will then no longer be set. We can't just set this inside DefaultEglDisplay() because that's in the wrong context. But it's already being set by GLLibraryEGL::DefaultDisplay(), which is in the right context (it's part of the GLLibraryEGL) and should be enough already.
I'm about to make this change when something else peculiar jumps out at me. The call to GLLibraryEGL::DefaultDisplay() checks whether the value of mDefaultDisplay is valid and only calls CreateDisplay() to create a new one if it's not. That's strange, since the value should already have been set in GLLibraryEGL::Init(). The exception is if CreateDisplay() returns null. It is possible that this is what's causing this issue. Let's check:
(gdb) b GLLibraryEGL.cpp:504 Breakpoint 5 at 0x7ff111cd98: file ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp, line 504. (gdb) r [...] Thread 37 "Compositor" hit Breakpoint 6, mozilla::gl::GLLibraryEGL:: Init (this=this@entry=0x7ee00031c0, forceAccel=forceAccel@entry=false, out_failureId=out_failureId@entry=0x7f1f7bf1c8, aDisplay=aDisplay@entry=0x0) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:504 504 ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp: No such file or directory. (gdb) p defaultDisplay $14 = std::shared_ptr<mozilla::gl::EglDisplay> (use count -134204000, weak count 125) = {get() = 0x0} (gdb) p mDefaultDisplay $15 = std::weak_ptr<mozilla::gl::EglDisplay> (empty) = {get() = 0x0} (gdb) n 505 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) p defaultDisplay $16 = std::shared_ptr<mozilla::gl::EglDisplay> (use count 1, weak count 1) = {get() = 0x7ee0003570} (gdb) p mDefaultDisplay $17 = std::weak_ptr<mozilla::gl::EglDisplay> (empty) = {get() = 0x0} (gdb) n 508 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) n 510 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) p mDefaultDisplay $18 = std::weak_ptr<mozilla::gl::EglDisplay> (use count 1, weak count 2) = {get( ) = 0x7ee0003570} (gdb) p this $19 = (mozilla::gl::GLLibraryEGL * const) 0x7ee00031c0As we can see that's not what's happening. The actual reason is that when it comes to the call to GLLibraryEGL::DefaultDisplay() the pointer has expired and hence it's created all over again.
(gdb) b GLLibraryEGL::DefaultDisplay Breakpoint 7 at 0x7ff111d7f4: file ${PROJECT}/gecko-dev/gfx/gl/ GLLibraryEGL.cpp, line 741. (gdb) c Continuing. Thread 37 "Compositor" hit Breakpoint 7, mozilla::gl::GLLibraryEGL:: DefaultDisplay (this=0x7ee00031c0, out_failureId=out_failureId@entry=0x7f1f7bf1c8) at ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp:741 741 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) n 742 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb) p this $20 = (mozilla::gl::GLLibraryEGL * const) 0x7ee00031c0 (gdb) p mDefaultDisplay $21 = std::weak_ptr<mozilla::gl::EglDisplay> (expired, weak count 1) = {get() = 0x7ee0003570} (gdb) p ret $22 = std::shared_ptr<mozilla::gl::EglDisplay> (use count 1634882661, weak count 25954) = {get() = 0x7f1f7bf160} (gdb) n 745 in ${PROJECT}/gecko-dev/gfx/gl/GLLibraryEGL.cpp (gdb)The flow all makes sense now and the solution to all of these issues is the same: remove the code I added to Init(). As long as nothing is attempting to access the mDefaultDisplay before GLLibraryEGL::DefaultDisplay() is called, it should all work out fine.
I've deleted the five lines. I've set the build running.
While that chugs away, let's briefly look at the other potentially relevant issue that valgrind threw up. This issue relates to APZCTreeManager::PrintAPZCInfo(). Although that doesn't sound especially relevant, the code is from the Compositor thread, so may be worth looking in to just in case.
==29044== Thread 32 Compositor: ==29044== Mismatched free() / delete / delete [] ==29044== at 0x484E858: operator delete(void*) (vg_replace_malloc.c:923) ==29044== by 0x5CEFFFF: std::__cxx11::basic_stringstream<char, std:: char_traits<char>, std::allocator<char> >::~basic_stringstream() (in /usr/ lib64/libstdc++.so.6.0.25) ==29044== by 0x767C23F: void mozilla::layers::APZCTreeManager:: PrintAPZCInfo<mozilla::layers::LayerMetricsWrapper>(mozilla::layers:: LayerMetricsWrapper const&, mozilla::layers::AsyncPanZoomController const*) (APZCTreeManager.cpp:1014) ==29044== by 0x768DC63: mozilla::layers::HitTestingTreeNode* mozilla::layers: :APZCTreeManager::PrepareNodeForLayer<mozilla::layers::LayerMetricsWrapper>( mozilla::RecursiveMutexAutoLock const&, mozilla::layers:: LayerMetricsWrapper const&, mozilla::layers::FrameMetrics const&, mozilla:: layers::LayersId, mozilla::Maybe<mozilla::layers::ZoomConstraints> const&, mozilla::layers::AncestorTransform const&, mozilla::layers:: HitTestingTreeNode*, mozilla::layers::HitTestingTreeNode*, mozilla::layers:: APZCTreeManager::TreeBuildingState&) (APZCTreeManager.cpp:1323) ==29044== by 0x768E55F: mozilla::layers::APZCTreeManager:: UpdateHitTestingTreeImpl<mozilla::layers::LayerMetricsWrapper>(mozilla:: layers::LayerMetricsWrapper const&, bool, mozilla::layers::LayersId, unsigned int)::{lambda(mozilla::layers::LayerMetricsWrapper)#2}::operator()( mozilla::layers::LayerMetricsWrapper) const (APZCTreeManager.cpp:481) [...] ==29044== Address 0x18c00050 is 0 bytes inside a block of size 513 alloc'd ==29044== at 0x484B7C0: malloc (vg_replace_malloc.c:381) ==29044== by 0x48F61EB: moz_xmalloc (in /usr/lib64/ libqt5embedwidget.so.1.53.9) ==29044== by 0x48F644B: std::__cxx11::basic_string<char, std:: char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long) (in /usr/lib64/libqt5embedwidget.so.1.53.9) ==29044== by 0x5CFC2E7: std::__cxx11::basic_string<char, std:: char_traits<char>, std::allocator<char> >::reserve(unsigned long) (in /usr/ lib64/libstdc++.so.6.0.25) ==29044== by 0x5CF14D7: std::__cxx11::basic_stringbuf<char, std:: char_traits<char>, std::allocator<char> >::overflow(int) (in /usr/lib64/ libstdc++.so.6.0.25) ==29044== by 0x5CFA7B7: std::basic_streambuf<char, std::char_traits<char> >:: xsputn(char const*, long) (in /usr/lib64/libstdc++.so.6.0.25) ==29044== by 0x5CEC1AB: std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std:: basic_ostream<char, std::char_traits<char> >&, char const*, long) (in /usr/ lib64/libstdc++.so.6.0.25) ==29044== by 0x765AC0B: operator<< <std::char_traits<char> > (ostream:561) ==29044== by 0x765AC0B: mozilla::layers::operator<<(std::ostream&, mozilla:: layers::ScrollableLayerGuid const&) (ScrollableLayerGuid.cpp:52) ==29044== by 0x767BE3F: void mozilla::layers::APZCTreeManager:: PrintAPZCInfo<mozilla::layers::LayerMetricsWrapper>(mozilla::layers:: LayerMetricsWrapper const&, mozilla::layers::AsyncPanZoomController const*) (APZCTreeManager.cpp:1015) ==29044== by 0x768DC63: mozilla::layers::HitTestingTreeNode* mozilla::layers: :APZCTreeManager::PrepareNodeForLayer<mozilla::layers::LayerMetricsWrapper>( mozilla::RecursiveMutexAutoLock const&, mozilla::layers:: LayerMetricsWrapper const&, mozilla::layers::FrameMetrics const&, mozilla:: layers::LayersId, mozilla::Maybe<mozilla::layers::ZoomConstraints> const&, mozilla::layers::AncestorTransform const&, mozilla::layers:: HitTestingTreeNode*, mozilla::layers::HitTestingTreeNode*, mozilla::layers:: APZCTreeManager::TreeBuildingState&) (APZCTreeManager.cpp:1323) [...]The definition of the APZCTreeManager::PrintAPZCInfo() method that's referred to in the output above looks like this:
template <class ScrollNode> void APZCTreeManager::PrintAPZCInfo(const ScrollNode& aLayer, const AsyncPanZoomController* apzc) { const FrameMetrics& metrics = aLayer.Metrics(); std::stringstream guidStr; guidStr << apzc->GetGuid(); mApzcTreeLog << "APZC " << guidStr.str() << "\tcb=" << metrics.GetCompositionBounds() << "\tsr=" << metrics.GetScrollableRect() << (metrics.IsScrollInfoLayer() ? "\tscrollinfo" : "") << (apzc->HasScrollgrab() ? "\tscrollgrab" : "") << "\t" << aLayer.Metadata().GetContentDescription().get(); }There may be something to do here, but this is so far removed from the rendering pipeline that I don't see how it can be related to the issues I'm trying to fix. So I'm going to ignore this one and potentially come back to it later. I'm thinking there's some work to be done cleaning up the valgrind output and if I do eventually do so, this would fall under that.
It's been a long one today and I apologise for that. Sometimes it takes a lot to unravel the code and structure my thoughts around it. The key question now is whether the small change I've made will make any difference to the rendering. We'll find that out tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
18 Mar 2024 : Day 189 #
The memory graphs we looked at yesterday told a different story to the one we were expecting. The memory usage didn't shoot through the roof, instead stay stable at between 200-250 MB. So if memory allocation isn't the problem, it does beg the question of what is. It could be that it's graphics related, that is, to do with the way GL and EGL textures are being created and destroyed.
Despite this I still think it may be worthwhile running the harbour-webview app through valgrind. Valgrind is a debugging tool that, amongst other tricks, will tell you the state of the memory when the application is exited. In case some portion of the allocated memory isn't freed up at the end, it will highlight the fact. It'll also tell you all sorts of other useful memory information. But crucially, if memory isn't being freed, then that's typically a sign of a memory leak.
In the case of EmbedLite the memory management is pretty flaky, making it hard to disentangle what's normal from what's exceptional. I've nevertheless collected valgrind logs from both harbour-webview and sailfish-browser. I did this on both ESR 78 and ESR 91 in the hope that performing some kind of comparison might be useful. First on my phone running ESR 78:
There are multiple mismatched frees related to QMozViewPrivate::createView(), android_dlopen(), do_dlopen(), QCoreApplicationPrivate::sendPostedEvents(), QMozSecurity, QSGNode::destroy(), QMozContext and others. These all look unrelated to the parts of the code I'm looking at, plus they also appear for both ESR 78 and ESR 91. So I'm going to leave those to one side for now. It may be that there's some useful work to be done cleaning all of these up, but I don't believe they relate to the rendering issues I'm experiencing.
There are also a collection of memory bugs that happen in the Compositor thread. These are far more likely to relate to the changes I've been looking at. There are some that reference CompositorOGL::GetShaderProgramFor(), but which appear for both ESR 78 and ESR 91. Given they're in both, I think I'll ignore those as well.
There are four further errors which I can see appear in ESR 91 but not ESR 78. These are the ones I want to focus on. The first one doesn't seem to be directly associated with the gecko code. It references EGL, so it's possible it's related, but without something more concrete I'm not sure what to do with it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Despite this I still think it may be worthwhile running the harbour-webview app through valgrind. Valgrind is a debugging tool that, amongst other tricks, will tell you the state of the memory when the application is exited. In case some portion of the allocated memory isn't freed up at the end, it will highlight the fact. It'll also tell you all sorts of other useful memory information. But crucially, if memory isn't being freed, then that's typically a sign of a memory leak.
In the case of EmbedLite the memory management is pretty flaky, making it hard to disentangle what's normal from what's exceptional. I've nevertheless collected valgrind logs from both harbour-webview and sailfish-browser. I did this on both ESR 78 and ESR 91 in the hope that performing some kind of comparison might be useful. First on my phone running ESR 78:
$ valgrind --log-file=valgrind-harbour-webview-esr78-01.txt harbour-webview $ valgrind --log-file=valgrind-sailfish-browser-esr78-01.txt sailfish-browserAnd then following that on a different phone running ESR 91:
$ valgrind --log-file=valgrind-harbour-webview-esr91-01.txt harbour-webview $ valgrind --log-file=valgrind-sailfish-browser-esr91-01.txt sailfish-browserThe task now is to manually compare the two. The ESR 91 valgrind output shows a lot of errors:
==29044== ERROR SUMMARY: 7057 errors from 159 contexts (suppressed: 0 from 0)That's more than the number of errors in the ESR 78 output, but it's not an order of magnitude different:
==17128== ERROR SUMMARY: 5737 errors from 155 contexts (suppressed: 0 from 0)Skimming through, there are a lot of mismatched frees shown for QtWaylandClient; in fact this seems to be the majority of them. However these appear for both ESR 78 and ESR 91. I don't recommend it, but if you'd like to take a look yourself I've uploaded both the ESR 78 valgrind output and the ESR 91 valdgrind output and you're very welcome to take a look yourself. You might spot something I missed.
There are multiple mismatched frees related to QMozViewPrivate::createView(), android_dlopen(), do_dlopen(), QCoreApplicationPrivate::sendPostedEvents(), QMozSecurity, QSGNode::destroy(), QMozContext and others. These all look unrelated to the parts of the code I'm looking at, plus they also appear for both ESR 78 and ESR 91. So I'm going to leave those to one side for now. It may be that there's some useful work to be done cleaning all of these up, but I don't believe they relate to the rendering issues I'm experiencing.
There are also a collection of memory bugs that happen in the Compositor thread. These are far more likely to relate to the changes I've been looking at. There are some that reference CompositorOGL::GetShaderProgramFor(), but which appear for both ESR 78 and ESR 91. Given they're in both, I think I'll ignore those as well.
There are four further errors which I can see appear in ESR 91 but not ESR 78. These are the ones I want to focus on. The first one doesn't seem to be directly associated with the gecko code. It references EGL, so it's possible it's related, but without something more concrete I'm not sure what to do with it.
==29044== Mismatched free() / delete / delete [] ==29044== at 0x484E2B8: free (vg_replace_malloc.c:872) ==29044== by 0x15CE6943: ??? (in /usr/libexec/droid-hybris/system/lib64/ libEGL.so) ==29044== by 0x15CE590B: ??? (in /usr/libexec/droid-hybris/system/lib64/ libEGL.so) ==29044== by 0x15CE523F: ??? (in /usr/libexec/droid-hybris/system/lib64/ libEGL.so) ==29044== by 0x15CDC4BF: ??? (in /usr/libexec/droid-hybris/system/lib64/ libEGL.so) ==29044== by 0x15CDC7A3: eglGetDisplay (in /usr/libexec/droid-hybris/system/ lib64/libEGL.so) ==29044== by 0xCCF63AF: ??? (in /usr/lib64/libEGL.so.1.0.0) ==29044== by 0x14960F73: ??? (in /usr/lib64/qt5/plugins/ wayland-graphics-integration-client/libwayland-egl.so) ==29044== by 0x14778947: QtWaylandClient::QWaylandIntegration:: initializeClientBufferIntegration() (in /usr/lib64/ libQt5WaylandClient.so.5.6.3) ==29044== by 0x14778C6F: QtWaylandClient::QWaylandIntegration:: clientBufferIntegration() const (in /usr/lib64/libQt5WaylandClient.so.5.6.3) ==29044== by 0x147784F3: QtWaylandClient::QWaylandIntegration::hasCapability( QPlatformIntegration::Capability) const (in /usr/lib64/ libQt5WaylandClient.so.5.6.3) ==29044== by 0x4B3EA1F: QSGRenderLoop::instance() (in /usr/lib64/ libQt5Quick.so.5.6.3) ==29044== Address 0x14bd40c0 is 0 bytes inside a block of size 48 alloc'd ==29044== at 0x484BD54: operator new(unsigned long) (vg_replace_malloc.c:420) ==29044== by 0x15A5C167: ??? (in /usr/libexec/droid-hybris/system/lib64/ libc++.so)Similarly, although the following is apparently happening in the Compositor thread, I don't see a way to tie it to the gecko code.
==29044== Mismatched free() / delete / delete [] ==29044== at 0x484E2B8: free (vg_replace_malloc.c:872) ==29044== by 0x174836EF: ??? (in /odm/lib64/libllvm-glnext.so) ==29044== Address 0x2156db90 is 0 bytes inside a block of size 48 alloc'd ==29044== at 0x484BD54: operator new(unsigned long) (vg_replace_malloc.c:420) ==29044== by 0x163E9167: std::__1::basic_string<char, std::__1:: char_traits<char>, std::__1::allocator<char> >::__grow_by_and_replace( unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, char const*) (in /apex/com.android.vndk.v30/lib64/libc++.so) ==29044== by 0x163E9273: std::__1::basic_string<char, std::__1:: char_traits<char>, std::__1::allocator<char> >::append(char const*) (in / apex/com.android.vndk.v30/lib64/libc++.so) ==29044== by 0x17483623: ??? (in /odm/lib64/libllvm-glnext.so)That leaves two remaining issues, both occurring in the Compositor thread. These look like the most promising avenues to look into, but I'm not going to look at them today; they'll need a bit more time than I have right now. I'll give full details and start working through them tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
17 Mar 2024 : Day 188 #
I was still trying to find memory leaks yesterday, although I came to the conclusion that it may not be a memory leak causing the problem after all. I need to continue investigating to try to get a clearer picture, there are a couple of obvious things to try.
One approach worth trying is running the app through valgrind. I'm not super-hopeful that this will yield helpful results because gecko struggles to maintain a healthy memory map at the best of times. As a result valgrind is likely to generate a huge number of results, most of which are just the consequence of "normal" (for some definition of the word) EmbedLite behaviour and so unrelated to the problem I'm trying to fix.
So before trying out valgrind I'm going to try something simpler. That is, I thought I'd just measure the memory usage and see whether it's growing out of control, or remaining stable.
There are many ways to check memory usage: tops might be one way. But as I'm writing this up as a diary entry I thought it would be better to generate some graphs. This will also give a more concrete idea of what's happening over time.
There's also no shortage of tools for generating memory usage graphs. I've plumped for psrecord, a small Python tool, since it's easy to install and use and generates nice clean graphs of memory usage against time.
The fact it's so easy to install and run a random Python utility is one of the things I really love about Sailfish OS. All I have to do is drop to a shell (I usually have an SSH session running already anyway), create a virtual environment and use pip to install it:
What do we find from these? With both ESR 78 and ESR 91 the browser increases memory to around 250-300 MB. It's nice to see that the memory footprint on ESR 91 is no higher than for ESR 78, in fact if anything it's lower. ESR 78 seems to accumulate memory until the app is shut down, whereas ESR 91 is more consistent.
We see a similar pattern for the WebView. ESR 78 quickly rises to 300 MB of memory usage before jumping up to 350 MB when I start scrolling the page. On ESR 91 the memory rapidly climbs to 200 MB where it stays pretty consistently throughout. Scrolling does cause some memory jitter, but not as much as on ESR 78.
This is all both encouraging and discouraging. It's encouraging to see ESR 91 isn't more of a memory hog than ESR 78. If anything it seems to be leaner. It's discouraging because the lack of excessive memory usage on ESR 91 suggests I may be looking in the wrong place for the solution to the issue I'm trying to solve.
Is that discouraging? Maybe it's encouraging. I have more information than I had before, but in truth I don't feel closer to finding a solution.
I spent a surprising amount of time investigating different ways to collect memory usage data. On top of that generating the graphs also took a whole, given it involved using two different phones and two different apps on each. So I'm going to call it a day. I still want to run the apps through valgrind — maybe this will pick up something new — but I'm going to leave that task until tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
One approach worth trying is running the app through valgrind. I'm not super-hopeful that this will yield helpful results because gecko struggles to maintain a healthy memory map at the best of times. As a result valgrind is likely to generate a huge number of results, most of which are just the consequence of "normal" (for some definition of the word) EmbedLite behaviour and so unrelated to the problem I'm trying to fix.
So before trying out valgrind I'm going to try something simpler. That is, I thought I'd just measure the memory usage and see whether it's growing out of control, or remaining stable.
There are many ways to check memory usage: tops might be one way. But as I'm writing this up as a diary entry I thought it would be better to generate some graphs. This will also give a more concrete idea of what's happening over time.
There's also no shortage of tools for generating memory usage graphs. I've plumped for psrecord, a small Python tool, since it's easy to install and use and generates nice clean graphs of memory usage against time.
The fact it's so easy to install and run a random Python utility is one of the things I really love about Sailfish OS. All I have to do is drop to a shell (I usually have an SSH session running already anyway), create a virtual environment and use pip to install it:
$ python3 -m venv venv $ . ./venv/bin/activate $ pip install --upgrade pip $ pip install psrecord matplotlibBeautiful! I'm sure there's a nice way to use Python on Android (using Termux?) and iOS (Pythonista?) but I'm not sure I'd be able to install and use psrecord quite so easily.
$ harbour-webview & psrecord --plot mem-harbour-webview-esr91.png \ --interval 0.2 "$!" $ sailfish-browser & psrecord --plot mem-sailfish-browser-esr91.png \ --interval 0.2 "$!"Then, when I'm done, I can deactivate and delete the virtual environment to restore my phone to its previous state.
$ deactivate $ rm -rf venvThe resulting graphs are pretty clear, giving both memory usage in MB and CPU usage as a percentage of the total available. We're really only interested in memory usage though. For context, these graphs were collected by running each of the apps for 20 seconds. After 10 seconds I started scrolling the page up and down with my finger, since the problem doesn't seem to manifest itself when the display is static.
What do we find from these? With both ESR 78 and ESR 91 the browser increases memory to around 250-300 MB. It's nice to see that the memory footprint on ESR 91 is no higher than for ESR 78, in fact if anything it's lower. ESR 78 seems to accumulate memory until the app is shut down, whereas ESR 91 is more consistent.
We see a similar pattern for the WebView. ESR 78 quickly rises to 300 MB of memory usage before jumping up to 350 MB when I start scrolling the page. On ESR 91 the memory rapidly climbs to 200 MB where it stays pretty consistently throughout. Scrolling does cause some memory jitter, but not as much as on ESR 78.
This is all both encouraging and discouraging. It's encouraging to see ESR 91 isn't more of a memory hog than ESR 78. If anything it seems to be leaner. It's discouraging because the lack of excessive memory usage on ESR 91 suggests I may be looking in the wrong place for the solution to the issue I'm trying to solve.
Is that discouraging? Maybe it's encouraging. I have more information than I had before, but in truth I don't feel closer to finding a solution.
I spent a surprising amount of time investigating different ways to collect memory usage data. On top of that generating the graphs also took a whole, given it involved using two different phones and two different apps on each. So I'm going to call it a day. I still want to run the apps through valgrind — maybe this will pick up something new — but I'm going to leave that task until tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
16 Mar 2024 : Day 187 #
Yesterday I was struggling. And getting this WebView render pipeline working is starting to feel drawn out. I don't want to make any claims about the larger task, but I'm hoping that a good night's sleep and some time to ponder on the best approach in the shower this morning will have helped me focus on the task I'm struggling with.
And this task is figuring out what — if anything &mash; is different between the execution of TextureClient::Destroy() on ESR 78 compared to ESR 91. I'm still labouring under the hypothesis that it's this method that's causing the (hypothetical) memory leak that's causing ESR 91 execution to seize up over time.
The difficulties I experienced yesterday were twofold. First on ESR 78 the actual section of code that reclaims the allocated memory appeared never to be executed. Second applying the debugger to ESR 91 gave peculiar results: what appeared to be an infinite loop of calls to Destroy() that never allowed me to step in to the method.
To tackle these difficulties I'm going to try two things. First I need to stick a breakpoint inside the conditional code that reclaims the memory on ESR 78, to establish whether it ever gets called. Second I'm going to annotate the ESR 91 code with debug prints. That should allow me to get a better idea of the true execution flow. If the debugger isn't playing by the rules, I'll take my ball somewhere else.
So, first up, checking the ESR 78 flow. The structure of the Destroy() method looks like this on ESR 78:
The reason I've never seen it enter this condition is because data (which is derived from the mData class variable) and actor (which is derived from the mActor class variable) have always been null when entering this method.
Let's do this then.
Let's now try the same thing on ESR 91.
To return to our original questions, I think this has answered both of them. First we can see that on ESR 78 this is definitely a place where memory is actually being freed. But on the other hand we also see it being freed on ESR 91. That doesn't mean that there isn't a problem here, but it does make it less likely.
Nevertheless there has been a change to this code. The mWorkaroundAnnoyingSharedSurfaceLifetimeIssues flag was removed by upstream. It's possible that this is causing the issue we're experiencing, so I'm going to reverse this and reinsert the removed code. I'm not really expecting this to fix things, but having travelled out into the sticks I now need to check under every stone. I've no choice but to figure this thing out if the WebView is going to get back up and running again.
[...]
Having worked carefully through the code and reintroduced the mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable and its associated logic, it's disappointing to find it's not fixed the issue. I'm not out of ideas yet though. Tomorrow I'm going to have a go at profiling the application and using specific memory tools (e.g. valgrind) to try to figure out which memory is being allocated but not deallocated. I'm not sure I hold out much hope of success using valgrind given that gecko is so big and messy and suffering from leakage as it is, but you never know.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
And this task is figuring out what — if anything &mash; is different between the execution of TextureClient::Destroy() on ESR 78 compared to ESR 91. I'm still labouring under the hypothesis that it's this method that's causing the (hypothetical) memory leak that's causing ESR 91 execution to seize up over time.
The difficulties I experienced yesterday were twofold. First on ESR 78 the actual section of code that reclaims the allocated memory appeared never to be executed. Second applying the debugger to ESR 91 gave peculiar results: what appeared to be an infinite loop of calls to Destroy() that never allowed me to step in to the method.
To tackle these difficulties I'm going to try two things. First I need to stick a breakpoint inside the conditional code that reclaims the memory on ESR 78, to establish whether it ever gets called. Second I'm going to annotate the ESR 91 code with debug prints. That should allow me to get a better idea of the true execution flow. If the debugger isn't playing by the rules, I'll take my ball somewhere else.
So, first up, checking the ESR 78 flow. The structure of the Destroy() method looks like this on ESR 78:
void TextureClient::Destroy() { [...] RefPtr<TextureChild> actor = mActor; [...] TextureData* data = mData; if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { mData = nullptr; } if (data || actor) { [...] DeallocateTextureClient(params); } }I'm interested in whether it ever goes inside the condition at the end in order to ultimately call DeallocateTextureClient(). If it doesn't — or if this only happens occasionally, say on shutdown — then I'm likely to be looking in the wrong place.
The reason I've never seen it enter this condition is because data (which is derived from the mData class variable) and actor (which is derived from the mActor class variable) have always been null when entering this method.
Let's do this then.
(gdb) break TextureClient.cpp:583 Breakpoint 5 at 0x7fb8e9b2fc: file gfx/layers/client/TextureClient.cpp, line 585. (gdb) r [...] Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7f8ea53bb0) at gfx/layers/client/TextureClient.cpp:585 585 params.allocator = mAllocator; (gdb) c Continuing. [LWP 16174 exited] Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7f8effa780) at gfx/layers/client/TextureClient.cpp:585 585 params.allocator = mAllocator; (gdb) p mWorkaroundAnnoyingSharedSurfaceLifetimeIssues $21 = false (gdb) p data $22 = (mozilla::layers::TextureData *) 0x7f8dac3a70 (gdb) p actor $23 = <optimized out> (gdb) b DeallocateTextureClient Breakpoint 6 at 0x7fb8e9a908: file gfx/layers/client/TextureClient.cpp, line 490. (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 6, mozilla::layers:: DeallocateTextureClient (params=...) at gfx/layers/client/TextureClient.cpp:490 490 void DeallocateTextureClient(TextureDeallocParams params) { (gdb) p params $24 = {data = 0x7f8dac3a70, actor = {mRawPtr = 0x7f8d648720}, allocator = {mRawPtr = 0x7f8cb40ff8}, clientDeallocation = false, syncDeallocation = false, workAroundSharedSurfaceOwnershipIssue = false} (gdb) n 491 if (!params.actor && !params.data) { (gdb) n 324 obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory. (gdb) n 499 if (params.allocator) { (gdb) n 500 ipdlThread = params.allocator->GetThread(); (gdb) n 501 if (!ipdlThread) { (gdb) n 510 if (ipdlThread && !ipdlThread->IsOnCurrentThread()) { (gdb) n 532 if (!ipdlThread) { (gdb) n 540 if (!actor) { (gdb) n 555 actor->Destroy(params); (gdb) n 497 nsCOMPtr<nsISerialEventTarget> ipdlThread; (gdb) n 505 return; (gdb) c Continuing. [...]So that clears things up: it does go inside the condition and it does deallocate the actor. But the data and actor values are non-null and ultimately in this case because we're executing on the IPDL thread, the actor is destroyed directly.
Let's now try the same thing on ESR 91.
(gdb) break TextureClient.cpp:574 Breakpoint 4 at 0x7ff1148ddc: TextureClient.cpp:574. (2 locations) (gdb) r [...] Thread 7 "GeckoWorkerThre" hit Breakpoint 4, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7fc5612680) at gfx/layers/client/TextureClient.cpp:587 587 if (actor) { (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 4, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7fc55b8690) at gfx/layers/client/TextureClient.cpp:587 587 if (actor) { (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 4, mozilla::layers:: TextureClient::Destroy (this=this@entry=0x7fc55b6c10) at gfx/layers/client/TextureClient.cpp:587 587 if (actor) { (gdb) p data $12 = (mozilla::layers::TextureData *) 0x7fc4d79e80 (gdb) p actor $13 = <optimized out> (gdb) b DeallocateTextureClient Breakpoint 5 at 0x7ff1148394: file gfx/layers/client/TextureClient.cpp, line 489. (gdb) c Continuing. Thread 7 "GeckoWorkerThre" hit Breakpoint 5, mozilla::layers:: DeallocateTextureClient (params=...) at gfx/layers/client/TextureClient.cpp:489 489 void DeallocateTextureClient(TextureDeallocParams params) { (gdb) p params $14 = {data = 0x7fc4d79e80, actor = {mRawPtr = 0x7fc59e6620}, allocator = {mRawPtr = 0x7fc46684e0}, clientDeallocation = false, syncDeallocation = false} (gdb) n 490 if (!params.actor && !params.data) { (gdb) n 496 nsCOMPtr<nsISerialEventTarget> ipdlThread; (gdb) n [New LWP 5954] 498 if (params.allocator) { (gdb) n 499 ipdlThread = params.allocator->GetThread(); (gdb) n [LWP 5954 exited] [New LWP 6110] 500 if (!ipdlThread) { (gdb) n 509 if (ipdlThread && !ipdlThread->IsOnCurrentThread()) { (gdb) n [LWP 6110 exited] 867 ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h: No such file or directory. (gdb) n 539 if (!actor) { (gdb) n 548 actor->Destroy(params); (gdb) n 496 nsCOMPtr<nsISerialEventTarget> ipdlThread; (gdb) n mozilla::layers::TextureClient::Destroy (this=this@entry=0x7fc55b6c10) at gfx/layers/client/TextureClient.cpp:591 591 DeallocateTextureClient(params); (gdb) c Continuing. [...]Here we see something similar. The inner condition is entered and the DeallocateTextureClient() method is ultimately called on the same thread. The data and actor values are both non-null.
To return to our original questions, I think this has answered both of them. First we can see that on ESR 78 this is definitely a place where memory is actually being freed. But on the other hand we also see it being freed on ESR 91. That doesn't mean that there isn't a problem here, but it does make it less likely.
Nevertheless there has been a change to this code. The mWorkaroundAnnoyingSharedSurfaceLifetimeIssues flag was removed by upstream. It's possible that this is causing the issue we're experiencing, so I'm going to reverse this and reinsert the removed code. I'm not really expecting this to fix things, but having travelled out into the sticks I now need to check under every stone. I've no choice but to figure this thing out if the WebView is going to get back up and running again.
[...]
Having worked carefully through the code and reintroduced the mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable and its associated logic, it's disappointing to find it's not fixed the issue. I'm not out of ideas yet though. Tomorrow I'm going to have a go at profiling the application and using specific memory tools (e.g. valgrind) to try to figure out which memory is being allocated but not deallocated. I'm not sure I hold out much hope of success using valgrind given that gecko is so big and messy and suffering from leakage as it is, but you never know.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
15 Mar 2024 : Day 186 #
Today I'm still searching for memory leaks and one memory leak in particular: after fixing the SurfaceFactory_EGLImage::Create() which, as the name implies, creates EGL surface textures, there's now a massive memory leak that grinds not just the browser but the entire phone to a halt.
I'm hypothesising that there's something created by the method that should get freed. The texture itself is the most likely culprit, since even just a few 1080 by 2520 textures are going to consume a lot of memory. Generate a few of those each frame without freeing them and bad things will happen.
We're definitely seeing bad things happen, so this is my guess. However yesterday I checked a few of the associated destructors and they seem to be getting called.
So today I'm going to try to tackle it from a different angle. Rather than try to figure out what's not getting freed I'm going to try to find out what's causing the crash. I've started off by adjusting the size of the texture created as part of the render loop. Rather than a 1080 by 2520 texture, I've set it to always generate an 8 by 8 texture, by altering this code from GLContext.cpp (the commented part is the old code, replaced by the equivalent lines directly below):
While the updated code builds and gets transferred over to my phone I can continue looking through the code. Here's the backtrace from the SharedSurface_EGLImage::Create() method, the change to which has triggered the memory leak. It's possible, maybe even likely, that the code for reclaiming the resources will live somewhere in or close to the methods in this stack.
My new binary has copied over to my phone. The running app exhibits very similar symptoms as before: responsive at first but quickly seizes up and becoming unresponsive. I killed the process before it took down my phone, but a visual check suggests reducing the texture size didn't have any obvious benefit.
Back to looking through the code, I'm working through SharedSurfaceTextureClient::~SharedSurfaceTextureClient() and I notice this method called Destroy(). Although it's not immediately clear from the way the code is written, this call actually goes through to TextureClient::Destroy(). I recall making changes to this and scanning through my notes confirms it: it was back on Day 176. At the time I wrote this:
Could it be that the changes I made back then are causing the issues I'm experiencing now?
While I can step through the ESR 78 build, unfortunately all of changes integrated using partial builds have messed up the debug source for the ESR 91 build. Stepping through gives me just a slew of useless "TextureClientSharedSurface.cpp: No such file or directory." messages. So I've decided to kick off a full build. With a bit of luck it'll be completed before the end of the day and I'll be able to come back to this in the evening to compare the two executing flows by stepping through them.
[...]
It was just before 9:00 this morning that I set the build going. I could well imagine it running into the night and leaving me with no more time to get the benefit from it. But by 17:13 this evening the build had completed. That means there's still time to perform the comparison of TextureClient::Destroy() running on the two versions.
The results, however, are not what I was expecting. At least on ESR 78 things seem to act normally. The breakpoint is hit and it's possible to step through the code. It seems a little anomalous that both the data and actor variables are null, meaning that the actual dealocation step gets skipped:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
I'm hypothesising that there's something created by the method that should get freed. The texture itself is the most likely culprit, since even just a few 1080 by 2520 textures are going to consume a lot of memory. Generate a few of those each frame without freeing them and bad things will happen.
We're definitely seeing bad things happen, so this is my guess. However yesterday I checked a few of the associated destructors and they seem to be getting called.
So today I'm going to try to tackle it from a different angle. Rather than try to figure out what's not getting freed I'm going to try to find out what's causing the crash. I've started off by adjusting the size of the texture created as part of the render loop. Rather than a 1080 by 2520 texture, I've set it to always generate an 8 by 8 texture, by altering this code from GLContext.cpp (the commented part is the old code, replaced by the equivalent lines directly below):
GLuint CreateTexture(GLContext* aGL, GLenum aInternalFormat, GLenum aFormat, GLenum aType, const gfx::IntSize& aSize, bool linear) { [...] // aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, aSize.width, // aSize.height, 0, aFormat, aType, nullptr); aGL->fTexImage2D(LOCAL_GL_TEXTURE_2D, 0, aInternalFormat, 8, 8, 0, aFormat, aType, nullptr); return tex; }This won't prevent the memory leak, but if this is the texture that's not being freed, it should at least slow it down, which should be discernible in use.
While the updated code builds and gets transferred over to my phone I can continue looking through the code. Here's the backtrace from the SharedSurface_EGLImage::Create() method, the change to which has triggered the memory leak. It's possible, maybe even likely, that the code for reclaiming the resources will live somewhere in or close to the methods in this stack.
#0 SharedSurface_EGLImage::Create, SharedSurfaceEGL.cpp:58 #1 SurfaceFactory_EGLImage::CreateSharedImpl, WeakPtr.h:185 #2 SurfaceFactory::CreateShared, RefCounted.h:240 #3 SurfaceFactory::NewTexClient, SharedSurface.cpp:406 #4 GLScreenBuffer::Swap, UniquePtr.h:290 #5 GLScreenBuffer::PublishFrame, GLScreenBuffer.h:229 #6 EmbedLiteCompositorBridgeParent::PresentOffscreenSurface, EmbedLiteCompositorBridgeParent.cpp:191 #7 embedlite::nsWindow::PostRender, embedshared/nsWindow.cpp:248 #8 InProcessCompositorWidget::PostRender, InProcessCompositorWidget.cpp:60 #9 LayerManagerComposite::Render, Compositor.h:575 #10 LayerManagerComposite::UpdateAndRender, LayerManagerComposite.cpp:657 #11 LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572I should take a look more closely at SharedSurfaceTextureClient::~SharedSurfaceTextureClient() which is — in theory — being called each time Swap() is called when the contents of the mBack reference-counted pointer is freed.
My new binary has copied over to my phone. The running app exhibits very similar symptoms as before: responsive at first but quickly seizes up and becoming unresponsive. I killed the process before it took down my phone, but a visual check suggests reducing the texture size didn't have any obvious benefit.
Back to looking through the code, I'm working through SharedSurfaceTextureClient::~SharedSurfaceTextureClient() and I notice this method called Destroy(). Although it's not immediately clear from the way the code is written, this call actually goes through to TextureClient::Destroy(). I recall making changes to this and scanning through my notes confirms it: it was back on Day 176. At the time I wrote this:
This mWorkaroundAnnoyingSharedSurfaceLifetimeIssues is used in ESR 78 to decide whether or not to deallocate the TextureData when the TextureClient is destroyed. This has been removed in ESR 91 and to be honest, I'm comfortable leaving it this way... Maybe this will cause problems later, but my guess is that this will show up with the debugger if that's the case, at which point we can refer back to the ESR 78 code to restore these checks.
Could it be that the changes I made back then are causing the issues I'm experiencing now?
While I can step through the ESR 78 build, unfortunately all of changes integrated using partial builds have messed up the debug source for the ESR 91 build. Stepping through gives me just a slew of useless "TextureClientSharedSurface.cpp: No such file or directory." messages. So I've decided to kick off a full build. With a bit of luck it'll be completed before the end of the day and I'll be able to come back to this in the evening to compare the two executing flows by stepping through them.
[...]
It was just before 9:00 this morning that I set the build going. I could well imagine it running into the night and leaving me with no more time to get the benefit from it. But by 17:13 this evening the build had completed. That means there's still time to perform the comparison of TextureClient::Destroy() running on the two versions.
The results, however, are not what I was expecting. At least on ESR 78 things seem to act normally. The breakpoint is hit and it's possible to step through the code. It seems a little anomalous that both the data and actor variables are null, meaning that the actual dealocation step gets skipped:
(gdb) b TextureClient::Destroy Breakpoint 4 at 0x7fb8e9b1b8: file gfx/layers/client/TextureClient.cpp, line 558. (gdb) c Continuing. [LWP 7428 exited] [Switching to LWP 7404] Thread 37 "Compositor" hit Breakpoint 4, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7ea0118ab0) at gfx/layers/client/TextureClient.cpp:558 558 void TextureClient::Destroy() { (gdb) n [LWP 7308 exited] [LWP 7374 exited] [LWP 7375 exited] 560 MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); (gdb) n 562 if (mActor && !mIsLocked) { (gdb) n 566 mBorrowedDrawTarget = nullptr; (gdb) n 567 mReadLock = nullptr; (gdb) n [New LWP 7538] 569 RefPtr<TextureChild> actor = mActor; (gdb) n 577 TextureData* data = mData; (gdb) n 578 if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { (gdb) n 582 if (data || actor) { (gdb) p mWorkaroundAnnoyingSharedSurfaceLifetimeIssues $11 = true (gdb) p data $12 = (mozilla::layers::TextureData *) 0x0 (gdb) p actor $13 = {mRawPtr = 0x0} (gdb)Here are the most relevant parts of the code associated with the above debug output to help follow what's happening:
void TextureClient::Destroy() { // Async paints should have been flushed by now. MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); if (mActor && !mIsLocked) { mActor->Lock(); } mBorrowedDrawTarget = nullptr; mReadLock = nullptr; RefPtr<TextureChild> actor = mActor; mActor = nullptr; if (actor && !actor->mDestroyed.compareExchange(false, true)) { actor->Unlock(); actor = nullptr; } TextureData* data = mData; if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { mData = nullptr; } if (data || actor) { [...] DeallocateTextureClient(params); } }Although the code does step through okay, on ESR 91 something happens which I can't explain. The debugger suggests execution is getting stuck in a loop calling TextureClient::Destroy():
(gdb) b TextureClient::Destroy Breakpoint 3 at 0x7ff1148c90: file layers/client/ TextureClient.cpp, line 551. (gdb) c Continuing. [LWP 18393 exited] [Switching to LWP 18284] Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7fc5975380) at layers/client/TextureClient.cpp:551 551 void TextureClient::Destroy() { (gdb) n 553 MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); (gdb) Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7fc595de20) at layers/client/TextureClient.cpp:551 551 void TextureClient::Destroy() { (gdb) 553 MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); (gdb) Thread 7 "GeckoWorkerThre" hit Breakpoint 3, mozilla::layers::TextureClient:: Destroy (this=this@entry=0x7fc55d2280) at layers/client/TextureClient.cpp:551 551 void TextureClient::Destroy() { (gdb)The code is quite similar to the ESR 78 code, so I'm not sure why this might be happening. There's no obviously nested call to Destroy():
void TextureClient::Destroy() { // Async paints should have been flushed by now. MOZ_RELEASE_ASSERT(mPaintThreadRefs == 0); if (mActor && !mIsLocked) { mActor->Lock(); } mBorrowedDrawTarget = nullptr; mReadLock = nullptr; RefPtr<TextureChild> actor = mActor; mActor = nullptr; if (actor && !actor->mDestroyed.compareExchange(false, true)) { actor->Unlock(); actor = nullptr; } TextureData* data = mData; mData = nullptr; if (data || actor) { [...] DeallocateTextureClient(params); } }The backtrace is showing nested calls to TextureClient::~TextureClient(). But none of this explains the repeated hits of the TextureClient::Destroy() breakpoint.
#0 mozilla::layers::TextureClient::Destroy (this=this@entry=0x7fc6f7a3a0) at layers/client/TextureClient.cpp:551 #1 0x0000007ff1149144 in mozilla::layers::TextureClient::~TextureClient (this=0x7fc6f7a3a0, __in_chrg=<optimized out>) at layers/client/TextureClient.cpp:769 #2 0x0000007ff1149310 in mozilla::layers::TextureClient::~TextureClient (this=0x7fc6f7a3a0, __in_chrg=<optimized out>) at layers/client/TextureClient.cpp:764 #3 0x0000007ff110507c in mozilla::AtomicRefCountedWithFinalize <mozilla::layers::TextureClient>::Release (this=0x7fc6f7a3a8) at include/c++/8.3.0/bits/basic_ios.h:282 #4 0x0000007ff1268420 in mozilla::RefPtrTraits <mozilla::layers::TextureClient>::Release (aPtr=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:381 #5 RefPtr<mozilla::layers::TextureClient>::ConstRemovingRefPtrTraits <mozilla::layers::TextureClient>::Release (aPtr=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:381 [...]I was hoping to get this part resolved today, but the situation is confusing and my head has stopped focusing, so I'm going to have to continue tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
14 Mar 2024 : Day 185 #
By the time I'd finished writing my diary entry yesterday I was pretty tired; my mind wasn't entirely with it. But it wasn't just the need for sleep that was causing me confusion. I was also confused as to why the HasEglImageExtensions() was returning false on ESR 91 while HasExtensions() — which is essentially the same functionality — was returning true.
A good night's sleep hasn't helped answer this question unfortunately. My manual inspection of the code coupled with output from the debugger suggested that HasEglImageExtensions() should have been returning true.
What I'd really like to see is explicit output for each of the items in the condition to figure out which is returning false. The debugger on its own won't be any further help with this as there are too many steps optimised out. But if I expand the code a bit, rebuild and redeploy, then I may be able to get a clearer picture. So at least that's a clear path for today.
The first step is to make some changes to the code. I've added in variables to store return values for each of the four flags, all marked as volatile in the hope this will prevent the compiler from optimising them away. Then I print them all out. In practice I don't really care whether they actually get printed out or not, since my plan is to inspect them using the debugger. But I need to do something with them; printing them out is as good as anything.
On executing the updated code I'm surprised to discover that it does actually output to the console. And the results aren't what I was expected. Well, not exactly.
Here's the output I get. And as soon as I see this output I realise the stupid mistake I've made.
I feel more than a little bit silly. But it's okay, the important point is that it's fixed now. I've added in that crucial missing ! and this should now work as expected:
On executing the app and with this change in place we still don't unfortunately get a render. In fact the app now seems to hog CPU cycles and make my phone unresponsive. I have a feeling this is a memory leak, but a bit more digging will help confirm it (or otherwise).
If this change has triggered a memory leak, it's likely because the surface being created by SurfaceFactory_EGLImage::Create() is never being freed. Creating a new 1080 by 2520 texture each frame will start to eat up memory pretty fast. So an obvious next step is to find out where it's being freed on ESR 78 and establish whether the same thing is happening on ESR 91 or not.
Unfortunately it turns out to be harder to find than I'd expected. There are quite a few methods that are used for deleting textures or memory associated with them. I've tried adding breakpoints to all of the following:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
A good night's sleep hasn't helped answer this question unfortunately. My manual inspection of the code coupled with output from the debugger suggested that HasEglImageExtensions() should have been returning true.
What I'd really like to see is explicit output for each of the items in the condition to figure out which is returning false. The debugger on its own won't be any further help with this as there are too many steps optimised out. But if I expand the code a bit, rebuild and redeploy, then I may be able to get a clearer picture. So at least that's a clear path for today.
The first step is to make some changes to the code. I've added in variables to store return values for each of the four flags, all marked as volatile in the hope this will prevent the compiler from optimising them away. Then I print them all out. In practice I don't really care whether they actually get printed out or not, since my plan is to inspect them using the debugger. But I need to do something with them; printing them out is as good as anything.
static bool HasEglImageExtensions(const GLContextEGL& gl) { const auto& egl = *(gl.mEgl); volatile bool imagebase = egl.HasKHRImageBase(); volatile bool tex2D = egl.IsExtensionSupported( EGLExtension::KHR_gl_texture_2D_image); volatile bool external = gl.IsExtensionSupported( GLContext::OES_EGL_image_external); volatile bool image = gl.IsExtensionSupported(GLContext::OES_EGL_image); printf_stderr("RENDER: egl HasKHRImageBase: %d\n", imagebase); printf_stderr("RENDER: egl KHR_gl_texture_2D_image: %d\n", tex2D); printf_stderr("RENDER: gl OES_EGL_image_external: %d\n", external); printf_stderr("RENDER: gl OES_EGL_image: %d\n", image); return egl.HasKHRImageBase() && egl.IsExtensionSupported(EGLExtension::KHR_gl_texture_2D_image) && (gl.IsExtensionSupported(GLContext::OES_EGL_image_external) || gl.IsExtensionSupported(GLContext::OES_EGL_image)); }Now I've set it building. As I write this I'm on a different mode of transport: travelling by bus. It's surprising that using a laptop on a bus feels far more socially awkward compared to using a laptop on a train. It's true the ride is more bumpy and the space more cramped, but I still find it odd that I don't see other people doing it. Everyone is on their phones; nobody (apart from me) ever seems to have a laptop out.
On executing the updated code I'm surprised to discover that it does actually output to the console. And the results aren't what I was expected. Well, not exactly.
RENDER: egl HasKHRImageBase: 1 RENDER: egl KHR_gl_texture_2D_image: 1 RENDER: gl OES_EGL_image_external: 1 RENDER: gl OES_EGL_image: 1They're all coming back true, so I must have been mistaken about why the SurfaceFactory_EGLImage::Create() method is exiting early. I've therefore annotated the Create() method with some more debug output in the hope this will shed more light on things. See the added printf_stderr() calls in the code below.
if (HasEglImageExtensions(*gle)) { printf_stderr("RENDER: !HasEglImageExtensions()\n"); return ret; } MOZ_ALWAYS_TRUE(prodGL->MakeCurrent()); GLuint prodTex = CreateTextureForOffscreen(prodGL, formats, size); if (!prodTex) { printf_stderr("RENDER: !prodTex\n"); return ret; } EGLClientBuffer buffer = reinterpret_cast<EGLClientBuffer>(uintptr_t(prodTex)); EGLImage image = egl->fCreateImage(context, LOCAL_EGL_GL_TEXTURE_2D, buffer, nullptr); if (!image) { prodGL->fDeleteTextures(1, &prodTex); printf_stderr("RENDER: !image\n"); return ret; } ret.reset(new SharedSurface_EGLImage(prodGL, size, hasAlpha, formats, prodTex, image)); printf_stderr("RENDER: returning normally\n"); return ret;None of these should be necessary: it should be possible to extract all of this execution flow from the debugger. But for some reason the conclusion I came to from using the debugger doesn't make sense based on the values HasEglImageExtensions() is returning. Maybe I made a mistake somewhere. Nevertheless, this approach should hopefully give an answer to the question we want to know.
Here's the output I get. And as soon as I see this output I realise the stupid mistake I've made.
RENDER: egl HasKHRImageBase: 1 RENDER: egl KHR_gl_texture_2D_image: 1 RENDER: gl OES_EGL_image_external: 1 RENDER: gl OES_EGL_image: 1 RENDER: !HasEglImageExtensions()So did you notice the stupid mistake? Here's the relevant ESR 78 code:
if (!HasExtensions(egl, prodGL)) { return ret; }And here — oh dear — is what I replaced it with:
if (HasEglImageExtensions(*gle)) { return ret; }See what's missing? It's the crucial negation of the condition. Oh boy. I can see why I made this mistake: it's because elsewhere in the same file the condition is used — correctly — with the opposite effect, like this:
if (HasEglImageExtensions(*gle)) { ret.reset(new ptrT({prodGL, SharedSurfaceType::Basic, layers::TextureType::Unknown, true}, caps, allocator, flags, context)); }In one case the method should return early if the extension check fails; in the other case it should reset the returned texture if the extension check succeeds.
I feel more than a little bit silly. But it's okay, the important point is that it's fixed now. I've added in that crucial missing ! and this should now work as expected:
if (!HasEglImageExtensions(*gle)) { return ret; }I'm not expecting this change to miraculously fix the entire rendering pipeline, but it should certainly help.
On executing the app and with this change in place we still don't unfortunately get a render. In fact the app now seems to hog CPU cycles and make my phone unresponsive. I have a feeling this is a memory leak, but a bit more digging will help confirm it (or otherwise).
If this change has triggered a memory leak, it's likely because the surface being created by SurfaceFactory_EGLImage::Create() is never being freed. Creating a new 1080 by 2520 texture each frame will start to eat up memory pretty fast. So an obvious next step is to find out where it's being freed on ESR 78 and establish whether the same thing is happening on ESR 91 or not.
Unfortunately it turns out to be harder to find than I'd expected. There are quite a few methods that are used for deleting textures or memory associated with them. I've tried adding breakpoints to all of the following:
- SharedSurface_EGLImage::~SharedSurface_EGLImage()
- GLContext::Readback()
- GLContext::fDeleteFramebuffers()
- GLContext::raw_fDeleteTextures()
- SharedSurfaceTextureClient::~SharedSurfaceTextureClient()
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
13 Mar 2024 : Day 184 #
Yesterday we determined that a problem in CreateShared() meant that the method was returning a null SharedSurface_EGLImage on ESR 91 when it should have been returning a valid pointer. The question I want to answer today is: "why"?
Stepping through the code the programme counter is jumping all over the place, making it hard to follow. But eventually it becomes clear that it's the HasEglImageExtensions() method that's returning false, causing CreateShared() to return early with a null return value. Although the method is called HasEglImageExtensions() in ESR 91, in ESR 78 it's called something else; just HasExtensions. Let's take a look at the two versions of it. But they're otherwise largely the same. First the ESR 78 version:
Both egl and gl use different enums, so we'll need to consider them separately.
Here's the enum associated with egl in ESR 78, found in GlLibraryEGL.h:
On ESR 91, the related enum, also found in GlLibraryEGL.h, looks like this:
So, no obvious problems on the egl side. Let's now check the longer enum for gl. Here are the active values based on the ESR 78 list available in GLContext.h:
It's not clear to me right now. Something is wrong, but I can't see where. I'd love to dig deeper in to this today but my mind has reached its limit. I'll have to pick this up again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Stepping through the code the programme counter is jumping all over the place, making it hard to follow. But eventually it becomes clear that it's the HasEglImageExtensions() method that's returning false, causing CreateShared() to return early with a null return value. Although the method is called HasEglImageExtensions() in ESR 91, in ESR 78 it's called something else; just HasExtensions. Let's take a look at the two versions of it. But they're otherwise largely the same. First the ESR 78 version:
bool SharedSurface_EGLImage::HasExtensions(GLLibraryEGL* egl, GLContext* gl) { return egl->HasKHRImageBase() && egl->IsExtensionSupported(GLLibraryEGL::KHR_gl_texture_2D_image) && (gl->IsExtensionSupported(GLContext::OES_EGL_image_external) || gl->IsExtensionSupported(GLContext::OES_EGL_image)); }Followed by the ESR 91 version:
static bool HasEglImageExtensions(const GLContextEGL& gl) { const auto& egl = *(gl.mEgl); return egl.HasKHRImageBase() && egl.IsExtensionSupported(EGLExtension::KHR_gl_texture_2D_image) && (gl.IsExtensionSupported(GLContext::OES_EGL_image_external) || gl.IsExtensionSupported(GLContext::OES_EGL_image)); }As you can see, they're similar but not quite identical. Unfortunately the debugger claims the IsExtensionSupported() methods have been optimised out. But it's a pretty simple method, just returning as it does the value in the mAvailableExtensions array referenced by aKnownExtension.
bool IsExtensionSupported(EGLExtensions aKnownExtension) const { return mAvailableExtensions[aKnownExtension]; }There's a change on ESR 91 where the aKnownExtension is first redirected via the UnderlyingValue() method. Here's the ESR 91 version:
bool IsExtensionSupported(EGLExtension aKnownExtension) const { return mAvailableExtensions[UnderlyingValue(aKnownExtension)]; }We'll come back to UnderlyingValue() in a bit. Now that we know the implementations we can make use of this info when we perform our debugging to circumnavigate the fact the methods have been optimised out: we can just access the mAvailableExtensions array used by each directly instead. Let's take a look at that. First let's look at the values in ESR 78:
(gdb) b HasExtensions Breakpoint 2 at 0x7fb8e84d70: HasExtensions. (2 locations) (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::SharedSurface_EGLImage:: HasExtensions (egl=0x7eac0036a0, gl=0x7eac109140) at gfx/gl/SharedSurfaceEGL.cpp:59 59 return egl->HasKHRImageBase() && (gdb) p egl.mAvailableExtensions $1 = std::bitset = { [0] = 1, [2] = 1, [3] = 1, [5] = 1, [6] = 1, [7] = 1, [13] = 1, [21] = 1, [22] = 1} (gdb) p gl.mAvailableExtensions $2 = std::bitset = { [1] = 1, [57] = 1, [58] = 1, [60] = 1, [72] = 1, [75] = 1, [77] = 1, [78] = 1, [86] = 1, [87] = 1, [96] = 1, [97] = 1, [100] = 1, [111] = 1, [112] = 1, [113] = 1, [114] = 1, [115] = 1, [117] = 1, [118] = 1, [120] = 1, [121] = 1, [122] = 1, [123] = 1, [125] = 1, [126] = 1, [127] = 1, [128] = 1, [129] = 1, [130] = 1, [131] = 1, [132] = 1} (gdb)And for contrast, let's see what happens on ESR 91 using the same process:
(gdb) b HasEglImageExtensions Breakpoint 1 at 0x7ff11322a0: file include/c++/8.3.0/bitset, line 1163. (gdb) c Continuing. [LWP 26957 exited] [LWP 26952 exited] [New LWP 27078] [LWP 27037 exited] [Switching to LWP 26961] Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::HasEglImageExtensions (gl=...) at ${PROJECT}/gfx/gl/SharedSurfaceEGL.cpp:28 28 ${PROJECT}/gfx/gl/SharedSurfaceEGL.cpp: No such file or directory. (gdb) p egl.mAvailableExtensions $1 = std::bitset = { [0] = 1, [2] = 1, [4] = 1, [5] = 1, [6] = 1, [7] = 1, [8] = 1, [11] = 1, [16] = 1, [17] = 1, [22] = 1} (gdb) p gl.mAvailableExtensions $2 = std::bitset = { [1] = 1, [57] = 1, [58] = 1, [60] = 1, [72] = 1, [75] = 1, [77] = 1, [78] = 1, [86] = 1, [87] = 1, [88] = 1, [97] = 1, [99] = 1, [101] = 1, [102] = 1, [113] = 1, [114] = 1, [115] = 1, [116] = 1, [117] = 1, [119] = 1, [120] = 1, [122] = 1, [123] = 1, [124] = 1, [125] = 1, [127] = 1, [128] = 1, [129] = 1, [130] = 1, [131] = 1, [132] = 1, [133] = 1, [134] = 1} (gdb)It's noticeable that neither the egl nor the gl values are identical across the two versions. The obvious question is whether this is a real difference, or whether the UnderlyingValue() method is obscuring the fact that they're the same. Here's what the code has to say about UnderlyingValue():
/** * Get the underlying value of an enum, but typesafe. * * example: * * enum class Pet : int16_t { * Cat, * Dog, * Fish * }; * enum class Plant { * Flower, * Tree, * Vine * }; * UnderlyingValue(Pet::Fish) -> int16_t(2) * UnderlyingValue(Plant::Tree) -> int(1) */ template <typename T> inline constexpr auto UnderlyingValue(const T v) { static_assert(std::is_enum_v<T>); return static_cast<typename std::underlying_type<T>::type>(v); }So this isn't actually changing the value, it's checking and casting it to the appropriate type. So we can ignore this when we're comparing values and conclude that the mAvailableExtensions array definitely has different indices set to true between ESR 78 and ESR 91. But we still need to check the enums that these represent in order to be sure that these are real differences.
Both egl and gl use different enums, so we'll need to consider them separately.
Here's the enum associated with egl in ESR 78, found in GlLibraryEGL.h:
0: KHR_image_base 2: KHR_gl_texture_2D_image 3: KHR_lock_surface 5: EXT_create_context_robustness 6: KHR_image 7: KHR_fence_sync 13: KHR_create_context 21: KHR_surfaceless_context 22: KHR_create_context_no_errorBased on the HasExtensions() implementation we're interested in KHR_gl_texture_2D_image, KHR_image and KHR_image_base; all of which are present in the list above (indices 2, 6 and 0).
On ESR 91, the related enum, also found in GlLibraryEGL.h, looks like this:
0: KHR_image_base 2: KHR_gl_texture_2D_image 4: ANGLE_surface_d3d_texture_2d_share_handle 5: EXT_create_context_robustness 6: KHR_image 7: KHR_fence_sync 8: ANDROID_native_fence_sync 11: ANGLE_platform_angle_d3d 16: EXT_device_query 17: NV_stream_consumer_gltexture_yuv 22: KHR_create_context_no_errorAgain, looking at the code and based on HasEglImageExtensions() we're interested in the same flags: KHR_gl_texture_2D_image, KHR_image and KHR_image_base. All of these are also present in the ESR 91 list (indices 2, 6 and 0).
So, no obvious problems on the egl side. Let's now check the longer enum for gl. Here are the active values based on the ESR 78 list available in GLContext.h:
1: AMD_compressed_ATC_texture 57: EXT_color_buffer_float 58: EXT_color_buffer_half_float 60: EXT_disjoint_timer_query 72: EXT_multisampled_render_to_texture 75: EXT_read_format_bgra 77: EXT_sRGB 78: EXT_sRGB_write_control 86: EXT_texture_filter_anisotropic 87: EXT_texture_format_BGRA8888 96: IMG_texture_npot 97: KHR_debug 100: KHR_robustness 111: NV_transform_feedback 112: NV_transform_feedback2 113: OES_EGL_image 114: OES_EGL_image_external 115: OES_EGL_sync 117: OES_depth24 118: OES_depth32 120: OES_element_index_uint 121: OES_fbo_render_mipmap 122: OES_framebuffer_object 123: OES_packed_depth_stencil 125: OES_standard_derivatives 126: OES_stencil8 127: OES_texture_3D 128: OES_texture_float 129: OES_texture_float_linear 130: OES_texture_half_float 131: OES_texture_half_float_linear 132: OES_texture_npotFrom the ESR 78 code the ones we're interested in are just OES_EGL_image_external and OES_EGL_image. These are both in the list (indices 114 and 113). What about ESR 91? Here's the enum list in this case:
1: AMD_compressed_ATC_texture 57: EXT_color_buffer_float 58: EXT_color_buffer_half_float 60: EXT_disjoint_timer_query 72: EXT_multisampled_render_to_texture 75: EXT_read_format_bgra 77: EXT_sRGB 78: EXT_sRGB_write_control 86: EXT_texture_filter_anisotropic 87: EXT_texture_format_BGRA8888 88: EXT_texture_norm16 97: KHR_debug 99: KHR_robust_buffer_access_behavior 101: KHR_texture_compression_astc_hdr 102: KHR_texture_compression_astc_ldr 113: OES_EGL_image 114: OES_EGL_image_external 115: OES_EGL_sync 116: OES_compressed_ETC1_RGB8_texture 117: OES_depth24 119: OES_depth_texture 120: OES_element_index_uint 122: OES_framebuffer_object 123: OES_packed_depth_stencil 124: OES_rgb8_rgba8 125: OES_standard_derivatives 127: OES_texture_3D 128: OES_texture_float 129: OES_texture_float_linear 130: OES_texture_half_float 131: OES_texture_half_float_linear 132: OES_texture_npot 133: OES_vertex_array_object 134: OVR_multiview2Once again from the ESR 91 code we can see the ones we're interested in are the same: OES_EGL_image_external and OES_EGL_image. These are both in the list (indices 114 and 113). So what gives? Both methods have the appropriate flags set, so why is one succeeding and the other failure?
It's not clear to me right now. Something is wrong, but I can't see where. I'd love to dig deeper in to this today but my mind has reached its limit. I'll have to pick this up again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
12 Mar 2024 : Day 183 #
Before we get in to my usual dev diary I want to fist draw attention to a couple of new development blogs that have started up recently. Both related to mobile operating system development and Linux in particular.
First up is Adventures with Sailfish and Ofono from Adam Pigg (piggz). Adam will be well-known to many amSailfish OS users and already featured here back on Day 178.
A long-time porter of Sailfish OS, Adam is responsible for the native PinePhone, PinePhone Pro and Volla ports, amongst others.
He's recently turned his hand to contributing Sailfish-specific Ofono changes to upstream, with the aim of reversing the divergence that's grown between the two over the years. If successful, this would not only add functionality to upstream Ofono, it would also allow updating the Sailfish OS version and benefiting from recent upstream improvements as well.
Adam recently started writing a semi-daily blog about it. He's up to Day 5 already and it's a great read.
But it's not just Adam. Peter Mack (1peter10) from LINMOB.net has started a developer diary to chart his explorations of — and improvements to — Mobile Config Firefox. The project is aimed at getting Firefox nicely configured for mobile use, part of the postmarketOS project. Peter has already written about his first pull request to tidy up the URL bar for use on smaller displays.
I'm really enjoying reading about others' approach to development and watching as things progress. I'll be following along avidly to both.
It might seem a little obvious, but I also recommend Peter's weekly Mobile Linux Update as the best way to catch up on all the latest activity in the mobile Linux space. I like to think I'm keeping up with developments in the world of Sailfish OS, but keeping up with activity across all of the various distributions is a real challenge. I'd say it was impossible, except that this is exactly what Peter does, making it possible for the rest of us to keep up in the process.
On a separate but related note, I also want to give a shout-out to Florian Xaver (wosrediinanatour). Florian has been extremely helpful reviewing some of the code changes mentioned here in my diary. He's been sharing useful advice and tips. I'm going to go into this in more detail in a future diary entry, but for now, let me just say that I'm grateful for the input.
Alright, let's move back on to the gecko development track. After taking some steps to align the ESR 78 and ESR 91 offscreen rendering pipeline yesterday, I'm following on with more of the same today. My plan is to step through various methods I know to be relevant as part of the render process and see whether they differ between ESR 78 and ESR 91. I have a pretty good setup for this. Two phones, one with ESR 78 another with ESR 91. Two SSH sessions, one for each phone, running my test application through the debugger. Then on another display I have Qt Creator running with ESR 78 code on one side and ESR 91 code on the other.
With this setup I can step through the code simultaneously on ESR 78 and ESR 91 to establish whether they diverge or not, and if so where. The first method I'm going to look at is the same one we started with yesterday, which is GLScreenBuffer::Swap(). What I'd really like to do is show the debugger output side-by-side here, but the line lengths are too wide for it to comfortably fit, so I'm just going to have to list them here serially.
Working this way it doesn't take long before I'm able to identify a critical issue. First ESR 78.
The SurfaceFactory_EGLImage class must be inheriting the method from SurfaceFactory::NewTexClient(). So we can check by stepping through that. First the ESR 78 code.
On ESR 91 there's a big difference: although mAllocator and mFlags are set correctly, the call to CreateShared returns null. Immediately afterwards the method notices this and returns early.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
First up is Adventures with Sailfish and Ofono from Adam Pigg (piggz). Adam will be well-known to many amSailfish OS users and already featured here back on Day 178.
A long-time porter of Sailfish OS, Adam is responsible for the native PinePhone, PinePhone Pro and Volla ports, amongst others.
He's recently turned his hand to contributing Sailfish-specific Ofono changes to upstream, with the aim of reversing the divergence that's grown between the two over the years. If successful, this would not only add functionality to upstream Ofono, it would also allow updating the Sailfish OS version and benefiting from recent upstream improvements as well.
Adam recently started writing a semi-daily blog about it. He's up to Day 5 already and it's a great read.
But it's not just Adam. Peter Mack (1peter10) from LINMOB.net has started a developer diary to chart his explorations of — and improvements to — Mobile Config Firefox. The project is aimed at getting Firefox nicely configured for mobile use, part of the postmarketOS project. Peter has already written about his first pull request to tidy up the URL bar for use on smaller displays.
I'm really enjoying reading about others' approach to development and watching as things progress. I'll be following along avidly to both.
It might seem a little obvious, but I also recommend Peter's weekly Mobile Linux Update as the best way to catch up on all the latest activity in the mobile Linux space. I like to think I'm keeping up with developments in the world of Sailfish OS, but keeping up with activity across all of the various distributions is a real challenge. I'd say it was impossible, except that this is exactly what Peter does, making it possible for the rest of us to keep up in the process.
On a separate but related note, I also want to give a shout-out to Florian Xaver (wosrediinanatour). Florian has been extremely helpful reviewing some of the code changes mentioned here in my diary. He's been sharing useful advice and tips. I'm going to go into this in more detail in a future diary entry, but for now, let me just say that I'm grateful for the input.
Alright, let's move back on to the gecko development track. After taking some steps to align the ESR 78 and ESR 91 offscreen rendering pipeline yesterday, I'm following on with more of the same today. My plan is to step through various methods I know to be relevant as part of the render process and see whether they differ between ESR 78 and ESR 91. I have a pretty good setup for this. Two phones, one with ESR 78 another with ESR 91. Two SSH sessions, one for each phone, running my test application through the debugger. Then on another display I have Qt Creator running with ESR 78 code on one side and ESR 91 code on the other.
With this setup I can step through the code simultaneously on ESR 78 and ESR 91 to establish whether they diverge or not, and if so where. The first method I'm going to look at is the same one we started with yesterday, which is GLScreenBuffer::Swap(). What I'd really like to do is show the debugger output side-by-side here, but the line lengths are too wide for it to comfortably fit, so I'm just going to have to list them here serially.
Working this way it doesn't take long before I'm able to identify a critical issue. First ESR 78.
(gdb) b GLScreenBuffer::Swap Breakpoint 2 at 0x7fb8e672b8: file dist/include/mozilla/UniquePtr.h, line 287. (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x7eac003c00, size=...) at dist/include/mozilla/UniquePtr.h:287 287 in dist/include/mozilla/UniquePtr.h (gdb) p size $1 = (const mozilla::gfx::IntSize &) @0x7eac1deaa8: {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) n 382 if (!newBack) return false; (gdb) p newBack $2 = {mRawPtr = 0x7eac138860} (gdb)And now ESR 91.
(gdb) break GLScreenBuffer::Swap Breakpoint 2 at 0x7ff1106f8c: file ${PROJECT}/gfx/gl/GLScreenBuffer.cpp, line 506. (gdb) c Continuing. [LWP 22652 exited] Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555642d50, size=...) at ${PROJECT}/gfx/gl/GLScreenBuffer.cpp:506 506 in ${PROJECT}/gfx/gl/GLScreenBuffer.cpp (gdb) p size $1 = (const mozilla::gfx::IntSize &) @0x7edc1a21e4: {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) n [LWP 22657 exited] 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) n 509 ${PROJECT}/gfx/gl/GLScreenBuffer.cpp: No such file or directory. (gdb) p newBack $2 = {mRawPtr = 0x0} (gdb)Debugging on the ESR 91 build isn't so clean due to the partial build messing up some of the debug source alignment, but we can nevertheless see that the call to SurfaceFactory_EGLImage::NewTexClient() is returning something sensible in ESR 78, but null in ESR 91. Here's the relevant code:
RefPtr<layers::SharedSurfaceTextureClient> newBack = mFactory->NewTexClient(size); if (!newBack) return false;Let's ensure that mFactory is valid and of the correct type. First on ESR 78:
(gdb) p mFactory.mTuple.mFirstA $4 = (mozilla::gl::SurfaceFactory *) 0x7eac139b60 (gdb) set print object on (gdb) p mFactory.mTuple.mFirstA $5 = (mozilla::gl::SurfaceFactory_EGLImage *) 0x7eac139b60 (gdb) set print object off (gdb)And then, to compare, on ESR 91
(gdb) p mFactory.mTuple.mFirstA $7 = (mozilla::gl::SurfaceFactory *) 0x7edc1dc470 (gdb) set print object on (gdb) p mFactory.mTuple.mFirstA $8 = (mozilla::gl::SurfaceFactory_EGLImage *) 0x7edc1dc470 (gdb) set print object off (gdb)That all looks similar, so the next thing to check is what's happening inside SurfaceFactory_EGLImage::NewTexClient() that's preventing it from doing what it's supposed to. But when I try to place a breakpoint on SurfaceFactory_EGLImage::NewTexClient() I discover I can't: it doesn't exist.
The SurfaceFactory_EGLImage class must be inheriting the method from SurfaceFactory::NewTexClient(). So we can check by stepping through that. First the ESR 78 code.
(gdb) b SurfaceFactory::NewTexClient Breakpoint 3 at 0x7fb8e6f338: file gfx/gl/SharedSurface.cpp, line 287. (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SurfaceFactory:: NewTexClient (this=0x7eac139b60, size=...) at gfx/gl/SharedSurface.cpp:287 287 SurfaceFactory::NewTexClient(const gfx::IntSize& size) { (gdb) bt #0 mozilla::gl::SurfaceFactory::NewTexClient (this=0x7eac139b60, size=...) at gfx/gl/SharedSurface.cpp:287 #1 0x0000007fb8e672d8 in mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x7eac003c00, size=...) at dist/include/mozilla/UniquePtr.h:287 [...] #25 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6 (gdb) p size $6 = (const mozilla::gfx::IntSize &) @0x7eac140ec8: {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) p mRecycleFreePool $7 = {mQueue = std::queue wrapping: std::deque with 0 elements} (gdb) n 1367 include/c++/8.3.0/bits/stl_deque.h: No such file or directory. (gdb) n 300 UniquePtr<SharedSurface> surf = CreateShared(size); (gdb) n 301 if (!surf) return nullptr; (gdb) p surf.mTuple.mFirstA $8 = (mozilla::gl::SharedSurface *) 0x7eac1802b0 (gdb) n 292 dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) n 289 dist/include/mozilla/RefPtr.h: No such file or directory. (gdb) n 305 mAllocator, mFlags); (gdb) n 307 StartRecycling(ret); (gdb) p ret $10 = {mRawPtr = 0x7eac138910} (gdb) p mAllocator $11 = {mRawPtr = 0x0} (gdb) p mFlags $12 = mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT (gdb)Notice how the call to CreateShared() returns an object which then when moved into Create() returns another object. The allocator is null and the flags are set to ORIGIN_BOTTOM_LEFT.
On ESR 91 there's a big difference: although mAllocator and mFlags are set correctly, the call to CreateShared returns null. Immediately afterwards the method notices this and returns early.
(gdb) b SurfaceFactory::NewTexClient Breakpoint 3 at 0x7ff111d888: file ${PROJECT}/gfx/gl/SharedSurface.cpp, line 393. (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SurfaceFactory:: NewTexClient (this=0x7edc1dc470, size=...) at ${PROJECT}/gfx/gl/SharedSurface.cpp:393 393 ${PROJECT}/gfx/gl/SharedSurface.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::SurfaceFactory::NewTexClient (this=0x7edc1dc470, size=...) at ${PROJECT}/gfx/gl/SharedSurface.cpp:393 #1 0x0000007ff1106fac in mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555642d50, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 [...] #24 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb) p size $17 = (const mozilla::gfx::IntSize &) @0x7ed81a22e4: {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) p mRecycleFreePool $18 = {mQueue = std::queue wrapping: std::deque with 0 elements} (gdb) p mAllocator $19 = {mRawPtr = 0x0} (gdb) p mFlags $20 = mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT (gdb) p mRecycleFreePool $21 = {mQueue = std::queue wrapping: std::deque with 0 elements} (gdb) n 394 in ${PROJECT}/gfx/gl/SharedSurface.cpp (gdb) n 406 in ${PROJECT}/gfx/gl/SharedSurface.cpp (gdb) n 407 in ${PROJECT}/gfx/gl/SharedSurface.cpp (gdb) p surf.mTuple.mFirstA $22 = (mozilla::gl::SharedSurface *) 0x0 (gdb) n 79 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h: No such file or directory. (gdb) n mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555643a10, size=...) at ${PROJECT}/gfx/gl/GLScreenBuffer.cpp:509 509 ${PROJECT}/gfx/gl/GLScreenBuffer.cpp: No such file or directory. (gdb)So it would seem that there's a problem in CreateShared(), so the next step will be to drill down into that. That's all I've time for today though; we'll pick this up again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
11 Mar 2024 : Day 182 #
We reached an important milestone yesterday when our WebView-augmented app finally ran without crashing for the first time. There's still nothing being rendered to the screen, but preventing the app from crashing feels like an important waypoint on the journey towards the result we want.
But crashes have their uses. In particular, when an app crashes the debugger provides a backtrace which can be the bedrock of an investigation; a place to start and fall back to in case things spiral out of control. Now we're presented with something more nebulous: somewhere in the program there are one or many bugs, or differences in execution, between the ESR 78 and ESR 91 rendering pipelines that we need to find and rewire. We've been here before, with the original browser rendering pipeline. That time it took a few weeks before I managed to find the root cause. I'm not expecting this to be any easier.
The first thing to check is whether the render loop is actually being called. Here we have something to go on in the form of the GLScreenBuffer::Swap() method. This should be called every frame in order to move the image on the back buffer onto the front buffer. We can use the debugger to see whether it's being called.
So let's try and find out why. In ESR 78 the code that's being called from CompositorVsyncScheduler::Composite() is the following:
I'm going to leave it there for today. Tomorrow we'll look further into the differences between ESR 78 and ESR 91.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
But crashes have their uses. In particular, when an app crashes the debugger provides a backtrace which can be the bedrock of an investigation; a place to start and fall back to in case things spiral out of control. Now we're presented with something more nebulous: somewhere in the program there are one or many bugs, or differences in execution, between the ESR 78 and ESR 91 rendering pipelines that we need to find and rewire. We've been here before, with the original browser rendering pipeline. That time it took a few weeks before I managed to find the root cause. I'm not expecting this to be any easier.
The first thing to check is whether the render loop is actually being called. Here we have something to go on in the form of the GLScreenBuffer::Swap() method. This should be called every frame in order to move the image on the back buffer onto the front buffer. We can use the debugger to see whether it's being called.
(gdb) b GLScreenBuffer::Swap Breakpoint 1 at 0x7ff1106f8c: file ${PROJECT}/gecko-dev/gfx/gl/ GLScreenBuffer.cpp, line 506. (gdb) c Continuing. [LWP 1224 exited] [LWP 1218 exited] [New LWP 1735] [LWP 1255 exited]No hits. In one sense this is bad: something is broken. On the other hand, it's also good: we already knew something was broken, this at least gives us something concrete to fix. So the next step is to see whether it's also actually being called on ESR 78. It could be that I've misunderstood how this rendering pipeline is supposed to work.
(gdb) b GLScreenBuffer::Swap Breakpoint 1 at 0x7fb8e672b8: file obj-build-mer-qt-xr/dist/include/mozilla/ UniquePtr.h, line 287. (gdb) c Continuing. [LWP 20447 exited] [LWP 20440 exited] [LWP 20511 exited] [LWP 20460 exited]Also no hit! Ah... that is until I touch the screen. Then suddenly this:
[New LWP 20679] [New LWP 20680] [New LWP 20681] [New LWP 20682] [Switching to LWP 20444] Thread 36 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x7eac003c00, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:287 287 obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb)We should get a backtrace from ESR 78 to compare against.
(gdb) bt #0 mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x7eac003c00, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:287 #1 0x0000007fbb2e2d8c in mozilla::gl::GLScreenBuffer::PublishFrame (size=..., this=0x7eac003c00) at obj-build-mer-qt-xr/dist/include/GLScreenBuffer.h:171 #2 mozilla::embedlite::EmbedLiteCompositorBridgeParent:: PresentOffscreenSurface (this=0x7f8c99d3b0) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:183 #3 0x0000007fbb2f8600 in mozilla::embedlite::nsWindow::PostRender (this=0x7f8c87db30, aContext=<optimized out>) at mobile/sailfishos/embedshared/nsWindow.cpp:395 #4 0x0000007fb8fbff4c in mozilla::layers::LayerManagerComposite::Render (this=this@entry=0x7eac1988c0, aInvalidRegion=..., aOpaqueRegion=...) at obj-build-mer-qt-xr/dist/include/mozilla/TimeStamp.h:452 #5 0x0000007fb8fc02c4 in mozilla::layers::LayerManagerComposite:: UpdateAndRender (this=this@entry=0x7eac1988c0) at gfx/layers/composite/LayerManagerComposite.cpp:647 #6 0x0000007fb8fc0514 in mozilla::layers::LayerManagerComposite:: EndTransaction (aFlags=mozilla::layers::LayerManager::END_DEFAULT, aTimeStamp=..., this=0x7eac1988c0) at gfx/layers/composite/ LayerManagerComposite.cpp:566 #7 mozilla::layers::LayerManagerComposite::EndTransaction (this=0x7eac1988c0, aTimeStamp=..., aFlags=mozilla::layers::LayerManager::END_DEFAULT) at gfx/layers/composite/LayerManagerComposite.cpp:536 #8 0x0000007fb8fe7f9c in mozilla::layers::CompositorBridgeParent:: CompositeToTarget (this=0x7f8c99d3b0, aId=..., aTarget=0x0, aRect=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #9 0x0000007fbb2e288c in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7f8c99d3b0, aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:159 #10 0x0000007fb8fc7988 in mozilla::layers::CompositorVsyncScheduler::Composite (this=0x7f8cbacc40, aId=..., aVsyncTimestamp=...) at gfx/layers/ipc/CompositorVsyncScheduler.cpp:249 #11 0x0000007fb8fc5ff0 in mozilla::detail::RunnableMethodArguments <mozilla::layers::BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::layers:: BaseTransactionId<mozilla::VsyncIdType>, mozilla::TimeStamp), StoreCopyPassByConstLRef<mozilla::layers::BaseTransactionId<mozilla:: VsyncIdType> >, StoreCopyPassByConstLRef<mozilla::TimeStamp>, 0ul, 1ul> (args=..., m=<optimized out>, o=<optimized out>) at obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:925 [...] #24 0x0000007fbe70d89c in ?? () from /lib64/libc.so.6 (gdb)Going back to ESR 91, it turns out the method isn't missing after all, it just needs a bit of prodding to get it to be called. So on touching the screen I get the same result. We should compare the backtraces. Here's what we get from ESR 91:
Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555643950, size=...) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506 506 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555643950, size=...) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506 #1 0x0000007ff36667a8 in mozilla::gl::GLScreenBuffer::PublishFrame (size=..., this=0x5555643950) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/GLScreenBuffer.h:229 #2 mozilla::embedlite::EmbedLiteCompositorBridgeParent::PresentOffscreenSurface (this=0x7fc49fde60) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:191 #3 0x0000007ff367fc0c in mozilla::embedlite::nsWindow::PostRender (this=0x7fc49fc290, aContext=<optimized out>) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:248 #4 0x0000007ff2a64ec0 in mozilla::widget::InProcessCompositorWidget::PostRender (this=0x7fc4b8df00, aContext=0x7f1f970848) at ${PROJECT}/gecko-dev/widget/InProcessCompositorWidget.cpp:60 #5 0x0000007ff128f9f4 in mozilla::layers::LayerManagerComposite::Render (this=this@entry=0x7ed41bb450, aInvalidRegion=..., aOpaqueRegion=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/Compositor.h:575 #6 0x0000007ff128fe70 in mozilla::layers::LayerManagerComposite:: UpdateAndRender (this=this@entry=0x7ed41bb450) at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:657 #7 0x0000007ff1290220 in mozilla::layers::LayerManagerComposite::EndTransaction (this=this@entry=0x7ed41bb450, aTimeStamp=..., aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT) at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:572 #8 0x0000007ff12d19bc in mozilla::layers::CompositorBridgeParent:: CompositeToTarget (this=0x7fc49fde60, aId=..., aTarget=0x0, aRect=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #9 0x0000007ff12b7104 in mozilla::layers::CompositorVsyncScheduler::Composite (this=0x7fc429cac0, aVsyncEvent=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorVsyncScheduler.cpp:256 #10 0x0000007ff12af57c in mozilla::detail::RunnableMethodArguments<mozilla:: VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&), StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:887 [...] #22 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)Both are largely similar, both are initially triggered from the vsync scheduler CompositorVsyncScheduler(), which makes sense for a render update pipeline. But there are some differences too. In between the vsync scheduler and the layer manager's EndTransaction() call we have this on ESR 78:
#7 LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:536 #8 CompositorBridgeParent::CompositeToTarget, mozilla/RefPtr.h:313 #9 EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget, EmbedLiteCompositorBridgeParent.cpp:159 #10 CompositorVsyncScheduler::Composite, CompositorVsyncScheduler.cpp:249Whereas on ESR 91 we have this:
#7 LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572 #8 CompositorBridgeParent::CompositeToTarget, RefPtr.h:313 #9 CompositorVsyncScheduler::Composite, CompositorVsyncScheduler.cpp:256It could be that this is a completely benign change, either due to us hitting a breakpoint at a slightly different point in the cycle, or due to upstream changes that are unrelated to the rendering issues we're experiencing. But I'm honing in on it because I remember there being differences in the way the target is set up and this immediately looks suspicious to me. Especially suspicious is the fact that EmbedLiteCompositorBridgeParent has been written out of the ESR 91 execution flow. That's Sailfish-specific code, so that could well indicate a problem.
So let's try and find out why. In ESR 78 the code that's being called from CompositorVsyncScheduler::Composite() is the following:
// Tell the owner to do a composite mVsyncSchedulerOwner->CompositeToDefaultTarget(aId); mVsyncNotificationsSkipped = 0;In ESR 91 we have a strange addition to this.
// Tell the owner to do a composite mVsyncSchedulerOwner->CompositeToTarget(aVsyncEvent.mId, nullptr, nullptr); mVsyncSchedulerOwner->CompositeToDefaultTarget(aVsyncEvent.mId); mVsyncNotificationsSkipped = 0;The extra line spacing makes this look very intentional, but the real question I'd like to know the answer to is: "was this change done by upstream or by me?". If it was upstream then it's almost certainly intentional. If it was me, well, then it could well be a mistake. We can find out, as always, using git.
git blame gfx/layers/ipc/CompositorVsyncScheduler.cpp -L 255,259 Blaming lines: 1% (5/370), done. 7a2ef4343bb1d (Kartikaya Gupta 2018-02-01 16:28:53 -0500 255) // Tell the owner to do a composite 97287dc1b1d82 (Markus Stange 2020-07-18 05:17:39 +0000 256) mVsyncSchedulerOwner->CompositeToTarget(aVsyncEvent.mId, nullptr, nullptr); 0acadeba1ac39 (David Llewellyn-Jones 2023-08-28 14:55:57 +0100 257) mVsyncSchedulerOwner->CompositeToDefaultTarget(aVsyncEvent.mId); 7a2ef4343bb1d (Kartikaya Gupta 2018-02-01 16:28:53 -0500 258) 133e28473a0f8 (Sotaro Ikeda 2016-11-18 02:37:04 -0800 259) mVsyncNotificationsSkipped = 0;Well, that's interesting. There is a line inserted by me, but it's not the line I was expecting. It looks very much like I added the line and intended to remove the line before, but forgot. I'm going to rectify this.
$ git diff gfx/layers/ipc/CompositorVsyncScheduler.cpp diff --git a/gfx/layers/ipc/CompositorVsyncScheduler.cpp b/gfx/layers/ipc/CompositorVsyncScheduler.cpp index 2e8e58a2c46b..3abe24ceeeea 100644 --- a/gfx/layers/ipc/CompositorVsyncScheduler.cpp +++ b/gfx/layers/ipc/CompositorVsyncScheduler.cpp @@ -253,9 +253,7 @@ void CompositorVsyncScheduler::Composite (const VsyncEvent& aVsyncEvent) { mLastComposeTime = SampleTime::FromVsync(aVsyncEvent.mTime); // Tell the owner to do a composite - mVsyncSchedulerOwner->CompositeToTarget(aVsyncEvent.mId, nullptr, nullptr); mVsyncSchedulerOwner->CompositeToDefaultTarget(aVsyncEvent.mId); - mVsyncNotificationsSkipped = 0; TimeDuration compositeFrameTotal = TimeStamp::Now() - aVsyncEvent.mTime;Line removed. Now to build and see what happens when we try to run it.
(gdb) b GLScreenBuffer::Swap Breakpoint 1 at 0x7ff1106f8c: file ${PROJECT}/gecko-dev/gfx/gl/ GLScreenBuffer.cpp, line 506. (gdb) c Continuing. [LWP 22665 exited] [New LWP 22740] [Switching to LWP 22660] Thread 36 "Compositor" hit Breakpoint 1, mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555642d50, size=...) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506 506 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp: No such file or directory. (gdb) bt #0 mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555642d50, size=...) at ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:506 #1 0x0000007ff3666788 in mozilla::gl::GLScreenBuffer::PublishFrame (size=..., this=0x5555642d50) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/GLScreenBuffer.h:229 #2 mozilla::embedlite::EmbedLiteCompositorBridgeParent::PresentOffscreenSurface (this=0x7fc49fde10) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:191 #3 0x0000007ff367fbec in mozilla::embedlite::nsWindow::PostRender (this=0x7fc49fc240, aContext=<optimized out>) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/nsWindow.cpp:248 #4 0x0000007ff2a64ea4 in mozilla::widget::InProcessCompositorWidget::PostRender (this=0x7fc4b88770, aContext=0x7f177f87e8) at ${PROJECT}/gecko-dev/widget/InProcessCompositorWidget.cpp:60 #5 0x0000007ff128f9f4 in mozilla::layers::LayerManagerComposite::Render (this=this@entry=0x7edc1bb450, aInvalidRegion=..., aOpaqueRegion=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/ Compositor.h:575 #6 0x0000007ff128fe70 in mozilla::layers::LayerManagerComposite:: UpdateAndRender (this=this@entry=0x7edc1bb450) at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:657 #7 0x0000007ff1290220 in mozilla::layers::LayerManagerComposite::EndTransaction (this=this@entry=0x7edc1bb450, aTimeStamp=..., aFlags=aFlags@entry=mozilla::layers::LayerManager::END_DEFAULT) at ${PROJECT}/gecko-dev/gfx/layers/composite/LayerManagerComposite.cpp:572 #8 0x0000007ff12d19a0 in mozilla::layers::CompositorBridgeParent:: CompositeToTarget (this=0x7fc49fde10, aId=..., aTarget=0x0, aRect=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #9 0x0000007ff3666488 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc49fde10, aId=...) at ${PROJECT}/gecko-dev/mobile/sailfishos/embedthread/ EmbedLiteCompositorBridgeParent.cpp:165 #10 0x0000007ff12b70fc in mozilla::layers::CompositorVsyncScheduler::Composite (this=0x7fc4b82ef0, aVsyncEvent=...) at ${PROJECT}/gecko-dev/gfx/layers/ipc/CompositorVsyncScheduler.cpp:256 #11 0x0000007ff12af57c in mozilla::detail::RunnableMethodArguments<mozilla:: VsyncEvent>::applyImpl<mozilla::layers::CompositorVsyncScheduler, void (mozilla::layers::CompositorVsyncScheduler::*)(mozilla::VsyncEvent const&), StoreCopyPassByConstLRef<mozilla::VsyncEvent>, 0ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:887 [...] #23 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)Let's strip out the part we're interested in.
#7 LayerManagerComposite::EndTransaction, LayerManagerComposite.cpp:572 #8 CompositorBridgeParent::CompositeToTarget, RefPtr.h:313 #9 EmbedLiteCompositorBridgeParent::CompositeToDefaultTarget, EmbedLiteCompositorBridgeParent.cpp:165 #10 CompositorVsyncScheduler::Composite, CompositorVsyncScheduler.cpp:256If we compare this with the previous backtrace from ESR 78 we can see that's now aligning fully. Unfortunately, despite fixing this issue, it doesn't give us a working render: the screen is still just plain white. But it will be one step on the way to fixing things fully.
I'm going to leave it there for today. Tomorrow we'll look further into the differences between ESR 78 and ESR 91.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
10 Mar 2024 : Day 181 #
Today it's time to test out the changes I made yesterday, adding in the SharedSurface_Basic functionality that got lost in the transition to ESR 91. The key change is that now the ProdTexture() will be overridden and so the call to it from SurfaceFactory during initialisation should — I'm hoping — no longer trigger a crash.
One of the downsides of using the massive partial libxul.so builds packed full of debugging information is that they just take up so much room on the device. But after deleting the library, the reinstallation of it then goes through without a hitch. It does strike me as a little odd that the calculation for how much space is needed doesn't take into account how much will be removed as well as how much will be added. I guess this is important for allowing transactional updates.
Nevertheless, this still seems to be a crash due to a missing override, as we can see from the error message that's output: "Did you forget to override this function". It's an induced crash again. But this time the surface is of type SharedSurface_EGLImage. Probably we'll have to add the overrides into this class as well. This will be similar to the work I did yesterday, but this time applied to a different class that's also inheriting from SharedSurface.
Looking at the SharedSurface_EGLImage class definition in SharedSurfaceEGL.h there are some very distinct differences between ESR 78 and ESR 91, including the lack of a ProdTexture() override in ESR 91. Here are the relevant code pieces from ESR 78 (I've rearranged some of the line orders for clarity):
The mFormats variable is also missing from ESR 91, but I can't see anywhere that's used in a meaningful way, so I'll leave that out. The Cast() method has also been removed. But the logic for this is pretty simple and it looks like this has just been replaced with the same logic and direct cast in the various places it's used in the code, rather than in a separate method. Given this, there looks to be no need to revert this particular change.
The other change I've had to make is to the SharedSurface_EGLImage::Create() and SurfaceFactory_EGLImage::Create() methods. I've changed their implementations slightly and redirected them to use the new (old?) constructors.
With all of these changes in place compilation the partial build now goes through. I've linked the partial build, copied it over to my phone and manually copied it to the correct directory. Now to see whether it's had any effect.
I'm afraid to say, there's still a long journey ahead of us. But we are still, slowly but surely, moving forwards.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
One of the downsides of using the massive partial libxul.so builds packed full of debugging information is that they just take up so much room on the device. But after deleting the library, the reinstallation of it then goes through without a hitch. It does strike me as a little odd that the calculation for how much space is needed doesn't take into account how much will be removed as well as how much will be added. I guess this is important for allowing transactional updates.
$ rpm -U xulrunner-qt5-91.*.rpm xulrunner-qt5-debuginfo-91.*.rpm \ xulrunner-qt5-debugsource-91.*.rpm xulrunner-qt5-misc-91.*.rpm installing package xulrunner-qt5-91.9.1+git1.aarch64 needs 7MB more space on the / filesystem $ rm /usr/lib64/xulrunner-qt5-91.9.1/libxul.so $ rpm -U xulrunner-qt5-91.*.rpm xulrunner-qt5-debuginfo-91.*.rpm \ xulrunner-qt5-debugsource-91.*.rpm xulrunner-qt5-misc-91.*.rpmWhen running the new code it still almost immediately crashes. And that's not a surprise; I'm expecting at least a few more cycles of this "run-crash-debug-fix" process before we have something working.
Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 4396] 0x0000007ff1107dc4 in mozilla::gl::SharedSurface::ProdTexture (this=<optimized out>) at gfx/gl/SharedSurface.h:157 157 MOZ_CRASH("GFX: Did you forget to override this function?"); (gdb) bt #0 0x0000007ff1107dc4 in mozilla::gl::SharedSurface::ProdTexture (this=<optimized out>) at gfx/gl/SharedSurface.h:157 #1 0x0000007ff1106cc4 in mozilla::gl::ReadBuffer::Attach (this=0x7ed41a1700, surf=surf@entry=0x7ed419f9c0) at gfx/gl/GLScreenBuffer.cpp:718 #2 0x0000007ff1106ebc in mozilla::gl::GLScreenBuffer::Attach (this=this@entry=0x5555642e30, surf=0x7ed419f9c0, size=...) at gfx/gl/GLScreenBuffer.cpp:486 #3 0x0000007ff1106f60 in mozilla::gl::GLScreenBuffer::Swap (this=this@entry=0x5555642e30, size=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 [...] #25 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb) p this $6 = <optimized out> (gdb) frame 1 #1 0x0000007ff1106cc4 in mozilla::gl::ReadBuffer::Attach (this=0x7ed41a1700, surf=surf@entry=0x7ed419f9c0) at gfx/gl/GLScreenBuffer.cpp:718 718 colorTex = surf->ProdTexture(); (gdb) p surf $3 = (mozilla::gl::SharedSurface *) 0x7ed419f9c0 (gdb) set print object on (gdb) p surf $5 = (mozilla::gl::SharedSurface_EGLImage *) 0x7ed419f9c0 (gdb) set print object off (gdb)My initial reaction is that this is the same error that I spent yesterday trying to fix. But on closer inspection it's actually a little different. So maybe the changes made yesterday were actually worthwhile after all?
Nevertheless, this still seems to be a crash due to a missing override, as we can see from the error message that's output: "Did you forget to override this function". It's an induced crash again. But this time the surface is of type SharedSurface_EGLImage. Probably we'll have to add the overrides into this class as well. This will be similar to the work I did yesterday, but this time applied to a different class that's also inheriting from SharedSurface.
Looking at the SharedSurface_EGLImage class definition in SharedSurfaceEGL.h there are some very distinct differences between ESR 78 and ESR 91, including the lack of a ProdTexture() override in ESR 91. Here are the relevant code pieces from ESR 78 (I've rearranged some of the line orders for clarity):
class SharedSurface_EGLImage : public SharedSurface { [...] protected: mutable Mutex mMutex; const GLFormats mFormats; GLuint mProdTex; [...] virtual GLuint ProdTexture() override { return mProdTex; } [...]In comparison, the ProdTexture() method and associated mProdTex member variable are both missing in ESR 91. I'll need to add them in, along with all the logic associated with them.
The mFormats variable is also missing from ESR 91, but I can't see anywhere that's used in a meaningful way, so I'll leave that out. The Cast() method has also been removed. But the logic for this is pretty simple and it looks like this has just been replaced with the same logic and direct cast in the various places it's used in the code, rather than in a separate method. Given this, there looks to be no need to revert this particular change.
static SharedSurface_EGLImage* Cast(SharedSurface* surf) { MOZ_ASSERT(surf->mType == SharedSurfaceType::EGLImageShare); return (SharedSurface_EGLImage*)surf; }Possibly these were changes I made myself at some point in the (now distant!) past while performing this update.
The other change I've had to make is to the SharedSurface_EGLImage::Create() and SurfaceFactory_EGLImage::Create() methods. I've changed their implementations slightly and redirected them to use the new (old?) constructors.
With all of these changes in place compilation the partial build now goes through. I've linked the partial build, copied it over to my phone and manually copied it to the correct directory. Now to see whether it's had any effect.
$ harbour-webview [D] unknown:0 - QML debugging is enabled. Only use this in a safe environment. [D] main:30 - WebView Example [D] main:47 - Opening webview [D] unknown:0 - Using Wayland-EGL [...] Created LOG for EmbedLiteLayerManager =============== Preparing offscreen rendering context =============== CONSOLE message: OpenGL compositor Initialized Succesfully. [...] Frame script: embedhelper.js loaded CONSOLE message: [JavaScript Warning: "This page uses the non standard property “zoom”. Consider using calc() in the relevant property values, or using “transform” along with “transform-origin: 0 0”." {file: "https://sailfishos.org/" line: 0}] CONSOLE message: [JavaScript Warning: "Layout was forced before the page was fully loaded. If stylesheets are not yet loaded this may cause a flash of unstyled content." {file: "https://sailfishos.org/wp-includes/js/jquery/ jquery.min.js?ver=3.5.1" line: 2}]The good news is the WebView test app is no longer crashing. It stays running and even responds to touch input. But it's not rendering. The screen is just showing a completely white page. This is definitely good progress though. It means that tomorrow I can dive back in to the debugger to compare execution with ESR 78, see where they're diverging and hopefully gradually get them to align closer until the rendering works.
I'm afraid to say, there's still a long journey ahead of us. But we are still, slowly but surely, moving forwards.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
9 Mar 2024 : Day 180 #
I'm picking up where I left off, reintroducing code around the SharedSurface_Basic class. I've added back in the missing methods and code that ESR 78 includes as part of SharedSurface_Basic:
The good news is that with SharedSurface_Basic::Create() being a static function, it'll need to be fully qualified (in other words, prefixed with the class name) which makes it a lot easier to search the code for. On ESR 78 we have these cases:
I've now changed the ESR 91 code so that it uses the same Create() override as the single version available in ESR 78. I'm really hoping to match up this particular set of functionality as closely as possible.
Finally I need to do a similar check for the SharedSurface_Basic::Wrap() method. This is also static, which again helps with discovery.
Since I've only just now added the Wrap() method back in to ESR 91 there's no point doing a search there. All we'll find is the new code I just added. But let's do it anyway for completeness.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
- Create(). There's already an existing Create() method, but this new implementation provides a new signature.
- Wrap(). This was missing from the ESR 91 version. It feels like we'll need it
- Cast(). Similarly this might turn out to be needed.
- SharedSurface_Basic(). Again, there was already a constructor but this one provides some new features.
- ~SharedSurface_Basic(). In ESR 91 the destructor had been removed.
- LockProdImpl(). This is an override; it's not clear that it's needed.
- UnlockProdImpl(). Similarly here.
- ProducerAcquireImpl(). And here.
- ProducerReleaseImpl(). And also here.
- ProdTexture(). However this override we need; the fact this was missing from ESR 91 is the reason we saw a segfault.
- const GLuint mTex.
- const bool mOwnsTex.
- GLuint mFB.
The good news is that with SharedSurface_Basic::Create() being a static function, it'll need to be fully qualified (in other words, prefixed with the class name) which makes it a lot easier to search the code for. On ESR 78 we have these cases:
$ grep -rIn "SharedSurface_Basic::Create(" * gecko-dev/gfx/gl/SharedSurfaceGL.h:82: return SharedSurface_Basic::Create(mGL, mFormats, size, hasAlpha); gecko-dev/gfx/gl/SharedSurfaceGL.cpp:22: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Create( gecko-dev/gfx/gl/SharedSurface.cpp:41: tempSurf = SharedSurface_Basic::Create(gl, factory->mFormats, src->mSize,Comparing that to the instances in ESR 91, we can see they're surprisingly similar:
$ grep -rIn "SharedSurface_Basic::Create(" * gecko-dev/gfx/gl/SharedSurfaceGL.h:72: return SharedSurface_Basic::Create(desc); gecko-dev/gfx/gl/SharedSurfaceGL.cpp:21: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Create( gecko-dev/gfx/gl/SharedSurfaceGL.cpp:57: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Create( gecko-dev/gfx/gl/SharedSurface.cpp:66: tempSurf = SharedSurface_Basic::Create({{gl, SharedSurfaceType::Basic,On ESR 91 there are two different versions of SharedSurface_Basic::Create() because we just added a new one. But there are also differences in the way the method is called and I'd like to fix that so that the ESR 91 code better aligns with the ESR 78 code.
I've now changed the ESR 91 code so that it uses the same Create() override as the single version available in ESR 78. I'm really hoping to match up this particular set of functionality as closely as possible.
Finally I need to do a similar check for the SharedSurface_Basic::Wrap() method. This is also static, which again helps with discovery.
$ grep -rIn "SharedSurface_Basic::Wrap(" * gecko-dev/gfx/gl/SharedSurfaceGL.cpp:44: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl, $ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.h 38: static UniquePtr<SharedSurface_Basic> Wrap(GLContext* gl, $ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.cpp 44: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl,The Wrap() method isn't an override, so this should be enough to demonstrate that the method isn't actually being used at all in ESR 78. So there should be no need to worry too much about it in ESR 91. And in fact once things are working it should almost certainly be removed. But I want to get a working executable before worrying about that.
Since I've only just now added the Wrap() method back in to ESR 91 there's no point doing a search there. All we'll find is the new code I just added. But let's do it anyway for completeness.
$ grep -rIn "SharedSurface_Basic::Wrap(" * gecko-dev/gfx/gl/SharedSurfaceGL.cpp:79: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl, $ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.h 26: static UniquePtr<SharedSurface_Basic> Wrap(GLContext* gl, $ grep -rIn "Wrap(" gecko-dev/gfx/gl/SharedSurfaceGL.cpp 79: UniquePtr<SharedSurface_Basic> SharedSurface_Basic::Wrap(GLContext* gl,Having made these changes the partial builds have now all gone through, so it's time to set off the full build so I can test the executable again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
8 Mar 2024 : Day 179 #
The build started yesterday has now completed; let's not waste any time and get straight to testing it out.
One thing that we are able to work with is the differences between the ESR 78 code (working) and the ESR 91 code (broken). And they are different. In ESR 78 the SharedSurfaceTextureData constructor at the top of the stack looks like this:
I could fire up my second development phone and place a breakpoint on the SharedSurfaceTextureClient constructor to compare, but I'm on the train and one laptop and two phones is already leaving me cramped. A laptop and three phones would crowd me out entirely. So let's find out why surf is null by looking through the ESR 91 code instead.
The odd thing is that the parent method has plenty of checks for it not being null:
So accessing this value that's been optimised out is actually more difficult than I'd thought. I can't just go up (down) a stack frame and check it there after all.
The solution will be to place a breakpoint on SharedSurfaceTextureClient::Create() and inspect the value before it's moved. Let's try that out.
And now suddenly it's hit me. There's something very wrong with this ordering:
I'm going to attempt to run the library generated from the partial build. Unfortunately the partial builds mess up the symbol references so it's not always possible to debug with them. I've also stripping them of debug symbols to make uploading them quicker, so even if they don't get messed up, I still can't use them. But running it may nevertheless help to find out whether this fix has made any difference at all.
[...]
I'm home now. After digging around in the code a bit and comparing with the execution of ESR 78, the reason for the crash has become clear. At the top of the backtrace is SharedSurface::ProdTexture(). But this method is designed to crash; it looks like this:
For the same reason, when I place a breakpoint on SharedSurface::ProdTexture() in the ESR 78 version of the code it doesn't get hit. On the other hand, when I place a breakpoint on ReadBuffer::Create() which is further down the stack trace that does get hit. After which if we place a breakpoint on all instances of ProdTexture() we do get a hit, but it's from SharedSurface_Basic::ProdTexture() rather than from SharedSurface:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
$ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] Created LOG for EmbedLiteLayerManager [New LWP 4065] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 4059] 0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData (this=0x7ed81af3c0, surf=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283 283 obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) bt #0 0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData (this=0x7ed81af3c0, surf=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283 #1 0x0000007ff125d210 in mozilla::layers::SharedSurfaceTextureClient:: Create (surf=..., factory=factory@entry=0x7ed80043d0, aAllocator=0x0, aFlags=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/ cxxalloc.h:33 #2 0x0000007ff111e038 in mozilla::gl::SurfaceFactory::NewTexClient (this=0x7ed80043d0, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:289 #3 0x0000007ff1107088 in mozilla::gl::GLScreenBuffer::Resize (this=0x5555643e90, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #4 0x0000007ff1131c04 in mozilla::gl::GLContext::CreateScreenBufferImpl (this=this@entry=0x7ed819ee40, size=..., caps=...) at gfx/gl/GLContext.cpp:2150 #5 0x0000007ff1131cc8 in mozilla::gl::GLContext::CreateScreenBuffer (caps=..., size=..., this=0x7ed819ee40) at gfx/gl/GLContext.h:3555 #6 mozilla::gl::GLContext::InitOffscreen (this=this@entry=0x7ed819ee40, size=..., caps=...) at gfx/gl/GLContext.cpp:2398 [...] #29 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)This is progress: the application got a bit further today. But just from reading through the backtrace I'm not entirely certain what's going on here. The issues — and their locations — are being obscured by the various pointer wrappers (UniquePtr and RefPtr) in use. I'm finding it hard to get a purchase. The first actually useful location is line 2150 of GLContext.cpp but that's way down in frame 4.
One thing that we are able to work with is the differences between the ESR 78 code (working) and the ESR 91 code (broken). And they are different. In ESR 78 the SharedSurfaceTextureData constructor at the top of the stack looks like this:
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<gl::SharedSurface> surf) : mSurf(std::move(surf)) {}Whereas in ESR 91 I appear to have added some additional initialisation steps:
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<gl::SharedSurface> surf) : mSurf(std::move(surf)), mDesc(), mFormat(), mSize(surf->mDesc.size) { }One possibility is that the value of surf is null. This wouldn't necessarily cause a problem until we try to read the mDesc entry while setting mSize. I'm not able to extract the value of surf directly as the debugger informs me it's been "optimized out". But if I go up (down?) a stack frame I can seer what was passed in for its value.
(gdb) p surf $3 = <optimized out> (gdb) frame 1 #1 0x0000007ff125d210 in mozilla::layers::SharedSurfaceTextureClient::Create (surf=..., factory=factory@entry=0x7ed80043d0, aAllocator=0x0, aFlags=<optimized out>) at obj-build-mer-qt-xr/dist/include/mozilla/ cxxalloc.h:33 33 obj-build-mer-qt-xr/dist/include/mozilla/cxxalloc.h: No such file or directory. (gdb) p surf $4 = { mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SharedSurface*, mozilla::DefaultDelete<mozilla::gl::SharedSurface>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::SharedSurface>> = {<No data fields>}, mFirstA = 0x0}, <No data fields>}} (gdb) p surf.mTuple.mFirstA $5 = (mozilla::gl::SharedSurface *) 0x0 (gdb)So the fact that surf is null provides us with an immediate reason for the crash, but it leaves us with a question: should the value be null and we shouldn't be attempting to access its contents, or should it be a non-null value going in to this method?
I could fire up my second development phone and place a breakpoint on the SharedSurfaceTextureClient constructor to compare, but I'm on the train and one laptop and two phones is already leaving me cramped. A laptop and three phones would crowd me out entirely. So let's find out why surf is null by looking through the ESR 91 code instead.
The odd thing is that the parent method has plenty of checks for it not being null:
already_AddRefed<SharedSurfaceTextureClient> SharedSurfaceTextureClient::Create( UniquePtr<gl::SharedSurface> surf, gl::SurfaceFactory* factory, LayersIPCChannel* aAllocator, TextureFlags aFlags) { if (!surf) { return nullptr; } TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(); SharedSurfaceTextureData* data = new SharedSurfaceTextureData(std::move(surf)); return MakeAndAddRef<SharedSurfaceTextureClient>(data, flags, aAllocator); }So the value going in isn't null. And now that I look at it carefully, I see that I have this all wrong: the surf variable isn't an instance of SharedSurface, it's a UniquePtr wrapping an instance of SharedSurface. When the value inside the unique pointer is moved, the value inside the unique pointer that it's coming from gets set to zero. That's the whole point of unique pointers.
So accessing this value that's been optimised out is actually more difficult than I'd thought. I can't just go up (down) a stack frame and check it there after all.
The solution will be to place a breakpoint on SharedSurfaceTextureClient::Create() and inspect the value before it's moved. Let's try that out.
(gdb) b SharedSurfaceTextureClient::Create Breakpoint 1 at 0x7ff125d1a4: file gfx/layers/client/ TextureClientSharedSurface.cpp, line 113. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/harbour-webview [...] Thread 36 "Compositor" hit Breakpoint 1, mozilla::layers:: SharedSurfaceTextureClient::Create (surf=..., factory=factory@entry=0x7ee0004310, aAllocator= 0x0, aFlags=mozilla::layers::TextureFlags::ORIGIN_BOTTOM_LEFT) at gfx/layers/client/TextureClientSharedSurface.cpp:113 113 LayersIPCChannel* aAllocator, TextureFlags aFlags) { (gdb) p surf $6 = { mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::SharedSurface*, mozilla::DefaultDelete<mozilla::gl::SharedSurface>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::SharedSurface>> = {<No data fields>}, mFirstA = 0x7ee01a1c00}, <No data fields>}} (gdb) p surf.mTuple.mFirstA $7 = (mozilla::gl::SharedSurface *) 0x7ee01a1c00 (gdb) p surf.mTuple.mFirstA->mDesc.size $9 = {<mozilla::gfx::BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx:: UnknownUnits> >> = {{{width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::gfx::UnknownUnits> = {<No data fields>}, <No data fields>} (gdb) p surf.mTuple.mFirstA->mDesc.size.width $10 = 1080 (gdb) p surf.mTuple.mFirstA->mDesc.size.height $11 = 2520 (gdb)This is all now looking a lot more healthy than I thought. Since it's not that the value is null going in, what else could be causing the crash here? It could be that there are multiple calls to this Create() method and this isn't the one causing the problem. But that's easy to check as well:
(gdb) n 114 if (!surf) { (gdb) 49 obj-build-mer-qt-xr/dist/include/mozilla/TypedEnumBits.h: No such file or directory. (gdb) 117 TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(); (gdb) 119 new SharedSurfaceTextureData(std::move(surf)); (gdb) Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. 0x0000007ff125d16c in mozilla::layers::SharedSurfaceTextureData:: SharedSurfaceTextureData (this=0x7ee01af300, surf=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:283 283 obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb)Stepping forwards to the next call to the SharedSurfaceTextureData constructor triggers the crash. So we're definitely in the right place and the problem isn't a null value for the surf parameter after all.
And now suddenly it's hit me. There's something very wrong with this ordering:
: mSurf(std::move(surf)), mDesc(), mFormat(), mSize(surf->mDesc.size)The sequence here is going to be:
- Move surf into mSurf. This will leave the value stored inside surf as null.
- Create mDesc and mFormat in their default state.
- Attempt to access a value from inside surf. But surf has been moved out of the unique pointer wrapper and into another one, so we can't do this.
SharedSurfaceTextureData::SharedSurfaceTextureData( UniquePtr<gl::SharedSurface> surf) : mSurf(std::move(surf)), mDesc(), mFormat(), mSize(mSurf->mDesc.size) { }There may be other reasons why this constructor causes problems later, such as having the default values for mDesc and mFormat. I'm not sure how important these are. But this fix should get us closer to finding out.
I'm going to attempt to run the library generated from the partial build. Unfortunately the partial builds mess up the symbol references so it's not always possible to debug with them. I've also stripping them of debug symbols to make uploading them quicker, so even if they don't get messed up, I still can't use them. But running it may nevertheless help to find out whether this fix has made any difference at all.
$ make -j1 -C obj-build-mer-qt-xr/gfx/layer $ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit $ strip obj-build-mer-qt-xr/toolkit/library/build/libxul.so [...] $ scp obj-build-mer-qt-xr/toolkit/library/build/libxul.so \ defaultuser@172.28.172.2:~/Documents/Development/gecko/ $ ssh defaultuser@172.28.172.2 [...] $ devel-su cp libxul.so /usr/lib64//xulrunner-qt5-91.9.1/ $ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 29907] 0x0000007ff1107e00 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so (gdb) bt #0 0x0000007ff1107e00 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #1 0x0000007ff1106948 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #2 0x0000007ff1106c30 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #3 0x0000007ff1106e5c in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #4 0x0000007ff1107104 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #5 0x0000007ff1131c64 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #6 0x0000007ff1131d28 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #7 0x0000007ff1131e74 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #8 0x0000007ff11999f8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #9 0x0000007ff11aefe8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #10 0x0000007ff12c4c98 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #11 0x0000007ff12cfd14 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so [...] #25 0x0000007ff07fe8d8 in ?? () from /usr/lib64/xulrunner-qt5-91.9.1/libxul.so #26 0x0000007feca3c9f0 in ?? () from /usr/lib64/libnspr4.so #27 0x0000007fefd00a4c in ?? () from /lib64/libpthread.so.0 #28 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)Clearly that's not as helpful as we might have hoped. Maybe it's worth a try without stripping out the debug symbols.
$ make -j16 -C `pwd`/obj-build-mer-qt-xr/toolkit [...] $ scp obj-build-mer-qt-xr/toolkit/library/build/libxul.so \ defaultuser@172.28.172.2:~/Documents/Development/gecko/The problem with using the non-stripped version is that it's a large 2.7GiB file (that's three times larger than the RPM packages, even including the debuginfo). The consequence is that it's actually taking my entire train journey to copy the file over to my phone (via my other phone using a Wifi hotspot). It's finally got there... but with only a few minutes to spare before we pull in to Cambridge station. I'm going to have to be quick!
$ ssh defaultuser@172.28.172.2 [...] $ devel-su cp libxul.so /usr/lib64//xulrunner-qt5-91.9.1/ $ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 2262] 0x0000007ff1107e00 in mozilla::gl::SharedSurface::ProdTexture (this=<optimized out>) at gecko-dev/gfx/gl/SharedSurface.h:154 154 gecko-dev/gfx/gl/SharedSurface.h: No such file or directory. (gdb) bt #0 0x0000007ff1107e00 in mozilla::gl::SharedSurface::ProdTexture (this=<optimized out>) at gecko-dev/gfx/gl/SharedSurface.h:154 #1 0x0000007ff1106948 in mozilla::gl::ReadBuffer::Create (gl=0x7ed819ee40, caps=..., formats=..., surf=surf@entry=0x7ed81a1cc0) at gecko-dev/gfx/gl/GLScreenBuffer.cpp:653 #2 0x0000007ff1106c30 in mozilla::gl::GLScreenBuffer::CreateRead (this=this@entry=0x55556434d0, surf=surf@entry=0x7ed81a1cc0) at gecko-dev/gfx/gl/GLScreenBuffer.cpp:584 #3 0x0000007ff1106e5c in mozilla::gl::GLScreenBuffer::Attach (this=this@entry=0x55556434d0, surf=0x7ed81a1cc0, size=...) at gecko-dev/gfx/gl/GLScreenBuffer.cpp:488 #4 0x0000007ff1107104 in mozilla::gl::GLScreenBuffer::Resize (this=0x55556434d0, size=...) at obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #5 0x0000007ff1131c64 in mozilla::gl::GLContext::CreateScreenBufferImpl (this=this@entry=0x7ed819ee40, size=..., caps=...) at gecko-dev/gfx/gl/GLContext.cpp:2150 #6 0x0000007ff1131d28 in mozilla::gl::GLContext::CreateScreenBuffer (caps=..., size=..., this=0x7ed819ee40) at gecko-dev/gfx/gl/GLContext.h:3555 #7 mozilla::gl::GLContext::InitOffscreen (this=this@entry=0x7ed819ee40, size=..., caps=...) at gecko-dev/gfx/gl/GLContext.cpp:2398 #8 0x0000007ff1131e74 in mozilla::gl::GLContextProviderEGL::CreateOffscreen (size=..., minCaps=..., flags=flags@entry=mozilla::gl::CreateContextFlags:: REQUIRE_COMPAT_PROFILE, out_failureId=out_failureId@entry=0x7f1778a1c8) at gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1308 [...] #30 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb)This is a more interesting backtrace! And it's certainly different from the one we had yesterday. But now we really are pulling in to the station so it's time to close up my laptop and prepare for the next stage of my journey home. When I get back, I'll dig in to this crash further.
[...]
I'm home now. After digging around in the code a bit and comparing with the execution of ESR 78, the reason for the crash has become clear. At the top of the backtrace is SharedSurface::ProdTexture(). But this method is designed to crash; it looks like this:
virtual GLuint ProdTexture() { MOZ_ASSERT(mAttachType == AttachmentType::GLTexture); MOZ_CRASH("GFX: Did you forget to override this function?"); }As you can see from the text passed to the crash macro, this function is never supposed to be called. It's supposed to be overridden by something else in a class that inherits from SharedSurface.
For the same reason, when I place a breakpoint on SharedSurface::ProdTexture() in the ESR 78 version of the code it doesn't get hit. On the other hand, when I place a breakpoint on ReadBuffer::Create() which is further down the stack trace that does get hit. After which if we place a breakpoint on all instances of ProdTexture() we do get a hit, but it's from SharedSurface_Basic::ProdTexture() rather than from SharedSurface:
(gdb) b ReadBuffer::Create Breakpoint 2 at 0x7fb8e66720: file gfx/gl/GLScreenBuffer.cpp, line 501. (gdb) r [...] Thread 36 "Compositor" hit Breakpoint 2, mozilla::gl::ReadBuffer::Create (gl=0x7eac109130, caps=..., formats=..., surf=surf@entry=0x7eac10a6e0) at gfx/gl/GLScreenBuffer.cpp:501 501 SharedSurface* surf) { (gdb) p surf $1 = (mozilla::gl::SharedSurface *) 0x7eac10a6e0 (gdb) p surf->mAttachType $2 = mozilla::gl::AttachmentType::GLTexture (gdb) b ProdTexture Breakpoint 3 at 0x7fb83835d0: ProdTexture. (5 locations) (gdb) c Continuing. Thread 36 "Compositor" hit Breakpoint 3, mozilla::gl::SharedSurface_Basic:: ProdTexture (this=0x7eac10a6e0) at gfx/gl/SharedSurfaceGL.h:65 65 virtual GLuint ProdTexture() override { return mTex; } (gdb)The SharedSurface_Basic class is defined in the SharedSurfaceGL.h file and in ESR 78 it does override the ProdTexture() method:
// For readback and bootstrapping: class SharedSurface_Basic : public SharedSurface { [...] virtual GLuint ProdTexture() override { return mTex; } [...] };However in the ESR 91 code there's no such override. I should have added one in, but the need hadn't made it on to my radar. So I'll make this change now. Unfortunately there are a collection of cascading changes that make this a slightly larger job than I'd hoped. Nothing crazy, but too much to do tonight, so I'll have to leave it until the morning.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
7 Mar 2024 : Day 178 #
Good morning! Well, it is for me at least. By the time this gets posted it'll be the evening. But for me right now it's early and when I check in on my laptop I can see that the build has completed.
We'll take a look at the build presently, but before we do, thank you to Adam Pigg (piggz) for commenting on my meander into virtual methods yesterday. As Adam rightly pointed out, my C++ code was missing some accoutrements that he'd have preferred to see included.
Adam is right of course, not only are these sensible ways to improve the code, but the use of override would also have improved clarity. Adam has kindly provide his own, improved, version of the code which I'll share below. He then went on to make a few, perhaps more controversial, suggestions"
I'll leave it to the reader to judge whether these would actually be improvements or not. Here's Adam's updated version of the code. One thing to note is that support for C++23's std::print requires at least GCC 14 or Clang 18 if you want to compile this at home, but Adam has confirmed it all works as expected using the online Godbolt Compiler Explorer.
The latest Gecko build incorporates all of the GLScreenBuffer code that I've been adding in and following changes made yesterday should also now no longer make use of the SwapChain class. I've copied the packages over to my development phone, installed them and now it's time to test them.
Looking through the ESR 78 code there's only one place I can discern that sets the mScreen variable and that's this one:
Let's continue digging backwards and find out where InitOffscreen() gets called. It turns out it's called in quite a few places:
We're building up a pretty clear path for how mScreen ends up initialised in ESR 78. The CreateContext() method is the first time we've reached something on this path which also exists in ESR 91. So this would seem to be a good place to switch back to ESR 91 and try to reconstruct the path in the opposite direction.
Just to make sure we're not being led in the wrong direction, it's also worth checking that CreateContext() is actually being called in ESR 91 and that this is happening before the segmentation fault causes the application to crash.
So the next step is to add the CreateOffscreen() code to GLContextProviderEGL. There's code in ESR 78 to copy over, so that part's straightforward. However various interfaces it makes use of have been changed or are missing. I've had to make some quite heavy, but nevertheless justifiable, changes to the code. For example the GLLibraryEGL::EnsureInitialized() method has been replaced by GLLibraryEGL::Init(). That's more than just a name change: the former can be safely called multiple times, whereas the latter can only be called once (or so it appears). Consequently I've removed the call entirely, on the assumption that the Init() method is being called somewhere else. We'll find out whether that's true or not when we come to execute the program.
Nevertheless the partial builds now compile, so it's time to run the full build again, which will take more hours than there are left in the day. So we'll return to this again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
We'll take a look at the build presently, but before we do, thank you to Adam Pigg (piggz) for commenting on my meander into virtual methods yesterday. As Adam rightly pointed out, my C++ code was missing some accoutrements that he'd have preferred to see included.
Sorry, but i have to deduct points for using raw pointers and not smart pointers, and missing the override keyword ;)
Adam is right of course, not only are these sensible ways to improve the code, but the use of override would also have improved clarity. Adam has kindly provide his own, improved, version of the code which I'll share below. He then went on to make a few, perhaps more controversial, suggestions"
I'd also drop using namespace, and maybe mix in some C++23, and swap cout for std::print 🙃
I'll leave it to the reader to judge whether these would actually be improvements or not. Here's Adam's updated version of the code. One thing to note is that support for C++23's std::print requires at least GCC 14 or Clang 18 if you want to compile this at home, but Adam has confirmed it all works as expected using the online Godbolt Compiler Explorer.
#includeThanks for that Adam! I'm always up for a bit of blog-based code review. Okay, now back to the Gecko changes from yesterday.#include #include class Parent { public: virtual ~Parent() {}; std::string hello() { return std::string("Hello son"); } std::string wave() { return std::string("Waves"); } virtual std::string goodbye() { return std::string("Goodbye son"); } }; class Child : public Parent { public: std::string hello() { return std::string("Hello Mum"); } std::string goodbye() override { return std::string("Goodbye Mum"); } }; int main() { std::unique_ptr parent(new Parent); std::shared_ptr child = std::make_shared (); std::shared_ptr vparent = std::dynamic_pointer_cast (child); std::println("1. {} ", parent->hello()); std::println("2. {} ", child->hello()); std::println("3. {} ", vparent->hello()); std::println("4. {} ", parent->wave()); std::println("5. {} ", child->wave()); std::println("6. {} ", vparent->wave()); std::println("7. {} ", parent->goodbye()); std::println("8. {} ", child->goodbye()); std::println("9. {} ", vparent->goodbye()); std::println("10. {} ", reinterpret_cast (parent.get())); std::println("11. {} ", reinterpret_cast (child.get())); std::println("12. {} ", reinterpret_cast (vparent.get())); return 0; }
The latest Gecko build incorporates all of the GLScreenBuffer code that I've been adding in and following changes made yesterday should also now no longer make use of the SwapChain class. I've copied the packages over to my development phone, installed them and now it's time to test them.
$ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r [...] =============== Preparing offscreen rendering context =============== [New LWP 20044] Thread 37 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 20040] 0x0000007ff110a0a8 in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 290 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: No such file or directory. (gdb) bt #0 0x0000007ff110a0a8 in mozilla::gl::GLScreenBuffer::Size (this=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #1 mozilla::gl::GLContext::OffscreenSize (this=this@entry=0x7ed419ee40) at gfx/gl/GLContext.cpp:2141 #2 0x0000007ff3666804 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc4ba91a0, aId=...) at mobile/sailfishos/embedthread/EmbedLiteCompositorBridgeParent.cpp:156 #3 0x0000007ff12b6f50 in mozilla::layers::CompositorVsyncScheduler:: ForceComposeToTarget (this=0x7fc4cabe80, aTarget=aTarget@entry=0x0, aRect=aRect@entry=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/LayersTypes.h:82 #4 0x0000007ff12b6fac in mozilla::layers::CompositorBridgeParent:: ResumeComposition (this=this@entry=0x7fc4ba91a0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #5 0x0000007ff12b7038 in mozilla::layers::CompositorBridgeParent:: ResumeCompositionAndResize (this=0x7fc4ba91a0, x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>) at gfx/layers/ipc/CompositorBridgeParent.cpp:794 #6 0x0000007ff12afbd4 in mozilla::detail::RunnableMethodArguments<int, int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int), StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151 [...] #18 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb) frame 1 #1 mozilla::gl::GLContext::OffscreenSize (this=this@entry=0x7ed419ee40) at gfx/gl/GLContext.cpp:2141 2141 return mScreen->Size(); (gdb) p mScreen $1 = { mTuple = {<mozilla::detail::CompactPairHelper<mozilla::gl::GLScreenBuffer*, mozilla::DefaultDelete<mozilla::gl::GLScreenBuffer>, (mozilla::detail::StorageType)1, (mozilla::detail::StorageType)0>> = {<mozilla::DefaultDelete<mozilla::gl::GLScreenBuffer>> = {<No data fields>}, mFirstA = 0x0}, <No data fields>}} (gdb) p mScreen.mTuple.mFirstA $2 = (mozilla::gl::GLScreenBuffer *) 0x0 (gdb)So that's an immediate crash, the reason being that the code is calling the OffScreenSize() method of the mScreen member variable of type GLScreenBuffer, but mScreen is null. That is, it's not been initialised yet.
Looking through the ESR 78 code there's only one place I can discern that sets the mScreen variable and that's this one:
bool GLContext::CreateScreenBufferImpl(const IntSize& size, const SurfaceCaps& caps) { UniquePtr<GLScreenBuffer> newScreen = GLScreenBuffer::Create(this, size, caps); [...] mScreen = std::move(newScreen); return true; }The only place this is called is here:
bool CreateScreenBuffer(const gfx::IntSize& size, const SurfaceCaps& caps) { if (!IsOffscreenSizeAllowed(size)) return false; return CreateScreenBufferImpl(size, caps); }But this is just some indirection. Clearly the method we're really interested in is CreateScreenBuffer(). This is also only called in one place:
bool GLContext::InitOffscreen(const gfx::IntSize& size, const SurfaceCaps& caps) { if (!CreateScreenBuffer(size, caps)) return false; [...] mCaps = mScreen->mCaps; MOZ_ASSERT(!mCaps.any); return true; }It's worth noticing here that soon after calling the CreateScreenBuffer() method and in the same call, the mCaps member of mScreen is being accessed. If mScreen were null at the time of this access, this would immediately trigger a segmentation fault. So clearly, by the end of the InitOffscreen() method, it's expected that mScreen should be a valid instance of GLScreenBuffer.
Let's continue digging backwards and find out where InitOffscreen() gets called. It turns out it's called in quite a few places:
$ grep -rIn "InitOffscreen(" * --include="*.cpp" gecko-dev/gfx/gl/GLContextProviderGLX.cpp:1035: if (!gl->InitOffscreen(size, minCaps)) { gecko-dev/gfx/gl/GLContextProviderWGL.cpp:532: if (!gl->InitOffscreen(size, minCaps)) return nullptr; gecko-dev/gfx/gl/GLContextProviderEGL.cpp:1443: if (!gl->InitOffscreen(size, minOffscreenCaps)) { gecko-dev/gfx/gl/GLContext.cpp:2576: bool GLContext::InitOffscreen(const gfx::IntSize& size,The only one of these context providers we care about, given it's the only one used by sailfish-browser, is GLContextProviderEGL. The instance in GLContext.cpp is the method definition so we can ignore that. So there's only one case to concern ourselves with. The call in GLContextProviderEGL occurs in the GLContextProviderEGL::CreateOffscreen() method, which is rather long so I won't list the entire contents here. Suffice it to say that this is the call we need to be finding some equivalent of in ESR 91. This itself is only called from CompositorOGL::CreateContext().
We're building up a pretty clear path for how mScreen ends up initialised in ESR 78. The CreateContext() method is the first time we've reached something on this path which also exists in ESR 91. So this would seem to be a good place to switch back to ESR 91 and try to reconstruct the path in the opposite direction.
Just to make sure we're not being led in the wrong direction, it's also worth checking that CreateContext() is actually being called in ESR 91 and that this is happening before the segmentation fault causes the application to crash.
(gdb) b CompositorOGL::CreateContext Breakpoint 1 at 0x7ff119a348: file gfx/layers/opengl/CompositorOGL.cpp, line 227. (gdb) r The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/harbour-webview [...] Thread 38 "Compositor" hit Breakpoint 1, mozilla::layers::CompositorOGL:: CreateContext (this=this@entry=0x7ed8002f10) at gfx/layers/opengl/CompositorOGL.cpp:227 227 already_AddRefed<mozilla::gl::GLContext> CompositorOGL::CreateContext() { (gdb)Yes, it looks to be the case. So now back to building the path. The call to CreateOffscreen() in ESR 78 has been replaced by a call to CreateHeadless() in ESR 91. This is code we've looked at before, but finally we're getting to unravel it a bit. Here's the ESR 78 version:
SurfaceCaps caps = SurfaceCaps::ForRGB(); caps.preserve = false; caps.bpp16 = gfxVars::OffscreenFormat() == SurfaceFormat::R5G6B5_UINT16; nsCString discardFailureId; context = GLContextProvider::CreateOffscreen( mSurfaceSize, caps, CreateContextFlags::REQUIRE_COMPAT_PROFILE, &discardFailureId);And here's the equivalent code, also executed from within CompositorOGL::CreateContext(), from ESR 91.
nsCString discardFailureId; context = GLContextProvider::CreateHeadless( {CreateContextFlags::REQUIRE_COMPAT_PROFILE}, &discardFailureId); if (!context->CreateOffscreenDefaultFb(mSurfaceSize)) { context = nullptr; }This gives us all the pieces we need — both mSurfaceSize and caps are available in ESR 91 — so we just need to reconstruct the call to reflect what's happening in ESR 78.
So the next step is to add the CreateOffscreen() code to GLContextProviderEGL. There's code in ESR 78 to copy over, so that part's straightforward. However various interfaces it makes use of have been changed or are missing. I've had to make some quite heavy, but nevertheless justifiable, changes to the code. For example the GLLibraryEGL::EnsureInitialized() method has been replaced by GLLibraryEGL::Init(). That's more than just a name change: the former can be safely called multiple times, whereas the latter can only be called once (or so it appears). Consequently I've removed the call entirely, on the assumption that the Init() method is being called somewhere else. We'll find out whether that's true or not when we come to execute the program.
Nevertheless the partial builds now compile, so it's time to run the full build again, which will take more hours than there are left in the day. So we'll return to this again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
6 Mar 2024 : Day 177 #
Yesterday I spent the day fixing linker errors. By the evening there was one last error that popped up at the end of the full build that looked like this:
One of the reasons I want to explain is that these vtable errors are a little cryptic. They also related to an interesting implementation detail of C++. They're telling us that there's something wrong with the SharedSurfaceTextureClient implementation, but not telling us exactly what. The line number the error refers to is in the SharedSurfaceTextureClient constructor, which is also not particularly helpful:
However, sometimes you want to change the behaviour — or at least some of it — from the parent class. That might be where you'd override some of that functionality.
Where am I going with this? Bear with me! The point is that there are also two ways that a method can be overridden. It can be statically overridden or dynamically overridden. The former case is useful when the compiler can always tell that the method to be used is the one from the child class. In particular, if you never cast a class to make it look like its parent. If you cast it to the parent, any statically overridden methods will use the parent class's implementation.
However, if you dynamically override a method, the child's implementation will be used even if you cast the class upwards. Let's take some simple example (this doesn't appear in the Gecko codebase!).
Now let's see what happens when we build and run this code.
However, there's also this peculiar vparent variable. This is the instance of Child that's been dynamically cast to look like a Parent. We can see that child and parent are actually the same object because they point to the same place (lines 11 and 12) compared to the parent object which has a different pointer (line 10).
Casing a Child to a Parent is perfectly safe because the former is inheriting from the latter. In other words, the code knows that everything a Parent has a Child will have as well. Safe.
But in the case of the dynamic cast, the calls to any overridden virtual methods are resolved at runtime rather than compile time. In particular, even though vparent is of type Parent, calling the goodbye() method on it calls the child's implementation of goodbye() (line 9).
The way this is achieved is with a vtable (this is where we're going!). This is a list of pointers to methods stored at runtime. Because goodbye() is marked as virtual in the Parent class, a pointer to this method is stored in the vtable. When the compiler calls the method it gets the address to call from the vtable, rather than from some fixed value stored at compile time.
Now when the Child dynamically overrides goodbye() it overwrites the value in the vtable. The result is that when the goodbye() class is called from vparent, the Child instance is referenced even though the class looks otherwise just like a Parent.
Finally we get to our error. The error is suggesting that there's some method that should be virtual, but no vtable has been created. This is most likely because there's a signature for a virtual method, but no implementation for any virtual method.
It's all a bit obscure, made more so because there are no virtual methods in the definition. However we do have ~SharedSurfaceTextureClient() in the class definition that doesn't have an implementation. Plus the class is inheriting from TextureData and this has a virtual destructor. As a result the ~SharedSurfaceTextureClient() destructor will also need to be added to the vtable.
But while ~SharedSurfaceTextureClient() appears in the class definition, there is no implementation of it. This is therefore likely what's causing our error.
The solution, after this very long explanation, is that we need to add in the implementation of the class destructor. Thankfully there's a destructor implementation in the ESR 78 code we can use:
The fix is to switch back from SwapChain to the previous GLScreenBuffer interface, like to:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient'I promised to spend a bit of time today explaining what was going on and how I fixed it.
One of the reasons I want to explain is that these vtable errors are a little cryptic. They also related to an interesting implementation detail of C++. They're telling us that there's something wrong with the SharedSurfaceTextureClient implementation, but not telling us exactly what. The line number the error refers to is in the SharedSurfaceTextureClient constructor, which is also not particularly helpful:
SharedSurfaceTextureClient::SharedSurfaceTextureClient( SharedSurfaceTextureData* aData, TextureFlags aFlags, LayersIPCChannel* aAllocator) : TextureClient(aData, aFlags, aAllocator) { }The vtable the error is referring to is the "virtual table" of methods that can be dynamically overridden. When you subclass a class in C++ you can override certain methods from the parent class. In other words, when the code calls the method in the subclass, you can make sure it uses a method written for the subclass, rather than using the inherited method from the parent class. Both overriding and not overriding are useful. The whole point of creating a subclass is that you re-use some functionality from the parent class, so if you don't override a method it's great that the implementation from the parent class can be used.
However, sometimes you want to change the behaviour — or at least some of it — from the parent class. That might be where you'd override some of that functionality.
Where am I going with this? Bear with me! The point is that there are also two ways that a method can be overridden. It can be statically overridden or dynamically overridden. The former case is useful when the compiler can always tell that the method to be used is the one from the child class. In particular, if you never cast a class to make it look like its parent. If you cast it to the parent, any statically overridden methods will use the parent class's implementation.
However, if you dynamically override a method, the child's implementation will be used even if you cast the class upwards. Let's take some simple example (this doesn't appear in the Gecko codebase!).
#include <iostream> using namespace std; class Parent { public: string hello() { return string("Hello son"); } string wave() { return string("Waves"); } virtual string goodbye() { return string("Goodbye son"); } }; class Child : public Parent { public: string hello() { return string("Hello Mum"); } string goodbye() { return string("Goodbye Mum"); } }; int main() { Parent* parent = new Parent(); Child* child = new Child(); Parent* vparent = dynamic_cast<Parent*>(child); cout << "1. " << parent->hello() << "\n"; cout << "2. " << child->hello() << "\n"; cout << "3. " << vparent->hello() << "\n"; cout << "4. " << parent->wave() << "\n"; cout << "5. " << child->wave() << "\n"; cout << "6. " << vparent->wave() << "\n"; cout << "7. " << parent->goodbye() << "\n"; cout << "8. " << child->goodbye() << "\n"; cout << "9. " << vparent->goodbye() << "\n"; cout << "10. " << parent << "\n"; cout << "11. " << child << "\n"; cout << "12. " << vparent << "\n"; delete child; delete parent; return 0; }Here we can see from the class definition that the Child class is inheriting from the Parent class. The Child class doesn't have any implementation of the wave() method because we're expecting it to be inherited from the Parent. The Child class statically overrides its parent's hello() implementation but dynamically overrides its parent's goodbye() method. We can see this because of the virtual keyword that's added before the name of the method. Note that it's actually specified on the parent's goodbye() method. It's parents that decide whether their children can inherit methods virtually or not.
Now let's see what happens when we build and run this code.
$ g++ main.cpp $ ./a.out 1. Hello son 2. Hello Mum 3. Hello son 4. Waves 5. Waves 6. Waves 7. Goodbye son 8. Goodbye Mum 9. Goodbye Mum 10. 0x563a9f694eb0 11. 0x563a9f694ed0 12. 0x563a9f694ed0Notice how the Child class is still able to successfully call the wave() method (line 5). When the child class calls hello() and goodbye() it calls its own implementations of these methods.
However, there's also this peculiar vparent variable. This is the instance of Child that's been dynamically cast to look like a Parent. We can see that child and parent are actually the same object because they point to the same place (lines 11 and 12) compared to the parent object which has a different pointer (line 10).
Casing a Child to a Parent is perfectly safe because the former is inheriting from the latter. In other words, the code knows that everything a Parent has a Child will have as well. Safe.
But in the case of the dynamic cast, the calls to any overridden virtual methods are resolved at runtime rather than compile time. In particular, even though vparent is of type Parent, calling the goodbye() method on it calls the child's implementation of goodbye() (line 9).
The way this is achieved is with a vtable (this is where we're going!). This is a list of pointers to methods stored at runtime. Because goodbye() is marked as virtual in the Parent class, a pointer to this method is stored in the vtable. When the compiler calls the method it gets the address to call from the vtable, rather than from some fixed value stored at compile time.
Now when the Child dynamically overrides goodbye() it overwrites the value in the vtable. The result is that when the goodbye() class is called from vparent, the Child instance is referenced even though the class looks otherwise just like a Parent.
Finally we get to our error. The error is suggesting that there's some method that should be virtual, but no vtable has been created. This is most likely because there's a signature for a virtual method, but no implementation for any virtual method.
It's all a bit obscure, made more so because there are no virtual methods in the definition. However we do have ~SharedSurfaceTextureClient() in the class definition that doesn't have an implementation. Plus the class is inheriting from TextureData and this has a virtual destructor. As a result the ~SharedSurfaceTextureClient() destructor will also need to be added to the vtable.
But while ~SharedSurfaceTextureClient() appears in the class definition, there is no implementation of it. This is therefore likely what's causing our error.
The solution, after this very long explanation, is that we need to add in the implementation of the class destructor. Thankfully there's a destructor implementation in the ESR 78 code we can use:
SharedSurfaceTextureClient::~SharedSurfaceTextureClient() { // XXX - Things break when using the proper destruction handshake with // SharedSurfaceTextureData because the TextureData outlives its gl // context. Having a strong reference to the gl context creates a cycle. // This needs to be fixed in a better way, though, because deleting // the TextureData here can race with the compositor and cause flashing. TextureData* data = mData; mData = nullptr; Destroy(); if (data) { // Destroy mData right away without doing the proper deallocation handshake, // because SharedSurface depends on things that may not outlive the // texture's destructor so we can't wait until we know the compositor isn't // using the texture anymore. It goes without saying that this is really bad // and we should fix the bugs that block doing the right thing such as bug // 1224199 sooner rather than later. delete data; } }As I mentioned yesterday, the partial builds all passed. But last night I also set off the full build again. So what was the outcome of this? The good news is that the full build went through as well, as a consequence of which I now have five shiny new xulrunner-qt5 packages to test on my phone. I'm not expecting them to work first time, but testing them is an essential step in finding out how to progress. Let's give them a go...
$ gdb harbour-webview GNU gdb (GDB) Mer (8.2.1+git9) [...] (gdb) r Starting program: /usr/bin/harbour-webview [...] =============== Preparing offscreen rendering context =============== [New LWP 29323] Thread 36 "Compositor" received signal SIGSEGV, Segmentation fault. [Switching to LWP 29318] mozilla::gl::SwapChain::OffscreenSize (this=0x0) at gfx/gl/GLScreenBuffer.cpp:141 141 return mPresenter->mBackBuffer->mFb->mSize; (gdb) bt #0 mozilla::gl::SwapChain::OffscreenSize (this=0x0) at gfx/gl/GLScreenBuffer.cpp:141 #1 0x0000007ff3666884 in mozilla::embedlite::EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget (this=0x7fc4b7c500, aId=...) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:290 #2 0x0000007ff12b6fbc in mozilla::layers::CompositorVsyncScheduler:: ForceComposeToTarget (this=0x7fc4d23b20, aTarget=aTarget@entry=0x0, aRect=aRect@entry=0x0) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/layers/ LayersTypes.h:82 #3 0x0000007ff12b7018 in mozilla::layers::CompositorBridgeParent:: ResumeComposition (this=this@entry=0x7fc4b7c500) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313 #4 0x0000007ff12b70a4 in mozilla::layers::CompositorBridgeParent:: ResumeCompositionAndResize (this=0x7fc4b7c500, x=<optimized out>, y=<optimized out>, width=<optimized out>, height=<optimized out>) at gfx/layers/ipc/CompositorBridgeParent.cpp:794 #5 0x0000007ff12afc40 in mozilla::detail::RunnableMethodArguments<int, int, int, int>::applyImpl<mozilla::layers::CompositorBridgeParent, void (mozilla::layers::CompositorBridgeParent::*)(int, int, int, int), StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, StoreCopyPassByConstLRef<int>, 0ul, 1ul, 2ul, 3ul> (args=..., m=<optimized out>, o=<optimized out>) at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151 [...] #17 0x0000007ff6a0489c in ?? () from /lib64/libc.so.6 (gdb) p this $1 = (const mozilla::gl::SwapChain * const) 0x0 (gdb)Immediately there's a segmentation fault and it feels like we've been here before. Here's where the crash is triggered:
const gfx::IntSize& SwapChain::OffscreenSize() const { return mPresenter->mBackBuffer->mFb->mSize; }It's being triggered because the instance of SwapChain this is being called from is null. But hang on; didn't we just spend days getting rid of the SwapChain code and swapping in GLScreenBuffer precisely to avoid this error? Well, yes, we did. Clearly after changing all that code I missed a case of SwapChain being used.
The fix is to switch back from SwapChain to the previous GLScreenBuffer interface, like to:
@@ -151,8 +153,7 @@ EmbedLiteCompositorBridgeParent:: CompositeToDefaultTarget(VsyncId aId) if (context->IsOffscreen()) { MutexAutoLock lock(mRenderMutex); - if (context->GetSwapChain()->OffscreenSize() != mEGLSurfaceSize - && !context->GetSwapChain()->Resize(mEGLSurfaceSize)) { + if (context->OffscreenSize() != mEGLSurfaceSize && + !context->ResizeOffscreen(mEGLSurfaceSize)) { return; } }I've also checked through the rest of the EmbedliteCompositionBridgeParent code to ensure there are no other SwapChain references in there. And, with that, we're done for the day. I've spent a long time explaining vtables today and very little time actually coding, but the change is an essential one and I can't test it without another build. So I've set it building again and with any luck I'll be able to test it again tomorrow.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
5 Mar 2024 : Day 176 #
Just as I was explaining in my diary entry yesterday, the first thing I like to do after I wake up following a night of hard work — all performed by my laptop building the latest Gecko changes of course — is to scan the output for red. Red indicates errors.
When I peeked in this morning there was no red showing on the console. But my excitement was short-lived. The exception to the "errors are red" rule is when they come from the linker rather than the compiler and that's what seems to have happened here.
I'm not sure why the linker doesn't bother highlighting errors in red, but on close inspection they definitely are errors.
The fact it compiled without error is nevertheless exciting in itself, just not as exciting as actually having a binary to test. So what are these errors? As is the way with the linker, they're all "undefined reference" or symbol-related errors. This happens when the code calls something that may, for example, have a method signature but no method definition. That can be common with pure virtual functions that aren't subsequently overridden, say, but there are other reasons too. For example I might have added a method signature simply without adding in the method body. Here's a sample of the errors that came out:
The partial builds all now pass, so it's time to do a full rebuild. Here's the updated tally of changes once again to highlight today's progress:
[...]
An early and unusually short build run (just 5 hours 12 minutes) meant that I've got a second bit of the cherry! The build finished before the end of the day. It's not a successful build unfortunately, but by discovering this now rather than in the morning I'm able to claim an entire day back. So I'm very happy about that.
Here are the linker errors. Well, actually, I think it's just one error that's spread over many lines:
With these latest changes the partial builds still all pass. So it's time to set the full build off again. But this also means it's the end of my gecko development for the day. More tomorrow when we'll see how things have gone.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
When I peeked in this morning there was no red showing on the console. But my excitement was short-lived. The exception to the "errors are red" rule is when they come from the linker rather than the compiler and that's what seems to have happened here.
I'm not sure why the linker doesn't bother highlighting errors in red, but on close inspection they definitely are errors.
The fact it compiled without error is nevertheless exciting in itself, just not as exciting as actually having a binary to test. So what are these errors? As is the way with the linker, they're all "undefined reference" or symbol-related errors. This happens when the code calls something that may, for example, have a method signature but no method definition. That can be common with pure virtual functions that aren't subsequently overridden, say, but there are other reasons too. For example I might have added a method signature simply without adding in the method body. Here's a sample of the errors that came out:
403:37.88 toolkit/library/build/libxul.so 410:16.07 aarch64-meego-linux-gnu-ld: ../../../gfx/gl/GLScreenBuffer.o: in function `mozilla::gl::GLScreenBuffer::~GLScreenBuffer()': 410:16.07 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:296: undefined reference to `mozilla::gl::SurfaceCaps::~SurfaceCaps()' 410:16.07 aarch64-meego-linux-gnu-ld: /jol403:37.88 toolkit/library/build/libxul.so 410:16.07 aarch64-meego-linux-gnu-ld: ../../../gfx/gl/GLScreenBuffer.o: in function `mozilla::gl::GLScreenBuffer::~GLScreenBuffer()': 410:16.07 ${PROJECT}/gecko-dev/gfx/gl/GLScreenBuffer.cpp:296: undefined reference to `mozilla::gl::SurfaceCaps::~SurfaceCaps()' 410:16.07 aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/gfx/gl/ GLScreenBuffer.cpp:296: undefined reference to `mozilla::gl::SurfaceCaps::~SurfaceCaps()' 410:16.07 aarch64-meego-linux-gnu-ld: ../../../gfx/gl/Unified_cpp_gfx_gl0.o: in function `mozilla::gl::GLContext::GLContext(mozilla::gl::GLContextDesc const&, mozilla::gl::GLContext*, bool)': 410:16.07 ${PROJECT}/gecko-dev/gfx/gl/GLContext.cpp:290: undefined reference to `mozilla::gl::SurfaceCaps::SurfaceCaps()' [...] 410:16.10 aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol `_ZNK7mozilla2gl9GLContext15ChooseGLFormatsERKNS0_11SurfaceCapsE' isn't defined 410:16.10 aarch64-meego-linux-gnu-ld: final link failed: bad value 410:16.10 collect2: error: ld returned 1 exit statusLet me summarise all of the errors we have here and make things a bit clearer. The following are all the error locations followed by the names of the missing implementations.
GLScreenBuffer.cpp:296: SurfaceCaps::~SurfaceCaps() GLContext.cpp:290: SurfaceCaps::SurfaceCaps() SharedSurface.cpp:362: SurfaceCaps::SurfaceCaps(mozilla::gl::SurfaceCaps const&) SharedSurface.cpp:361: GLContext::ChooseGLFormats( mozilla::gl::SurfaceCaps const&) const SharedSurface.cpp:362: SurfaceCaps::SurfaceCaps() SharedSurface.cpp:338: SurfaceCaps::SurfaceCaps() SurfaceTypes.h:38: SurfaceCaps::SurfaceCaps() SurfaceTypes.h:38: SurfaceCaps::operator=(mozilla::gl::SurfaceCaps const&) SurfaceTypes.h:38: SurfaceCaps::~SurfaceCaps() SharedSurface.cpp:350: SurfaceCaps::operator=(mozilla::gl::SurfaceCaps const&) SharedSurface.cpp:338: SurfaceCaps::~SurfaceCaps() SharedSurface.cpp:367: SurfaceCaps::~SurfaceCaps() GLContext.cpp:296: SurfaceCaps::~SurfaceCaps() SharedSurfaceGL.cpp:39: SurfaceCaps::SurfaceCaps() SharedSurfaceGL.cpp:39: SurfaceCaps::~SurfaceCaps() RefPtr.h:590: SharedSurfaceTextureClient::SharedSurfaceTextureClient( mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags, mozilla::layers::LayersIPCChannel*)So, what's going on with these? Let's take a look. First up we have line 296 of GLScreenBuffer.cpp. This line, it turns out, is where the GLScreenBuffer destructor implementation starts in the code:
GLScreenBuffer::~GLScreenBuffer() { mFactory = nullptr; mRead = nullptr; if (!mBack) return; // Detach mBack cleanly. mBack->Surf()->ProducerRelease(); }Any instance of GLScreenBuffer will also hold an instance of SurfaceCaps as we can see in the header:
public: const SurfaceCaps mCaps;This is an actual instance of SurfaceCaps not just a pointer to it, so when the GLScreenBuffer instance is destroyed mCaps will be too, triggering a call to the SurfaceCaps destructor. If we look in the SurfaceTypes.h header file we can see that there is a destructor specified:
struct SurfaceCaps final { [...] SurfaceCaps(); SurfaceCaps(const SurfaceCaps& other); ~SurfaceCaps(); [...]But nowhere is the body of the method defined. Compare that to the ESR 78 code where the SurfaceCaps class looks the same. The difference is that checking in the GLContect.cpp source we find these:
// These are defined out of line so that we don't need to include // ISurfaceAllocator.h in SurfaceTypes.h. SurfaceCaps::SurfaceCaps() = default; SurfaceCaps::SurfaceCaps(const SurfaceCaps& other) = default; SurfaceCaps& SurfaceCaps::operator=(const SurfaceCaps& other) = default; SurfaceCaps::~SurfaceCaps() = default;They're pretty much the simplest implementations you can get. But they are at least implementations. So with any luck adding these in to the ESR 91 code as well will solve quite a few of the linker errors. That's not everything though. After we take away the instances that these four methods deal with we're left with these two:
SharedSurface.cpp:361: GLContext::ChooseGLFormats( mozilla::gl::SurfaceCaps const&) const RefPtr.h:590: SharedSurfaceTextureClient::SharedSurfaceTextureClient( mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags, mozilla::layers::LayersIPCChannel*)Let's check SharedSurface.cpp line 361 next. The line looks like this:
mFormats(partialDesc.gl->ChooseGLFormats(caps)),If we look in the GLContext class, sure enough we can see the method signature:
// Only varies based on bpp16 and alpha. GLFormats ChooseGLFormats(const SurfaceCaps& caps) const;
We can see the same signature in the ESR 78 code. Once again though, the body of the method is defined in GLContext.cpp of ESR 78, whereas it's nowhere to be found in ESR 91. So I've copied the missing code over:
GLFormats GLContext::ChooseGLFormats(const SurfaceCaps& caps) const { GLFormats formats; // If we're on ES2 hardware and we have an explicit request for 16 bits of // color or less OR we don't support full 8-bit color, return a 4444 or 565 // format. bool bpp16 = caps.bpp16; [...]Finally we have a reference to SharedSurfaceTextureClient. The reference to line 590 of RefPtr.h is unhelpful; what we really want to know is where the RefPtr is being used. A quick search suggests it's inside GLScreenBuffer where there are a couple of cases that look like this:
bool GLScreenBuffer::Swap(const gfx::IntSize& size) { RefPtr<layers::SharedSurfaceTextureClient> newBack = mFactory->NewTexClient(size); [...]There's no definition of SharedSurfaceTextureClient::SharedSurfaceTextureClient() in the ESR 91 code, but there is in the ESR 78 code, in the TextureClientSharedSurface.cpp source:
SharedSurfaceTextureClient::SharedSurfaceTextureClient( SharedSurfaceTextureData* aData, TextureFlags aFlags, LayersIPCChannel* aAllocator) : TextureClient(aData, aFlags, aAllocator) { mWorkaroundAnnoyingSharedSurfaceLifetimeIssues = true; }I've copied that code over as well. But my reading that's all of them. So this may mean we're ready to kick off a full rebuild. Before doing so, it's worth spending a bit of extra time to check that the changes pass our three partial builds already.
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/ [...] $ make -j1 -C obj-build-mer-qt-xr/dom/canvas [...] $ make -j1 -C obj-build-mer-qt-xr/gfx/layers [...]The first two work fine, but the last generates a new error:
${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp:109:3: error: ‘mWorkaroundAnnoyingSharedSurfaceLifetimeIssues’ was not declared in this scope mWorkaroundAnnoyingSharedSurfaceLifetimeIssues = true; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~This mWorkaroundAnnoyingSharedSurfaceLifetimeIssues variable is present in ESR 78, used to decide whether or not to deallocate the TextureData when the TextureClient is destroyed. Here's the logic it's part of:
void TextureClient::Destroy() { [...] TextureData* data = mData; if (!mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { mData = nullptr; } if (data || actor) { [...] params.workAroundSharedSurfaceOwnershipIssue = mWorkaroundAnnoyingSharedSurfaceOwnershipIssues; if (mWorkaroundAnnoyingSharedSurfaceLifetimeIssues) { params.data = nullptr; } else { params.data = data; } [...] DeallocateTextureClient(params);This logic has been removed in ESR 91 and, if I'm honest, I'm comfortable leaving it this way. Here's what the equivalent ERS 91 code looks like:
void TextureClient::Destroy() { [...] TextureData* data = mData; mData = nullptr; if (data || actor) { [...] DeallocateTextureClient(params); } }As you can see, there are no references to mWorkaroundAnnoyingSharedSurfaceLifetimeIssues, or anything like it, there at all. Consequently I'm just going to remove the references to it from our new SharedSurfaceTextureClient constructor as well:
SharedSurfaceTextureClient::SharedSurfaceTextureClient( SharedSurfaceTextureData* aData, TextureFlags aFlags, LayersIPCChannel* aAllocator) : TextureClient(aData, aFlags, aAllocator) { }Maybe this will cause problems later, but my guess is that if there are going to be problems, they'll be pretty straightforward to track down with the debugger, at which point we can refer back to the ESR 78 code to restore these checks.
The partial builds all now pass, so it's time to do a full rebuild. Here's the updated tally of changes once again to highlight today's progress:
$ git diff --numstat 145 16 gfx/gl/GLContext.cpp 73 8 gfx/gl/GLContext.h 11 0 gfx/gl/GLContextTypes.h 590 0 gfx/gl/GLScreenBuffer.cpp 171 1 gfx/gl/GLScreenBuffer.h 305 5 gfx/gl/SharedSurface.cpp 132 4 gfx/gl/SharedSurface.h 16 10 gfx/gl/SharedSurfaceEGL.cpp 13 3 gfx/gl/SharedSurfaceEGL.h 81 2 gfx/gl/SharedSurfaceGL.cpp 61 0 gfx/gl/SharedSurfaceGL.h 60 0 gfx/gl/SurfaceTypes.h 27 0 gfx/layers/client/TextureClientSharedSurface.cpp 25 0 gfx/layers/client/TextureClientSharedSurface.h $ git diff --shortstat 14 files changed, 1710 insertions(+), 49 deletions(-)So off the build goes again:
$ sfdk build -d --with git_workaround NOTICE: Appending changelog entries to the RPM SPEC file… Setting version: 91.9.1+git1+sailfishos.esr91.20240302180401. 2f1f19ac7d73+gecko.dev.5292b747b036 Directory walk started [...]This will likely take until the morning, at which point I'll be eagerly checking for errors once again.
[...]
An early and unusually short build run (just 5 hours 12 minutes) meant that I've got a second bit of the cherry! The build finished before the end of the day. It's not a successful build unfortunately, but by discovering this now rather than in the morning I'm able to claim an entire day back. So I'm very happy about that.
Here are the linker errors. Well, actually, I think it's just one error that's spread over many lines:
308:11.87 toolkit/library/build/libxul.so 311:51.74 aarch64-meego-linux-gnu-ld: ../../../gfx/layers/ Unified_cpp_gfx_layers6.o: in function `mozilla::layers:: SharedSurfaceTextureClient::SharedSurfaceTextureClient (mozilla::layers::SharedSurfaceTextureData*, mozilla::layers::TextureFlags, mozilla::layers::LayersIPCChannel*)': 311:51.74 ${PROJECT}/gecko-dev/gfx/layers/client/ TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient' 311:51.74 aarch64-meego-linux-gnu-ld: ${PROJECT}/gecko-dev/gfx/layers/client/ TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient' 311:51.74 aarch64-meego-linux-gnu-ld: libxul.so: hidden symbol `_ZTVN7mozilla6layers26SharedSurfaceTextureClientE' isn't defined 311:51.74 aarch64-meego-linux-gnu-ld: final link failed: bad value 311:51.75 collect2: error: ld returned 1 exit statusThe essence of error is the following:
TextureClientSharedSurface.cpp:108: undefined reference to `vtable for mozilla::layers::SharedSurfaceTextureClient'I'm going to hold of explaining what's going on here until tomorrow. This post is already rather long and I don't have the energy to go in to the details tonight. But I have added in a fix.
With these latest changes the partial builds still all pass. So it's time to set the full build off again. But this also means it's the end of my gecko development for the day. More tomorrow when we'll see how things have gone.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
4 Mar 2024 : Day 175 #
I've been really struggling to get the GLScreenBuffer code back compiling again over the last few days. This is partly because a whole raft of interfaces have changed and these changes have seeped through into many different places in the code. Often the changes are quite small, such as using a reference rather than a pointer. But other times they're more significant, such as the way member variables in SharedSurface have changed. To look into the latter in a little more detail, the previous version looked like this:
Besides all this I've also had what feels like several long days at work. The result is that when I come to work on Gecko of an evening the code swims around in front of my eyes and refuses to stay still as my mind drifts hither and thither.
Nevertheless I've been persevering with my "fix, compile, examine" routine and have continued to make some slight progress. Once again, I don't want to go into the full detail of the changes I've had to make, because it really is just reintroducing code that was removed upstream. So there's a lot of it, and I'm not even spending any time attempting to understand it.
But it has now got to the point where all three directories I've been making changes to are compiling when I do the partial builds:
Before I do that, here are the latest stats showing the changes I've made in relation to the offscreen rendering pipeline:
Alright, time to set the build off. Hopefully this will complete without errors, but I'm not betting on it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
class SharedSurface { public: [...] const SharedSurfaceType mType;Th const AttachmentType mAttachType; const WeakPtr<GLContext> mGL; const gfx::IntSize mSize; const bool mHasAlpha; const bool mCanRecycle;But the new code bundles all this up into a structure so that it now looks more like this:
struct PartialSharedSurfaceDesc { const WeakPtr<GLContext> gl; const SharedSurfaceType type; const layers::TextureType consumerType; const bool canRecycle; }; struct SharedSurfaceDesc : public PartialSharedSurfaceDesc { gfx::IntSize size = {}; }; class SharedSurface { public: [...] const SharedSurfaceDesc mDesc;Even this might seem like a small change, but it can make it really hard to match up method signatures given the ESR 78 version takes a collection of individual parameters, whilst the ESR 91 version requires just a single SharedSurfaceDesc instance.
Besides all this I've also had what feels like several long days at work. The result is that when I come to work on Gecko of an evening the code swims around in front of my eyes and refuses to stay still as my mind drifts hither and thither.
Nevertheless I've been persevering with my "fix, compile, examine" routine and have continued to make some slight progress. Once again, I don't want to go into the full detail of the changes I've had to make, because it really is just reintroducing code that was removed upstream. So there's a lot of it, and I'm not even spending any time attempting to understand it.
But it has now got to the point where all three directories I've been making changes to are compiling when I do the partial builds:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/ [...] $ make -j1 -C obj-build-mer-qt-xr/dom/canvas [...] $ make -j1 -C obj-build-mer-qt-xr/gfx/layers [...]I've been here before of course. Just because the partial builds compile without errors doesn't mean the full build will pass. So the next step for me tonight is to set the full build running so we can come back to it and check its status in the morning.
Before I do that, here are the latest stats showing the changes I've made in relation to the offscreen rendering pipeline:
$ git diff --numstat 73 16 gfx/gl/GLContext.cpp 73 8 gfx/gl/GLContext.h 11 0 gfx/gl/GLContextTypes.h 590 0 gfx/gl/GLScreenBuffer.cpp 171 1 gfx/gl/GLScreenBuffer.h 305 5 gfx/gl/SharedSurface.cpp 132 4 gfx/gl/SharedSurface.h 16 10 gfx/gl/SharedSurfaceEGL.cpp 13 3 gfx/gl/SharedSurfaceEGL.h 81 2 gfx/gl/SharedSurfaceGL.cpp 61 0 gfx/gl/SharedSurfaceGL.h 60 0 gfx/gl/SurfaceTypes.h 21 0 gfx/layers/client/TextureClientSharedSurface.cpp 25 0 gfx/layers/client/TextureClientSharedSurface.h $ git diff --shortstat 14 files changed, 1632 insertions(+), 49 deletions(-)That's not a huge increase on yesterday, but is at least non-zero. This reflects the fact that as things have progressed the fixes have become smaller, but thornier. All of the main changes this time seem to have been to the GLContext code.
Alright, time to set the build off. Hopefully this will complete without errors, but I'm not betting on it.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
3 Mar 2024 : Day 174 #
As I walk, half asleep, from bedroom to kitchen for breakfast each morning I have to go past the office space where I work on my laptop. Whenever I leave a build running overnight I can't resist the urge to peak in and see how things have gone as I go past.
This morning I peaked in and saw a barrage of errors, highlighted in red, spanning back into the history of the terminal buffer. That's not quite the morning wake-up I was hoping for!
Nevertheless, as I like to tell myself, it still represents progress. And so today I have the task of fixing these fresh errors. Compile-time errors are generally easier to fix than runtime errors, so I don't mind at all really.
That was earlier today though and as I write this I'm already on the train. I've made the smart move of connecting all my devices to the same Wifi hotspot on my phone, which means no wires today. It's all surprisingly well arranged, although that hasn't prevented me from knocking my phone under the seat in front of me already once this journey.
This calls for another one of thigg's amazing images. This one feels a lot less chaotic than the last, albeit with a stalker fox in the background adding an ominous air to proceedings!
So that's me today. Before I start, let's have some stats from git so we can compare what happens now with what we get at the end of the day.
[...]
I've managed to plough through quite a few issues, almost all related to the SharedSurfaceGL and EmbedliteCompositorBridgeParent classes. Unfortunately I've not managed to get to the point where the partial builds are going through without error. That means there's no point in setting the full build to run over night. Instead, I'll have to pick up the errors where I left off in the morning.
Just to demonstrate that some progress has been made, here are the new stats for this evening.
Comment
This morning I peaked in and saw a barrage of errors, highlighted in red, spanning back into the history of the terminal buffer. That's not quite the morning wake-up I was hoping for!
Nevertheless, as I like to tell myself, it still represents progress. And so today I have the task of fixing these fresh errors. Compile-time errors are generally easier to fix than runtime errors, so I don't mind at all really.
That was earlier today though and as I write this I'm already on the train. I've made the smart move of connecting all my devices to the same Wifi hotspot on my phone, which means no wires today. It's all surprisingly well arranged, although that hasn't prevented me from knocking my phone under the seat in front of me already once this journey.
This calls for another one of thigg's amazing images. This one feels a lot less chaotic than the last, albeit with a stalker fox in the background adding an ominous air to proceedings!
So that's me today. Before I start, let's have some stats from git so we can compare what happens now with what we get at the end of the day.
$ git diff --numstat 38 4 gfx/gl/GLContext.cpp 59 7 gfx/gl/GLContext.h 11 0 gfx/gl/GLContextTypes.h 590 0 gfx/gl/GLScreenBuffer.cpp 171 1 gfx/gl/GLScreenBuffer.h 305 5 gfx/gl/SharedSurface.cpp 130 3 gfx/gl/SharedSurface.h 16 10 gfx/gl/SharedSurfaceEGL.cpp 13 3 gfx/gl/SharedSurfaceEGL.h 8 2 gfx/gl/SharedSurfaceGL.cpp 3 0 gfx/gl/SharedSurfaceGL.h 60 0 gfx/gl/SurfaceTypes.h 21 0 gfx/layers/client/TextureClientSharedSurface.cpp 25 0 gfx/layers/client/TextureClientSharedSurface.h $ git diff --shortstat 14 files changed, 1450 insertions(+), 35 deletions(-)Now on to the bug fixing!
[...]
I've managed to plough through quite a few issues, almost all related to the SharedSurfaceGL and EmbedliteCompositorBridgeParent classes. Unfortunately I've not managed to get to the point where the partial builds are going through without error. That means there's no point in setting the full build to run over night. Instead, I'll have to pick up the errors where I left off in the morning.
Just to demonstrate that some progress has been made, here are the new stats for this evening.
$ git diff --numstat 58 4 gfx/gl/GLContext.cpp 70 7 gfx/gl/GLContext.h 11 0 gfx/gl/GLContextTypes.h 590 0 gfx/gl/GLScreenBuffer.cpp 171 1 gfx/gl/GLScreenBuffer.h 305 5 gfx/gl/SharedSurface.cpp 131 4 gfx/gl/SharedSurface.h 16 10 gfx/gl/SharedSurfaceEGL.cpp 13 3 gfx/gl/SharedSurfaceEGL.h 72 2 gfx/gl/SharedSurfaceGL.cpp 60 0 gfx/gl/SharedSurfaceGL.h 60 0 gfx/gl/SurfaceTypes.h 21 0 gfx/layers/client/TextureClientSharedSurface.cpp 25 0 gfx/layers/client/TextureClientSharedSurface.h $ git diff --shortstat 14 files changed, 1603 insertions(+), 36 deletions(-)If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
2 Mar 2024 : Day 173 #
This morning I wake up to find the build has failed while attempting to compile dom/canvas/WebGLContext.cpp. This follows from all of the changes I've been making to try to bring back GLScreenBuffer. Yesterday I used partial builds, performed inside the scratchbox build target, to test my changes. Now that I've tried to build the whole of gecko it's failed with the following error:
It's not an ideal sign though as it suggests the changes I've made aren't self-contained. Ideally they would be. But let's run with it and focus on fixing the issue.
The good news is that I can trigger the same error back inside the scratchbox target by building the directory containing the file that's causing the error:
Performing a diff on the change we've made to SurfaceFactory_Basic() we can see how things are and how they were for comparison:
So, we need to look inside SurfaceFactory.cpp and specifically at the SurfaceFactory constructor. I fear this is going to get messy.
The mCaps, mAllocator and mFlags members only seem to get used directly in SurfaceFactory::NewTexClient(). They're public, so that doesn't mean they don't get used elsewhere, but I suspect the only other places they're used are in new code I've added for GLScreenBuffer. This won't be used by WebGLContext.
Similarly the NewTexClient() method is only called within GLScreenBuffer::Swap() and GLScreenBuffer::Resize(). The WebGLContext code doesn't have to worry or care about these as the WebGLContext code is entirely orthogonal.
In conclusion, it should be safe to set all the new parameters to some default or null values. So I've added in a new version of the SurfaceFactory_Basic constructor like this:
Okay, it's now time to set off a full build again. It's quite possible something else will fail, but running the build is the simplest way to find out. This will, of course, take a while. If it's done by the end of the day there may be time to test it this evening; let's see.
In the meantime, it's worth noting that ultimately the mCaps, mAllocator and mFlags parameters we've been discussing only seem to get used by the two different versions of GLScreenBuffer::CreateFactory(). However I can't find any code that actually calls either of these. So it's quite possible that eventually all of the code related to these parameters will be found to be redundant and can be removed. But we'll come back to that later.
[...]
Sadly this wasn't the last of the errors; the build failed again. The error is rather interesting:
So I'll need to add in the GetTextureFlags() to SharedSurface.h. It's a simple method because all of the functionality is supposed to come from it being overridden. So I've added this in to SharedSurface.h:
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
66:19.17 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h: In instantiation of ‘typename mozilla::detail::UniqueSelector<T>:: SingleObject mozilla::MakeUnique(Args&& ...) [with T = mozilla::gl:: SurfaceFactory_Basic; Args = {mozilla::gl::GLContext&}; typename mozilla::detail::UniqueSelector<T>::SingleObject = mozilla:: UniquePtr<mozilla::gl::SurfaceFactory_Basic, mozilla::DefaultDelete <mozilla::gl::SurfaceFactory_Basic> >]’: 66:19.17 ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:929:67: required from here 66:19.17 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:609:23: error: no matching function for call to ‘mozilla::gl::SurfaceFactory_Basic:: SurfaceFactory_Basic(mozilla::gl::GLContext&)’ 66:19.17 return UniquePtr<T>(new T(std::forward<Args>(aArgs)...)); 66:19.17 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 66:19.17 In file included from ${PROJECT}/gecko-dev/dom/canvas/ WebGLContext.cpp:52, 66:19.17 from Unified_cpp_dom_canvas1.cpp:119: 66:19.18 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.h:31:12: note: candidate: ‘mozilla::gl::SurfaceFactory_Basic::SurfaceFactory_Basic(mozilla::gl:: GLContext*, const mozilla::gl::SurfaceCaps&, const mozilla::layers::TextureFlags&)’ 66:19.18 explicit SurfaceFactory_Basic(GLContext* gl, 66:19.18 ^~~~~~~~~~~~~~~~~~~~ 66:19.18 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceGL.h:31:12: note: candidate expects 3 arguments, 1 providedThis can happen and, in fact, I was almost expecting it. A partial build will only builds the files inside a particular subdirectory and its children (depending on how the build scripts are structured). So if there's some other bit of code in a different branch of the directory hierarchy that uses something in this subdirectory, then it won't be compiled against the changes. If the interface changes and the a consumer of that interface is neither updated nor compiled against it, then we end up where we are now.
It's not an ideal sign though as it suggests the changes I've made aren't self-contained. Ideally they would be. But let's run with it and focus on fixing the issue.
The good news is that I can trigger the same error back inside the scratchbox target by building the directory containing the file that's causing the error:
$ make -j1 -C obj-build-mer-qt-xr/dom/canvas [...] ${PROJECT}/gecko-dev/dom/canvas/WebGLContext.cpp:929:67: required from here ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/UniquePtr.h:609:23: error: no matching function [...]This makes it much easier to test my fixes. So with all this in place I can get down to fixing the error. Here's the code causing the problem:
if (!swapChain->mFactory) { NS_WARNING("Failed to make an ideal SurfaceFactory."); swapChain->mFactory = MakeUnique<gl::SurfaceFactory_Basic>(*gl); }These lines and all the code around it have changed quite considerably since ESR 78, but I'm still able to find a portion of code in the ESR 78 codebase that looks to be doing something similar:
if (!factory) { // Absolutely must have a factory here, so create a basic one factory = MakeUnique<gl::SurfaceFactory_Basic>(gl, gl->Caps(), flags); mBackend = layers::LayersBackend::LAYERS_BASIC; }The error is caused by the fact I restored the previously removed SurfaceCaps and TextureFlags parameters to the SurfaceFactory_Basic() method. All of these end up being fed into the SurfaceFactory constructor.
Performing a diff on the change we've made to SurfaceFactory_Basic() we can see how things are and how they were for comparison:
$ git diff gfx/gl/SharedSurfaceGL.cpp [...] -SurfaceFactory_Basic::SurfaceFactory_Basic(GLContext& gl) - : SurfaceFactory({&gl, SharedSurfaceType::Basic, - layers::TextureType::Unknown, true}) {} +SurfaceFactory_Basic::SurfaceFactory_Basic(GLContext* gl, + const SurfaceCaps& caps, + const layers::TextureFlags& flags) + : SurfaceFactory({gl, SharedSurfaceType::Basic, + layers::TextureType::Unknown, true}, caps, nullptr, flags) {}This doesn't really help us though: the caps and flags, which are the ones causing the problem, get passed straight through. The important change is further down. The reason I'm chasing these is because I need to find sensible default values to set them to. There may not be sensible defaults, but if there are then it will allow us to create a new method that doesn't require these additional parameters, which is what we need for the code to compile.
So, we need to look inside SurfaceFactory.cpp and specifically at the SurfaceFactory constructor. I fear this is going to get messy.
$ git diff gfx/gl/SharedSurface.cpp [...] -SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc) - : mDesc(partialDesc), mMutex("SurfaceFactor::mMutex") {} [...] +SurfaceFactory::SurfaceFactory(const PartialSharedSurfaceDesc& partialDesc, + const SurfaceCaps& caps, + const RefPtr<layers::LayersIPCChannel>& allocator, + const layers::TextureFlags& flags) + : mDesc(partialDesc), + mCaps(caps), + mAllocator(allocator), + mFlags(flags), + mFormats(partialDesc.gl->ChooseGLFormats(caps)), + mMutex("SurfaceFactor::mMutex") +{ + ChooseBufferBits(caps, &mDrawCaps, &mReadCaps); +}The constructor itself is just storing the values and calling ChooseBufferBits() with some of them in order to set the mDrawCaps and mReadCaps member variables. It's unlikely that mDrawCaps or mReadCaps are being used by the WebGLContext code because I added them in as part of these changes (it's possible they're used by a method that's called in WebGLContext, but I don't believe that's the case).
The mCaps, mAllocator and mFlags members only seem to get used directly in SurfaceFactory::NewTexClient(). They're public, so that doesn't mean they don't get used elsewhere, but I suspect the only other places they're used are in new code I've added for GLScreenBuffer. This won't be used by WebGLContext.
Similarly the NewTexClient() method is only called within GLScreenBuffer::Swap() and GLScreenBuffer::Resize(). The WebGLContext code doesn't have to worry or care about these as the WebGLContext code is entirely orthogonal.
In conclusion, it should be safe to set all the new parameters to some default or null values. So I've added in a new version of the SurfaceFactory_Basic constructor like this:
SurfaceFactory_Basic::SurfaceFactory_Basic(GLContext& gl) : SurfaceFactory({&gl, SharedSurfaceType::Basic, layers::TextureType::Unknown, true}, SurfaceCaps(), nullptr, 0) {}I've run partial builds on the original source directory that contains the changes we've been looking at, as well as the directory containing the code that failed during the full build, like this:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/ [...] $ make -j1 -C obj-build-mer-qt-xr/dom/canvas [...]Both complete successfully without errors. Hooray!
Okay, it's now time to set off a full build again. It's quite possible something else will fail, but running the build is the simplest way to find out. This will, of course, take a while. If it's done by the end of the day there may be time to test it this evening; let's see.
In the meantime, it's worth noting that ultimately the mCaps, mAllocator and mFlags parameters we've been discussing only seem to get used by the two different versions of GLScreenBuffer::CreateFactory(). However I can't find any code that actually calls either of these. So it's quite possible that eventually all of the code related to these parameters will be found to be redundant and can be removed. But we'll come back to that later.
[...]
Sadly this wasn't the last of the errors; the build failed again. The error is rather interesting:
193:24.93 In file included from Unified_cpp_gfx_layers6.cpp:128: 193:24.93 ${PROJECT}/gecko-dev/gfx/layers/client/TextureClientSharedSurface.cpp: In static member function ‘static already_AddRefed<mozilla::layers:: SharedSurfaceTextureClient> mozilla::layers::SharedSurfaceTextureClient:: Create(mozilla::UniquePtr<mozilla::gl::SharedSurface>, mozilla::gl::SurfaceFactory*, mozilla::layers::LayersIPCChannel*, mozilla::layers::TextureFlags)’: 193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/ TextureClientSharedSurface.cpp:102:63: error: ‘class mozilla::gl::SharedSurface’ has no member named ‘GetTextureFlags’ 193:24.94 TextureFlags flags = aFlags | TextureFlags::RECYCLE | surf->GetTextureFlags(); 193:24.94 ^~~~~~~~~~~~~~~ 193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/ TextureClientSharedSurface.cpp:104:51: error: no matching function for call to ‘mozilla::layers::SharedSurfaceTextureData::SharedSurfaceTextureData( std::remove_reference<mozilla::UniquePtr<mozilla::gl::SharedSurface>&>:: type)’ 193:24.94 new SharedSurfaceTextureData(std::move(surf)); 193:24.94 ^ 193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/ TextureClientSharedSurface.cpp:34:1: note: candidate: ‘mozilla::layers::SharedSurfaceTextureData::SharedSurfaceTextureData( const mozilla::layers::SurfaceDescriptor&, mozilla::gfx::SurfaceFormat, mozilla::gfx::IntSize)’ 193:24.94 SharedSurfaceTextureData::SharedSurfaceTextureData( 193:24.94 ^~~~~~~~~~~~~~~~~~~~~~~~ 193:24.94 ${PROJECT}/gecko-dev/gfx/layers/client/ TextureClientSharedSurface.cpp:34:1: note: candidate expects 3 arguments, 1 providedInteresting because this is in part of the code that I made changes to in order to get the partial build to complete, but it turns out I wasn't actually building this code at all, it was just using the header.
So I'll need to add in the GetTextureFlags() to SharedSurface.h. It's a simple method because all of the functionality is supposed to come from it being overridden. So I've added this in to SharedSurface.h:
// Specifies to the TextureClient any flags which // are required by the SharedSurface backend. virtual layers::TextureFlags GetTextureFlags() const;In addition to this I've inserted a simple implementation for it to SharedSurface.cpp:
layers::TextureFlags SharedSurface::GetTextureFlags() const { return layers::TextureFlags::NO_FLAGS; }But to actually match the functionality of ESR 78 I've also added this override, directly into SharedSurfaceEGL.h:
virtual layers::TextureFlags GetTextureFlags() const override { return layers::TextureFlags::DEALLOCATE_CLIENT; }They're all super-simple methods and with any luck shouldn't trigger any additional errors. I now have to do a partial build of the now three the directories that I've touched with changes:
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/ [...] $ make -j1 -C obj-build-mer-qt-xr/dom/canvas [...] $ make -j1 -C obj-build-mer-qt-xr/gfx/layers [...]Happily these all compile successfully and without errors. So it's time to kick off another full build again. I don't expect this to be done before I go to bed this evening, so I'll have to pick this up again tomorrow. In the morning we'll find out what new gruesome errors there are to deal with!
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
1 Mar 2024 : Day 172 #
Having failed to fix the offscreen (WebView) rendering pipeline using a scalpel, over the last few days I've resorted to using a sledgehammer. What I'm now doing is essentially reverting the changes made upstream that stripped out all of the GLScreenBuffer goodness that the Sailfish WebView relied on. Unfortunately the changes aren't included in a single neat commit, so I'm having to do this by hand, copying and pasting over the changes from ESR 78 back into ESR 91.
Ever present and willing, git is happy to give me a summary of how many changes I've made so far:
The changes to the other files are all intended to accommodate the reintroduction of GLScreenBuffer.
I could live with some big changes being made to a single file, but having to make a large number of changes to many files is really not ideal for the project. It'll contribute to the burden of maintenance and future upgrades. But my suspicion is that not all of the code in GLScreenBuffer is actually used. Rather than try to trim out the fat as I go along, my plan is to introduce everything and get the renderer to a state where it's working, then work on trimming out the unnecessary code afterwards.
Once that's done, I'll then move on to re-architecting the code to try to minimise the changes to the Gecko library itself. It may be that we can move some of the changes into the EmbedLite code, say, in a similar way to what we did with the printing changes. If that can be done, it'll make future maintenance that much easier.
That's the summary of where things are at. Now I'm heading back into the code to perform my bug fix cycle:
[...]
After several hours of going around the bug fix cycle I've finally reached the point where the partial build completes without any compiler errors. I've added in a lot of code, not always fully understanding the purpose, but nevertheless with the intention of matching the structure and purpose of the GLScreenBuffer code from ESR 78.
Here are the final stats — for comparison — after these changes.
So far I've only been running quick partial builds. To find out if things are really working correctly I'll have to do a full rebuild, which will need to run overnight. So this seems like a good time to stop for the day, ready to start again in the morning.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
Ever present and willing, git is happy to give me a summary of how many changes I've made so far:
$ git diff --shortstat 11 files changed, 1012 insertions(+), 22 deletions(-)So that's over a thousand new lines added to ESR 91. Let's break that down further to find out what's been changed:
$ git diff --numstat 38 4 gfx/gl/GLContext.cpp 59 7 gfx/gl/GLContext.h 11 0 gfx/gl/GLContextTypes.h 592 0 gfx/gl/GLScreenBuffer.cpp 171 1 gfx/gl/GLScreenBuffer.h 26 0 gfx/gl/SharedSurface.cpp 14 0 gfx/gl/SharedSurface.h 15 9 gfx/gl/SharedSurfaceEGL.cpp 4 1 gfx/gl/SharedSurfaceEGL.h 60 0 gfx/gl/SurfaceTypes.h 22 0 gfx/layers/client/TextureClientSharedSurface.hAs we can see, all of the changes are to the rendering code. That's encouraging. And the majority, as expected, are re-introducing code to the GLScreenBuffer class. In fact, not just adding code into the class, but reintroducing the class in its entirety. It's this GLScreenBuffer class that was completely removed and replaced with a class called SwapChain in the upstream transition from ESR 78 to ESR 91.
The changes to the other files are all intended to accommodate the reintroduction of GLScreenBuffer.
I could live with some big changes being made to a single file, but having to make a large number of changes to many files is really not ideal for the project. It'll contribute to the burden of maintenance and future upgrades. But my suspicion is that not all of the code in GLScreenBuffer is actually used. Rather than try to trim out the fat as I go along, my plan is to introduce everything and get the renderer to a state where it's working, then work on trimming out the unnecessary code afterwards.
Once that's done, I'll then move on to re-architecting the code to try to minimise the changes to the Gecko library itself. It may be that we can move some of the changes into the EmbedLite code, say, in a similar way to what we did with the printing changes. If that can be done, it'll make future maintenance that much easier.
That's the summary of where things are at. Now I'm heading back into the code to perform my bug fix cycle:
- Build code.
- Examine compile-time errors.
- Fix the first one or two erros shown.
- Go to step 1.
[...]
After several hours of going around the bug fix cycle I've finally reached the point where the partial build completes without any compiler errors. I've added in a lot of code, not always fully understanding the purpose, but nevertheless with the intention of matching the structure and purpose of the GLScreenBuffer code from ESR 78.
Here are the final stats — for comparison — after these changes.
$ git diff --numstat 38 4 gfx/gl/GLContext.cpp 59 7 gfx/gl/GLContext.h 11 0 gfx/gl/GLContextTypes.h 590 0 gfx/gl/GLScreenBuffer.cpp 171 1 gfx/gl/GLScreenBuffer.h 301 5 gfx/gl/SharedSurface.cpp 126 3 gfx/gl/SharedSurface.h 16 10 gfx/gl/SharedSurfaceEGL.cpp 9 3 gfx/gl/SharedSurfaceEGL.h 6 4 gfx/gl/SharedSurfaceGL.cpp 3 1 gfx/gl/SharedSurfaceGL.h 60 0 gfx/gl/SurfaceTypes.h 12 0 gfx/layers/client/TextureClientSharedSurface.cpp 22 0 gfx/layers/client/TextureClientSharedSurface.h $ git diff --shortstat 14 files changed, 1424 insertions(+), 38 deletions(-)Most of the changes I've made this evening seem to have been to SharedSurface.cpp. That makes sense as it seems there were many GL/EGL related calls that had been removed.
So far I've only been running quick partial builds. To find out if things are really working correctly I'll have to do a full rebuild, which will need to run overnight. So this seems like a good time to stop for the day, ready to start again in the morning.
If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.