flypig.co.uk

List items

Items from the current list are shown below.

Blog

All items from May 2024

31 May 2024 : Day 249 #
Overnight I've been running a build. For the last few days — including last night — I've been on holiday lodging in a hotel, which has been lovely, but sadly today is the last day which means I'll be spending today travelling. It would have been great to have the build complete overnight so that I can start my edit-rebuild-test cycle while I travel. But the build failed, which means I have some more fixing to do first.

Here's the build error.
98:13.71 In file included from Unified_cpp_gfx_gl1.cpp:2:
98:13.71 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceEGL.cpp: In member function 
    ‘virtual bool mozilla::gl::SharedSurface_EGLImage::ReadPixels(GLint, GLint, 
    GLsizei, GLsizei, GLenum, GLenum, GLvoid*)’:
98:13.71 ${PROJECT}/gecko-dev/gfx/gl/SharedSurfaceEGL.cpp:186:38: error: ‘void 
    mozilla::gl::GLContext::raw_fDeleteFramebuffers(GLsizei, const GLuint*)’ is 
    private within this context
98:13.71    gl->raw_fDeleteFramebuffers(1, &fbo);
98:13.71                                       ^
The problem here is that I moved raw_fDeleteFramebuffers() into the private block of the class definition it's part of. It ought to be in the private block, but there's obviously an attempt being made in SharedSurfaceEGL.cpp to call it from another class.

My suspicion is that this is part of the debug code I added. It looks like this is being used by a ReadPixels() method I added for the purposes of extracting pixel colours:
bool SharedSurface_EGLImage::ReadPixels(GLint x, GLint y, GLsizei width, 
    GLsizei height,
                        GLenum format, GLenum type, GLvoid* pixels) {
  const auto& gl = GLContextEGL::Cast(mDesc.gl);

  // See https://stackoverflow.com/a/53993894
  GLuint fbo;
  gl->fGenFramebuffers(1, &fbo);
  gl->raw_fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, fbo);
  gl->fFramebufferTexture2D(LOCAL_GL_FRAMEBUFFER, LOCAL_GL_COLOR_ATTACHMENT0, 
    LOCAL_GL_TEXTURE_2D, mProdTex, 0);

  gl->raw_fReadPixels(x, y, width, height, format, type, pixels);

  gl->raw_fBindFramebuffer(LOCAL_GL_FRAMEBUFFER, 0);
  gl->raw_fDeleteFramebuffers(1, &fbo);

  return true;
}
There's a little nuance here though. This SharedSurface_EGLImage::ReadPixels() method is called from the GLScreenBuffer::ReadPixels() method, like so:
bool GLScreenBuffer::ReadPixels(GLint x, GLint y, GLsizei width, GLsizei height,
                                GLenum format, GLenum type, GLvoid* pixels) {
  // If the currently bound framebuffer is backed by a SharedSurface
  // then it might want to override how we read pixel data from it.
  // This is normally only the default framebuffer, but we can also
  // have SharedSurfaces bound to other framebuffers when doing
  // readback for BasicLayers.
  SharedSurface* surf;
  if (GetReadFB() == 0) {
    surf = SharedSurf();
  } else {
    surf = mGL->mFBOMapping[GetReadFB()];
  }
  if (surf) {
    return surf->ReadPixels(x, y, width, height, format, type, pixels);
  }

  return false;
}
This method exists in the GLScreenBuffer class of ESR 78 as well, so it's initially surprising that this is happy to call surf->ReadPixels() given SharedSurface_EGLImage::ReadPixels() doesn't exist.

Looking in to this more deeply, the reason this doesn't cause a problem in ESR 78 appears to be because the SharedSurface_EGLImage version of the method is an override. So if I were to remove it, the inherited version of the method would be called instead:
  virtual bool ReadPixels(GLint x, GLint y, GLsizei width, GLsizei height,
                          GLenum format, GLenum type, GLvoid* pixels) {
    return false;
  }
Making this change would allow things to compile, but it's not going to be adding a great deal of value. So I could just get rid of the SharedSurface_EGLImage version of the call and rely on this default implementation. But I can't at the moment see where the GLScreenBuffer version of the call is used so it seems to me it would be better to remove the GLScreenBuffer variant of the method as well.

Which is what I've done. If this method is being used it'll cause an error during compilation, which will slow things down, but will at least make clear where the caller lives. And if it goes through then we've managed to find a nice simplification, which is after all the purpose of this effort.

I've set the build off, but while it's running it might be a good idea to check in case something that consumes libxul.so is expecting access to this ReadPixels() method. One of the reasons I'm wondering is that there are certainly situations in which the texture buffer is read, such as when the images used for the tab previews and bookmark icons are generated. So it's possible it's a method we need to keep.

In QtMozEmbed there is something similar, but it calls GLES directly, rather than via libxul.so:
QImage gl_read_framebuffer(const QRect &rect)
{
    QSize size = rect.size();
    int x = rect.x();
    int y = rect.y();

    while (glGetError());

    QImage img(size, QImage::Format_RGB32);
    GLint fmt = GL_BGRA_EXT;
    glReadPixels(x, y, size.width(), size.height(), fmt, GL_UNSIGNED_BYTE, 
    img.bits());
    if (!glGetError())
        return img.mirrored();

    QImage rgbaImage(size, QImage::Format_RGBX8888);
    glReadPixels(x, y, size.width(), size.height(), GL_RGBA, GL_UNSIGNED_BYTE, 
    rgbaImage.bits());
    if (!glGetError())
        return rgbaImage.mirrored();
    return QImage();
}
There's also something similar in the sailfish-browser code. However, looking carefully at this shows that it's debug code I added myself. It's not needed for the final app and in fact, it's about time I removed it.
$ git diff
diff --git a/apps/core/declarativewebcontainer.cpp b/apps/core/
    declarativewebcontainer.cpp
index 60d5327c..1e2d63ce 100644
--- a/apps/core/declarativewebcontainer.cpp
+++ b/apps/core/declarativewebcontainer.cpp
@@ -694,9 +694,28 @@ void DeclarativeWebContainer::clearWindowSurface()
     QOpenGLFunctions_ES2* funcs = 
    m_context->versionFunctions<QOpenGLFunctions_ES2>();
     Q_ASSERT(funcs);
 
-    funcs->glClearColor(1.0, 1.0, 1.0, 0.0);
+    funcs->glClearColor(0.0, 1.0, 0.0, 0.0);
     funcs->glClear(GL_COLOR_BUFFER_BIT);
     m_context->swapBuffers(this);
+
+    QSize screenSize = QGuiApplication::primaryScreen()->size();
+    size_t bufferSize = screenSize.width() * screenSize.height() * 4;
+    uint8_t* buf = static_cast<uint8_t*>(calloc(sizeof(uint8_t), bufferSize));
+
+    funcs->glReadPixels(0, 0, screenSize.width(), screenSize.height(),
+                            GL_RGBA, GL_UNSIGNED_BYTE, buf);
+
+    int xpos = screenSize.width() / 2;
+    int ypos = screenSize.height() / 2;
+    int pos = xpos * ypos * 4;
+
+    volatile char red = buf[pos];
+    volatile char green = buf[pos + 1];
+    volatile char blue = buf[pos + 2];
+    volatile char alpha = buf[pos + 3];
+
+    printf(&quot;Colour: (%d, %d, %d, %d)\n&quot;, red, green, blue, alpha);
+    free(buf);
 }
 
 void DeclarativeWebContainer::dumpPages() const
Thankfully these debug changes are all neatly encapsulated and caught by git, so removing them is as simple as performing a git checkout.
$ git checkout apps/core/declarativewebcontainer.cpp
Updated 1 path from the index
$ git diff
$ 
I don't dare run a sailfish-browser build while the gecko build is already running (maybe it's safe... but I'm not certain). So I've lodged the fact I need to in my mental task queue. There were also some changes to embedlite-components and sailfish-components-webview which I've removed for the former and will need to commit for the latter. They're not related to ReadPixels() but I'm making another mental note here to rebuild and reinstall them later once the gecko build has completed.

Well, the build is continuing so that's it for today. More tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
30 May 2024 : Day 248 #
I'm tidying things up. If you've been following previous posts you'll know there was some visual celebration provided by artist-in-residence thigg when the rendering started working, but now things need to be put in order and pushed to the repository. Given this I need to get the code into a fit state, which means removing debug code, refactoring the changes, making sure everything still works correctly.

In other words, it's time to clean things up. And thigg has created a rather apt image to illustrate this process as well. It's not all glamorous you know!  
Tidying up after the party! A pig with wings in a room, detritus strewn everywhere including phones, cake half drunk glasses of unknown liquid and wires. A fox lies on the floor asleep.

My first clean-up task is a bit of an odd one. In some of the debug output I've pasted on previous days you'll have seen lines that look a bit like this:
frame014.data: Colour before: (8, 0, 152, 255), 1
frame015.data: Colour before: (1, 0, 255, 255), 1
frame016.data: Colour before: (223, 5, 0, 255), 1
frame017.data: Colour before: (71, 94, 162, 255), 1
You may recall that I added these in to test the colour at specific pixels of the texture used for offscreen rendering. I also introduced some code to export to disc images from the surface as textures. These produced some rubbish looking images which neither I nor the community ever successfully figured out (I now suspect they were just memory artefacts).

For example, I added a bunch of code into the GLScreenBuffer::Swap() method that looks like this:
    result = ReadPixels(0, 0, size.width, size.height, LOCAL_GL_RGB, 
    LOCAL_GL_UNSIGNED_BYTE, buf);
Right now this and all the related code can still be found in the source tree, not just in the ESR 91 code, but in the ESR 78 code as well. None of it is needed for a working browser and it's outlived its usefulness, so I'll need to remove it before pushing the latest changes to remote. All fine. But can I recall which specific parts are needed and which can be removed? No, no I can't.

I have a solution though. If I compare the changes with the original ESR 78 code it should become clear which bits I've added for debugging purposes and which are needed for offscreen rendering. But that means I first have to remove the debug code from ESR 78 so I have something to compare against.

Removing the code from ESR 78 is possible but it's also harder than it should be because it's now mixed in with all of the changes made by the patches. I didn't apply these to the ESR 78 as commits, but rather let the build processes apply them automatically, so they've been applied as a bunch of local changes, all mixed together.

To get things back to where I need them to be I need to reset the ESR 78 project folder to a clean state, then apply all the patches again. This will mess up the build state (meaning that in order to a build I'll have to build everything rather than just what's changed), but that's okay because in general I don't need to run ESR 78 builds very often. Let's hope it stays that way.

There are a few changes I made to the spec file that I want to keep, but everything else can be cleaned out.
$ cd gecko-dev-project/gecko-dev/gecko-dev
$ git reset --hard
$ git clean -xdf
$ cd ..
$ git stash
$ sfdk apply
$ git stash pop
The penultimate
apply
step applies the patches. After this I now have a nice clean ESR 78 source tree to compare my ESR 91 code against.

This has allowed me to remove a bunch of debug code from the GLContext.h file of ESR 91.

Now I'm wondering whether I can fold the GLScreenBuffer class into SwapChain. There's definitely some overlap, but also quite a bit of divergence too. Below is an annotated listing of the SwapChain header. Lines prefixed with a "+" symbol also appear in GLScreenBuffer. Those prefixed with a "-" appear only in SwapChain.
class SwapChain final {
  friend class SwapChainPresenter;

 public:
+  UniquePtr<SurfaceFactory> mFactory;
-  bool mPreserve = false;
-  std::shared_ptr<SharedSurface> mPrevFrontBuffer;

 private:
-  std::queue<std::shared_ptr<SharedSurface>> mPool;
-  std::shared_ptr<SharedSurface> mFrontBuffer;
-  SwapChainPresenter* mPresenter = nullptr;

 public:
+  SwapChain();
+  virtual ~SwapChain();

-  void ClearPool();
-  const auto& FrontBuffer() const { return mFrontBuffer; }
-  UniquePtr<SwapChainPresenter> Acquire(const gfx::IntSize&);

+  const gfx::IntSize& Size() const;
-  const gfx::IntSize& OffscreenSize() const;
+  bool Resize(const gfx::IntSize & size);
+  bool PublishFrame(const gfx::IntSize& size) { return Swap(size); }
+  void Morph(UniquePtr<SurfaceFactory> newFactory);

private:
+  bool Swap(const gfx::IntSize& size);
};
As well as these there are also a large number of methods in GLScreenBuffer which don't appear in SwapChain. These are less critical though, because as additions there won't be any attempt to make use of them in the existing code.

If I were to merge GLScreenBuffer into SwapChain I'd need to create an implementations for all of the lines indicated with a "-" in the class definition above. They can be dummy implementations (or make use of the existing implementations from the SwapChain class), so this process doesn't have to be complicated, but it would need to be done.

The fact that SwapChain is marked as final means that I'm not able to inherit from it. I could remove the final specifier of course, but that may not be a great idea. Some other parts of the code might be making assumptions based on it.

I think I'm going to have a go at performing this merge, but as it happens I'm not able to make these changes straight away anyway. That's because I need to run a full rebuild. I tried running a partial build and its come back complaining that the build config has changed:
$ sfdk engine exec
$ sb2 -t SailfishOS-esr78-aarch64.default
$ source `pwd`/obj-build-mer-qt-xr/rpm-shared.env
$ make -j1 -C obj-build-mer-qt-xr/gfx/gl/
make: Entering directory 'obj-build-mer-qt-xr/gfx/gl'
config/rules.mk:338: *** Build configuration changed. Build with |mach build| 
    or run |mach build-backend| to regenerate build config.  Stop.
make: Leaving directory 'obj-build-mer-qt-xr/gfx/gl'
For the next few days I'm planning to make small changes to the code, followed by rebuilds and tests to check things are working (a tight edit-rebuild-test cycle). To make the most of this I need the partial build working, which means running a full build now.

Well, by running the full build and watching it fail I have at least determined that the following changes are necessary:
diff --git a/gfx/gl/GLContext.h b/gfx/gl/GLContext.h
index 863414aaeb44..0991afbf045c 100644
--- a/gfx/gl/GLContext.h
+++ b/gfx/gl/GLContext.h
@@ -37,3 +38,2 @@
 #include &quot;GLDefs.h&quot;
-#include &quot;GLScreenBuffer.h&quot;
 #include &quot;GLTypes.h&quot;
@@ -42,2 +42,3 @@
 #include &quot;mozilla/WeakPtr.h&quot;
+#include &quot;SurfaceTypes.h&quot;
 
@@ -61,2 +62,5 @@ class GLReadTexImageHelper;
 struct SymLoadStruct;
+class SwapChain;
+class SurfaceCaps;
+class GLScreenBuffer;
 }  // namespace gl
Having restored them, the build is off again. Now it's back to the full build again; until this is done, there's not much point making any more changes to the code today.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
29 May 2024 : Day 247 #
It's a bright new morning and I'm in a good mood! I've been working on the offscreen rendering pipeline since Day 159. That's a total of 87 days. It took me 28 days to get printing working; 33 days to get the browser rending working and 45 days to get the initial build to complete. So 87 days is the longest stretch of singly-directed work so far. It amounts to approximately three and a half weeks of full-time work.

Technically it's not all there yet and my task for the next few days is clear: I need to get my commits in order. I've made a large number of changes to get this working and my plan is to split them into two commits, which will eventually become two patches on top of the upstream Gecko code.

The first commit will be an equivalent to patch 0044, reverting the upstream D5005 changeset (Bug 1445570), but in a simplified form to make future maintenance easier.

The second wil lbe a patch to reintroduced GLScreenBuffer and essentially reverse the changes made in upstream changeset D75055 (Bug 1632249).

The second patch is going to be the hardest to get into shape because I had to make a lot of changes which I supsect can be simplified quite considerably, but I'm not certain. Calling the changes a reversel of D75055 is unfortunately oversimplifying things somehwat. It has the feel of a quite carefully balance stack of Jenga blocks, where removing a single piece might result in the entire stack coming tumblng down (which, for the avoidance of doubt, is a metaphor for the rendering not working again).

Unlike Jenga I have the benefit of git which will restack the blocks to a working position in an instant if things go awry. Nevertheless I want to get the lightest and neatest structure I can: the fewer changes in the patch now, the less maintenance will be needed in future.

That's all to come. Today I'm just going to worry about the first, much simpler, patch.

Having simplified it as much as possible, what are we left with? Well, here's the summary of the changes as provided by git:
 dom/base/nsDOMWindowUtils.cpp    |   1 -
 layout/base/nsDocumentViewer.cpp |   8 ++
 layout/base/nsPresContext.cpp    | 142 ++++++++++++++++++-------------
 layout/base/nsPresContext.h      |  33 +++++++
 layout/base/nsRefreshDriver.cpp  |   6 --
 5 files changed, 125 insertions(+), 65 deletions(-)
Let's compare that to the chnages from patch 0044:
 dom/base/nsDOMWindowUtils.cpp                 |   1 -
 layout/base/nsDocumentViewer.cpp              |   8 ++
 layout/base/nsPresContext.cpp                 | 132 ++++++++++--------
 layout/base/nsPresContext.h                   |  32 ++++-
 layout/base/nsRefreshDriver.cpp               |  43 ++----
 layout/base/nsRefreshDriver.h                 |   7 +-
 .../test/test_restyles_in_smil_animation.html |   4 +-
 7 files changed, 129 insertions(+), 98 deletions(-)
Well, it's a simplification. Not a huge simplification (190 changes vs. 227 changes), but a simplification nonetheless. More importantly, it's done.

With the easy patch dealt with it's now time to turn to the complex patch. Here's the current local commit history.
$ git log --oneline -5
555957f3a8ee (HEAD -> FIREFOX_ESR_91_9_X_RELBRANCH_patches) Restore 
    NotifyDidPaint event and timers
bc575f7ac075 Prevent errors from DownloadPrompter
06b15998d179 Enable dconf
95276b73d273 Restore GLScreenBuffer and TextureImageEGL
c6ea49286566 (origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches) Disable SessionStore 
    functionality
I want to rearrange these so that commit 95276b73d273 is the most recent. That particular commit is the really big change made to get the renderer back working, as we can see from the change stats:
 dom/base/nsGlobalWindowOuter.cpp              |   6 +
 gfx/gl/AndroidSurfaceTexture.cpp              |   2 +-
 gfx/gl/GLContext.cpp                          | 217 +++++-
 gfx/gl/GLContext.h                            |  93 ++-
 gfx/gl/GLContextEGL.h                         |  10 +-
 gfx/gl/GLContextProviderEGL.cpp               | 284 +++++++-
 gfx/gl/GLContextProviderImpl.h                |  26 +
 gfx/gl/GLContextTypes.h                       |  11 +
 gfx/gl/GLLibraryEGL.cpp                       |  13 +-
 gfx/gl/GLLibraryEGL.h                         |   8 +-
 gfx/gl/GLScreenBuffer.cpp                     | 646 ++++++++++++++++++
 gfx/gl/GLScreenBuffer.h                       | 172 ++++-
 gfx/gl/GLTextureImage.cpp                     |  38 +-
 gfx/gl/SharedSurface.cpp                      | 321 ++++++++-
 gfx/gl/SharedSurface.h                        | 139 +++-
 gfx/gl/SharedSurfaceEGL.cpp                   | 113 ++-
 gfx/gl/SharedSurfaceEGL.h                     |  44 +-
 gfx/gl/SharedSurfaceGL.cpp                    | 152 ++++-
 gfx/gl/SharedSurfaceGL.h                      |  98 ++-
 gfx/gl/SurfaceTypes.h                         |  60 ++
 gfx/gl/TextureImageEGL.cpp                    | 221 ++++++
 gfx/gl/TextureImageEGL.h                      |  80 +++
 gfx/gl/moz.build                              |   1 +
 gfx/layers/CompositorTypes.h                  |   1 +
 gfx/layers/client/TextureClient.cpp           |  42 +-
 gfx/layers/client/TextureClient.h             |   3 +
 .../client/TextureClientSharedSurface.cpp     |  50 ++
 .../client/TextureClientSharedSurface.h       |  25 +
 gfx/layers/ipc/CompositorVsyncScheduler.cpp   |   2 -
 gfx/layers/opengl/CompositorOGL.cpp           |  35 +-
 gfx/thebes/gfxPlatform.cpp                    |   2 +-
 31 files changed, 2776 insertions(+), 139 deletions(-)
I'm anticipating that a lot of this will remain necessary, but if it's possible to cut it down, anything that can be removed will be a bonus. First though to rearrange the patches using an interactive rebase. Thankfully none of the commits overlap with these changes, so the rebase completes without a hitch. Here's how things look now.
$ git log --oneline -5
c3c263a98d03 (HEAD -> FIREFOX_ESR_91_9_X_RELBRANCH_patches) Restore 
    GLScreenBuffer and TextureImageEGL
d3ba4df29a32 Restore NotifyDidPaint event and timers
f55057391ac0 Prevent errors from DownloadPrompter
eab04b8c0d80 Enable dconf
c6ea49286566 (origin/FIREFOX_ESR_91_9_X_RELBRANCH_patches) Disable SessionStore 
    functionality
After carefully looking through the code I'm confident it could benefit from some changes. For example I'm fairly sure I added a bunch of code just for debugging purpose, such as when I was extracting images from the textures to help with debugging at various points. I should remove all that code.

Plus I suspect a bunch of the other methods were also added simply to align with the previous ESR 78 implementation. That doesn't mean they're actually needed and, in fact, some methods may not even get used at all. It's a hazard of the incremental approach I took.

The following lists the new methods that I added over the last few months. My plan is to add breakpoints to them, run the code and see which ones are hit. Any that aren't hit will then be candidates for removal.
GLContext::GuaranteeResolve()
GLContext::OffscreenSize()
GLContext::CreateScreenBuffer()
GLContext::CreateScreenBufferImpl()
GLContext::fBindFramebuffer()
GLContext::InitOffscreen()
CreateTexture()
CreateTextureForOffscreen()
GLContext::fBindFramebuffer()
GLContext::raw_fBindFramebuffer()
CreateTextureForOffscreen()
CreateTexture()
DefaultEglLibrary()
GLContextProviderEGL::CreateOffscreen()
GLScreenBuffer::Create()
GLScreenBuffer::CreateFactory()
GLScreenBuffer::CreateFactory()
GLScreenBuffer::GLScreenBuffer()
GLScreenBuffer::~GLScreenBuffer()
GLScreenBuffer::BindFB()
GLScreenBuffer::BindDrawFB()
GLScreenBuffer::BindReadFB()
GLScreenBuffer::BindReadFB_Internal()
GLScreenBuffer::GetDrawFB()
GLScreenBuffer::GetReadFB()
GLScreenBuffer::GetFB()
GLScreenBuffer::CopyTexImage2D()
GLScreenBuffer::ReadPixels()
GLScreenBuffer::Morph()
GLScreenBuffer::Attach()
GLScreenBuffer::Swap()
GLScreenBuffer::Resize()
GLScreenBuffer::CreateRead()
CreateRenderbuffersForOffscreen()
ReadBuffer::Create()
ReadBuffer::~ReadBuffer()
ReadBuffer::Attach()
ReadBuffer::Size()
SharedSurface::ProdCopy()
SharedSurface::SharedSurface()
SharedSurface::GetTextureFlags()
ChooseBufferBits()
SurfaceFactory::SurfaceFactory()
SurfaceFactory::~SurfaceFactory()
SurfaceFactory::NewTexClient()
SurfaceFactory::StartRecycling()
SurfaceFactory::StopRecycling()
SurfaceFactory::RecycleCallback()
SurfaceFactory::Recycle()
SharedSurface_EGLImage::SharedSurface_EGLImage()
SharedSurface_EGLImage::ReadPixels()
SurfaceFactory_Basic::SurfaceFactory_Basic()
SharedSurface_Basic::Wrap()
SharedSurface_Basic::SharedSurface_Basic()
SharedSurface_Basic::~SharedSurface_Basic()
SharedSurface_GLTexture::Create()
SharedSurface_GLTexture::~SharedSurface_GLTexture()
SharedSurface_GLTexture::ProducerReleaseImpl()
SharedSurface_GLTexture::ToSurfaceDescriptor()
GLFormatForImage()
GLTypeForImage()
TextureImageEGL::TextureImageEGL()
TextureImageEGL::~TextureImageEGL()
TextureImageEGL::DirectUpdate()
TextureImageEGL::BindTexture()
TextureImageEGL::Resize()
TextureImageEGL::BindTexImage()
TextureImageEGL::ReleaseTexImage()
TextureImageEGL::DestroyEGLSurface()
CreateTextureImageEGL()
TileGenFuncEGL()
SharedSurfaceTextureClient::SharedSurfaceTextureClient()
SharedSurfaceTextureClient::Create()
SharedSurfaceTextureClient::~SharedSurfaceTextureClient()
There are also a few datastructures which I should consider removing as well:
struct GLFormats
struct SurfaceCaps
enum class AttachmentType
An important part of these changes involves adding back the GLScreenBuffer method in to the code. But when the upstream change was made to remove this it was replaced with a class called SwapChain. So as part of all these changes I also want to see whether I can roll GLScreenBuffer into SwapChain. My gut tells me this would be a neat simplification if it's possible.

There are some changes that go along with this. I'm keeping a note of it here so I don't lose track of the things to consider.
 
  • Combine SwapChain with GLScreenBuffer.
  • Rename all GLContext::mScreen usage to GLContext::mSwapChain.
  • Switch calls to Screen() with calls to GetSwapChain().


In addition to all of the changes above there are also some changes I need to take more care over. The following methods were added in order to accommodate Wayland rendering and so might be important for getting the browser working on native ports. However, in truth while related to offscreen rendering, these changes aren't directly required for it and I'm not able to test these changes myself. I therefore need to take care about whether these are really needed or not.
GetAppDisplay
_wl_compositor_create_surface
_wl_surface_destroy
LoadWaylandFunctions
UnloadWaylandFunctions
WaylandGLSurface::WaylandGLSurface
WaylandGLSurface::~WaylandGLSurface
CreateEmulatorBufferSurface
Finally many of these changes may be conditional on whether they're related to the Sailfish OS build or not. In this case I should consider wrapping some of them in a MOZ_WIDGET_QT precompiler conditional.

Phew, that's a lot to get through. But there is some kind of plan here and in practice I'm hoping it won't take as long to get through all of this as the lists above might suggest.

But this is enough for today. I'll start to look in to all these changes tomorrow and hopefully will immediately be able to start slimming down the changes.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
28 May 2024 : Day 246 #
Hooray! After a lot of to-ing and fro-ing, I now have a successful build and a bunch of fresh packages. Now to test them. I've transferred them over to my phone, installed the packages and set the harbour-webview test application running.

After my epic train journey building these packages yesterday, and in the hope that these packages might work, thigg, AI artist in residence, has kindly created a picture of the aftermath. I think this captures the vibe rather nicely! Thank you thigg as always for your brilliant work.
 
A train carriage with no people in with a USB cable abandoned on one of the seats; the ambience is rather lonely.

Astonishingly, after all this when I execute my harbour-webview test app using the new packages I still just get a blank screen. That's okay though. Presumably it means the NotifyDidPaintForSubtree events still aren't being called. I now have all the structure for that to happen, so I'll just have to do a bit of debugging to find out why. That, at least, is the plan on paper.

So let's get stuck in with the debugger.
(gdb) info break
Num     Type           Address            What
1       0x0000007ff2c81914 in nsRootPresContext::EnsureEventualDidPaintEvent(
    BaseTransactionId<TransactionIdType>) at nsPresContext.cpp:2689
        breakpoint already hit 4 times
2       0x0000007ff7ee8ff0 in QMozViewPrivate::OnFirstPaint(int, int) at 
    qmozview_p.cpp:1122
        breakpoint already hit 1 time
3       <MULTIPLE>         
        breakpoint already hit 2 times
3.1     0x0000007ff367468c in mozilla::embedlite::EmbedLiteViewChild::
    OnFirstPaint(int, int) at EmbedLiteViewChild.cpp:1479
3.2     0x0000007ff3674774 EmbedLiteViewChild.h:60
(gdb) c
[...]
Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 1, nsRootPresContext::
    EnsureEventualDidPaintEvent (this=0x7fc5112ba0, 
    aTransactionId=aTransactionId@entry=...)
    at nsPresContext.cpp:2689
2689    {
(gdb) c
Continuing.

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, non-virtual thunk to 
    mozilla::embedlite::EmbedLiteViewChild::OnFirstPaint(int, int) ()
    at EmbedLiteViewChild.h:60
60        NS_DECL_NSIEMBEDBROWSERCHROMELISTENER
(gdb) c
Continuing.

Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 3, mozilla::embedlite::
    EmbedLiteViewChild::OnFirstPaint (this=0x7fc4ae79c0, aX=0, aY=0)
    at EmbedLiteViewChild.cpp:1479
1479    {
(gdb) c
Continuing.

Thread 1 &quot;harbour-webview&quot; hit Breakpoint 2, QMozViewPrivate::
    OnFirstPaint (this=0x55557db0f0, aX=0, aY=0) at qmozview_p.cpp:1122
1122        mIsPainted = true;
(gdb) c
[...]
As you can see from this all of the expected methods are being called. So it's a bit strange that it's not rendering anything. And in fact... the odd thing is that now, with the debugger running, it is rendering something. If I clear the breakpoints, the render appears just fine.

And if I restart the app without the debugger it's now working fine as well. The page is being rendered correctly to the screen. Weird. So although it didn't work the first time, it is now working correctly.

There are a bunch of reasons why this might happen, such as a profile folder that needed refreshing or an executable that hadn't quite been properly quit. So I'll put it down to an anomaly. Whatever the reason It looks like the WebView is now working correctly.

It feels like this justifies a bit of a celebration and thankfully thigg is on hand once again to provide us with the content needed for just that. After the rather sombre image above, this feels a lot more upbeat and I really do like it. Finally, the Firefox has yielded its rendering riches!
 
A medieval style block-carved image, mostly black and white, with a flying pig on the left, its front hoofs on a golden stool, holding a mobile phone. On the right an orange fox sits on a throne holding a sword upright with flames on its head (a Firefox!)

Thank you again thigg for your amazing work. I'd happily continue the celebrations, but in practice there's more work to be done. Although that's all for the harbour-webview test app it's time now to try a proper app that uses the WebView to render. The obvious example of this is the email application, but I don't have an email account set up on this phone as I only use it for development. I'll set up an email account to check this in due course, but I'm looking for a quick test and entering my account details will take some time.

The good news is that the Settings app also uses the WebView, specifically in the account creation pages, depending on the account type. Google, VK, DropBox, OneDrive and Twitter all use a WebView to show Terms and Conditions pages, and in the case of Twitter it's used for OAuth configuration as well.

But when I try to use any of them the Settings app crashes. Here's the backtrace I get when the crash occurs:
Thread 10 &quot;GeckoWorkerThre&quot; received signal SIGSEGV, Segmentation 
    fault.
[Switching to LWP 2476]
0x0000007fc4fe8fdc in mozilla::embedlite::PuppetWidgetBase::Invalidate (
    this=0x7fb4e698b0, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
274         MOZ_CRASH(&quot;Unexpected layer manager type&quot;);
(gdb) bt
#0  0x0000007fc4fe8fdc in mozilla::embedlite::PuppetWidgetBase::Invalidate (
    this=0x7fb4e698b0, aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:274
#1  0x0000007fc4fed870 in mozilla::embedlite::PuppetWidgetBase::UpdateBounds (
    this=0x7fb4e698b0, aRepaint=aRepaint@entry=true)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:395
#2  0x0000007fc4ff6a5c in mozilla::embedlite::EmbedLiteWindowChild::
    CreateWidget (this=0x7fb4e19370)
    at xpcom/base/nsCOMPtr.h:851
#3  0x0000007fc4fe6fc8 in mozilla::detail::RunnableMethodArguments<>::
    applyImpl<mozilla::embedlite::EmbedLiteWindowChild, void (mozilla::
    embedlite::EmbedLiteWindowChild::*)()>(mozilla::embedlite::
    EmbedLiteWindowChild*, void (mozilla::embedlite::EmbedLiteWindowChild::*)(
    ), mozilla::Tuple<>&, std::integer_sequence<unsigned long>) (args=..., 
    m=<optimized out>, o=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1151
#4  mozilla::detail::RunnableMethodArguments<>::apply<mozilla::embedlite::
    EmbedLiteWindowChild, void (mozilla::embedlite::EmbedLiteWindowChild::*)()> 
    (
    m=<optimized out>, o=<optimized out>, this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1154
#5  mozilla::detail::RunnableMethodImpl<mozilla::embedlite::
    EmbedLiteWindowChild*, void (mozilla::embedlite::EmbedLiteWindowChild::*)(
    ), true, (mozilla::RunnableKind)1>::Run (this=<optimized out>)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsThreadUtils.h:1201
#6  0x0000007fc214e63c in mozilla::RunnableTask::Run (this=0x55562b6ce0)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
[...]
#28 0x0000007ff6f4289c in ?? () from /lib64/libc.so.6
(gdb) 
And here's where that error is coming from in PuppetWidgetBase.cpp according to this backtrace:
  if (mozilla::layers::LayersBackend::LAYERS_CLIENT == lm->GetBackendType()) {
    // No need to do anything, the compositor will handle drawing
  } else {
    MOZ_CRASH(&quot;Unexpected layer manager type&quot;);
  }
As we can see from this, it's an enforced crash which happens whenever the layer backend is set to anything other than LAYERS_CLIENT. According to the debugger, ours is set to LAYERS_WR, so it's no wonder the crash is happening:
(gdb) p lm->GetBackendType()
$1 = mozilla::layers::LayersBackend::LAYERS_WR
(gdb) 
The interesting thing is that this isn't the first time we've hit this error. In fact, it's not the second time either.

The first time was back on Day 50 where the problem followed from the fact that gfx::gfxVars::UseWebRender() was set to true when it was expected to be set to false.

The second time was on Day 160 when the problem also amounted to the WebView layer manager being used. It took a couple of days of investigation (I eventually figured it out on Day 163) to figure out the reason, but it turned out to be because the gfx.webrender.force-disabled preference was set to false when it should have been set to true.

This preference is managed by the pref.js file inside the .mozilla profile folder. This folder is stored in different places for different apps. In the case of the Settings app the profile folder looks like this:
$ ls -a /home/defaultuser/.cache/org.sailfishos/Settings/
.         ..        .mozilla
$ ls -a /home/defaultuser/.cache/org.sailfishos/Settings/.mozilla/
.                   .parentlock         cert9.db            cookies.sqlite-wal  
    pkcs11.txt          startupCache        storage.sqlite
..                  cache2              cookies.sqlite      key4.db             
    security_state      storage             ua-update.json
$ rm -rf /home/defaultuser/.cache/org.sailfishos/Settings/.mozilla
It's notable that there is no pref.js file to be found here. That contrasts with the harbour-webview configuration file, which not only exists, but which also includes a setting for gfx.webrender.force-disabled:
$ pwd
/home/defaultuser/.cache/harbour-webview/harbour-webview/.mozilla
$ grep &quot;gfx\.webrender\.force-disabled&quot; prefs.js 
user_pref(&quot;gfx.webrender.force-disabled&quot;, true);
But as we noted, there's no equivalent settings file for the Settings app:
$ pwd
/home/defaultuser/.cache/org.sailfishos/Settings/.mozilla
$ grep &quot;gfx\.webrender\.force-disabled&quot; prefs.js 
grep: prefs.js: No such file or directory
There's obviously something to fix here, but right now I just want to test the rendering. So to move things forwards I'm going to copy over the preferences file from the harbour-webview app and put it in the configuration folder for the Settings app.
$ cp ../../../harbour-webview/harbour-webview/.mozilla/prefs.js .
$ ls
cache2              cookies.sqlite      key4.db             prefs.js            
    startupCache        storage.sqlite
cert9.db            cookies.sqlite-wal  pkcs11.txt          security_state      
    storage             ua-update.json
$ grep &quot;gfx\.webrender\.force-disabled&quot; prefs.js 
user_pref(&quot;gfx.webrender.force-disabled&quot;, true);
After copying this file over and re-starting the Settings app I find things are now working as expected. I can happily view the Terms and Conditions pages for the different accounts and even start to enter my Twitter credentials if I want to. There is clearly a problem to solve here, but it's not directly related to rendering. The problem is that the pref.js file isn't being created when it should be.

I need to keep my problems partitioned, otherwise they become too big and never get resolved. I'll create an issue for this pref.js problem and continue concentrating on getting the rendering wrapped up.

So what is there still to wrap up on the rendering front? Well, specifically I need to now go through the code and try to simplify things, partition everything into a series of clean commits so that I can turn them into nice patches.

I've had to make a really quite huge number of changes to get this WebView offscreen rendering working. Now I have to see if it can be simplified. This is a big task and not one I'm going to start today, so that'll have to do for now. I'll pick up the next steps tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
27 May 2024 : Day 245 #
This morning I woke up early unable to sleep. Last night I set a build running and although it's not nervousness about the build that's preventing me from sleeping, it does nevertheless provide a good opportunity for me to check the progress of the build.

Unfortunately the build that was running overnight has failed. There are errors resulting from the changes I made to nsPresContext.cpp, the first of which looks like this:
205:02.94 In file included from Unified_cpp_layout_base2.cpp:20:
205:02.94 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp: In member 
    function ‘void nsPresContext::DetachPresShell()’:
205:02.94 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:839:3: error: 
    ‘thisRoot’ was not declared in this scope
205:02.94    thisRoot->CancelAllDidPaintTimers();
205:02.94    ^~~~~~~~
205:02.95 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:839:3: note: 
    suggested alternative: ‘IsRoot’
205:02.95    thisRoot->CancelAllDidPaintTimers();
205:02.95    ^~~~~~~~
205:02.95    IsRoot
The problem here is an attempt to use the thisRoot variable in the nsPresContext::DetachPresShell() method without it having been declared. Here's the specific line that's failing from the ESR 91 code:
  // The did-paint timer also depends on a non-null pres shell.
  thisRoot->CancelAllDidPaintTimers();
In comparison, here's what the equivalent code looks like in ESR 78:
  if (IsRoot()) {
    nsRootPresContext* thisRoot = static_cast<nsRootPresContext*>(this);

    // Have to cancel our plugin geometry timer, because the
    // callback for that depends on a non-null presshell.
    thisRoot->CancelApplyPluginGeometryTimer();

    // The did-paint timer also depends on a non-null pres shell.
    thisRoot->CancelAllDidPaintTimers();
  }
Note that both of these blocks of code are post-patch. That is, they're the code as it looks after patch 0044 (or my manual changes in the case of ESR 91) have been applied. Looking at the ESR 78 code gives an immediate idea about what the code ought to look like in ESR 91. I've made changes in line with this so that now the ESR 91 code looks like this:
  if (IsRoot()) {
    nsRootPresContext* thisRoot = static_cast<nsRootPresContext*>(this);

    // The did-paint timer also depends on a non-null pres shell.
    thisRoot->CancelAllDidPaintTimers();
  }
That should do the trick. There's another error in the build output though, which is the following:
205:03.09 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp: At global scope:
205:03.09 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:2677:1: error: 
    ‘mOutstandingTransactionIdvoid’ does not name a type
205:03.09  mOutstandingTransactionIdvoid
205:03.09  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To understand this better it helps once again to compare the ESR 78 and ESR 91 code. But this time I'm going to switch it around so that we see the ESR 78 code first. Here's what it looks like there:
void
nsRootPresContext::EnsureEventualDidPaintEvent(TransactionId aTransactionId)
{
And here's what it looks like in ESR 91:
mOutstandingTransactionIdvoid
nsRootPresContext::EnsureEventualDidPaintEvent(TransactionId aTransactionId)
{
Notice the error? Honestly I don't know what's going on here but I suspect I accidentally hit Ctrl-V at some point when the cursor was in the wrong place and pasted the variable in the buffer in front of the return type.

Removing the extraneous mOutstandingTransactionIdvoid so it's restored back to the original code seems to be the correct fix here.

These are the only two errors thrown up by the compiler, so after fixing them both I've set the build running again.

Unfortunately the next attempt to build also fails. This time the compilation stages all pass without incident, it gets right up to the linking stage before throwing an undefined reference error. Apparently I've forgotten to define nsRootPresContext::Detach():
216:06.48 toolkit/library/build/libxul.so
219:52.11 opt/cross/bin/aarch64-meego-linux-gnu-ld: 
    ../../../layout/base/Unified_cpp_layout_base2.o:(
    .data.rel.ro.local._ZTV17nsRootPresContext[_ZTV17nsRootPresContext]+0x38): 
    undefined reference to `nsRootPresContext::Detach()'
219:52.11 opt/cross/bin/aarch64-meego-linux-gnu-ld: 
    libxul.so: hidden symbol `_ZN17nsRootPresContext6DetachEv' isn't defined
219:52.11 opt/cross/bin/aarch64-meego-linux-gnu-ld: 
    final link failed: bad value
219:52.11 collect2: error: ld returned 1 exit status
Sure enough the nsRootPresContext::Detach() method is defined in the class header:
class nsRootPresContext final : public nsPresContext {
 public:
  nsRootPresContext(mozilla::dom::Document* aDocument, nsPresContextType aType);
  virtual ~nsRootPresContext();
  virtual bool IsRoot() override { return true; }

  virtual void Detach() override;
[...]
But there's no code to define the body of the method in the cpp file. Hence the error. The correct method body that should have been included can be seen in the 0044 patch file:
+/* virtual */
+void nsRootPresContext::Detach() {
+  CancelAllDidPaintTimers();
+  nsPresContext::Detach();
+}
+
For some reason this never made it in to my manually patched ESR 91 code. I must have missed it.

Both of the methods called in this method — nsRootPresContext::CancelAllDidPaintTimers() and nsPresContext::Detach() — are defined and accessible, so adding the missing code shouldn't break compilation. I've added it in and kicked the build off once again. The fact it got all the way through to linking on the previous attempt will hopefully mean that there are fewer source files to be recompiled this time around and the build will complete relatively swiftly.

Having said that, for the rest of the day I'll be travelling. That means a day of switching between various buses and trains. Not ideal when it comes to a lengthy build since, while I can allow the build to continue when I'm on the train, my laptop will have to be suspended and resumed as I move from one leg of the journey to the next.

Still, the build continues, on and off, throughout the day.

[...]

It's evening now and as I sit here at my destination after a day spent travelling, the build continues to run, which means I'll have to leave it running overnight. Let's hope I wake up to a complete build and an opportunity to test the result in the morning.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
26 May 2024 : Day 244 #
I'm continuing to investigate the best way to re-enable the OnFirstPaint() call. I've now been through the code — and the patch — several times to see if I can find a way to avoid manually applying the patch in full.

The actual purpose of the patch appears to be to set up some timers so that the NotifyDidPaintForSubtree() method is triggered after the screen paint has completed. Here's where the timer is created and kicked off:
  nsCOMPtr<nsITimer> timer;
  RefPtr<nsRootPresContext> self = this;
  nsresult rv = NS_NewTimerWithCallback(
    getter_AddRefs(timer),
    NewNamedTimerCallback([self, aTransactionId](){
      nsAutoScriptBlocker blockScripts;
      self->NotifyDidPaintForSubtree(aTransactionId);
     }, &quot;NotifyDidPaintForSubtree&quot;), 100, nsITimer::TYPE_ONE_SHOT,
    Document()->EventTargetFor(TaskCategory::Other));
This piece of code would be added to the nsPresContext.cpp source file by the patch. As you can see it creates a one-shot timer that runs for 100 milliseconds (one tench of a second) before calling the NotifyDidPaintForSubtree() method.

A lot of the surrounding code is all related to ensuring the timer does — or doesn't — run under various scenarios, such as after the window has been destroyed.

Looking carefully through the patch, applying it to the most recent ESR 91 code doesn't seem to be too awkward. However the code after applying the patch seems to use mPendingTransaction and mCompletedTransaction in contrast to the original code that uses mNextTransactionId, mOutstandingTransactionId and mCompletedTransaction.

In the new ESR 91 code these have been switched out for an array called mPendingTransactions. In order to help get things straight in my head I'm going to try to summarise what each of these is used for.

In the non-patched ESR 78 code, all of these are of type TransactionId, which is a class, but essentially one that's a wrapper for a monotonically increasing uint64_t used as an identifier. Here are the three instances and what they're for in ESR 78 (before patch 0044 is applied):
  • mNextTransactionId: The most recently allocated transaction id.
  • mOutstandingTransactionId: Captures mCompletedTransaction + (pending transaction count).
  • mCompletedTransaction: The most recently completed transaction id.

Based on this we always have mNextTransactionId >= mOutstandingTransactionId >= mCompletedTransaction. The number of transactions not yet completed is given by (mOutstandingTransactionId - mCompletedTransaction).

Because the ids are monotonically increasing, all three of these variables are needed to keep track of things. After applying the ESR 78 patch the arrangement is rejigged so that only two variables are needed:
  • mPendingTransaction: The most recently allocated transaction id.
  • mCompletedTransaction: The most recently completed transaction id.

Here mPendingTransaction is replacing mNextTransactionId. Both increment when a new transaction is created. However when a transaction is revoked mPendingTransaction in the post-patch version of the code is decreased, whereas in the pre-patch version it was mOutstandingTransactionId that decreased.

Consequently if we want to find out how many open transactions there are to complete in the post-patch ESR 78 build, it now becomes (mPendingTransaction - mCompletedTransaction).

In the non-patched ESR 91 code this has all changed quite substantially. Now we have a variable to ensure ids aren't repeated, plus an array of TransactionId objects.
  • mNextTransactionId: The most recently allocated transaction id.
  • mPendingTransactions: An extensible array of transaction ids, representing all open transactions.

This gives a more flexible arrangement. When transactions are created mNextTransactionId is incremented and used to get the next id. This way ids are monotonically increasing. The newly minted id is added to the mPendingTransactions array when it's opened and removed from the array when it's revoked.

With this arrangement we can use mPendingTransactions.Length() to find out how many open transactions there are.

As you can see from this the ESR 91 approach supports everything that the other two approaches support. So there's no point in tearing it down and replacing it with one of the earlier approaches. Better to amend the changes to the ESR 91 codebase to suit this new approach.

What does this mean in practice? It means that the changes patching nsPresContext.cpp need to go in as before, but the changes in nsRefreshDriver.cpp can be greatly simplified. Most of the changes simply aren't needed if we're not having to change the way these ids are being managed. In fact the only change we need to make to this file is to drop the call to NotifyRevokingDidPaint():
$ git diff layout/base/nsRefreshDriver.cpp
diff --git a/layout/base/nsRefreshDriver.cpp b/layout/base/nsRefreshDriver.cpp
index 7726971e63c9..34406f9cf2fe 100644
--- a/layout/base/nsRefreshDriver.cpp
+++ b/layout/base/nsRefreshDriver.cpp
@@ -2647,12 +2647,6 @@ void nsRefreshDriver::RevokeTransactionId(
     FinishedWaitingForTransaction();
   }
 
-  // Notify the pres context so that it can deliver MozAfterPaint for this
-  // id if any caller was expecting it.
-  nsPresContext* pc = GetPresContext();
-  if (pc) {
-    pc->NotifyRevokingDidPaint(aTransactionId);
-  }
   // Remove aTransactionId from the set of outstanding transactions since we're
   // no longer waiting on it to be completed, but don't revert
   // mNextTransactionId since we can't use the id again.
I've applied all of these changes to the code — manually this time — and checked through them. This time I've tried to understand what each change is doing before applying it, so I'm much more confident of about the result this time. I've started the build off so that it's all set to run overnight. Let's see how this has gone when we return to it in the morning.

Nothing builds first time of course, but whatever errors are thrown up should hopefully be more straightforward to deal with than the errors I got with my previous attempts.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
25 May 2024 : Day 243 #
It was nice to get such appreciative comments after my last post, thank you for those who got in touch saying they were interested to hear my thoughts on the Jolla C2. I'm glad something good came out of my stupidity in sending my laptop to sleep mid-build! A special mention has to go to Leif-Jöran Olsson (ljo) for his amazing poetry in describing the situation!
 
The 3-way merged double oh double four patch brought the light to webview rendering. Manually reapplying might bring the build in shape. Community squared arrived, squared in joy. Stoically stochastic interventions might as well bring the young ones to reappear. The gecko's nest will be filled with seedlings to feast on. Congrats!

This really made my day; thank you ljo! And I needed it to because I'm just not happy with how all this patching, or more precisely reverting, has been going. As ljo aludes to in his poem, yesterday I applied a three-way merge of the Sailfish patch 0044. This patch was designed to revert upstream commit 1deccd7ac14706ad, but it's intended to be applied on top of the ESR 78 codebse. Things have changed with ESR 91, so the patch no longer applies directly.

For reference, here's the description that goes alongside patch 0044.
commit c3d0a8fb29cbb2998c238a2c85801059ed636070 (HEAD -> 
    FIREFOX_ESR_91_9_X_RELBRANCH_patches)
Author: Andrew den Exter <andrew.den.exter@qinetic.com.au>
Date:   Mon Sep 6 09:52:04 2021 +0000

    Revert &quot;Bug 1445570 - Remove EnsureEventualAfterPaint timer. 
    r=tnikkel&quot;

    This reverts commit 1deccd7ac14706ad1849343cfb2b93df191a1c42.
And here's the description of the commit that the patch is aiming to revert.
$ git log -1 --grep &quot;Bug 1445570&quot;
commit 1deccd7ac14706ad1849343cfb2b93df191a1c42
Author: Mantaroh Yoshinaga <mantaroh@gmail.com>
Date:   Thu Sep 6 02:21:39 2018 +0000

    Bug 1445570 - Remove EnsureEventualAfterPaint timer. r=tnikkel
    
    MozReview-Commit-ID: C7WICJ5Q0ES
    
    Differential Revision: https://phabricator.services.mozilla.com/D5005
    
    --HG--
    extra : moz-landing-system : lando
I'm copying these out here because I think they're going to be helpful as we continue onwards.

Since patching has been so unsuccessful, rather than trying to apply the Sailfish patch for ESR 78, why don't I go directly to the source and attempt to revert commit 1deccd7ac14706ad instead? Maybe that will bring better results.

It seems like a good plan, but in practice when I try to do this I get a bunch of conflicts.
$ git revert 1deccd7ac14706ad1849343cfb2b93df191a1c42
Auto-merging dom/base/nsDOMWindowUtils.cpp
CONFLICT (content): Merge conflict in dom/base/nsDOMWindowUtils.cpp
Auto-merging layout/base/nsPresContext.cpp
CONFLICT (content): Merge conflict in layout/base/nsPresContext.cpp
Auto-merging layout/base/nsPresContext.h
CONFLICT (content): Merge conflict in layout/base/nsPresContext.h
Auto-merging layout/base/nsRefreshDriver.cpp
CONFLICT (content): Merge conflict in layout/base/nsRefreshDriver.cpp
Auto-merging layout/base/nsRefreshDriver.h
CONFLICT (content): Merge conflict in layout/base/nsRefreshDriver.h
Auto-merging layout/reftests/text/reftest.list
CONFLICT (content): Merge conflict in layout/reftests/text/reftest.list
Auto-merging layout/style/test/test_restyles_in_smil_animation.html
error: could not revert 1deccd7ac147... Bug 1445570 - Remove 
    EnsureEventualAfterPaint timer. r=tnikkel
hint: After resolving the conflicts, mark them with
hint: &quot;git add/rm <pathspec>&quot;, then run
hint: &quot;git revert --continue&quot;.
hint: You can instead skip this commit with &quot;git revert --skip&quot;.
hint: To abort and get back to the state before &quot;git revert&quot;,
hint: run &quot;git revert --abort&quot;.
I don't much enjoy conflicts so I wonder if there's a different revert strategy that I can use to make this go more smoothly. Checking the git-revert documentation does indeed throw up some promising suggestions.
$ man git-revert
[...]
       --strategy=<strategy>
           Use the given merge strategy. Should only be used once. See the 
    MERGE STRATEGIES section in git-merge(1) for details.
This suggests checking the MERGE STRATEGIES section of the git-merge man pages. But that section is rather long, so rather than copying it all out here, I recommend to check out the online documentation instead. According to this page, alongside the standard ort ("Ostensibly Recursive’s Twin") strategy — which is what I used here — there's also a so-called recursive strategy. The ort approach didn't go great, so lets try the alternative.
$ git revert --strategy=recursive 1deccd7ac14706ad1849343cfb2b93df191a1c42
[...]
Unfortunately that gives me similarly poor results. What I really want is a strategy that goes back through the history, removes the commit to be reverted, applies all subsequent commits on top of that, then generates a diff of the result against the current codebase. It's possible this is exactly what these strategies are doing already, but I'm not smart enough to figure that out from the documentation.

Actually, after more reflection, where the documentation talks about common ancestors, I think it's this that it's refering to. Nevertheless it didn't give good results, so I want to try a different approach.

Rather than applying the patch, or reverting the commit, there is a third option. And that's to figure out how NotifyInvalidation() ought to be called in the existing ESR 91 codebase. If there is such a way then it's possible this patch could be skipped entirely, which would have the nice side-effect of making future maintenance easier.

I'll explore this third option a little further over the next few days (I say "few days" because it's going to take at least one full build, so this isn't a task that can be completed in a day).

The first thing I notice is that the code that's been removed isn't the only place that nsPresContext::NotifyDidPaintForSubtree() gets called. It also gets called from inside the nsAutoNotifyDidPaint class destructor:
class nsAutoNotifyDidPaint {
 public:
  nsAutoNotifyDidPaint(PresShell* aShell, PaintFlags aFlags)
      : mShell(aShell), mFlags(aFlags) {}
  ~nsAutoNotifyDidPaint() {
    if (!!(mFlags & PaintFlags::PaintComposite)) {
      mShell->GetPresContext()->NotifyDidPaintForSubtree();
    }
  }

 private:
  PresShell* mShell;
  PaintFlags mFlags;
};
This destructor does get called as a result of in instance being created on the stack in the PresShell::Paint() method. At the end of the method the instance is destroyed and the destructor called.

However, in this case the NotifyDidPaintForSubtree() method doesn't get called because !!(mFlags & PaintFlags::PaintComposite) evaluates to false.

There's also a call inside nsView::DidCompositeWindow() but this never gets called. In this case the method is part of an implementation of the nsIWidgetListener interface. We do have some instances of this interfaces in use, most notably inherited from nsBaseWidget into EmbedLitePuppetWidget:
class EmbedLitePuppetWidget : public PuppetWidgetBase
{
[...]
class PuppetWidgetBase : public nsBaseWidget
{
[...]
class nsBaseWidget : public nsIWidget, public nsSupportsWeakReference {
[...]
  nsIWidgetListener* mWidgetListener;
  nsIWidgetListener* mAttachedWidgetListener;
[...]
However whilst these are used in the EmbedLitePuppetWidget code they don't seem to be set there. The mAttachedWidgetListener variable gets set by nsView but that's a direction we already explored. The mWidgetListener variable is also set by nsView in addition to AppWindow. The AppWindow class is never used, but although the nsView class is instantiated we already know that nsView::DidCompositeWindow() is never called. I'm now not entirely sure why it doesn't get called. But it doesn't.

None of this is directly helping us to solve the problem, but it does allow us to explore possibilties and rule things out. This is all going to require a bit more thought. I'll continue with this line of investigation tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
24 May 2024 : Day 242 #
It's funny how even when asleep I'm capable of making stupid mistakes. Yesterday I applied patch 0044 in order to restore the OnFirstPaint() trigger functionality. This is critical for getting the offscreen rendering pipeline to work. But as is the case after applying a patch I needed to rebuild the packages before they could be tested on my phone.

I kicked the build off to run overnight and then — d'oh! — promptly sent my laptop to sleep. It doesn't take a genius to realise that the build isn't going to run while the laptop is sleeping.

So here I am this morning without packages to test. At least my laptop is now awake and the build is continuing. But this has lost me the opportunity to do some development this morning.

Instead I've spent the time ordering myself a Jolla C2, the new Sailfish OS phone announced at Jolla Love Day 2 this last Monday. It's an interesting move by Jolla. For years now members of the community have been encouraging Jolla to offer Sailfish OS pre-installed. On the other hand, many in the community are also keen to see Sailfish OS offered on a very specific selection of handsets, be they PinePHones, Fairphones or brands that are more widely available in Europe, especially those known for being more "open".

My own opinion is that this is a good move by Jolla. There is no perfect phone and every choice has its benefits and drawbacks. Going with Sony phones all those years ago was controversial, but Sony hardware has (in my opinion) been both excellent and offering good value for the hardware. Nice materials, unassuming design and unexpectedly solid features (with headphones jacks, SD card slots, notification lights, and excellent screens to boot). Even more importantly Sony retained their commitment to openness over the long term, when other manufacturers have long since lost interest. On the other hand PinePhone and Fairphone have credentials that are really appealing for Sailfish users, who tend to prize openness and ethical considerations over price and features alone.

The new Reeder phone offers something a little different to either of these, which is the opportunity for Jolla to work hand-in-hand with the manufacturer. Jolla has a good track record doing this and the results historically have been very good. The new Reeder phone might not be open in the same way that the Sony or PinePhone are, but it's likely Jolla will have been given access to much more of what they need, including support, in order to build a really solid software experience for the hardware.

I'm speculating of course and only time will tell, but given we're still expecting Sailfish OS releases for the Xperia 10 IV and Xperia 10 V, and given I already have one of the latter, these are exciting times as far as I'm concerned.

Alright, let's get back to development. The build has carried on for a while but eventually failed with an error. Here's the debug output generated from the build process.
813:10.10 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.cpp: In member 
    function ‘void nsRefreshDriver::Tick(mozilla::VsyncId, mozilla::TimeStamp, 
    nsRefreshDriver::IsExtraTick)’:
813:10.10 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.cpp:2430:41: error: 
    ‘mNextTransactionId’ was not declared in this scope
813:10.10        transactionId.AppendInt((uint64_t)mNextTransactionId);
813:10.10                                          ^~~~~~~~~~~~~~~~~~
813:10.12 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.cpp:2430:41: note: 
    suggested alternative: ‘GetTransactionId’
813:10.13        transactionId.AppendInt((uint64_t)mNextTransactionId);
813:10.13                                          ^~~~~~~~~~~~~~~~~~
813:10.13                                          GetTransactionId
813:14.49 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: 
    nsRefreshDriver.o] Error 1
The error is telling us that mNextTransactionIddoesn't exist and that's correct because it was removed by the patch. But there's still an instance remaining that the patch didn't capture. To get this through I've had to make one small change. This code looks to be related to profiling so probably isn't really needed for our purposes anyway, but we do need it to compile.
$ git diff layout/base/nsRefreshDriver.cpp
diff --git a/layout/base/nsRefreshDriver.cpp b/layout/base/nsRefreshDriver.cpp
index 33c90bb1f324..bb43dea053ea 100644
--- a/layout/base/nsRefreshDriver.cpp
+++ b/layout/base/nsRefreshDriver.cpp
@@ -2427,7 +2427,7 @@ void nsRefreshDriver::Tick(VsyncId aId, TimeStamp 
    aNowTime,
     nsCString transactionId;
     if (profiler_can_accept_markers()) {
       transactionId.AppendLiteral(&quot;Transaction ID: &quot;);
-      transactionId.AppendInt((uint64_t)mNextTransactionId);
+      transactionId.AppendInt((uint64_t)mPendingTransaction);
     }
     AUTO_PROFILER_MARKER_TEXT(
         &quot;ViewManagerFlush&quot;, GRAPHICS,
This is a similar change to that made elsewhere in the code, so should hopefully be pretty benign.

As the build is still taking a while I've decide I can't let all this potential development time go to waste. So I've decided to take the plunge and perform a test while the build continues. The current issue my most recent changes are aimed at addressing is the fact that the mIsPainted variable in the QMozViewPrivate class never gets set to true. So I'm wondering what will happen if I forcefully set it to true at a suitable time using the debugger.

If I'm honest I'm a little nervous about performing this test. If it turns out to have no effect then it'll leave me scratching my head yet again. I really want this rendering to work.

But I've plucked up the courage and am going to give it a go. First set the breakpoint on the method where the test for this flag is being made.
(gdb) b QuickMozView::updatePaintNode
Function &quot;QuickMozView::updatePaintNode&quot; not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (QuickMozView::updatePaintNode) pending.
(gdb) r
Starting program: /usr/bin/harbour-webview
[...]
Now run the program through to the first time this breakpoint is hit. I don't want to flip the flag here because the rendering pipeline and textures aren't fully set up yet, so I'm going to wait for a later hit of the breakpoint.
Thread 9 &quot;QSGRenderThread&quot; hit Breakpoint 1, QuickMozView::
    updatePaintNode (this=0x555586dc20, oldNode=0x0) at quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {
(gdb) c
Continuing.
[...]
=============== Preparing offscreen rendering context ===============
[...]
As we can see the rendering context has now been configured. Following this the breakpoint is hit again. Now is the time to change the flag from true to false. Here goes:
Thread 9 &quot;QSGRenderThread&quot; hit Breakpoint 1, QuickMozView::
    updatePaintNode (this=0x555586dc20, oldNode=0x7fc800c640) at 
    quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {
(gdb) p d->mIsPainted
$1 = false
(gdb) set d->mIsPainted = true
(gdb) p d->mIsPainted
$2 = true
(gdb) disable break
(gdb) c
Continuing.
[...]
Finally I disabled the breakpoint and set the program running again. And the result? I'm astonished to find that the result is a fully rendered Sailfish OS website!
 
A phone showing the sailfishos.org website next to a laptop on a table with the debugger running in a console window.

Scrolling around, selecting links and interacting with the site all seem to work nicely. This is really encouraging. This means that once this patch is properly applied, that should be the final piece of the puzzle needed to get the WebView working again.
 
Three screenshots of the WebView rendering: the sailfish.org site at the top; the same site scrolled down a bit; and shop.jolla.com with the new Jolla C2 on display

I'm quite excited by this!

Unfortunately this is a debugging hack so we're not there yet. And the build has failed again. At least now, motivated by my excitement, I have plenty of energy to fix the build.

Here's the latest output from the failed build:
302:48.25 In file included from Unified_cpp_layout_base2.cpp:20:
302:48.25 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp: In member 
    function ‘void nsPresContext::DetachPresShell()’:
302:48.25 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:843:15: error: 
    ‘class nsRootPresContext’ has no member named 
    ‘CancelApplyPluginGeometryTimer’; did you mean ‘mApplyPluginGeometryTimer’?
302:48.25      thisRoot->CancelApplyPluginGeometryTimer();
302:48.25                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302:48.25                mApplyPluginGeometryTimer
302:48.40 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp: In destructor 
    ‘virtual nsRootPresContext::~nsRootPresContext()’:
302:48.40 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:2683:3: error: 
    ‘CancelApplyPluginGeometryTimer’ was not declared in this scope
302:48.40    CancelApplyPluginGeometryTimer();
302:48.40    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302:48.47 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:2683:3: note: 
    suggested alternative:  mApplyPluginGeometryTimer’
302:48.47    CancelApplyPluginGeometryTimer();
302:48.47    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302:48.47    mApplyPluginGeometryTimer
302:48.51 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp: In member 
    function ‘void nsRootPresContext::ComputePluginGeometryUpdates(nsIFrame*, 
    nsDisplayListBuilder*, nsDisplayList*)’:
302:48.51 ${PROJECT}/gecko-dev/layout/base/nsPresContext.cpp:2712:21: error: 
    ‘nsPluginFrame’ does not name a type; did you mean ‘nsPageFrame’?
302:48.51          static_cast<nsPluginFrame*>(iter.Get()->GetKey(
    )->GetPrimaryFrame());
302:48.51                      ^~~~~~~~~~~~~
302:48.51                      nsPageFrame
These are actually rather odd errors. For example nsPageFrame has been added to the code even though it's not part of the upstream change. It must have been pulled in due to some changes applied later.

I'm not very happy with this, it's all a bit messy. So I've decided to revert my attempt to automatically merge in the patch and go back to a manual approach instead.

This will require a bit of a rework, which unfortunately I don't have time to do tonight. But I'll start on it in the morning and with any luck may be able to get it done tomorrow. Which means that's it for today. But we're definitely getting there!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
23 May 2024 : Day 241 #
It's a good job I stopped working on this last night when I did. On returning to it this morning with a clearer head it didn't take long to identify that the changes needed to revert the D5005 upstream code changes had already been wrapped up into a patch — patch 0044 to be precise — that I'd not yet applied.

If the patch can be applied cleanly, it could save me a lot of time. So first step is to do a dry run of applying the patch. This won't actually make any changes to the code, but it will tell me whether the patch is going to apply without incident if I attempt it properly.
$ patch --dry-run -d gecko-dev -p1 < rpm/
    0044-Revert-Bug-1445570-Remove-EnsureEventualAfterPaint-t.patch 
checking file dom/base/nsDOMWindowUtils.cpp
Hunk #1 succeeded at 407 (offset 51 lines).
checking file layout/base/nsDocumentViewer.cpp
Hunk #1 succeeded at 1573 (offset -44 lines).
Hunk #2 succeeded at 1718 (offset -45 lines).
Hunk #3 succeeded at 3649 (offset -254 lines).
checking file layout/base/nsPresContext.cpp
Hunk #1 succeeded at 314 (offset 3 lines).
Hunk #2 FAILED at 804.
Hunk #3 succeeded at 2021 (offset 66 lines).
Hunk #4 FAILED at 2039.
Hunk #5 FAILED at 2121.
Hunk #6 FAILED at 2613.
Hunk #7 succeeded at 2716 with fuzz 2 (offset -132 lines).
4 out of 7 hunks FAILED
checking file layout/base/nsPresContext.h
Hunk #1 succeeded at 355 (offset -5 lines).
Hunk #2 succeeded at 912 (offset -4 lines).
Hunk #3 FAILED at 1363.
Hunk #4 FAILED at 1455.
2 out of 4 hunks FAILED
checking file layout/base/nsRefreshDriver.cpp
Hunk #1 FAILED at 1166.
Hunk #2 FAILED at 1239.
Hunk #3 FAILED at 2337.
Hunk #4 FAILED at 2396.
Hunk #5 FAILED at 2404.
5 out of 5 hunks FAILED
checking file layout/base/nsRefreshDriver.h
Hunk #1 FAILED at 507.
1 out of 1 hunk FAILED
checking file layout/style/test/test_restyles_in_smil_animation.html
There are quite a few nice succeeded lines there. But also a disturbing number of FAILED lines as well. That's FAILED in all caps, just in case the word itself doesn't sufficiently convey the damage being done.

The advice Raine would give me in this situation is to apply a three-way merge using git. So that's what I'm going to try.
$ git am --3way ../rpm/
    0044-Revert-Bug-1445570-Remove-EnsureEventualAfterPaint-t.patch
Applying: Revert &quot;Bug 1445570 - Remove EnsureEventualAfterPaint timer. 
    r=tnikkel&quot;
Using index info to reconstruct a base tree...
M       dom/base/nsDOMWindowUtils.cpp
M       layout/base/nsDocumentViewer.cpp
M       layout/base/nsPresContext.cpp
M       layout/base/nsPresContext.h
M       layout/base/nsRefreshDriver.cpp
M       layout/base/nsRefreshDriver.h
Falling back to patching base and 3-way merge...
Auto-merging layout/base/nsRefreshDriver.h
CONFLICT (content): Merge conflict in layout/base/nsRefreshDriver.h
Auto-merging layout/base/nsRefreshDriver.cpp
CONFLICT (content): Merge conflict in layout/base/nsRefreshDriver.cpp
Auto-merging layout/base/nsPresContext.h
CONFLICT (content): Merge conflict in layout/base/nsPresContext.h
Auto-merging layout/base/nsPresContext.cpp
CONFLICT (content): Merge conflict in layout/base/nsPresContext.cpp
Auto-merging layout/base/nsDocumentViewer.cpp
Auto-merging dom/base/nsDOMWindowUtils.cpp
error: Failed to merge in the changes.
Patch failed at 0001 Revert &quot;Bug 1445570 - Remove EnsureEventualAfterPaint 
    timer. r=tnikkel&quot;
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run &quot;git am --continue&quot;.
If you prefer to skip this patch, run &quot;git am --skip&quot; instead.
To restore the original branch and stop patching, run &quot;git am 
    --abort&quot;.
This has done a better job than the patch did, but there are still some problems which require manual intervention. But after lots of manual changes, comparing the source code against the patch file, eventually I'm able to get it there.
git am --continue
It's still the morning and before I start work. It would be nice to kick off the build before I have to move on to my proper work day. That way there's an outside chance that the build will be completed and ready to test by this evening. So although I'd normally read carefully through the changes to check whether they look sound, I'm just going to kick off the build and hope for the best. Here we go!
sfdk build -d -p --with git_workaround
Sadly as the build trundles on into my work day it hits an error in the code.
27:10.41 In file included from ${PROJECT}/obj-build-mer-qt-xr/dist/include/
    mozilla/dom/DocumentTimeline.h:16,
27:10.41                  from ${PROJECT}/gecko-dev/dom/animation/Animation.cpp:
    15,
27:10.41                  from Unified_cpp_dom_animation0.cpp:2:
27:10.41 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.h: In member function 
    ‘bool nsRefreshDriver::AtPendingTransactionLimit()’:
27:10.41 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.h:514:12: error: 
    ‘mPendingTransactions’ was not declared in this scope
27:10.41      return mPendingTransactions.Length() == 2;
27:10.42             ^~~~~~~~~~~~~~~~~~~~
27:10.42 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.h:514:12: note: 
    suggested alternative: ‘mPendingTransaction’
27:10.42      return mPendingTransactions.Length() == 2;
27:10.42             ^~~~~~~~~~~~~~~~~~~~
27:10.42             mPendingTransaction
27:10.42 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.h: In member function 
    ‘bool nsRefreshDriver::TooManyPendingTransactions()’:
27:10.42 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.h:517:12: error: 
    ‘mPendingTransactions’ was not declared in this scope
27:10.42      return mPendingTransactions.Length() >= 2;
27:10.43             ^~~~~~~~~~~~~~~~~~~~
27:10.43 ${PROJECT}/gecko-dev/layout/base/nsRefreshDriver.h:517:12: note: 
    suggested alternative: ‘mPendingTransaction’
27:10.43      return mPendingTransactions.Length() >= 2;
27:10.43             ^~~~~~~~~~~~~~~~~~~~
27:10.43             mPendingTransaction
27:24.10 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:694: 
    Unified_cpp_dom_animation0.o] Error 1
27:24.10 make[3]: *** [${PROJECT}/gecko-dev/config/recurse.mk:72: dom/animation/
    target-objects] Error 2
These errors are clearly all caused by the patches I applied. Unfortunately I can't fix them while I'm at work, which means looking in to them in the evening. If I can get them done today there's a chance I'll be able to run the build overnight, but let's see.

The error is about use of mPendingTransactions in the nsRefreshDriver class. It's true that the member variable isn't defined there. Should it be?

It's not defined in the code in the upstream changeset on either side of the diff. Nor does it appear in the patch that we applied. Which is a little odd, except that there is reference in all of those to a mPendingTransaction member variable. Note the subtle difference: mPendingTransaction vs. mPendingTransactions (singular vs. plural).

And looking more carefully at the patch we can see the reason. Here are the changes that the patch tried to apply:
diff --git a/layout/base/nsRefreshDriver.h b/layout/base/nsRefreshDriver.h
index 07feab12e079..deec935f25f4 100644
--- a/layout/base/nsRefreshDriver.h
+++ b/layout/base/nsRefreshDriver.h
@@ -507,12 +507,7 @@ class nsRefreshDriver final : public mozilla::layers::
    TransactionIdAllocator,
   RefPtr<nsRefreshDriver> mRootRefresh;
 
   // The most recently allocated transaction id.
-  TransactionId mNextTransactionId;
-  // This number is mCompletedTransaction + (pending transaction count).
-  // When we revoke a transaction id, we revert this number (since it's
-  // no longer outstanding), but not mNextTransactionId (since we don't
-  // want to reuse the number).
-  TransactionId mOutstandingTransactionId;
+  TransactionId mPendingTransaction;
   // The most recently completed transaction id.
   TransactionId mCompletedTransaction;
But this failed and I made a manual intervention; one which I thought looked pretty similar, but which in fact included a crucial destructive change:
diff --git a/layout/base/nsRefreshDriver.h b/layout/base/nsRefreshDriver.h
index 29c595ba5ba8..b9a0fa4bb2ed 100644
--- a/layout/base/nsRefreshDriver.h
+++ b/layout/base/nsRefreshDriver.h
@@ -528,8 +528,9 @@ class nsRefreshDriver final : public mozilla::layers::
    TransactionIdAllocator,
   RefPtr<nsRefreshDriver> mRootRefresh;
 
   // The most recently allocated transaction id.
-  TransactionId mNextTransactionId;
-  AutoTArray<TransactionId, 3> mPendingTransactions;
+  TransactionId mPendingTransaction;
+  // The most recently completed transaction id.
+  TransactionId mCompletedTransaction;
As you can see, my change removed the mPendingTransactions (plural) variable. The good news is that after applying the patch this mPendingTransactions variable is only used in two places:
  bool AtPendingTransactionLimit() {
    return mPendingTransactions.Length() == 2;
  }
  bool TooManyPendingTransactions() {
    return mPendingTransactions.Length() >= 2;
  }
Neither of these two methods is used anywhere, so it should be safe to remove them. On the other hand, it seems that by applying the patch I've trampled over rather more changes than I'd anticipated. So while removing these two methods to get the build to pass is a simple solution, I'm also going to need to spend some time checking the other changes.

But, having worked through the patch and read through it side-by-side with the changes made to the ESR 91 code, it all looks sensible, albeit rather messy. It looks like the ESR 91 code now matches the code that would have been expected had the patch been applied to ESR 78 as a base.

I've therefore set the build running again. If it builds that'll be a good sign. The good news is also that once it's built, due to the work I've been doing over the last few days, it should be really easy to check whether this change has had the desired effect. All I need to do is check whether the OnFirstPaint() method is being called. If it is, then that's a good sign things have worked out.

Since it's building now I can't do much else until it completes, so I'm going to call it a night. By the morning it should be done and we can continue with some testing.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
22 May 2024 : Day 240 #
We're continuing to dig deeper into the first paint trigger code today. I've spent a couple of days digging around this already now and it's turning out to be a surprisingly complex investigation. I'm hoping the reward will be worth it, and to be honest, I think it will be: WebView rendering isn't going to work without getting to the bottom of this.

Yesterday we saw that the crucial OnFirstPaint() call was being made from EmbedLiteViewParent::RecvOnFirstPaint(). That's because the QMozViewPrivate class implements the EmbedLiteViewListener interface like this:
class QMozViewPrivate : public QObject,
    public mozilla::embedlite::EmbedLiteViewListener
{
[...]
This listener is set up to get called by EmbedLiteViewParent on receipt of the OnFirstPaint message:
mozilla::ipc::IPCResult EmbedLiteViewParent::RecvOnFirstPaint(const int32_t &aX,
                                                              const int32_t &aY)
{
  LOGT();
  NS_ENSURE_TRUE(mView && !mViewAPIDestroyed, IPC_OK());

  mView->GetListener()->OnFirstPaint(aX, aY);
  return IPC_OK();
}
This is called at the far end of a channel after receipt of a message sent potentially across a thread or process boundary. Sending things across process boundaries like this messes up the backtraces but are otherwise a pretty commonplace in Gecko. We just need to find out where the message is being sent from. Searching the code for where the Send happens is usually the way to go with this, and here it's no different. These send and receive messages are auto-generated from the interface files, so they always follow a strict pattern. A search for a call to SendOnFirstPaint() therefore throws up the following in the ESR 78 code:
NS_IMETHODIMP
EmbedLiteViewChild::OnFirstPaint(int32_t aX, int32_t aY)
{
  if (mDOMWindow) {
    nsCOMPtr<nsIDocShell> docShell = mDOMWindow->GetDocShell();
    if (docShell) {
      RefPtr<PresShell> presShell = docShell->GetPresShell();
      if (presShell) {
        nscolor bgcolor = presShell->GetCanvasBackground();
        Unused << SendSetBackgroundColor(bgcolor);
      }
    }
  }

  return SendOnFirstPaint(aX, aY) ? NS_OK : NS_ERROR_FAILURE;
}
We want to find out where this is called and for that we can use the debugger. Using it gives us the following backtrace.
Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 2, non-virtual thunk to 
    mozilla::embedlite::EmbedLiteViewChild::OnFirstPaint(int, int) ()
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.h:57
57        NS_DECL_NSIEMBEDBROWSERCHROMELISTENER
(gdb) bt
#0  non-virtual thunk to mozilla::embedlite::EmbedLiteViewChild::OnFirstPaint(
    int, int) ()
    at mobile/sailfishos/embedshared/EmbedLiteViewChild.h:57
#1  0x0000007fbb306f38 in WebBrowserChrome::HandleEvent (this=0x7f8ccba130, 
    aEvent=0x7f8dd57a60)
    at mobile/sailfishos/utils/WebBrowserChrome.cpp:443
#2  0x0000007fb9d4b4a0 in mozilla::EventListenerManager::HandleEventSubType (
    this=this@entry=0x7f8ccbd3d0, aListener=<optimized out>, 
    aListener@entry=0x7f8cdb7d78, aDOMEvent=0x7f8dd57a60, 
    aCurrentTarget=<optimized out>, aCurrentTarget@entry=0x7eac138da0)
    at dom/events/EventListenerManager.cpp:1087
#3  0x0000007fb9d4e984 in mozilla::EventListenerManager::HandleEventInternal (
    this=0x7f8ccbd3d0, aPresContext=0x7f8cf82940, aEvent=0x7f8d5c7e80, 
    aDOMEvent=0x7fa7972858, aCurrentTarget=<optimized out>, 
    aEventStatus=<optimized out>, aItemInShadowTree=<optimized out>)
    at dom/events/EventListenerManager.cpp:1279
#4  0x0000007fb9d4f604 in mozilla::EventTargetChainItem::HandleEventTargetChain 
    (aChain=..., aVisitor=..., aCallback=aCallback@entry=0x0, aCd=...)
    at dom/events/EventDispatcher.cpp:558
#5  0x0000007fb9d50eb8 in mozilla::EventDispatcher::Dispatch (
    aTarget=<optimized out>, aPresContext=0x7f8cf82940, aEvent=0x7f8d5c7e80, 
    aDOMEvent=0x7f8dd57a60, aEventStatus=<optimized out>, aCallback=0x0, 
    aTargets=<optimized out>)
    at dom/events/EventDispatcher.cpp:1055
#6  0x0000007fba7f8f50 in nsPresContext::FireDOMPaintEvent (this=0x7f8cf82940, 
    aList=aList@entry=0x7ecc05ae78, aTransactionId=..., aTimeStamp=...)
    at obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:847
#7  0x0000007fba7f9090 in DelayedFireDOMPaintEvent::Run (this=0x7ecc05ae50)
    at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#8  0x0000007fb917d1a8 in nsContentUtils::RemoveScriptBlocker ()
    at dom/base/nsContentUtils.cpp:5407
#9  0x0000007fb917d218 in nsContentUtils::RemoveScriptBlocker ()
    at dom/base/nsContentUtils.cpp:5386
#10 0x0000007fba7fd020 in nsAutoScriptBlocker::~nsAutoScriptBlocker (
    this=<synthetic pointer>, __in_chrg=<optimized out>)
    at dom/base/nsContentUtils.h:3461
#11 nsRootPresContext::<lambda()>::operator() (__closure=0x7f8dd1cfe8)
    at layout/base/nsPresContext.cpp:2825
#12 mozilla::layers::GenericNamedTimerCallback<nsRootPresContext::
    EnsureEventualDidPaintEvent(nsPresContext::TransactionId)::<lambda()> >::
    Notify(nsITimer *) (this=0x7f8dd1cfd0) at obj-build-mer-qt-xr/dist/include/
    mozilla/layers/APZThreadUtils.h:81
#13 0x0000007fb84653e0 in nsTimerImpl::Fire (this=0x7f8cecfb50, aGeneration=1)
    at obj-build-mer-qt-xr/dist/include/mozilla/TimeStamp.h:136
#14 0x0000007fb84654c0 in nsTimerEvent::Run (this=0x7ed4001ab0)
    at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#15 0x0000007fb845e4c4 in nsThread::ProcessNextEvent (aResult=0x7fa7972d77, 
    aMayWait=<optimized out>, this=0x7f8c02f610)
    at xpcom/threads/nsThread.cpp:1211
[...]
#32 0x0000007fbe70b89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) 
That's a long backtrace but the magic appears to be happening inside a call to DelayedFireDOMPaintEvent() and we can check for this by searching through the code. It gets set up in nsPresContext::NotifyDidPaintForSubtree() inside nsPresContext.cpp where we have the following code, which is basically the same in both ESR 78 and ESR 91. This is a long piece of code, but the important parts are the two calls to DelayedFireDOMPaintEvent() which you can see around the middle and towards the end of the method:
void nsPresContext::NotifyDidPaintForSubtree(
    TransactionId aTransactionId, const mozilla::TimeStamp& aTimeStamp) {
  if (mFirstContentfulPaintTransactionId && !mHadContentfulPaintComposite) {
    if (aTransactionId >= *mFirstContentfulPaintTransactionId) {
      mHadContentfulPaintComposite = true;
      RefPtr<nsDOMNavigationTiming> timing = mDocument->GetNavigationTiming();
      if (timing && !aTimeStamp.IsNull()) {
        timing->NotifyContentfulPaintForRootContentDocument(aTimeStamp);
      }
    }
  }

  if (IsRoot()) {
    static_cast<nsRootPresContext*>(this)->CancelDidPaintTimers(aTransactionId);

    if (mTransactions.IsEmpty()) {
      return;
    }
  }

  if (!PresShell()->IsVisible() && mTransactions.IsEmpty()) {
    return;
  }

  // Non-root prescontexts fire MozAfterPaint to all their descendants
  // unconditionally, even if no invalidations have been collected. This is
  // because we don't want to eat the cost of collecting invalidations for
  // every subdocument (which would require putting every subdocument in its
  // own layer).

  bool sent = false;
  uint32_t i = 0;
  while (i < mTransactions.Length()) {
    if (mTransactions[i].mTransactionId <= aTransactionId) {
      if (!mTransactions[i].mInvalidations.IsEmpty()) {
        nsCOMPtr<nsIRunnable> ev = new DelayedFireDOMPaintEvent(
            this, &mTransactions[i].mInvalidations,
            mTransactions[i].mTransactionId, aTimeStamp);
        nsContentUtils::AddScriptRunner(ev);
        sent = true;
      }
      mTransactions.RemoveElementAt(i);
    } else {
      i++;
    }
  }

  if (!sent) {
    nsTArray<nsRect> dummy;
    nsCOMPtr<nsIRunnable> ev =
        new DelayedFireDOMPaintEvent(this, &dummy, aTransactionId, aTimeStamp);
    nsContentUtils::AddScriptRunner(ev);
  }
As anticipated there's no hit for this method on ESR 91, but on ESR 78 we can get a backtrace:
(gdb) break DelayedFireDOMPaintEvent::DelayedFireDOMPaintEvent
Breakpoint 8 at 0x7fba7fcc8c: DelayedFireDOMPaintEvent::
    DelayedFireDOMPaintEvent. (2 locations)
(gdb) r
[...]
Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 8, 0x0000007fba7fcc8c in 
    DelayedFireDOMPaintEvent::DelayedFireDOMPaintEvent (aTimeStamp=..., 
    aTransactionId=..., aList=<optimized out>, aPresContext=<optimized out>, 
    this=<optimized out>)
    at layout/base/nsPresContext.h:173
173         return mPresShell;
(gdb) bt
#0  0x0000007fba7fcc8c in DelayedFireDOMPaintEvent::DelayedFireDOMPaintEvent (
    aTimeStamp=..., aTransactionId=..., aList=<optimized out>, 
    aPresContext=<optimized out>, this=<optimized out>)
    at layout/base/nsPresContext.h:173
#1  nsPresContext::NotifyDidPaintForSubtree (this=0x7f8ce6dda0, 
    aTransactionId=..., aTimeStamp=...)
    at layout/base/nsPresContext.cpp:2082
#2  0x0000007fba7fd01c in nsRootPresContext::<lambda()>::operator() (
    __closure=0x7f8dc1af58)
    at obj-build-mer-qt-xr/dist/include/mozilla/TimeStamp.h:400
#3  mozilla::layers::GenericNamedTimerCallback<nsRootPresContext::
    EnsureEventualDidPaintEvent(nsPresContext::TransactionId)::<lambda()> >::
    Notify(nsITimer *) (this=0x7f8dc1af40) at obj-build-mer-qt-xr/dist/include/
    mozilla/layers/APZThreadUtils.h:81
#4  0x0000007fb84653e0 in nsTimerImpl::Fire (this=0x7f8c72bfd0, aGeneration=1)
    at obj-build-mer-qt-xr/dist/include/mozilla/TimeStamp.h:136
#5  0x0000007fb84654c0 in nsTimerEvent::Run (this=0x7ecc001ab0)
    at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#6  0x0000007fb845e4c4 in nsThread::ProcessNextEvent (aResult=0x7fa7972d77, 
    aMayWait=<optimized out>, this=0x7f8c02f610)
    at xpcom/threads/nsThread.cpp:1211
#7  nsThread::ProcessNextEvent (this=0x7f8c02f610, aMayWait=<optimized out>, 
    aResult=0x7fa7972d77)
    at xpcom/threads/nsThread.cpp:1047
#8  0x0000007fb845cdcc in NS_ProcessPendingEvents (
    aThread=aThread@entry=0x7f8c02f610, aTimeout=10)
    at xpcom/threads/nsThreadUtils.cpp:449
#9  0x0000007fba63c430 in nsBaseAppShell::NativeEventCallback (
    this=0x7f8c3e6ac0)
    at widget/nsBaseAppShell.cpp:87
#10 0x0000007fba64f0bc in nsAppShell::event (this=<optimized out>, e=<optimized 
    out>)
    at widget/qt/nsAppShell.cpp:75
[...]
#23 0x0000007fbe70b89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb) 
Frustratingly it's called within a lambda function, which is messing up the backtrace yet again. So we'll have to resort to digging through the code again. There we can see that the NotifyDidPaintForSubtree() method gets called in a lambda function that's triggered inside itself. But this can't be the one we're after because we'd have seen the hit on the containing method first.

But there's another case where a lambda function gets configured to then call nsPresContext::NotifyDidPaintForSubtree() and that's in the following method from nsPresContext.cpp (this is the ESR 78 version):
void
nsRootPresContext::EnsureEventualDidPaintEvent(TransactionId aTransactionId)
{
  for (NotifyDidPaintTimer& t : mNotifyDidPaintTimers) {
    if (t.mTransactionId == aTransactionId) {
      return;
    }
  }

  nsCOMPtr<nsITimer> timer;
  RefPtr<nsRootPresContext> self = this;
  nsresult rv = NS_NewTimerWithCallback(
    getter_AddRefs(timer),
    NewNamedTimerCallback([self, aTransactionId](){
      nsAutoScriptBlocker blockScripts;
      self->NotifyDidPaintForSubtree(aTransactionId);
     }, &quot;NotifyDidPaintForSubtree&quot;), 100, nsITimer::TYPE_ONE_SHOT,
    Document()->EventTargetFor(TaskCategory::Other));

  if (NS_SUCCEEDED(rv)) {
    NotifyDidPaintTimer* t = mNotifyDidPaintTimers.AppendElement();
    t->mTransactionId = aTransactionId;
    t->mTimer = timer;
  }
}
That's the only other case I can see, so we could check to find out how this EnsureEventualDidPaintEvent() method gets called. However as I was in the process of configuring the debugger I noticed that this method doesn't exist in ESR 91. That's unexpected. In ESR 78 it's called from nsPresContext::NotifyInvalidation() like this:
void nsPresContext::NotifyInvalidation(TransactionId aTransactionId,
                                       const nsRect& aRect) {
  MOZ_ASSERT(GetContainerWeak(), &quot;Invalidation in detached pres 
    context&quot;);

  // If there is no paint event listener, then we don't need to fire
  // the asynchronous event. We don't even need to record invalidation.
  // MayHavePaintEventListener is pretty cheap and we could make it
  // even cheaper by providing a more efficient
  // nsPIDOMWindow::GetListenerManager.

  nsPresContext* pc;
  for (pc = this; pc; pc = pc->GetParentPresContext()) {
    TransactionInvalidations* transaction =
        pc->GetInvalidations(aTransactionId);
    if (transaction) {
      break;
    } else {
      transaction = pc->mTransactions.AppendElement();
      transaction->mTransactionId = aTransactionId;
    }
  }
  if (!pc) {
    nsRootPresContext* rpc = GetRootPresContext();
    if (rpc) {
      rpc->EnsureEventualDidPaintEvent(aTransactionId);
    }
  }

  TransactionInvalidations* transaction = GetInvalidations(aTransactionId);
  MOZ_ASSERT(transaction);
  transaction->mInvalidations.AppendElement(aRect);
}
But as we can see, the call to EnsureEventualDidPaintEvent() has been removed in the ESR 91 version of the same method:
void nsPresContext::NotifyInvalidation(TransactionId aTransactionId,
                                       const nsRect& aRect) {
  MOZ_ASSERT(GetContainerWeak(), &quot;Invalidation in detached pres 
    context&quot;);

  // If there is no paint event listener, then we don't need to fire
  // the asynchronous event. We don't even need to record invalidation.
  // MayHavePaintEventListener is pretty cheap and we could make it
  // even cheaper by providing a more efficient
  // nsPIDOMWindow::GetListenerManager.

  nsPresContext* pc;
  for (pc = this; pc; pc = pc->GetParentPresContext()) {
    TransactionInvalidations* transaction =
        pc->GetInvalidations(aTransactionId);
    if (transaction) {
      break;
    } else {
      transaction = pc->mTransactions.AppendElement();
      transaction->mTransactionId = aTransactionId;
    }
  }

  TransactionInvalidations* transaction = GetInvalidations(aTransactionId);
  MOZ_ASSERT(transaction);
  transaction->mInvalidations.AppendElement(aRect);
}
I wasn't expecting that. For completeness, the backtrace from ESR 78 confirms that this is indeed the correct method to be looking at, at least on the ESR 78 side:
Thread 7 &quot;GeckoWorkerThre&quot; hit Breakpoint 9, nsRootPresContext::
    EnsureEventualDidPaintEvent (this=0x7f8ce70330, 
    aTransactionId=aTransactionId@entry=...)
    at layout/base/nsPresContext.cpp:2813
2813    {
(gdb) bt
#0  nsRootPresContext::EnsureEventualDidPaintEvent (this=0x7f8ce70330, 
    aTransactionId=aTransactionId@entry=...)
    at layout/base/nsPresContext.cpp:2813
#1  0x0000007fba7fb480 in nsPresContext::NotifyInvalidation (
    this=this@entry=0x7f8ce70330, aTransactionId=..., aRect=...)
    at layout/base/nsPresContext.cpp:1964
#2  0x0000007fba7fb5d8 in nsPresContext::NotifyInvalidation (
    this=this@entry=0x7f8ce70330, aTransactionId=..., aRect=...)
    at layout/base/nsPresContext.cpp:1927
#3  0x0000007fbaa4dbe4 in nsDisplayList::PaintRoot (
    this=this@entry=0x7fa7971f48, aBuilder=aBuilder@entry=0x7fa7970170, 
    aCtx=aCtx@entry=0x0, 
    aFlags=aFlags@entry=13) at layout/painting/nsDisplayList.cpp:2526
#4  0x0000007fba7d656c in nsLayoutUtils::PaintFrame (
    aRenderingContext=aRenderingContext@entry=0x0, 
    aFrame=aFrame@entry=0x7f8dba5c40, aDirtyRegion=..., 
    aBackstop=aBackstop@entry=4294967295, 
    aBuilderMode=aBuilderMode@entry=nsDisplayListBuilderMode::Painting, 
    aFlags=<optimized out>, 
    aFlags@entry=(nsLayoutUtils::PaintFrameFlags::WidgetLayers | nsLayoutUtils::
    PaintFrameFlags::ExistingTransaction | nsLayoutUtils::PaintFrameFlags::
    NoComposite)) at layout/base/nsLayoutUtils.cpp:4168
#5  0x0000007fba79a258 in mozilla::PresShell::Paint (
    this=this@entry=0x7f8d2dd790, aViewToPaint=aViewToPaint@entry=0x7f8d210390, 
    aDirtyRegion=..., 
    aFlags=aFlags@entry=mozilla::PaintFlags::PaintLayers)
    at layout/base/PresShell.cpp:6254
#6  0x0000007fba6110ac in nsViewManager::ProcessPendingUpdatesPaint (
    this=this@entry=0x7f8d210df0, aWidget=aWidget@entry=0x7f8d210e60)
    at obj-build-mer-qt-xr/dist/include/nsTArray.h:554
#7  0x0000007fba6113c0 in nsViewManager::ProcessPendingUpdatesForView (
    this=this@entry=0x7f8d210df0, aView=<optimized out>, 
    aFlushDirtyRegion=aFlushDirtyRegion@entry=true) at view/nsViewManager.cpp:
    395
#8  0x0000007fba611b50 in nsViewManager::ProcessPendingUpdates (
    this=0x7f8d210df0)
    at view/nsViewManager.cpp:1018
#9  nsViewManager::ProcessPendingUpdates (this=<optimized out>)
    at view/nsViewManager.cpp:1004
#10 0x0000007fba611c0c in nsViewManager::WillPaintWindow (
    this=this@entry=0x7f8d210df0, aWidget=0x7f8d210e60)
    at view/nsViewManager.cpp:664
#11 0x0000007fba611c88 in nsView::WillPaintWindow (this=<optimized out>, 
    aWidget=<optimized out>)
    at view/nsView.cpp:1048
#12 0x0000007fbb2f6344 in mozilla::embedlite::PuppetWidgetBase::Invalidate (
    aRect=..., this=0x7f8d210e60)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:268
#13 mozilla::embedlite::PuppetWidgetBase::Invalidate (this=0x7f8d210e60, 
    aRect=...)
    at mobile/sailfishos/embedshared/PuppetWidgetBase.cpp:253
#14 0x0000007fbb2ef768 in mozilla::embedlite::EmbedLitePuppetWidget::Show (
    this=0x7f8d210e60, aState=<optimized out>)
    at mobile/sailfishos/embedshared/EmbedLitePuppetWidget.cpp:97
#15 0x0000007fba7f03e8 in nsDocumentViewer::Show (this=0x7f8d15df90)
    at layout/base/nsDocumentViewer.cpp:2132
#16 0x0000007fba7f8b7c in nsPresContext::EnsureVisible (this=0x7f8ce70330)
    at obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:857
#17 0x0000007fba79c114 in mozilla::PresShell::UnsuppressAndInvalidate (
    this=0x7f8d2dd790)
    at obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#18 0x0000007fba7a1638 in mozilla::PresShell::ProcessReflowCommands (
    this=this@entry=0x7f8d2dd790, aInterruptible=aInterruptible@entry=false)
    at layout/base/PresShell.cpp:9712
#19 0x0000007fba7a0630 in mozilla::PresShell::DoFlushPendingNotifications (
    this=this@entry=0x7f8d2dd790, aFlush=...)
    at layout/base/PresShell.cpp:4209
[...]
#33 0x0000007fba772840 in mozilla::VsyncRefreshDriverTimer::
    RefreshDriverVsyncObserver::NotifyVsync (this=0x7f8cc9de00, aVsync=...)
    at layout/base/nsRefreshDriver.cpp:644
[...]
#53 0x0000007fbe70b89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
(gdb)
Alright, let's rewind back to the top again. We've now reached the point where we know how the OnFirstPaint() method is getting called in ESR 78. And we can see why it's not happening the same way in ESR 91, viz. that the code to call it has been removed from the NotifyInvalidation() method. The next step is to find out why it's been removed and what should be triggering the OnFirstPaint() instead (assuming there is still something).

The best way to investigate this further will be to look a the Bugzilla ticket and associated changeset where this change was made upstream. We've searched for these kinds of changes many times before and now is no different. Here's the log for the change:
$ git log -S EnsureEventualDidPaintEvent -1 -- layout/base/nsPresContext.cpp
commit 1deccd7ac14706ad1849343cfb2b93df191a1c42
Author: Mantaroh Yoshinaga <mantaroh@gmail.com>
Date:   Thu Sep 6 02:21:39 2018 +0000

    Bug 1445570 - Remove EnsureEventualAfterPaint timer. r=tnikkel
    
    MozReview-Commit-ID: C7WICJ5Q0ES
    
    Differential Revision: https://phabricator.services.mozilla.com/D5005
    
    --HG--
    extra : moz-landing-system : lando
After reading through Bug 1445570 and the associated code changes my conclusion is that unravelling this is going to be too complicated for my brain in its current state, so best to let it sink in overnight. I'll pick this up again tomorrow where we'll try to find out what's replacing the trigger that we need in the updated ESR 91 code.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
21 May 2024 : Day 239 #
I got a bit carried away yesterday delving into the reasons why QuickMozView::updatePaintNode() was failing to fire on ESR 91. My conclusion was that it came down to the mIsPainted flag, part of the QMozViewPrivate class and which was remaining stubbornly set to false when running ESR 91.

While in this state the invalidTexture flag inside updatePaintNode() never gets set to false since it's defined like this:
    const bool invalidTexture = !mComposited
            || !d->mIsPainted
            || !d->mViewInitialized
            || !d->mHasCompositor
            || !d->mContext->registeredWindow()
            || !d->mMozWindow;
Notice how if any single value in this list is set to false then the combined value of invalidTexture will be set to true. The consequence of this is that the following clause is always executed:
    if (!mTexture && invalidTexture) {
        QSGSimpleRectNode *node = static_cast<QSGSimpleRectNode *>(oldNode);
        if (!node) {
            node = new QSGSimpleRectNode;
        }
        node->setColor(d->mBackgroundColor);
        node->setRect(boundingRect);

        return node;
    }
The return at the end of this means the updatePaintNode() method returns early. This means that the code beyond the clause never gets executed and crucially this section is never entered:
    if (!node) {
        QMozExtTexture * const texture = new QMozExtTexture;
        mTexture = texture;

        connect(texture, &QMozExtTexture::getPlatformImage, d->mMozWindow, 
            &QMozWindow::getPlatformImage, Qt::DirectConnection);

        node = new MozExtMaterialNode;

        node->setTexture(mTexture);
    }
Here we can see both mTexture and the MozExtMaterialNode object being created. That's what we need for the rendering to proceed. So if this portion of code is never executed, we'll never get the rendered page to appear.

That all sounds very plausible and comprehensible, but it begs the question: "why is mIsPainted never being set to true?". That's what I plan to find out over the coming days.

Unsurprisingly we can see the initial value it's set to in the QMozViewPrivate constructor, where it's initially set to false:
QMozViewPrivate::QMozViewPrivate(IMozQViewIface *aViewIface, QObject *publicPtr)
    : mViewIface(aViewIface)
[...]
    , mIsPainted(false)
[...]
It's also set to false after a call to QMozViewPrivate::reset():
QMozViewPrivate::reset()
{
    if (mIsPainted) {
        mIsPainted = false;
        mViewIface->firstPaint(-1, -1);
    }
[...]
But the important part is where it gets set to true and that only happens in one place; here in the OnFirstPaint() method:
void QMozViewPrivate::OnFirstPaint(int32_t aX, int32_t aY)
{
    mIsPainted = true;
    mViewIface->firstPaint(aX, aY);
}
This method doesn't get explicitly called anywhere in the QtMozEmbed code, so to find out where it's actually getting used I'm going to drop to the debugger on my ESR 78 device.
(gdb) break QMozViewPrivate::OnFirstPaint
Breakpoint 13 at 0x7fbfbf56b8: file qmozview_p.cpp, line 1113.
(gdb) r
[...]
Thread 1 &quot;harbour-webview&quot; hit Breakpoint 13, QMozViewPrivate::
    OnFirstPaint (this=0x55557b08f0, aX=0, aY=0) at qmozview_p.cpp:1113
1113        mIsPainted = true;
(gdb) bt                         
#0  QMozViewPrivate::OnFirstPaint (this=0x55557b08f0, aX=0, aY=0) at 
    qmozview_p.cpp:1113
#1  0x0000007fbb2f5d90 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#2  0x0000007fb8932f50 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#3  0x0000007fb8920238 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#4  0x0000007fb88428cc in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#5  0x0000007fb88482d8 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#6  0x0000007fb8849d58 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#7  0x0000007fb8809d48 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#8  0x0000007fb880e740 in ?? () from /usr/lib64/xulrunner-qt5-78.15.1/libxul.so
#9  0x0000007fbfbed924 in MessagePumpQt::HandleDispatch (this=0x555578ef30) at 
    qmessagepump.cpp:63
#10 0x0000007fbfbeda9c in MessagePumpQt::event (this=<optimized out>, 
    e=<optimized out>) at qmessagepump.cpp:51
#11 0x0000007fbebe1144 in QCoreApplication::notify(QObject*, QEvent*) () from /
    usr/lib64/libQt5Core.so.5
#12 0x0000007fbebe12e8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (
    ) from /usr/lib64/libQt5Core.so.5
#13 0x0000007fbebe36b8 in QCoreApplicationPrivate::sendPostedEvents(QObject*, 
    int, QThreadData*) () from /usr/lib64/libQt5Core.so.5
[...]
#21 0x0000005555557bcc in main ()
(gdb) 
Well that's an unhelpful mess. But it is at least getting called from inside the Gecko code and that in itself is both helpful to know and a good thing (I was concerned it might be called from inside Qt which would have made it considerably harder to track down; the fact it's coming from Gecko means we can skip a layer of Qt indirection and get closer to the source of the problem more quickly).

But as you can see the Gecko debug symbols are messed up: we can see eight calls to methods inside libxul in a row, but it's not telling us the names of the methods.

So I'm going to reinstall the packages, including the debug packages, to get the symbols back. Before the gap I was using a custom build with extra debug output and exporting of textures, but I don't need any of that any more so I can just reinstall the packages from the official repositories. Much easier and quicker than rebuilding the packages locally and installing them that way:
# zypper install --force xulrunner-qt5-78.15.1+git33.2-1.21.1.jolla 
    xulrunner-qt5-debuginfo-78.15.1+git33.2-1.21.1.jolla xul
runner-qt5-debugsource-78.15.1+git33.2-1.21.1.jolla 
    xulrunner-qt5-misc-78.15.1+git33.2-1.21.1.jolla
Loading repository data...
Reading installed packages...
Forcing installation of 'xulrunner-qt5-78.15.1+git33.2-1.21.1.jolla.aarch64' 
    from repository 'jolla'.
Forcing installation of 
    'xulrunner-qt5-debuginfo-78.15.1+git33.2-1.21.1.jolla.aarch64' from 
    repository 'jolla'.
Forcing installation of 
    'xulrunner-qt5-debugsource-78.15.1+git33.2-1.21.1.jolla.aarch64' from 
    repository 'jolla'.
Forcing installation of 
    'xulrunner-qt5-misc-78.15.1+git33.2-1.21.1.jolla.aarch64' from repository 
    'jolla'.
[...]
2 packages to downgrade, 2 to reinstall, 2  to change vendor.
Overall download size: 758.8 MiB. Already cached: 0 B. After the operation, 2.4 
    GiB will be freed.
Continue? [y/n/v/...? shows all options] (y): y
[...]
Okay, with the debug symbols now properly aligned we can try getting the backtrace again, but hopefully this time with a bit more helpful detail.
Thread 1 &quot;harbour-webview&quot; hit Breakpoint 1, QMozViewPrivate::
    OnFirstPaint (this=0x55555cb830, aX=0, aY=0) at qmozview_p.cpp:1113
1113        mIsPainted = true;
(gdb) bt
#0  QMozViewPrivate::OnFirstPaint (this=0x55555cb830, aX=0, aY=0) at 
    qmozview_p.cpp:1113
#1  0x0000007fbb2f5d90 in mozilla::embedlite::EmbedLiteViewParent::
    RecvOnFirstPaint (this=0x55555d9d70, aX=@0x7fffffea18: 0, aY=@0x7fffffea28: 
    0)
    at mobile/sailfishos/embedshared/EmbedLiteViewParent.cpp:237
#2  0x0000007fb8932f50 in mozilla::embedlite::PEmbedLiteViewParent::
    OnMessageReceived (this=0x55555d9d70, msg__=...) at 
    PEmbedLiteViewParent.cpp:1600
#3  0x0000007fb8920238 in mozilla::embedlite::PEmbedLiteAppParent::
    OnMessageReceived (this=<optimized out>, msg__=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:866
#4  0x0000007fb88428cc in mozilla::ipc::MessageChannel::DispatchAsyncMessage (
    this=this@entry=0x7f8c888578, aProxy=aProxy@entry=0x555560d1d0, aMsg=...)
    at obj-build-mer-qt-xr/dist/include/mozilla/ipc/ProtocolUtils.h:866
#5  0x0000007fb88482d8 in mozilla::ipc::MessageChannel::DispatchMessage (
    this=0x7f8c888578, aMsg=...)
    at ipc/glue/MessageChannel.cpp:2100
#6  0x0000007fb8849b50 in mozilla::ipc::MessageChannel::RunMessage (
    this=<optimized out>, aTask=...)
    at ipc/glue/MessageChannel.cpp:1959
#7  0x0000007fb8849d58 in mozilla::ipc::MessageChannel::MessageTask::Run (
    this=0x7f8e127650)
    at obj-build-mer-qt-xr/dist/include/mozilla/ipc/MessageChannel.h:610
#8  0x0000007fb8809d48 in MessageLoop::RunTask (aTask=..., this=0x55557c5200)
    at ipc/chromium/src/base/message_loop.cc:487
#9  MessageLoop::DeferOrRunPendingTask (pending_task=..., this=0x55557c5200)
    at ipc/chromium/src/base/message_loop.cc:478
#10 MessageLoop::DeferOrRunPendingTask (this=0x55557c5200, pending_task=...)
    at ipc/chromium/src/base/message_loop.cc:476
#11 0x0000007fb880e740 in MessageLoop::DoWork (this=<optimized out>)
    at ipc/chromium/src/base/message_loop.cc:551
#12 MessageLoop::DoWork (this=0x55557c5200) at ipc/chromium/src/base/
    message_loop.cc:530
#13 0x0000007fbfbed924 in MessagePumpQt::HandleDispatch (this=0x55557a6f40) at 
    qmessagepump.cpp:63
#14 0x0000007fbfbeda9c in MessagePumpQt::event (this=<optimized out>, 
    e=<optimized out>) at qmessagepump.cpp:51
#15 0x0000007fbebe1144 in QCoreApplication::notify(QObject*, QEvent*) () from /
    usr/lib64/libQt5Core.so.5
#16 0x0000007fbebe12e8 in QCoreApplication::notifyInternal2(QObject*, QEvent*) (
    ) from /usr/lib64/libQt5Core.so.5
#17 0x0000007fbebe36b8 in QCoreApplicationPrivate::sendPostedEvents(QObject*, 
    int, QThreadData*) () from /usr/lib64/libQt5Core.so.5
[...]
#25 0x0000005555557bcc in main ()
(gdb) 
That's much better and gives us a much clearer idea of where OnFirstPaint() is getting called. But as we can see from this it's actually being triggered by receipt of a message. The next step will therefore be to find out where the OnFirstPaint message is being sent from.

For completeness I also performed the same checks on ESR 91. You'll not be surprised to hear that the QMozViewPrivate::OnFirstPaint() method is never called in the newer build.

The ESR 78 backtrace tells us what should be happening on ESR 91. So our task now is to find out why it's not. This takes us a step forwards and gives us something to look into further; we'll pick this avenue of investigation up again tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
20 May 2024 : Day 238 #
It's been an exciting and packed day of Jolla and Sailfish OS news today with the announcement of not one but two new Jolla devices, plus some significant changes to the way Sailfish OS will be developed and funded in the future. This isn't the place to dwell on these topics, but I am excited for the future of the operating system. I didn't want it to go without saying, but I do want to focus on Gecko here, so let's head straight in to development.

Yesterday I made what could be an important discovery. The QMozExtTexture::updateTexture() method is being called on ESR 78, but not on ESR 91. If I can track down the underlying reason for this, then it may well help getting the offscreen rendering back up and running again.

The next step is to find out at which point the failure is happening and try to understand why. Before we get into it I want to warn you that this is rather a long post, mostly containing debugging output which is dull at the best of times. But it does lead to some useful conclusions, so if you're interest to follow but don't want to have to read through all of the details, don't feel bad if you want to skip straight to the end!

If you're still reading this then I'll assume you're in it for the long haul. Don't say I didn't warn you though! So the best place to start this investigation is with the backtrace of the call. We saw this yesterday and it looked like this:
#0  QMozExtTexture::updateTexture (this=0x7f90024e30) at qmozexttexture.cpp:64
#1  0x0000007fbfc04f18 in MozMaterialNode::preprocess (this=0x7f9001ef60) at 
    qmozextmaterialnode.cpp:81
#2  0x0000007fbf8f0c44 in QSGRenderer::preprocess() () from /usr/lib64/
    libQt5Quick.so.5
#3  0x0000007fbf8f0e5c in QSGRenderer::renderScene(QSGBindable const&) () from /
    usr/lib64/libQt5Quick.so.5
#4  0x0000007fbf8f14c0 in QSGRenderer::renderScene(unsigned int) () from /usr/
    lib64/libQt5Quick.so.5
#5  0x0000007fbf8ffe58 in QSGRenderContext::renderNextFrame(QSGRenderer*, 
    unsigned int) () from /usr/lib64/libQt5Quick.so.5
#6  0x0000007fbf9429d0 in QQuickWindowPrivate::renderSceneGraph(QSize const&) (
    ) from /usr/lib64/libQt5Quick.so.5
#7  0x0000007fbf916c04 in ?? () from /usr/lib64/libQt5Quick.so.5
#8  0x0000007fbf91cc10 in ?? () from /usr/lib64/libQt5Quick.so.5
#9  0x0000007fbea290e8 in ?? () from /usr/lib64/libQt5Core.so.5
#10 0x0000007fb74b0a4c in start_thread (arg=0x7fffffe9bf) at pthread_create.c:
    479
#11 0x0000007fbe70b89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
This backtrace is, of course, from the ESR 78 build. There is no equivalent backtrace for the ESR 91 build because the QMozExtTexture::updateTexture() method never gets called in ESR 91. But maybe some of the other methods in the backtrace are being called. Let's find out.

To establish this I'm putting breakpoints on each of the functions in the backtrace and then running the ESR 91 build. If one of them hits I move to one higher up the stack (that is, the next line up that has a smaller number). Eventually one will fail to hit and at that point I'll know where the failure is happening. So here goes.
Thread 68 &quot;QSGRenderThread&quot; hit Breakpoint 4, 0x0000007ff7b6e984 in 
    QQuickWindowPrivate::renderSceneGraph(QSize const&)@plt ()
   from /usr/lib64/libQt5Quick.so.5
(gdb) delete break 4
(gdb) b QSGRenderContext::renderNextFrame
Breakpoint 5 at 0x7fe82e4474 (2 locations)
(gdb) c
Continuing.

Thread 68 &quot;QSGRenderThread&quot; hit Breakpoint 5, 0x0000007fe82e4474 in 
    QSGRenderContext::renderNextFrame(QSGRenderer*, unsigned int)@plt ()
   from /usr/lib64/qt5/plugins/scenegraph/libcustomcontext.so
(gdb) delete break 5
(gdb) b QSGRenderer::preprocess
Breakpoint 6 at 0x7ff7be7a50
(gdb) c
Continuing.

Thread 68 &quot;QSGRenderThread&quot; hit Breakpoint 6, 0x0000007ff7be7a50 in 
    QSGRenderer::preprocess() () from /usr/lib64/libQt5Quick.so.5
(gdb) delete break 6
(gdb) b MozMaterialNode::preprocess
Breakpoint 7 at 0x7ff7ef899c
(gdb) c
Continuing.
frame008.data: Colour before: (0, 0, 0, 255), 1
frame009.data: Colour before: (0, 0, 0, 255), 1
frame010.data: Colour before: (0, 0, 0, 255), 1
frame011.data: Colour before: (0, 0, 0, 255), 1
[...]
As we can see from the above, the QQuickWindowPrivate::renderSceneGraph() method does get called. So does the QSGRenderContext::renderNextFrame() method. but the QSGRenderer::preprocess() method? That never gets called.

That means the failure is happening inside the QSGRenderContext::renderNextFrame() method. This call isn't part of the Gecko code, or indeed part of the QtMozEmbed code, but rather part of the Qt stack. In particular, part of the "Qt Scene Graph" stack (hence the "QSG" prefix of the method).

I'm not super familiar with how the Qt Scene Graph does it's thing, but there is plenty of good documentation out there. So I've done some reading around to get a better idea.

One part sprang out at me, since it specifically relates to the method that isn't getting called. The explanation that follows is taken from the Qt documentation:
 
void QSGNode::preprocess()

Override this function to do processing on the node before it is rendered.

Preprocessing needs to be explicitly enabled by setting the flag QSGNode::UsePreprocess. The flag needs to be set before the node is added to the scene graph and will cause the preprocess() function to be called for every frame the node is rendered.

So maybe the problem is that the QSGNode::UsePreprocess flag is never being set? A quick scan of the QtMozEmbed code throws up the following bit of code for controlling the flag, in the qmozextmaterialnode.cpp file:
MozMaterialNode::MozMaterialNode()
{
    setFlag(UsePreprocess);

    setGeometry(&m_geometry);
}
So, we should try to establish whether this is getting called or not. Running with some breakpoints shows that it is indeed getting called on the ESR 78 build.
(gdb) break MozMaterialNode::MozMaterialNode
Breakpoint 2 at 0x7fbfbdadc4 (2 locations)
(gdb) r
[...]
Thread 11 &quot;QSGRenderThread&quot; hit Breakpoint 2, MozMaterialNode::
    MozMaterialNode (this=this@entry=0x7f0001ef60) at qmozextmaterialnode.cpp:27
27      MozMaterialNode::MozMaterialNode()
(gdb) n
619     /usr/include/qt5/QtCore/qrect.h: No such file or directory.
(gdb) 
27      MozMaterialNode::MozMaterialNode()
(gdb) 
31          setGeometry(&m_geometry);
(gdb) 
That's no big surprise, since if it were not being called the other parts of the render pipeline would also be failing and they're not. How about on ESR 91 though? We know the render pipeline isn't working there and that could be because this call isn't happening. Let's find out by adding the same breakpoint on the ESR 91 build and executing it just as we did for ESR 78:
(gdb) break MozMaterialNode::MozMaterialNode
Breakpoint 8 at 0x7ff7ece9e4 (2 locations)
(gdb) r
[...]
frame014.data: Colour before: (8, 0, 152, 255), 1
frame015.data: Colour before: (1, 0, 255, 255), 1
frame016.data: Colour before: (223, 5, 0, 255), 1
frame017.data: Colour before: (71, 94, 162, 255), 1
^C
Thread 1 &quot;harbour-webview&quot; received signal SIGINT, Interrupt.
0x0000007ff69f8740 in poll () from /lib64/libc.so.6
(gdb) info break
Num     Type           Disp Enb Address            What
8       breakpoint     keep y   <MULTIPLE>         
8.1                         y     0x0000007ff7ece9e4 <MozMaterialNode::
    MozMaterialNode()@plt+4>
8.2                         y     0x0000007ff7ef9028 <MozMaterialNode::
    MozMaterialNode()+16>
(gdb) 
Here the breakpoint doesn't get called. You can see that I double-checked that the breakpoint is set correctly and it is. So this is looking very much like a smoking gun.

We'll use the same approach we used before to try to figure out why this isn't getting called. We take the backtrace of the ESR 78 build where it is getting called and work our way up through it until we find out what's failing on the ESR 91 side. Here's the backtrace, taken from ESR 78:
#0  MozMaterialNode::MozMaterialNode (this=this@entry=0x7f0001ef60) at 
    qmozextmaterialnode.cpp:31
#1  0x0000007fbfc05834 in MozExtMaterialNode::MozExtMaterialNode (
    this=0x7f0001ef60) at qmozextmaterialnode.cpp:376
#2  0x0000007fbfc03ea4 in QuickMozView::updatePaintNode (this=0x55558433a0, 
    oldNode=<optimized out>)
    at /usr/include/xulrunner-qt5-78.15.1/mozilla/cxxalloc.h:33
#3  0x0000007fbf941c50 in QQuickWindowPrivate::updateDirtyNode(QQuickItem*) () 
    from /usr/lib64/libQt5Quick.so.5
#4  0x0000007fbf94214c in QQuickWindowPrivate::updateDirtyNodes() () from /usr/
    lib64/libQt5Quick.so.5
#5  0x0000007fbf943270 in QQuickWindowPrivate::syncSceneGraph() () from /usr/
    lib64/libQt5Quick.so.5
#6  0x0000007fbf9164a8 in ?? () from /usr/lib64/libQt5Quick.so.5
#7  0x0000007fbf917134 in ?? () from /usr/lib64/libQt5Quick.so.5
#8  0x0000007fbf91cc10 in ?? () from /usr/lib64/libQt5Quick.so.5
#9  0x0000007fbea290e8 in ?? () from /usr/lib64/libQt5Core.so.5
#10 0x0000007fb74b0a4c in start_thread (arg=0x7fffffe9bf) at pthread_create.c:
    479
#11 0x0000007fbe70b89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
It's important to note that the call at the top to MozMaterialNode::MozMaterialNode() — which is the one we're interested in — isn't a normal call, it's a constructor. It's not being called directly, but rather as a side-effect of a call to the MozExtMaterialNode class constructor. This inherits from the MozMaterialNode class as we can see in this code copied from qozextmaterialnode.cpp:
class MozExtMaterialNode : public MozMaterialNode
{
public:
    MozExtMaterialNode();
    ~MozExtMaterialNode();
[...]
As an aside, there's also a MozRgbMaterialNode class, but it looks like we're not using that:
class MozRgbMaterialNode : public MozMaterialNode
{
public:
    MozRgbMaterialNode();
    ~MozRgbMaterialNode();
[...]
A quick check confirms that it's definitely only the MozExtMaterialNode class in use for us here:
(gdb) break MozRgbMaterialNode
Breakpoint 3 at 0x7fbfc056a0: file qmozextmaterialnode.cpp, line 176.
(gdb) b MozExtMaterialNode
Breakpoint 4 at 0x7fbfbdc174 (2 locations)
(gdb) r
[...]
Thread 10 &quot;QSGRenderThread&quot; hit Breakpoint 4, 0x0000007fbfbdc174 in 
    MozExtMaterialNode::MozExtMaterialNode()@plt () from /usr/lib64/
    libqt5embedwidget.so.1
(gdb)                            
Another check of the backtrace and an examination of the code shows that the place where this is constructed is inside the QuickMozView::updatePaintNode() method; part of QtMozEmbed. We're going to be spending a bit of time inside this code, so it's worth taking a look for yourself at the source. But for reference, here's the full source for the method we're dealing with:
QSGNode *
QuickMozView::updatePaintNode(QSGNode *oldNode, UpdatePaintNodeData *)
{
    // If the dimensions are entirely invalid return no node.
    if (width() <= 0 || height() <= 0) {
        delete oldNode;

        delete mTexture;
        mTexture = nullptr;

        return nullptr;
    }

    const bool invalidTexture = !mComposited
            || !d->mIsPainted
            || !d->mViewInitialized
            || !d->mHasCompositor
            || !d->mContext->registeredWindow()
            || !d->mMozWindow;

    if (mTexture && invalidTexture) {
        delete oldNode;
        oldNode = nullptr;

        delete mTexture;
        mTexture = nullptr;
    }

    QRectF boundingRect(d->renderingOffset(), d->mSize);

    if (!mTexture && invalidTexture) {
        QSGSimpleRectNode *node = static_cast<QSGSimpleRectNode *>(oldNode);
        if (!node) {
            node = new QSGSimpleRectNode;
        }
        node->setColor(d->mBackgroundColor);
        node->setRect(boundingRect);

        return node;
    }

    if (!mTexture) {
        delete oldNode;
        oldNode = nullptr;
    }

    MozMaterialNode *node = static_cast<MozMaterialNode *>(oldNode);

    if (!node) {
#if defined(QT_OPENGL_ES_2)
        QMozExtTexture * const texture = new QMozExtTexture;
        mTexture = texture;

        connect(texture, &QMozExtTexture::getPlatformImage, d->mMozWindow, 
    &QMozWindow::getPlatformImage, Qt::DirectConnection);

        node = new MozExtMaterialNode;
#else
#warning &quot;Implement me for non ES2 platform&quot;
//        node = new MozRgbMaterialNode;
        return nullptr;
#endif
Note the call there to new MozExtMaterialNode. That's the line we're really interested in. On ESR 78 when we step through the code we initially get an execution flow that enters the (!mTexture && invalidTexture) clause. Once we're in there we know the method is going to return early and the MozExtMaterialNode constructor isn't going to get called until we re-enter the method.

After a few cycles through like this, there's suddenly a change. While the mTexture value remains unset, the invalidTexture value flips to false which means the clause gets skipped. We then end up inside the (!node) section, which is where things get interesting. At that point the texture is created, as is the MozExtMaterialNode object and now things are properly set up for rendering.

Let's break this down a bit further. Here's what the initial few cycles look like:
(gdb) break QuickMozView::updatePaintNode
Breakpoint 8 at 0x7faf06f764 (2 locations)
(gdb) r
[...]
Thread 10 &quot;QSGRenderThread&quot; hit Breakpoint 8, QuickMozView::
    updatePaintNode (this=0x55558454e0, oldNode=0x0) at quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {
(gdb) delete break
Delete all breakpoints? (y or n) y
(gdb) break quickmozview.cpp:214
Breakpoint 9 at 0x7fbfc03fac: file quickmozview.cpp, line 214.
(gdb) c
Continuing.
[...]
Thread 10 &quot;QSGRenderThread&quot; hit Breakpoint 9, QuickMozView::
    updatePaintNode (this=0x55558454e0, oldNode=0x7f0801eab0) at 
    quickmozview.cpp:216
216         if (!node) {
(gdb) 
Stepping through this section more carefully we can see why the invalidTexture flag is set to true in all these cases:
Thread 10 &quot;QSGRenderThread&quot; hit Breakpoint 12, QuickMozView::
    updatePaintNode (this=0x5555842660, oldNode=0x7f0800c400) at 
    quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {
(gdb) p mComposited
$23 = true
(gdb) p d->mIsPainted
$24 = false
(gdb) p d->mViewInitialized
$25 = true
(gdb) p d->mHasCompositor
$26 = true
(gdb) p d->mContext->registeredWindow()
$27 = (QMozWindow *) 0x555560ce60
(gdb) p d->mMozWindow
$28 = {wp = {d = 0x5555bbe8e0, value = 0x555560ce60}}
(gdb) 
Crucially here, we can see that p d->mIsPainted is set to false. While any of the items shown here are set to false the invalidTexture flag will be set to true. But then after three or four cycles like this the d->mIsPainted flag suddenly flips to true at which point everything else flows through as we want.
Thread 10 &quot;QSGRenderThread&quot; hit Breakpoint 12, QuickMozView::
    updatePaintNode (this=0x5555842660, oldNode=0x7f0800c400) at 
    quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {
(gdb) p d->mIsPainted
$37 = true
(gdb) n
181         const bool invalidTexture = !mComposited
(gdb) 
186                 || !d->mMozWindow;
(gdb) 
196         QRectF boundingRect(d->renderingOffset(), d->mSize);
(gdb) 
198         if (!mTexture && invalidTexture) {
(gdb) 
210             delete oldNode;
(gdb) 
218             QMozExtTexture * const texture = new QMozExtTexture;
(gdb) 
219             mTexture = texture;
(gdb) 
221             connect(texture, &QMozExtTexture::getPlatformImage, 
    d->mMozWindow, &QMozWindow::getPlatformImage, Qt::DirectConnection);
(gdb) p mTexture
$38 = (QSGTexture *) 0x7f08024070
(gdb) n
223             node = new MozExtMaterialNode;
(gdb) 
230             node->setTexture(mTexture);
(gdb) 
Nice! But this never happens on ESR 91. In the case of ESR 91 the d->mIsPainted flag remains set to false throughout. Consequently we never get to the point where the texture or MozExtMaterialNode object are created on ESR 91:
(gdb) break QuickMozView::updatePaintNode
Breakpoint 9 at 0x7fe790b764 (2 locations)
(gdb) r
[...]
Thread 9 &quot;QSGRenderThread&quot; hit Breakpoint 1, QuickMozView::
    updatePaintNode (this=0x555586cf70, oldNode=0x0) at quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {

(gdb) break quickmozview.cpp:223
Breakpoint 11 at 0x7fbfc03ea4: file quickmozview.cpp, line 230.
(gdb) r
[...]
We end up just getting this same sequence again and again and again:
Thread 9 &quot;QSGRenderThread&quot; hit Breakpoint 7, QuickMozView::
    updatePaintNode (this=0x555586d0d0, oldNode=0x7fc800c640) at 
    quickmozview.cpp:172
172         if (width() <= 0 || height() <= 0) {
(gdb) n
181         const bool invalidTexture = !mComposited
(gdb) p mComposited
$7 = true
(gdb) p d->mIsPainted
$8 = false
(gdb) n
188         if (mTexture && invalidTexture) {
(gdb) p mTexture
$9 = (QSGTexture *) 0x0
(gdb) n
196         QRectF boundingRect(d->renderingOffset(), d->mSize);
(gdb) 
198         if (!mTexture && invalidTexture) {
(gdb) 
200             if (!node) {
(gdb) 
203             node->setColor(d->mBackgroundColor);
(gdb) 
204             node->setRect(boundingRect);
(gdb) 
206             return node;
(gdb) c
Clearly the question we have to answer is "why is d->mIsPainted never set to true?". If we can answer this, who knows, we might even be on the right path to solving the rendering problem.

This has been a rather long investigation today already, so I'm going to leave it there for now. But tomorrow I'm going to come back to this in order to try to answer this question. This feels like it could be a very fruitful line of enquiry!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
19 May 2024 : Day 237 #
I was easing myself gently back into Gecko development yesterday with a quick look over some of the work that others in the Sailfish community have been doing while I've been having a break, related to deciphering the messed up textures that are coming out of the render surface.

Today I'm back looking at the code in earnest. And it's been a fruitful process. My starting point has been to take a look not at the Gecko code, but from the other end: starting with QtMozEmbed and sailfish-components-webview. Taking things easy, I'm just browsing through the code, trying to find where the Surface, as used by Gecko, connects with the texture rendered to the screen by the WebView. If I can figure that out I'll feel like I'm on solid ground.

It looks like a key piece of the puzzle happens in QMozExtTexture::updateTexture(). For example, on ESR 78, when rendering is taking place, the method gets hit often. So often that it appears to be every frame:
Thread 9 &quot;QSGRenderThread&quot; hit Breakpoint 1, QMozExtTexture::
    updateTexture (this=0x7f90024e30) at qmozexttexture.cpp:64
64      {
(gdb) bt
#0  QMozExtTexture::updateTexture (this=0x7f90024e30) at qmozexttexture.cpp:64
#1  0x0000007fbfc04f18 in MozMaterialNode::preprocess (this=0x7f9001ef60) at 
    qmozextmaterialnode.cpp:81
#2  0x0000007fbf8f0c44 in QSGRenderer::preprocess() () from /usr/lib64/
    libQt5Quick.so.5
#3  0x0000007fbf8f0e5c in QSGRenderer::renderScene(QSGBindable const&) () from /
    usr/lib64/libQt5Quick.so.5
#4  0x0000007fbf8f14c0 in QSGRenderer::renderScene(unsigned int) () from /usr/
    lib64/libQt5Quick.so.5
#5  0x0000007fbf8ffe58 in QSGRenderContext::renderNextFrame(QSGRenderer*, 
    unsigned int) () from /usr/lib64/libQt5Quick.so.5
#6  0x0000007fbf9429d0 in QQuickWindowPrivate::renderSceneGraph(QSize const&) (
    ) from /usr/lib64/libQt5Quick.so.5
#7  0x0000007fbf916c04 in ?? () from /usr/lib64/libQt5Quick.so.5
#8  0x0000007fbf91cc10 in ?? () from /usr/lib64/libQt5Quick.so.5
#9  0x0000007fbea290e8 in ?? () from /usr/lib64/libQt5Core.so.5
#10 0x0000007fb74b0a4c in start_thread (arg=0x7fffffe9bf) at pthread_create.c:
    479
#11 0x0000007fbe70b89c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/
    clone.S:78
This contrasts with the ESR 91 build, where the same breakpoint is never hit. I placed a breakpoint on all of the updateTexture() methods just to be sure:
Num Disp Enb Address       What
3   keep y   <MULTIPLE>         
3.1      y   0x07ff7b64ba4 <QSGDistanceFieldGlyphCache::updateTexture()@plt+4>
3.2      y   0x07ff7b64d64 <QSGDefaultPainterNode::updateTexture()@plt+4>
3.3      y   0x07ff7bf071c <QSGDefaultPainterNode::updateTexture()+20>
3.4      y   0x07ff7bf5230 <QSGDistanceFieldGlyphCache::updateTexture()+24>
3.5      y   0x07ff7c1babc <QSGDefaultLayer::updateTexture()+12>
3.6      y   0x07ff7ef846c <QMozExtTexture::updateTexture()+20>
The fact there are hits for ESR 78, but not for ESR 91, is definitely of interest.

This isn't something that's come up before. Until now I've focused almost exclusively on getting the texture rendered to at the other end. But it looks now like the QtMozEmbed code is failing somewhere; the render update isn't being called. Just to be clear though, this isn't due to any change that I've made in QtMozEmbed. At least that seems unlikely at any rate, given I've not made an notable changes to these parts of the code.

More likely, some part of the initialisation process in the Gecko code is failing, or not happening as it should. Something like that could well lead to a situation where the render update doesn't get called. For example, it could be that the texture is never set as part of the Qt Scene Graph. But that's speculation, it could be all sorts of things.

I'm rather excited by this. Maybe this is the key problem we've been searching for? But I don't want to go too far today while I'm still getting myself up to speed with things. So I'll continue looking in to this further tomorrow.

Also worth mentioning is that tomorrow is Jolla Love Day 2. I'm excited to find out what Jolla have in store for us and while I won't be able to attend in-person, I'll definitely be there both online and in spirit!

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
18 May 2024 : Day 236 #
It's been two weeks since I last published an entry in my Gecko dev diary series. The gap was necessary for me to fit various other things into my life which just weren't going to be compatible with me working on the Gecko code, specifically my attendance of the — HPC/AI Days — conference in Durham for a week, followed by preparations for a couple of Pint of Science events I was involved with during this last week. This was my first involvement with Pint of Science and I have to admit it was a lot of fun.

The break from Gecko offered me some healthy respite, but I promised to start back up again today and here we are. I've not yet got back in to the rhythm of Gecko development yet, so my first few entries may be somewhat relaxed. I'll be taking it gently to begin with.

One important reason for this is the problem I'm trying to solve. As I left things a fortnight ago I'm still trying to figure out why the offscreen rendering pipeline is broken. Since starting on trying to fix this problem I've come a long way, but I'm still not there. More importantly, by this point it's become a problem without a clear solution. That means I no longer have a clear path forwards.

The daily rhythm of these posts is important for keeping me focused on the task, but it can also be a distraction. It means chopping the work into day-sized chunks. That can be a hindrance when what the task really needs is a deep analysis, during which there may not be so much to write about. This work is easiest to write about when I'm making code changes. When I'm not making changes to the code, things can be a bit too intangible.

That's where I am today.

Before I get in to the actual work I want to first talk about confusing textures and the amazing work of the Sailfish community in their attempts to disentangle them. If you've been following these diary entries for some time you'll know that before the gap I spent a fair amount of time trying to extract texture image data from the render surface used by the offline renderer.

The result of all this work ended up being a collection of raw dumps of pixel data which, after attempting to convert them to something visible, looked like this.
 
The corrupted, swizzled image data from before the gap. After two weeks setting around unchanged it still just looks like fuzzy pixels to me.

I tried all sorts to get the images into better shape but without any luck. But the courageous Sailfish community took up the baton and continued the work while I was taking a break from it.

There were great contributions from across the community, but I want to especially highlight the efforts and ideas of Tone (tortoisedoc), Adam Pigg (piggz), Ville Nummela (vige, my ex-colleague from Jolla), Mark Washeim (poetaster), thigg, attah, remote and kan_ibal, all of whom provided genuinely useful input.
 
Three screenshots: a texture with some ghostly images shown on it, a zoomed in texture with a few bright pixels and a window showing a similar texture with some metadata in a desktop window

Here you can see some of the efforts to decipher the code. On the left is the result of poetaster having processed the image using ImageMagick. As poetaster explains:
 
I thought it might be an offset issue, but after tooling about in gimp, I'm not so sure. What 'should' they look like. The RGB data does seem to be RBG, the RGBA, I'm not sure. You said unsigned?... bits of sailfish logo can also be recognized. It does look unsigned judging from the results of the convert operation.

In the centre is a close up rendered by piggz using PixelViewer.
 
There is definitely almost an image there! For this I used PixelViewer, which allows to quickly try many different possible formats, this one is RGBA_8888... This is making a great puzzle... just need to figure out how to put the bits together.

When Adam talks about bits I'm not sure whether he's talking about computer bits or puzzle pieces, but either way I agree it should just be a matter of putting them together int the right way.

Finally on the right we can see the image also rendered using PixelViewer this time by kan_ibal in combination with a Python script.
 
There are blocks 8x8 pixels. It reminds me a jpeg compression and looks like a stage of quantization and DCT.

Again, this all sounds very plausible to me, alghouth the image still isn't popping out just yet. Thank you to everyone for your input on this. Unfortunately while none of these resulted in a definitive conclusion, it's all very helpful input. If anyone else would like to give this a go, feel free to take a look at the textures and pass them through your favourite raw image processing pipeline. I'd love to get to the bottom of this.

I'm going to leave it there for today. But tomorrow I'll be starting to look at the Gecko code again in earnest.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
3 May 2024 : Day 235 #
As discussed at the start of the week, this is going to be my last developer diary post for a little bit. But I want to make absolutely clear that this is a temporary pause. I'm heading to the HPC Days Conference in Durham next week where I'll be giving a talk on Matching AI Research to HPC Resource. I'm expecting it to be a packed schedule and, as is often the case with this kind of event, I'm not expecting to be able to fit in my usual gecko work alongside this. So there will be a pause, but I'll be back on the 18th May to restart from where I'm leaving things today.

During the pause I'm hoping write up some of the blog posts that I've put on hold while working on the gecko engine, so things may not be completely silent around here. But the key message I want to get across is that I'm not abandoning gecko. Not at all. I'll be back to it in no time.

Nevertheless, as I write this I'm still frustrated and stumped by my lack of progress with the offscreen rendering. So I'm also hoping that a break will give me the chance to come up with some alternative approaches. Over the last couple of days I attempted to capture the contents of the surface used to transfer the offscreen texture onscreen, but the results were inconclusive at best.

On the forums I received helpful advice from Tone (tortoisedoc). Tone highlighted the various areas where texture decoding can go awry:
 
two things can go wrong (assuming the data is correct in the texture):
  • alignment of data (start, stride)
  • format of pixel (ARGB / RGBA etc)
Which one is the texture you are look at configured with?

Is the image the same the browser would show? Which image are you displaying in the embedded view?


These are all great points and great questions. Although I know what sort of image I'm asking for (either RGB or RGBA in UNSIGNED_BYTE format) the results don't seem to be matching that. It's made more complex by the fact that the outline of the image is there (which suggests things like start and stride are correct) but still many of the pixels are just blank. It's somewhat baffling and I think the reason might be more to do with an image that doesn't exist rather than an image which is being generated in the wrong format.

But I could be wrong and I'll continue to consider these possibilities and try to figure out the underlying reason. I really appreciate the input and as always it gives me more go to on.

But today I've shifted focus just briefly by giving the code a last review before the pause: sifting through the code again to try to spot differences.

There were many places I thought there might be differences, such as the fact that the it wasn't clear to me that the GLContext->mDesc.isOffscreen flag was being set. But stepping through the application in the debugger showed it was set after all. So no fix to be had there.

The only difference I can see — and it's too small to make a difference I'm sure — is that the Surface capabilities are set at slightly different places. It's possible that in the gap between where it's set in ESR 78 and the place it's set slightly later in ESR 91, something makes use of it and a difference results.

I'm not convinced to be honest, but to avoid even the slightest doubt I've updated the code so that the values are now set explicitly:
GLContext::GLContext(const GLContextDesc& desc, GLContext* sharedContext,
                     bool useTLSIsCurrent)
    : mDesc(desc),
      mUseTLSIsCurrent(ShouldUseTLSIsCurrent(useTLSIsCurrent)),
      mDebugFlags(ChooseDebugFlags(mDesc.flags)),
      mSharedContext(sharedContext),
      mWorkAroundDriverBugs(
          StaticPrefs::gfx_work_around_driver_bugs_AtStartup()) {
  mCaps.any = true;
  mCaps.color = true;
[...]
}

[...]

bool GLContext::InitImpl() {
[...]
  // TODO: Remove SurfaceCaps::any.
  if (mCaps.any) {
    mCaps.any = false;
    mCaps.color = true;
    mCaps.alpha = false;
  }
[...]
The updated code is built and right now transferring over to my development phone. As I say though, this doesn't look especially promising to me. I'm not holding my breath for a successful render.

And I was right: no improvements after this change. Argh. How frustrating.

I'll continue looking through the fine details of the code. There has to be a difference in here that I'm missing. And while it doesn't make for stimulating diary entries, just sitting and sifting through the code seems like an essential task nonetheless.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.

If you've got this far, then not only do you have my respect and thanks, but I'd also urge you to return on the 18th May when I'll be picking things up from where I've left them today.
Comment
2 May 2024 : Day 234 #
Yesterday I added code to the build to try to capture the image from the surface texture that's supposed to be used for rendering to the screen. I got poor results with it though: the colours are wrong and the images look corrupted somehow.

So today I've been playing around with the data to try to figure out why. One possibility that sprang to mind is that potentially the data needs swizzling. Swizzling is the act of rearranging pixels, components or bits within the components. One possibility is that the data is being read little-endian (or big-endian) but GIMP is interpreting it as big-endian (or little-endian). So reversing the endianness might help.

In order to check this I've written a simple Python script that allows me to rearrange bits and bytes within the image in a simplified way. I just have to run the image through the script:
#!/bin/python3
import sys

def swizzle(byte):
    result = 0
    for pos in range(8):
        result |= ((byte & (1 << pos)) >> pos) << (7 - pos)
    return result

def convert(filein, fileout):
    with open(filein, &quot;rb&quot;) as fin:
        data = fin.read(-1)
    with open(fileout, &quot;wb&quot;) as fout:
        for pos in range(0, len(data), 4):
            r1, g1, b1, a1 = data[pos:pos + 4]

            # Rearrange the bits and bytes
            r2 = swizzle(r1)
            g2 = swizzle(g1)
            b2 = swizzle(b1)

            fout.write(r2.to_bytes(1, 'little'))
            fout.write(g2.to_bytes(1, 'little'))
            fout.write(b2.to_bytes(1, 'little'))

if len(sys.argv) != 3:
    print(&quot;Please provide input and output filenames&quot;)
else:
    filein = sys.argv[1]
    fileout = sys.argv[2]
    print(f&quot;Converting from: {filein}&quot;)
    print(f&quot;Converting to: {fileout}&quot;)
    convert(filein, fileout)
    print(f&quot;Done&quot;)
After passing the images through this script and rearranging the data in as many ways as I can think of, I'm still left with a very messy result. The two new versions I created are one with just the alpha channel removed; and a second which also reverses the bits in each byte. Here are the three variants I'm left with:
 
  1. frame060-00.data: Original file (RGBA).
  2. frame060-01.data: Alpha removed (RGB).
  3. frame060-02.data: Alpha removed and swizzled (RGB).


The version without the alpha channel was created using a change to the gecko code by setting the texture format to RGB. I did this because I was concerned all of the zeroed-out values (which appear as black pixels in the image) might have been somehow related to this. But as you can see, the result is identical to generating an RGBA texture and then using the Python script to remove the alpha channel.

It's hard to see the difference between the first two versions and the third, which has the bit direction reversed. The colours are actually different, but because black remains black even after swizzling, the overall darkness of the image remains.
 
The swizzled image data still looks corrupted.

As you can see we get very similar results irrespective of these changes. When we look at the data using a hex editor we can see why. The majority of the entries are zero, which is why we're getting such a preponderance of black in the images:
$ hexdump -vC -s 0x1e0 -n 0x090 frame060-00.data | less
000001e0  ff 03 04 ff 0f 00 00 ff  00 01 00 ff 00 00 f0 ff  |................|
000001f0  00 00 00 ff 00 00 00 ff  00 00 00 ff 00 00 00 ff  |................|
00000200  ff 03 04 ff 0f 00 00 ff  80 00 00 ff 00 00 d0 ff  |................|
00000210  00 00 00 ff 00 00 00 ff  00 00 00 ff 00 00 00 ff  |................|
00000220  00 00 00 ff 00 00 00 ff  00 00 00 ff 00 00 00 ff  |................|
00000230  00 00 00 ff 00 00 00 ff  00 00 00 ff 00 00 00 ff  |................|
00000240  ff 03 04 ff 0f 00 00 ff  00 01 00 ff 00 00 e0 ff  |................|
00000250  00 00 00 ff 00 00 00 ff  00 00 00 ff 00 00 00 ff  |................|
00000260  00 00 00 ff 00 00 00 ff  00 00 00 ff 00 00 00 ff  |................|
00000270
$ hexdump -vC -s 0x1e0 -n 0x090 frame060-01.data | less
000001e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001f0  00 00 00 00 00 00 00 00  ff 03 04 0f 00 00 00 01  |................|
00000200  00 00 00 d0 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000210  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000220  00 00 00 00 00 00 00 00  ff 03 04 0f 00 00 80 00  |................|
00000230  00 00 00 d0 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000240  ff 03 04 0f 00 00 00 01  00 00 00 e0 00 00 00 00  |................|
00000250  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000270
$ hexdump -vC -s 0x1e0 -n 0x090 frame060-02.data | less
000001e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000001f0  00 00 00 00 00 00 00 00  ff c0 20 f0 00 00 00 80  |.......... .....|
00000200  00 00 00 0b 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000210  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000220  00 00 00 00 00 00 00 00  ff c0 20 f0 00 00 01 00  |.......... .....|
00000230  00 00 00 0b 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000240  ff c0 20 f0 00 00 00 80  00 00 00 07 00 00 00 00  |.. .............|
00000250  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000270
Inverting the colours doesn't help: we just end up with a predominance of white. It looks more like only certain points in the texture are being exported.

Swizzling then doesn't appear to be the answer. Another possibility, I thought, might be that I'm using the wrong parameters for the call to raw_fReadPixels() when actually requesting the data. Maybe these parameters have to match the underlying texture?

To test this out I've tried to determine the values that are used when the surface textures are created. I used the debugger for this. But I thought I'd also take the opportunity to check that we have the same values for ESR 78 and ESR 91. So first off, this is what I get when I check ESR 78:
Thread 37 &quot;Compositor&quot; hit Breakpoint 3, mozilla::gl::
    SharedSurface_Basic::Create (gl=0x7eac109140, formats=..., size=..., 
    hasAlpha=false)
    at gfx/gl/SharedSurfaceGL.cpp:24
24          bool hasAlpha) {
(gdb) n
25        UniquePtr<SharedSurface_Basic> ret;
(gdb) p formats
$1 = (const mozilla::gl::GLFormats &) @0x7eac00595c: {color_texInternalFormat = 
    6407, color_texFormat = 6407, color_texType = 5121, 
  color_rbFormat = 32849, depthStencil = 35056, depth = 33190, stencil = 36168}
(gdb) p size
$2 = (const mozilla::gfx::IntSize &) @0x7eac003564: {<mozilla::gfx::
    BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> >> = 
    {{{
        width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::
    gfx::UnknownUnits> = {<No data fields>}, <No data fields>}
(gdb) p hasAlpha
$3 = false
(gdb) p gl
$4 = (mozilla::gl::GLContext *) 0x7eac109140
(gdb) 
And here's the check for ESR 91. The details all looks identical to me:
Thread 37 &quot;Compositor&quot; hit Breakpoint 3, mozilla::gl::
    SharedSurface_Basic::Create (gl=0x7ee019aa50, formats=..., size=..., 
    hasAlpha=false)
    at gfx/gl/SharedSurfaceGL.cpp:59
59          bool hasAlpha) {
(gdb) n
60        UniquePtr<SharedSurface_Basic> ret;
(gdb) p formats
$4 = (const mozilla::gl::GLFormats &) @0x7ee00044e4: {color_texInternalFormat = 
    6407, color_texFormat = 6407, color_texType = 5121, 
  color_rbFormat = 32849, depthStencil = 35056, depth = 33190, stencil = 36168}
(gdb) p size
$5 = (const mozilla::gfx::IntSize &) @0x7f1f92effc: {<mozilla::gfx::
    BaseSize<int, mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> >> = 
    {{{
        width = 1080, height = 2520}, components = {1080, 2520}}}, <mozilla::
    gfx::UnknownUnits> = {<No data fields>}, <No data fields>}
(gdb) p hasAlpha
$6 = false
(gdb) p gl
$7 = (mozilla::gl::GLContext *) 0x7ee019aa50
(gdb) 
The actual format is specified in GLContext::ChooseGLFormats(). Here are the values taken from the debugger:
{
    color_texInternalFormat = 6407,
    color_texFormat = 6407,
    color_texType = 5121, 
    color_rbFormat = 32849,
    depthStencil = 35056,
    depth = 33190,
    stencil = 36168
}
Checking these against the appropriate enums, these values are equivalent to the following:
{
    color_texInternalFormat = LOCAL_GL_RGB,
    color_texFormat = LOCAL_GL_RGB,
    color_texType = LOCAL_GL_UNSIGNED_BYTE, 
    color_rbFormat = LOCAL_GL_RGB8,
    depthStencil = LOCAL_GL_DEPTH24_STENCIL8,
    depth = LOCAL_GL_DEPTH_COMPONENT24,
    stencil = LOCAL_GL_STENCIL_INDEX8
}
I've used these values for the parameters to ReadPixles(), including removing the alpha channel. But sadly the results are practically identical. Here's the new output generated during the capture for reference:
frame000.data: Colour before: (0, 0, 0), 1
frame001.data: Colour before: (0, 0, 0), 1
frame002.data: Colour before: (0, 0, 0), 1
frame003.data: Colour before: (0, 0, 0), 1
frame004.data: Colour before: (208, 175, 205), 1
frame005.data: Colour before: (39, 0, 0), 1
frame006.data: Colour before: (67, 115, 196), 1
frame007.data: Colour before: (0, 0, 0), 1
frame008.data: Colour before: (0, 0, 0), 1
frame009.data: Colour before: (157, 149, 42), 1
frame010.data: Colour before: (0, 0, 0), 1
frame011.data: Colour before: (0, 32, 12), 1
frame012.data: Colour before: (71, 118, 198), 1
frame013.data: Colour before: (0, 0, 0), 1
frame014.data: Colour before: (0, 0, 0), 1
This is all interesting stuff, but it doesn't seem to get us any closer to what we were hoping for, which is supposed to be a copy of the image that ought to be shown on the screen. It's all been a bit of an unproductive diversion. I'll have to try to make more progress on that tomorrow.

If you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment
1 May 2024 : Day 233 #
Before I get in to my diary entry today I want to add a reminder that I'll not be posting diaries next week, or the week after. Next week I'll be attending a conference and I need a bit of time to sort out a few other things in my wider life. So this will give me the chance to do that. But this is only a temporary gap; I'll be back straight after to continue where I left off.

With that out of the way, let's get on with the entry for today. If you've read any of my diary entries over the last few days you'll know I've been trying to extract something useful from the GLScreenBuffer::Swap() method. I added some code in to the method to read off pixel data from the render surface which generated some plausible looking output.

But I wasn't totally convinced: it looked to me like there were too many zero entries that I couldn't explain

So today I've been trying to do a couple of things. First I tried to get something to compare against; second I tried to make the code that captures the colour at a point a little more general by also saving out the entire image buffer to disk.

Let's tackle these in order. The obvious way to get something to compare against is to add the same code to the ESR 78 library and try running that. And this is exactly what I've been doing. The surrounding code in ESR 91 is almost identical to that in ESR 78, so it's a pretty straightforward task to just copy over the code changes from ESR 91 to ESR 78.

Having done this, built the library and deployed it to my phone, here's what's output when I now execute the browser:
=============== Preparing offscreen rendering context ===============
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
[...]
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (45, 91, 86, 255), 1
Colour after: (45, 91, 86, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (128, 13, 160, 255), 1
Colour after: (128, 13, 160, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
Colour before: (0, 0, 0, 255), 1
Colour after: (0, 0, 0, 255), 1
[...]
There are a couple of lines of interesting output here, but most of them just show black pixels and nothing else. That's not ideal. To be clear, this is output from a WebView app that is fully working. So I'd have expected to see a lot more colour data coming out in the debug output.

This has left me more confused than enlightened, so I've gone on to implement the second idea as well: exporting the image data to file so I can check what the image actually looks like.

Once again the code for this is pretty straightforward. All I'm doing is taking a copy of the entire surface, rather than just one pixel of it. This is provided as a contiguous block of memory: a raw buffer with the pixel values stored in it. So I'm just dumping this out to a file. To avoid completely thrashing my phone I've set it up to output an image only once in every ten renders. The filenames increment each time an image is exported, so I should capture several steps as the page completes is render.

Here's the code I'm using for this, added to the GLScreenBuffer::Swap() method. This is hacky code at best, but it's also temporary, so I'm not too concerned about the style here. Something quick and dirty is fine, as long as... well, as long as it works!
  static int count = 0;
  static int filecount = 0;
  size_t bufferSize;
  uint8_t* buf;
  bool result;
  int xpos;
  int ypos;
  int pos;
  volatile char red;
  volatile char green;
  volatile char blue;
  volatile char alpha;

  if (count % 10 == 0) {
    //bool GLScreenBuffer::ReadPixels(GLint x, GLint y, GLsizei width, GLsizei 
    height, GLenum format, GLenum type, GLvoid* pixels)
    bufferSize = sizeof(char) * size.width * size.height * 4;
    buf = static_cast&lt;uint8_t*&gt;(calloc(sizeof(uint8_t), bufferSize));
    result = ReadPixels(0, 0, size.width, size.height, LOCAL_GL_RGBA, 
    LOCAL_GL_UNSIGNED_BYTE, buf);

    xpos = size.width / 2;
    ypos = size.height / 2;
    pos = (xpos + (size.width * ypos)) * 4;

    red = buf[pos];
    green = buf[pos + 1];
    blue = buf[pos + 2];
    alpha = buf[pos + 3];

    printf_stderr(&quot;Colour before: (%d, %d, %d, %d), %d\n&quot;, red, 
    green, blue, alpha, result);

    #define FORMAT &quot;/home/defaultuser/Documents/Development/gecko/
    frame%03d.dat&quot;

    // Export out the pixel data
    int const len = 61 + 10;
    char filename[61 + 10];
    snprintf(filename, len, FORMAT, filecount);
    FILE *fh = fopen(filename, &quot;w&quot;);
    fwrite(buf, sizeof(char), bufferSize, fh);
    fclose(fh);
    free(buf);
    filecount += 1;
  }
  count += 1;
After building and running the code I get some sensible looking output. As you can see there are sixteen frames generated before I quit the application. The first five look empty, but eventually some colours start coming through.
frame000.data: Colour before: (0, 0, 0, 255), 1
frame001.data: Colour before: (0, 0, 0, 255), 1
frame002.data: Colour before: (0, 0, 0, 255), 1
frame003.data: Colour before: (0, 0, 0, 255), 1
frame004.data: Colour before: (0, 0, 0, 255), 1
frame005.data: Colour before: (187, 125, 127, 255), 1
frame006.data: Colour before: (67, 115, 196, 255), 1
frame007.data: Colour before: (18, 0, 240, 255), 1
frame008.data: Colour before: (162, 225, 0, 255), 1
frame009.data: Colour before: (128, 202, 255, 255), 1
frame010.data: Colour before: (0, 0, 0, 255), 1
frame011.data: Colour before: (240, 255, 255, 255), 1
frame012.data: Colour before: (255, 159, 66, 255), 1
frame013.data: Colour before: (68, 115, 196, 255), 1
frame014.data: Colour before: (0, 0, 0, 255), 1
frame015.data: Colour before: (0, 192, 7, 255), 1
The textures captured from the execution; the colours are wrong and the image looks corrupted.

After copying the frame data from my phone over to my laptop I'm able to load them into GIMP (the GNU Image Manipulation Package) using the raw plugin. This is activated for files with a .data extension and allows the data to be loaded in as if it were pure pixel data without any header or metadata. Because there's no header you have to specify certain parameters manually, such as the width, height and format of the image data.

I always forget exactly what the dimensions of the screens on my development Xperia 10 II devices are, but thankfully GSMArena is only a few clicks away to check:
 
Resolution: 1080 x 2520 pixels, 21:9 ratio (~457 ppi density)

Adding the dimensions into the raw data loader seems to do the trick.

The code I added to gecko requested a texture format of RGBA, so that's what I need to use when loading the data in. Sadly the results are not what I had hoped. There are clearly some data related to the render and it's worth noting that the buffer where these are stored is initialised to contain zeroes each frame, so the data is real, not just artefacts from the memory or previous render.

But most of the pixels are black, the colours are wrong and the images seem to be mangled in very strange ways as you can see in the screenshots.

It's too late for me to figure this out tonight so I'll have a think about it overnight and come back to it in the morning.

As always, if you'd like to read any of my other gecko diary entries, they're all available on my Gecko-dev Diary page.
Comment