List items

Items from the current list are shown below.

Blog

All items from November 2023

30 Nov 2023 : Day 93 #

Unfortunately a clerical error yesterday (me forgetting to start it) left me with an incomplete build this morning. I kicked it off immediately, but as I write this in the evening it's still running.

So, while I have my patched, partial-build, version installed, and since the changes didn't add or remove any lines, a practical task for me to work on while the build continues is to debug the changes I made using gdb to check whether the changes are actually having any effect.

The drill is: fire up the phone with the patched version on; ssh into it; start gdb with sailfish-browser set as its executable; stick breakpoints on EmbedLiteViewChild::RecvLoadURL(), nsDocShellLoadState::CreateFromLoadURIOptions() and WebNavigationFlagsToFixupFlags(); run the executable; then check out various variables once the breakpoint is hit. The interesting variables are likely to be the following:

In RecvLoadURL() we should take a look at sAllowKeyWordURL to ensure it's set to true.
In RecvLoadURL() we should take a look at flags see which if its bits are set.
In CreateFromLoadURIOptions() we should take a look at loadFlags see which if its bits are set.
In WebNavigationFlagsToFixupFlags() we should take a look at fixupFlags see which if its bits are set.

So, here goes:

[...]
Thread 8 "GeckoWorkerThre" hit Breakpoint 1, mozilla::embedlite::
    EmbedLiteViewChild::RecvLoadURL (this=0x7f88ba4070, url=...)
    at ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:534
534     ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:
        No such file or directory.
(gdb) p sAllowKeyWordURL
$1 = <optimized out>
(gdb) n
[LWP 11437 exited]
535     in ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp
(gdb) n
867     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
        No such file or directory.
(gdb) n
545     ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp:
        No such file or directory.
(gdb) n
546     in ${PROJECT}/gecko-dev/mobile/sailfishos/embedshared/EmbedLiteViewChild.cpp
(gdb) p /x flags
$3 = 0x340000
(gdb)

Hitting our first breakpoint we can't determined what sAllowKeyWordURL is, but we can determine its effects, which are to the flags variable. By the end of the method this is set to 0x340000 which is equivalent to having the following three bits set:

LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP = 0x100000
LOAD_FLAGS_FIXUP_SCHEME_TYPOS = 0x200000
LOAD_FLAGS_DISALLOW_INHERIT_PRINCIPAL = 0x40000

We can check that this is correct by dropping briefly into the Python interpreter:

$ python3
>>> (0x100000 + 0x200000 + 0x40000) - 0x340000
0

The larger of these two bits wouldn't be set if sAllowKeyWordURL was set to false, so sAllowKeyWordURL must be set to true. That means that our partial build did work. That sounds like good news, but it's a bit of a disappointment for me because it means that the changes made so far haven't yet fixed the problem: there's more to do.

The values of the other variables which we wanted to look at will all now be determined by this, so arguably there's no need to check. But I'll continue stepping through the executable just to be sure, and because maybe it's interesting to see.

(gdb) c
Continuing.

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, nsDocShellLoadState::
    CreateFromLoadURIOptions (aBrowsingContext=0x7f88b67bd0, aURI=...,
    aLoadURIOptions=..., aResult=aResult@entry=0x7f9f3db3c8)
    at ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp:226
226     ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp:
    No such file or directory.
(gdb) p /x aLoadURIOptions.mLoadFlags
$4 = 0x340000
(gdb) p loadFlags
$5 = <optimized out>
(gdb) c
Continuing.

That's our second breakpoint there, and as we can see the flags have safely made it through to aLoadURIOptions.mLoadFlags in CreateFromLoadURIOptions(). Let's continue onward.

Thread 8 "GeckoWorkerThre" hit Breakpoint 3, WebNavigationFlagsToFixupFlags
    (aNavigationFlags=3407872, aURIString=..., aURI=0x0)
    at ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp:211
211     in ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp
(gdb) p /x aNavigationFlags
$6 = 0x340000
(gdb) n
[W] unknown:0 - QConnmanEngine: Unable to translate the bearer type of the
    unknown connection type: ""
nsDocShellLoadState::CreateFromLoadURIOptions (aBrowsingContext=0x7f88b67bd0,
    aURI=..., aLoadURIOptions=..., aResult=aResult@entry=0x7f9f3db3c8)
    at ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp:266
266     in ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp
(gdb) 
270     in ${PROJECT}/gecko-dev/docshell/base/nsDocShellLoadState.cpp
(gdb) p /x fixupFlags
$14 = 0x9
(gdb) c
Continuing.

There's a bit to be disentangled here. This is our third breakpoint and we can see that the correct bits made it safely into aNavigationFlags inside WebNavigationFlagsToFixupFlags(). Stepping through the method, eventually the fixupFlags variable gets set to 0x9. That's also what we'd expect as it tallies with the following flags taken from nsIURIFixup.idl:

    /**
     * Allow the fixup to use a keyword lookup service to complete the URI.
     * The fixup object implementer should honour this flag and only perform
     * any lengthy keyword (or search) operation if it is set.
     */
    const unsigned long FIXUP_FLAG_ALLOW_KEYWORD_LOOKUP = 1;

    /*
     * Fix common scheme typos.
     */
    const unsigned long FIXUP_FLAG_FIX_SCHEME_TYPOS = 8;

So everything is as we expected. But continuing on from here we still get an error page showing in the browser with the text "The address isn't valid" rather than the online search results we were hoping for.

Unfortunately the partial build doesn't show the source code lines after all, which makes it challenging to step through effectively. I'd really like to step through the rest of the nsDocShellLoadState::CreateFromLoadURIOptions() method to see whether there are any more clues as to why the search provider isn't being contacted. To get proper debugging back I'm going to have to wait for the build to complete.

[...]

So finally the build completed and I can now step through properly! So brace yourself, this is going to be a bit of a long section showing me stepping through the CreateFromLoadURIOptions() method. If gdb debugging isn't your thing, not to worry, you can safely skip past it.

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, nsDocShellLoadState::
    CreateFromLoadURIOptions (aBrowsingContext=0x7f88cba350, aURI=...,
    aLoadURIOptions=..., aResult=aResult@entry=0x7f9f3d33c8) at /usr/src/debug/
    xulrunner-qt5-91.9.1-1.aarch64/docshell/base/nsDocShellLoadState.cpp:226
226         const LoadURIOptions& aLoadURIOptions, nsDocShellLoadState** aResult) {
(gdb) n
227       uint32_t loadFlags = aLoadURIOptions.mLoadFlags;
(gdb)
233       nsCOMPtr<nsIURI> uri;
(gdb) p /x loadFlags
$1 = 0x340000
(gdb) n
234       nsCOMPtr<nsIInputStream> postData(aLoadURIOptions.mPostData);
(gdb) n
237       NS_ConvertUTF16toUTF8 uriString(aURI);
(gdb) n
239       uriString.Trim(" ");
(gdb) n
241       uriString.StripCRLF();
(gdb) n
[LWP 20950 exited]
188     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsTStringRepr.h:
        No such file or directory.
(gdb) n
245       rv = NS_NewURI(getter_AddRefs(uri), uriString);
(gdb) n
359     ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
        No such file or directory.
(gdb) n
30      ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsError.h:
        No such file or directory.
(gdb) n
252       } else if (!sURIFixup && !XRE_IsContentProcess()) {
(gdb) n
262       nsAutoString searchProvider, keyword;
(gdb) p fixup
$2 = true
(gdb) n
264       if (fixup) {
(gdb) n
266             WebNavigationFlagsToFixupFlags(uri, uriString, loadFlags);
(gdb) n
270         if (!(fixupFlags & nsIURIFixup::FIXUP_FLAG_ALLOW_KEYWORD_LOOKUP)) {
(gdb) n
274         if (aBrowsingContext->UsePrivateBrowsing()) {
(gdb) n
278         RefPtr<nsIInputStream> fixupStream;
(gdb) n
279         if (!XRE_IsContentProcess()) {
(gdb) n
280           nsCOMPtr<nsIURIFixupInfo> fixupInfo;
(gdb) n
281           sURIFixup->GetFixupURIInfo(uriString, fixupFlags,
(gdb) p fixupInfo                 
$3 = {<nsCOMPtr_base> = {mRawPtr = 0x0}, <No data fields>}
(gdb) n
1363    ${PROJECT}/obj-build-mer-qt-xr/dist/include/nsCOMPtr.h:
        No such file or directory.
(gdb) n
JavaScript error: resource://gre/modules/URIFixup.jsm, line 443:
    NS_ERROR_MALFORMED_URI: Couldn't build a valid uri
283           if (fixupInfo) {
(gdb) n
266             WebNavigationFlagsToFixupFlags(uri, uriString, loadFlags);
(gdb) n
304         if (fixupStream) {
(gdb) n
312       if (rv == NS_ERROR_MALFORMED_URI) {
(gdb) n
262       nsAutoString searchProvider, keyword;
(gdb) n
237       NS_ConvertUTF16toUTF8 uriString(aURI);
(gdb) n
234       nsCOMPtr<nsIInputStream> postData(aLoadURIOptions.mPostData);
(gdb) n
233       nsCOMPtr<nsIURI> uri;
(gdb) n
nsDocShell::LoadURI (this=0x7f88adf170, aURI=..., aLoadURIOptions=...) at
    /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/docshell/base/nsDocShell.cpp:3256
3256      if (NS_ERROR_MALFORMED_URI == rv) {
(gdb) c                                       
Continuing.
[...]

Looking at this step through it's clear that the problem happens on line 281 where the fixupInfo object is returned. Or, more specifically, the fact that it's not returned.

To understand why we have to move back into JavaScript mode, in the URIFixup.jsm file. From the getFixupURIInfo() it looks like we'll descend into the tryKeywordFixupForURIInfo() method, which will then call then keywordToURI() method. That's where the info.engine value gets set:

    // Try falling back to the search service's default search engine
    // We must use an appropriate search engine depending on the private
    // context.
    let engine = isPrivateContext
      ? Services.search.defaultPrivateEngine
      : Services.search.defaultEngine;

That's the value which will eventually be returned in the fixupInfo.keywordProviderName parameter, as well as the equally important info.preferredURI parameter.

Why aren't these being set properly I wonder...

That'll be the question for tomorrow.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

29 Nov 2023 : Day 92 #

It's already the evening here while I start this post. I've not had a chance to do any gecko development during the day but am hoping to make up for some lost time now.

Yesterday we were looking at search in the address bar. We made a bit of progress: previously there were a couple of errors that simply stopped all search in its tracks. After having fixed them I'm not getting a page stating that "The address isn't valid". I think that's a step forwards!

There are still a couple of errors remaining, one of which we tracked down to some code in the URIFixup.jsm file. The task now is to figure out what changed between ESR 78 and ESR 91 that might make this error occur.

Taking a look through each of them side by side, the code in the getFixupURIInfo() method has changed quite a bit, but ultimately it arrives in the same place and there's no evidence in the ESR 78 code that the hand-off to search happens within this method. The return value is the same. So maybe it's worth taking a look at the code that calls this method instead.

Searching through the code for instances of getFixupURIInfo() is unsatisfactory (it's not clear which is the relevant one) so I'm going to try to tackle it from the other end. When you enter some text into the address bar and hit enter the following JavaScript is executed from inside the QML (this is from Overlay.qml in the sailfish-browser repository):

    function loadPage(url, newTab) {
        if (url == "about:config") {
            pageStack.animatorPush(Qt.resolvedUrl("ConfigWarning.qml"),
                {"browserPage": browserPage})
        } else if (url == "about:settings") {
            pageStack.animatorPush(Qt.resolvedUrl("../SettingsPage.qml"))
        } else {
            if (webView && webView.tabModel.count === 0) {
                webView.clearSurface();
            }
            // let gecko figure out how to handle malformed URLs
            var pageUrl = url
            if (!isNaN(pageUrl) && pageUrl.trim()) {
                pageUrl = "\"" + pageUrl.trim() + "\""
            }

            if (!searchField.enteringNewTabUrl && !newTab) {
                webView.releaseActiveTabOwnership()
                webView.load(pageUrl)
            } else {
                // Loading will start once overlay animator has animated chrome
                // visible.
                enteredUrl = pageUrl
            }
        }

        overlayAnimator.showChrome()
    }

The interesting parts here are the text "let gecko figure out how to handle malformed URLs" and the action, which is the webView.load(pageUrl) call. This is where we transition from sailfish-browser to qtmozembed.

So this is a call to QuickMozView::load() which calls the private method which does nothing other than call QMozViewPrivate::load() with the same parameter. This is where we transition from qtmozembed to EmbedLite in gecko-dev.

This is a call to mozilla::embedlite::EmbedLiteView::LoadURL() which we can find in embedding/embedlite/EmbedLiteView.cpp and which looks like this:

void
EmbedLiteView::LoadURL(const char* aUrl)
{
  LOGT("url:%s", aUrl);
  Unused << mViewParent->SendLoadURL(NS_ConvertUTF8toUTF16(
      nsDependentCString(aUrl)));
}

From the header file we can see that mViewParent is an instance of PEmbedLiteViewParent. This ends up as a call to EmbedLiteViewChild::RecvLoadURL(). This is an interesting method so worth copying out in full. Specifically interesting because of the flags it sets and passes on through a call to nsIWebNavigation::LoadURI():

mozilla::ipc::IPCResult EmbedLiteViewChild::RecvLoadURL(const nsString &url)
{
  LOGT("url:%s", NS_ConvertUTF16toUTF8(url).get());
  NS_ENSURE_TRUE(mWebNavigation, IPC_OK());

  uint32_t flags = 0;
  if (sAllowKeyWordURL) {
    flags |= nsIWebNavigation::LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP;
    flags |= nsIWebNavigation::LOAD_FLAGS_FIXUP_SCHEME_TYPOS;
  }
  flags |= nsIWebNavigation::LOAD_FLAGS_DISALLOW_INHERIT_PRINCIPAL;

  LoadURIOptions loadURIOptions;
  loadURIOptions.mTriggeringPrincipal = nsContentUtils::GetSystemPrincipal();
  loadURIOptions.mLoadFlags = flags;
  mWebNavigation->LoadURI(url, loadURIOptions);

  return IPC_OK();
}

It's worth noting that nsIWebNavigation is just an interface. The mWebNavigation implementation of it comes from nsWebBrowser. All this does is pass the parameters along to nsDocShell::LoadURI(). We're deep in C++ territory now, but at any moment we're going to burst out into some JavaScript.

The code in nsDocShell::LoadURI() is considerably more complex. At the top of the method it contains this interesting sequence:

  RefPtr<nsDocShellLoadState> loadState;
  nsresult rv = nsDocShellLoadState::CreateFromLoadURIOptions(
      mBrowsingContext, aURI, aLoadURIOptions, getter_AddRefs(loadState));

  uint32_t loadFlags = aLoadURIOptions.mLoadFlags;
  if (NS_ERROR_MALFORMED_URI == rv) {
    MOZ_LOG(gSHLog, LogLevel::Debug,
            ("Creating an active entry on nsDocShell %p to %s (because "
             "we're showing an error page)",
             this, NS_ConvertUTF16toUTF8(aURI).get()));

This is interesting because of the MOZ_LOG output, which is currently hidden, but which we can enable.

These log messages can be controlled by category (in this case gSHLog which is set to "SessionHistory") and by log level (in this case Debug which takes the value 4). To view them, we set the MOZ_LOG environment variable, like this:

MOZ_LOG="SessionHistory:4" sailfish-browser

With this enabled we see the following output:

JavaScript error: resource://gre/modules/URIFixup.jsm, line 443:
    NS_ERROR_MALFORMED_URI: Couldn't build a valid uri
[Parent 6189: Unnamed thread 705c002670]: D/SessionHistory Creating an active
    entry on nsDocShell 705c999460 to test search (because we're showing an
    error page)

On the ESR 78 install I get neither of those errors. So, as suspected, the NS_ERROR_MALFORMED_URI return value is being caught and changed before it reaches the DocShell somewhere inside the nsDocShellLoadState::CreateFromLoadURIOptions() call.

And now we've reach the middle point from either end. Here's the bit of code inside this method running up to the call to GetFixupURIInfo() that's generating the error:

nsresult nsDocShellLoadState::CreateFromLoadURIOptions(
    BrowsingContext* aBrowsingContext, const nsAString& aURI,
    const LoadURIOptions& aLoadURIOptions, nsDocShellLoadState** aResult) {
[...]
  nsAutoString searchProvider, keyword;
  bool didFixup = false;
  if (fixup) {
    uint32_t fixupFlags =
        WebNavigationFlagsToFixupFlags(uri, uriString, loadFlags);

    // If we don't allow keyword lookups for this URL string, make sure to
    // update loadFlags to indicate this as well.
    if (!(fixupFlags & nsIURIFixup::FIXUP_FLAG_ALLOW_KEYWORD_LOOKUP)) {
      loadFlags &= ~nsIWebNavigation::LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP;
    }
    // Ensure URIFixup will use the right search engine in Private Browsing.
    if (aBrowsingContext->UsePrivateBrowsing()) {
      fixupFlags |= nsIURIFixup::FIXUP_FLAG_PRIVATE_CONTEXT;
    }

    RefPtr<nsIInputStream> fixupStream;
    if (!XRE_IsContentProcess()) {
      nsCOMPtr<nsIURIFixupInfo> fixupInfo;
      sURIFixup->GetFixupURIInfo(uriString, fixupFlags,
                                 getter_AddRefs(fixupInfo));
[...]

Inside the GetFixupURIInfo() method there are a couple of conditions that look a like this:

    // If we still haven't been able to construct a valid URI, try to force a
    // keyword match.
    if (keywordEnabled && fixupFlags & FIXUP_FLAG_ALLOW_KEYWORD_LOOKUP) {
      tryKeywordFixupForURIInfo(info.originalInput, info, isPrivateContext);
    }

Placing some debug output into GetFixupURIInfo() I can see that these aren't entered because of the flags: neither the FIXUP_FLAG_FIX_SCHEME_TYPOS nor the FIXUP_FLAG_ALLOW_KEYWORD_LOOKUP flag are set.

The answer to why that is might be in the WebNavigationFlagsToFixupFlags() method that we can see above inside CreateFromLoadURIOptions(). Now while GetFixupURIInfo() is JavaScript, the class that's calling it is C++, so I can debug and step through that easily. I'm going to give that a go to see whether that sheds any further light on things.

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, nsDocShellLoadState::
    CreateFromLoadURIOptions (aBrowsingContext=0x7f88cccb40, aURI=...,
    aLoadURIOptions=..., aResult=aResult@entry=0x7f9f3f23c8) at
    /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/docshell/base/
    nsDocShellLoadState.cpp:226
226         const LoadURIOptions& aLoadURIOptions, nsDocShellLoadState** aResult) {
(gdb) p /x aLoadURIOptions.mLoadFlags
$5 = 0x40000
(gdb) n
227       uint32_t loadFlags = aLoadURIOptions.mLoadFlags;
(gdb) n
233       nsCOMPtr<nsIURI> uri;
(gdb) n
[LWP 25851 exited]
234       nsCOMPtr<nsIInputStream> postData(aLoadURIOptions.mPostData);
(gdb) n
237       NS_ConvertUTF16toUTF8 uriString(aURI);
(gdb) n
239       uriString.Trim(" ");
(gdb) n
241       uriString.StripCRLF();
(gdb) p uriString.mData
$6 = (mozilla::detail::nsTStringRepr<char>::char_type *) 0x7f9f3f2204 "abc def"
(gdb) n
245       rv = NS_NewURI(getter_AddRefs(uri), uriString);
(gdb) n
252       } else if (!sURIFixup && !XRE_IsContentProcess()) {
(gdb) n
262       nsAutoString searchProvider, keyword;
(gdb) p fixup
$7 = true
(gdb) n
264       if (fixup) {
(gdb) n

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, WebNavigationFlagsToFixupFlags
    (aNavigationFlags=262144, aURIString=..., aURI=0x0)
    at /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/docshell/base/
    nsDocShellLoadState.cpp:211
211       if (aURI) {
(gdb) p /x aNavigationFlags
$8 = 0x40000
(gdb) n
nsDocShellLoadState::CreateFromLoadURIOptions (aBrowsingContext=0x7f88cccb40,
    aURI=..., aLoadURIOptions=..., aResult=aResult@entry=0x7f9f3f23c8)
    at /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/docshell/base/
    nsDocShellLoadState.cpp:266
266             WebNavigationFlagsToFixupFlags(uri, uriString, loadFlags);
(gdb) n
270         if (!(fixupFlags & nsIURIFixup::FIXUP_FLAG_ALLOW_KEYWORD_LOOKUP)) {
(gdb) n
271           loadFlags &= ~nsIWebNavigation::LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP;
(gdb) n
274         if (aBrowsingContext->UsePrivateBrowsing()) {
(gdb) n
278         RefPtr<nsIInputStream> fixupStream;
(gdb) p /x fixupFlags
$9 = 0x0
(gdb) n
279         if (!XRE_IsContentProcess()) {
(gdb) n
280           nsCOMPtr<nsIURIFixupInfo> fixupInfo;
(gdb) n
281           sURIFixup->GetFixupURIInfo(uriString, fixupFlags,
(gdb) p fixupFlags
$10 = 0
(gdb)

From this it looks like the problem has already happened by the time CreateFromLoadURIOptions() has been called with aLoadURIOptions.mLoadFlags set to 0x40000. The values we're interested in are the following, so we'd really want at least one of these taken from docshell/base/nsIWebNavigation.idl to be set:

  const unsigned long LOAD_FLAGS_ALLOW_THIRD_PARTY_FIXUP = 0x100000;
  const unsigned long LOAD_FLAGS_FIXUP_SCHEME_TYPOS = 0x200000;

This takes us all the way back to EmbedLiteViewChild::RecvLoadURL() which I already copied out above. There we can see that these flags are dependent on the value of sAllowKeyWordURL. This is a global static variable, which is being set to false. Maybe that's our problem there.

Setting the flags using the debugger doesn't seem to make any difference:

Thread 8 "GeckoWorkerThre" hit Breakpoint 3, mozilla::embedlite::
    EmbedLiteViewChild::RecvLoadURL (this=0x7f88d03170, url=...)
    at /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/mobile/sailfishos/
    embedshared/EmbedLiteViewChild.cpp:534
534     {
(gdb) p sAllowKeyWordURL
$11 = <optimized out>
(gdb) n
535       LOGT("url:%s", NS_ConvertUTF16toUTF8(url).get());
(gdb) n
545       LoadURIOptions loadURIOptions;
(gdb) p /x flags
$13 = 0x40000
(gdb) n
546       loadURIOptions.mTriggeringPrincipal = nsContentUtils::GetSystemPrincipal();
(gdb) n
547       loadURIOptions.mLoadFlags = flags;
(gdb) n
548       mWebNavigation->LoadURI(url, loadURIOptions);
(gdb) set loadURIOptions.mLoadFlags = 0x340000
(gdb) p /x loadURIOptions.mLoadFlags
$14 = 0x340000
(gdb) c
Continuing.
JavaScript error: resource://gre/modules/URIFixup.jsm, line 446: NS_ERROR_MALFORMED_URI: Couldn't build a valid uri
[Parent 24558: Unnamed thread 7f88002670]: D/SessionHistory Creating an active entry on nsDocShell 7f88b0c190 to test search (because we're showing an error page)

So I've edited the file to set sAllowKeyWordURL to be true:

$ git diff embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
diff --git a/embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
           b/embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
index e06d68947ae9..5761689b6877 100644
--- a/embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
+++ b/embedding/embedlite/embedshared/EmbedLiteViewChild.cpp
@@ -81,7 +81,7 @@ static struct {
     bool longTap;
 } sPostAZPCAsJson;
 
-static bool sAllowKeyWordURL = false;
+static bool sAllowKeyWordURL = true;
 
 static void ReadAZPCPrefs()
 {

This is in the C++ code, so I'll need to build it to test it. I've done a partial rebuild, but unfortunately that doesn't seem to have been enough. Given changing the variable using gdb and the partial build give the same results, I suspect there's more that needs changing.

Nevertheless, it's late and time to stop for the day, so I may as well run the build overnight to see whether that makes a difference. Just in case. I can then continue looking at this tomorrow.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

28 Nov 2023 : Day 91 #

Over the last couple of days we were looking at an error coming from the URLQueryStrippingListService JavaScript code. Although I committed some sort of "fix", it was a bit unsatisfactory because the fix was essentially to suppress the error. The error itself was harmless, but maybe it was masking some more serious issue?

This morning I committed the changes and recorded the details on the issue. Today I want to move on to something else. This is another JavaScript issue with error output showing in the console:

JavaScript Error: "TypeError: Services.search.addEngine is not a function"
    {file: "file:///usr/lib64/mozembedlite/components/EmbedLiteSearchEngine.js" line: 53}

This looks like it should be an easy one, but you never know. You really never know.

The first thing to note is that this error is coming from EmbedLiteSearchEngine.js which isn't part of the gecko-dev source tree, but rather comes from embedlite-components. Here's what it looks like:

      case "embedui:search": {
        var data = JSON.parse(aData);
        switch (data.msg) {
          case "loadxml": {
            Services.search.addEngine(data.uri, null, data.confirm).then(
              engine => {
                var message = {
                  "msg": "search-engine-added",
                  "engine": (engine && engine.name) || "",
                  "errorCode": 0,
                }
                Services.obs.notifyObservers(null, "embed:search",
                  JSON.stringify(message));
              },
              errorCode => {
                // For failure conditions see nsISearchService.idl
                var message = {
                  "msg": "search-engine-added",
                  "engine": "",
                  "errorCode": errorCode
                }
                Services.obs.notifyObservers(null, "embed:search",
                  JSON.stringify(message));
              }
            );
            break;
          }

It's the Services.search.addEngine() line that's causing the error. It's not complaining about search existing, so this would imply that Services.search exists but doesn't have a method called addEngine().

For the curious, this code is used to add "new" search engines to the browser. Some search provider sites (e.g. Wikipedia or Qwant) supply a small piece of XML code conforming to the OpenSearch standard. This explains to the browser the REST API that can be used to access the search capability. For example, it details what URL to use, what parameter to call the search term and how to get suggestions.

The Sailfish Browser identifies these OpenSearch hints when you visit a page that provides one and adds them to the list of available search engines in the Settings. This particular bit of code asks for the search provider to be installed using this OpenSearch XML file.

But as we've seen, in ESR 91 this no longer exists. To find out why not we have to check the Services.search module that's supposed to be offering the functionality.

You'll recall from a couple of days back that there was something similar happening with Services.appinfo. It's quite possible we have the same thing happening again.

The first thing to do is find out where this search module is. A quick grep of the code shows that, at least in a standard gecko build, it's defined in gecko-dev/toolkit/components/search/components.conf like this:

    {
        'js_name': 'search',
        'cid': '{7319788a-fe93-4db3-9f39-818cf08f4256}',
        'contract_ids': ['@mozilla.org/browser/search-service;1'],
        'interfaces': ['nsISearchService'],
        'jsm': 'resource://gre/modules/SearchService.jsm',
        'constructor': 'SearchService',
        'processes': ProcessSelector.MAIN_PROCESS_ONLY,
    },

I can't find any evidence that we override this for sailfish-browser, so we should take a look at the SearchService.jsm file referenced in the components configuration.

In the ESR 78 version of the file we can see the method that we need:

  async addEngine(engineURL, iconURL, confirm, extensionID) {
    SearchUtils.log('addEngine: Adding "' + engineURL + '".');
    await this.init();
    let errCode;
    try {
      var engine = new SearchEngine({
        uri: engineURL,
        isBuiltin: false,
      });
[...]

The method doesn't exist in ESR 91, which is likely why the error is appearing. There is however this method which looks like it should be doing something similar:

  async addOpenSearchEngine(engineURL, iconURL) {
    logConsole.debug("addEngine: Adding", engineURL);
    await this.init();
    let errCode;
    try {
      var engine = new OpenSearchEngine();

Before switching over to it, I'm going to check the history to check that this really is what we want. I also need to figure out what to do with the confirm parameter that is no longer passed in.

Here's the git blame from ESR 78:

$ git blame toolkit/components/search/SearchService.jsm -L 2707
Blaming lines:  29% (1138/3844), done.
94bb650cd4eea toolkit/components/search/SearchService.jsm
    (Andreea Pavel            2020-04-22 22:13:03 +0300 2707)
    async addEngine(engineURL, iconURL, confirm, extensionID) {
4a06c925ac8ae toolkit/components/search/SearchService.jsm
    (Victor Porof             2019-07-05 11:14:05 +0200 2708)
    SearchUtils.log('addEngine: Adding "' + engineURL + '".');
caceac87e1c78 toolkit/components/search/SearchService.jsm
    (Dale Harvey              2020-02-04 22:22:43 +0000 2709)
    await this.init();
481ae95c00e75 toolkit/components/search/nsSearchService.js
    (Mike de Boer             2019-02-02 11:27:21 +0000 2710)
    let errCode;
56ee1324c98ca browser/components/search/nsSearchService.js
    (gavin%gavinsharp.com     2006-03-17 07:16:00 +0000 2711)
    try {
7fa6125c63f18 toolkit/components/search/SearchService.jsm
    (Mark Banner              2019-05-01 18:51:04 +0000 2712)
    var engine = new SearchEngine({
7fa6125c63f18 toolkit/components/search/SearchService.jsm
    (Mark Banner              2019-05-01 18:51:04 +0000 2713)
    uri: engineURL,
690733b0642c3 toolkit/components/search/SearchService.jsm
    (Mark Banner              2020-03-20 16:44:05 +0000 2714)
    isBuiltin: false,
7fa6125c63f18 toolkit/components/search/SearchService.jsm
    (Mark Banner              2019-05-01 18:51:04 +0000 2715)
    });

And here's the git blame from the new method in ESR 91:

$ git blame toolkit/components/search/SearchService.jsm -L 1866
Blaming lines:  37% (1104/2969), done.
e70637b6d9207 toolkit/components/search/SearchService.jsm
    (Mark Banner              2020-07-09 09:56:25 +0000 1866)
    async addOpenSearchEngine(engineURL, iconURL) {
ed636bbf8b0b5 toolkit/components/search/SearchService.jsm
    (Mark Banner              2020-06-04 21:47:57 +0000 1867)
    logConsole.debug("addEngine: Adding", engineURL);
caceac87e1c78 toolkit/components/search/SearchService.jsm
    (Dale Harvey              2020-02-04 22:22:43 +0000 1868)
    await this.init();
481ae95c00e75 toolkit/components/search/nsSearchService.js
    (Mike de Boer             2019-02-02 11:27:21 +0000 1869)
    let errCode;
56ee1324c98ca browser/components/search/nsSearchService.js
    (gavin%gavinsharp.com     2006-03-17 07:16:00 +0000 1870)
    try {
269a6fbae9928 toolkit/components/search/SearchService.jsm
    (Mark Banner              2021-02-10 18:12:08 +0000 1871)
    var engine = new OpenSearchEngine();

What's encouraging here is that the following three lines don't just share code, they also share commit history:

caceac87e1c78 toolkit/components/search/SearchService.jsm
    (Dale Harvey              2020-02-04 22:22:43 +0000 1868)
    await this.init();
481ae95c00e75 toolkit/components/search/nsSearchService.js
    (Mike de Boer             2019-02-02 11:27:21 +0000 1869)
    let errCode;
56ee1324c98ca browser/components/search/nsSearchService.js
    (gavin%gavinsharp.com     2006-03-17 07:16:00 +0000 1870)
    try {

That suggests that the addOpenSearchEngine() method was built around the code from addEngine(). The fact that nobody bothered to update the log messages, which still use "addEngine" is also a good sign. The changes to the method name appear to have been made in commit e70637b6d9207:

$ git log -1 e70637b6d9207
commit e70637b6d92078216ad927bf23fb83e6619ef2e1
Author: Mark Banner <standard8@mozilla.com>
Date:   Thu Jul 9 09:56:25 2020 +0000

    Bug 1632448 - Remove now unused confirm and extensionID parameters for
    nsISearchService.addEngine, and rename it. r=daleharvey
    
    Renaming to addOpenSearchEngine to make it more explicit about what it is
    actually doing.
    
    Depends on D82524
    
    Differential Revision: https://phabricator.services.mozilla.com/D82525

Okay, that's very encouraging indeed. Based on all this I'm convinced that switching to use addOpenSearchEngine() is the fix we need. From the commit description it's also clear that the confirm parameter can also be safely dropped, but because that parameter is being passed in to EmbedLiteSearchEngine.js by whoever is calling it, that may need a little extra work to unravel.

Here's the simple fix that I'm applying in embedlite-components:

$ git diff
diff --git a/jscomps/EmbedLiteSearchEngine.js b/jscomps/EmbedLiteSearchEngine.js
index b224f10..b3fc3a3 100644
--- a/jscomps/EmbedLiteSearchEngine.js
+++ b/jscomps/EmbedLiteSearchEngine.js
@@ -50,7 +50,7 @@ EmbedLiteSearchEngine.prototype = {
         var data = JSON.parse(aData);
         switch (data.msg) {
           case "loadxml": {
-            Services.search.addEngine(data.uri, null, data.confirm).then(
+            Services.search.addOpenSearchEngine(data.uri, null).then(
               engine => {
                 var message = {
                   "msg": "search-engine-added",

The embedui:search event is triggered from a few places, all of which are in the sailfish-browser code. That makes sense as this is essentially the interface used to allow the QML front end to interact with the gecko backend. Here are the cases where it appears:

$ grep -rIn "embedui:search" *
apps/core/datafetcher.cpp:169:
    SailfishOS::WebEngine::instance()->notifyObservers(QLatin1String(
    "embedui:search"), QVariant(loadsearch));
apps/core/settingmanager.cpp:126:
    webEngine->notifyObservers(QLatin1String("embedui:search"),
    QVariant(defaultSearchEngine));
apps/core/settingmanager.cpp:160:
    webEngine->notifyObservers(QLatin1String("embedui:search"),
    QVariant(loadsearch));
apps/core/settingmanager.cpp:170:
    webEngine->notifyObservers(QLatin1String("embedui:search"),
    QVariant(removeMsg));
apps/browser/settings/searchenginemodel.cpp:168:
    SailfishOS::WebEngine::instance()->notifyObservers(QLatin1String(
    "embedui:search"), QVariant(message));

Of these, only the following use the loadxml topic. These are the ones we're really interested in:

apps/core/datafetcher.cpp:169:
    SailfishOS::WebEngine::instance()->notifyObservers(QLatin1String(
    "embedui:search"), QVariant(loadsearch));
apps/core/settingmanager.cpp:160:
    webEngine->notifyObservers(QLatin1String("embedui:search"),
    QVariant(loadsearch));

These all pass in a confirm parameter. Strictly speaking we could leave these as they are since the parameter will now just be ignored, but for simplicity and cleanliness of code it makes sense to remove them. So I've also made these changes to the sailfish-browser code:

$ git diff
diff --git a/apps/core/datafetcher.cpp b/apps/core/datafetcher.cpp
index 8aea2874..bcef476f 100644
--- a/apps/core/datafetcher.cpp
+++ b/apps/core/datafetcher.cpp
@@ -165,7 +165,6 @@ void DataFetcher::saveAsSearchEngine()
                 QVariantMap loadsearch;
                 loadsearch.insert(QLatin1String("msg"), QVariant(QLatin1String("loadxml")));
                 loadsearch.insert(QLatin1String("uri"), QVariant(url.toString()));
-                loadsearch.insert(QLatin1String("confirm"), QVariant(false));
                 SailfishOS::WebEngine::instance()->notifyObservers(QLatin1String("embedui:search"), QVariant(loadsearch));
 
                 updateStatus(Ready);
diff --git a/apps/core/settingmanager.cpp b/apps/core/settingmanager.cpp
index fee5eb28..e84afa57 100644
--- a/apps/core/settingmanager.cpp
+++ b/apps/core/settingmanager.cpp
@@ -156,7 +156,6 @@ void SettingManager::handleObserve(const QString &message, const QVariant &data)
                     // load opensearch descriptions
                     loadsearch.insert(QLatin1String("msg"), QVariant(QLatin1String("loadxml")));
                     loadsearch.insert(QLatin1String("uri"), QVariant(QString("file://%1").arg(configs[searchName])));
-                    loadsearch.insert(QLatin1String("confirm"), QVariant(false));
                     webEngine->notifyObservers(QLatin1String("embedui:search"), QVariant(loadsearch));
                 }
             }

Straightforward stuff, but making these changes still feels good.

Making these changes fixes the error output, which is great. While testing the changes I also experienced the following error, which I've created a new issue for:

JavaScript Error: "TypeError: XPCOMUtils.generateNSGetFactory is not a function"
    {file: "file:///usr/lib64/mozembedlite/components/PromptService.js" line: 955}

In fact, if I try to search in the address bar I get the following couple of errors:

JavaScript error: resource://gre/modules/URIFixup.jsm, line 443:
    NS_ERROR_MALFORMED_URI: Couldn't build a valid uri
JavaScript error: file:///usr/lib64/mozembedlite/components/PromptService.js,
    line 955: TypeError: XPCOMUtils.generateNSGetFactory is not a function

These two errors prevent the search from working at all so will need to be fixed.

I reckon the second of these about XPCOMUtils.generateNSGetFactory not being a function may be the same as the issue we saw on Day 88. In that case the error was coming from EmbedLiteGlobalHelper.js and we fixed it by switching XPCOMUtils to ComponentUtils. Maybe the same thing will work again.

The PromptService.js file can be found in embedlite-components, so I've made the equivalent changes to that:

$ git diff jscomps/PromptService.js
diff --git a/jscomps/PromptService.js b/jscomps/PromptService.js
index 86117f4..996d8fb 100644
--- a/jscomps/PromptService.js
+++ b/jscomps/PromptService.js
@@ -6,6 +6,7 @@ const Cc = Components.classes;
 const Cr = Components.results;
 const Cu = Components.utils;
 
+const { ComponentUtils } = ChromeUtils.import("resource://gre/modules/ComponentUtils.jsm");
 const { XPCOMUtils } = ChromeUtils.import("resource://gre/modules/XPCOMUtils.jsm");
 const { Services } = ChromeUtils.import("resource://gre/modules/Services.jsm");
 const { PrivateBrowsingUtils } = ChromeUtils.import("resource://gre/modules/PrivateBrowsingUtils.jsm");
@@ -952,4 +953,4 @@ AuthPromptAdapter.prototype = {
   }
 };
 
-this.NSGetFactory = XPCOMUtils.generateNSGetFactory([PromptService, AuthPromptAdapterFactory]);
+this.NSGetFactory = ComponentUtils.generateNSGetFactory([PromptService, AuthPromptAdapterFactory

The next error I need to address is the NS_ERROR_MALFORMED_URI that's coming from docshell/base/URIFixup.jsm. More specifically it's happening in a call to the getFixupURIInfo(uriString, fixupFlags = FIXUP_FLAG_NONE) method.

The method is trying — really quite hard in fact — to interpret the text I enter as a URL of some sort. After all other attempts to force it into something coherent have failed it falls through into this bit of code, which is the code generating the error:

    if (!info.preferredURI) {
      // We couldn't salvage anything.
      throw new Components.Exception(
        "Couldn't build a valid uri",
        Cr.NS_ERROR_MALFORMED_URI
      );
    }

Rather than doing this it should be interpreting the text as a search string and calling the search provider. The URL bar search won't work without this, so it'll need to be fixed.

But it's getting quite late here now, so time to call it a day. I'll return to this in the morning.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

27 Nov 2023 : Day 90 #

Yesterday we were looking at the following error:

JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 227

I put a breakpoint and several log outputs lines in the code and established that the same "profile-after-change" event that's causing the URLQueryStrippingListService module to be initialised is also causing the error to fire. Everything about the sequence of events — event message being sent; module initialising; module receiving the event — seems sensible and exactly as I'd expected it to work, apart from the fact that the error is also being generated.

Searching across the Web doesn't bring up an obvious reason for the error. There are two posts that look relevant, but which don't quite provide an answer. The first is on the Mozilla support forum where one user mentions they're seeing the error, while also experiencing pages displaying poorly.

The error is fixed by making a configuration change. What's not clear from the discussion is whether the configuration change stops the error from appearing in the console, or just fixes the rendering problems. I suspect it's just the latter, since with pages rendering correctly I can imagine the user no longer worrying further about the log output.

I did also track down the configuration change to being equivalent to setting the value of the browser.display.document_color_use option in about:config. Changing the value of this didn't have any effect on whether the error is generated on Sailfish OS (and to be honest, I'd have been surprised if it had).

The second page discusses the same error appearing in the Thunderbird console for release 91.0b6. As of now this bug is still open and there's no proposed fix. That makes me think that this error can't be too important.

There are quite a few modules that use the "profile-after-change" event to trigger their initialisation: InstallerPrefs, CrashService, BHRTelemetryService, CrashMonitor, PurgeTrackerService, URLQueryStrippingListService, nsTerminatorTelemetry, nsUpdateTimerManager, SanityTest, formHistoryStartup, nsUpdateServiceStub.

I looked carefully through the code for all these modules and in the majority of cases the "profile-after-change" is specifically handled, usually to perform some startup/initialisation task. In none of the cases apart from URLQueryStrippingListService is there code that would generate an error if the event is received.

I've convinced myself that the current flow is reasonable and that the error is harmless. At least it's harmless enough for Thunderbird to ignore it.

But I don't much like it so I'm going to patch the file like this:

$ git diff
diff --git a/toolkit/components/antitracking/URLQueryStrippingListService.jsm
           b/toolkit/components/antitracking/URLQueryStrippingListService.jsm
index 991fd0d11f0d..b8eb2c25c358 100644
--- a/toolkit/components/antitracking/URLQueryStrippingListService.jsm
+++ b/toolkit/components/antitracking/URLQueryStrippingListService.jsm
@@ -220,6 +220,10 @@ class URLQueryStrippingListService {
         let prefValue = Services.prefs.getStringPref(data, "");
         this._onPrefUpdate(data, prefValue);
         break;
+      case "profile-after-change":
+        if (!!this.initialized) {
+          Cu.reportError(`Unexpected post-init event ${topic}`);
+        }
       default:
         Cu.reportError(`Unexpected event ${topic}`);
     }

This will mask the error as long as the "profile-after-change" event occurs only before the module has initialised. I'm not totally happy with making this change: it doesn't feel like it should be necessary. I don't see evidence of it triggering on my desktop Firefox instance. But this change does really seem reasonable to me.

If you've been following closely you might wonder why this change should work: the this.initialized variable is set in the _init() method and we already established that this gets called before the event is received.

This is all true. Nevertheless by the time the observe() method is called with a "profile-after-change" topic this.initialized still hasn't been set. That's because _init() is an async method that awaits for certain settings to be updated. While it's waiting the observe() method is triggered. After which the _init() method completes, setting the this.initialized variable in the process. If a second "profile-after-change" message were to be received it will now still cause an error. But the first one won't. That shouldn't have any practical effect, except to clean up our log output a little.

I'm going to leave it there for today. Once I've committed these changes and closed the task I'll pick another one to look at for tomorrow.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

26 Nov 2023 : Day 89 #

Now that I'm back looking at gecko it's been fun to dig into the JavaScript rather than the C++ for a change. There are pros and cons with both, but not having to recompile the code when working with JavaScript makes a nice change. On the other hand, and perhaps ironically, I don't have a good JavaScript debugger to match gdb. Is there even such a thing? I've always found JavaScript debugging to be more painful than it should be... non-sequential event-based programming.

So I've been tackling some of the JavaScript errors generated in the logs. Since we finished off those being generated from EmbedLiteGlobalHelper.js last night, that means this morning it's time to choose another.

Let's go for issue #1045:

JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 224}

The standard stages are going to start looking familiar: find the file, check the line, check the file the line relates to.

So, find the file. The URLQueryStrippingListService.jsm file is in the gecko-dev source tree as ./toolkit/components/antitracking/URLQueryStrippingListService.jsm. The problematic block of code looks like this:

  observe(subject, topic, data) {
    switch (topic) {
      case "xpcom-shutdown":
        this._shutdown();
        break;
      case "nsPref:changed":
        let prefValue = Services.prefs.getStringPref(data, "");
        this._onPrefUpdate(data, prefValue);
        break;
      default:
        Cu.reportError(`Unexpected event ${topic}`);
    }
  }

The error output is coming from the Cu.reportError() call at the end. At this stage it may also be worth noting that the URLQueryStrippingListService.jsm file doesn't exist in the ESR 78 codebase, However there is a very similar block of code in a file called UrlClassifierSkipListService.jsm instead:

  observe(subject, topic, data) {
    if (topic != "nsPref:changed" || data != this.prefName) {
      Cu.reportError(`Unexpected event ${topic} with ${data}`);
      return;
    }

    this.prefValue = Services.prefs.getStringPref(this.prefName, null);
    this.notifyObservers();
  }

Now the error is flagging up the event named profile-after-change. That must be being fired somewhere, so we should find out where.

There are multiple observers put on this event in the embedlite-components code:

$ grep -rIn "profile-after-change" *
jscomps/EmbedliteDownloadManager.js:184:
    Services.obs.addObserver(this, "profile-after-change", false);
jscomps/EmbedliteDownloadManager.js:187:
    case "profile-after-change":
jscomps/EmbedliteDownloadManager.js:188:
    Services.obs.removeObserver(this, "profile-after-change");
jscomps/EmbedLiteSearchEngine.js:30:
    Services.obs.addObserver(this, "profile-after-change", false);
jscomps/EmbedLiteSearchEngine.js:33:
    case "profile-after-change": {
jscomps/EmbedLiteSearchEngine.js:34:
    Services.obs.removeObserver(this, "profile-after-change");
jscomps/EmbedLiteGlobalHelper.js:49:
    Services.obs.addObserver(this, "profile-after-change", false);
jscomps/EmbedLiteGlobalHelper.js:61:
    case "profile-after-change": {

But none of these are responsible for sending the event. If it turns out that the event has been removed, or renamed, we may need to circle back to update all of these references.

The name of the event appears all over the place in the gecko code, so it doesn't look like it's been removed or changed. At least one place where the event is fired is in toolkit/xre/nsXREDirProvider.cpp line 978:

    obsSvc->NotifyObservers(nullptr, "profile-after-change", kStartup);

There's also a similar line in the ESR 78 code — it doesn't look like it's changed — which makes me think this probably isn't the cause. But let's continue looking through the code.

There are other instances, but from what I can tell all the other places are part of the test code, so not what we're interested in. While digging through the grep output I also discover the following in toolkit/profile/notifications.txt which could turn out to be useful:

"profile-after-change"
  Called after the profile has changed. Use this notification
  to make changes that are dependent on what some other listener
  did during its profile-do-change. For example, to respond to
  new preferences.

At this point I'm still quite unclear as to why the error is being triggered. Earlier I mentioned that debugging the JavaScript is more troublesome. In this case it looks like the event might be firing in the C++ code, in which case I can stick a breakpoint on it to see whether it tallies with the error being printed to the log. That might help identify what's firing this event and why the URLQueryStrippingListService event is suddenly so unhappy about receiving it. It feels like an ordering issue to me.

But I'm currently on the train, due to arrive in to St. Pancras Station soon and so not about to start wiring up an SSH session for debugging. So this will have to wait now until this evening.

[...]

I'm on the train back home now and have spent most of the journey working my way through the code linked in Bug 1706608. After reading through it carefully it's still not clear to me why the "profile-after-change" event is triggering an error. I thought maybe the discussion in changeset D117376 was a hint: there it talks about not being able to rely on the event to start up the URLQueryStrippingListService module. But that doesn't explain why the event causes problems.

This is a little frustrating, but maybe once I'm home I'll be able to run some debugging and that might help.

[...]

So now I'm home and in a much better position to performs some debugging. Firs let's do a vanilla run without the debugger to compare against. The thing to notice is that the error message is the next log output line after the UserAgentOverrideHelper app-startup output and before the Created LOG for EmbedPrefs output.

[defaultuser@Xperia10II-DualSIM gecko]$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:0 - Unable to open bookmarks
    "/home/defaultuser/.local/share/org.sailfishos/browser/bookmarks.json"
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 224}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:224:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager

Now I'm going to add in a breakpoint on line 978 of nsXREDirProvider.cpp. This is the one line I was able to identify that sends out a profile-after-change event. Recall that one of the things we'd like to find out is whether this is the event that's triggering the error message, or if there are some instances that I missed in my visual check.

Here's the gdb output from running with this breakpoint. In the debugging output you'll see lots of "New LWP", "Switching to LWP" or "LWP exited" lines. LWP stands for "Lightweight Process" which is the term used for a thread. These lines aren't output when you just run the app, which is why we see them in this gdb output but not in the standard execution output that we saw above. In short, these "LWP" lines can be ignored when comparing the debug output here with the previous debug output.

(gdb) b nsXREDirProvider.cpp:978
[...]
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
Missing separate debuginfo for /lib64/libnss_db.so.2
Try: zypper install -C "debuginfo(build-id)=08373f23b9c39aa3af7ef3008920a379a5e7df7e"
[New LWP 16903]
JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
[LWP 16900 exited]
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
[New LWP 16904]
[New LWP 16905]
[New LWP 16906]
[New LWP 16907]
[New LWP 16908]
[Switching to LWP 16885]

Thread 8 "GeckoWorkerThre" hit Breakpoint 1,
    nsXREDirProvider::DoStartup (this=)
    at /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/toolkit/xre/
    nsXREDirProvider.cpp:978
978	    obsSvc->NotifyObservers(nullptr, "profile-after-change", kStartup);
(gdb) n
Missing separate debuginfo for /usr/lib64/libsoftokn3.so
Try: zypper install -C "debuginfo(build-id)=932611cf165ab104fa7bae7176d0bfb63368f6bc"
Missing separate debuginfo for /usr/lib64/libfreeblpriv3.so
Try: zypper install -C "debuginfo(build-id)=ea9e4dc0f2a49e2a02ae37540a8561bf4f1ac79a"
[New LWP 16941]
[New LWP 16942]
982	    (void)NS_CreateServicesFromCategory("profile-after-change", nullptr,
(gdb) n
Missing separate debuginfo for /usr/lib64/libnssckbi.so
Try: zypper install -C "debuginfo(build-id)=c2e22ddaa9e2d7802fd7ff14979ba02fb6508bcd"
Missing separate debuginfo for /usr/lib64/libtasn1.so.6
Try: zypper install -C "debuginfo(build-id)=95f83c2cec19f8b9030e796b6c25c31a10c185e8"
985	    if (gSafeMode && safeModeNecessary) {
(gdb) n
999	    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SAFE_MODE_USAGE, mode);
(gdb) n
1001	    obsSvc->NotifyObservers(nullptr, "profile-initial-state", nullptr);
(gdb) n
899	        mozilla::components::AppStartup::Service());
(gdb) n
885	        mozilla::services::GetObserverService();
(gdb) n
GeckoLoader::InitEmbedding (aProfilePath=0x555585a310
    "/home/defaultuser/.local/share/org.sailfishos/browser")
    at /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/mobile/sailfishos/utils/
    GeckoLoader.cpp:230
230	  LOGF("InitEmbedding successfully");
(gdb) c
Continuing.
[New LWP 16974]
[LWP 16901 exited]
CONSOLE message:
[LWP 16941 exited]
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 224}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:224:12

Created LOG for EmbedPrefs
Created LOG for EmbedLiteLayerManager
[...]

What's interesting about this output for me is that it's consistent with a situation where the "profile-after-change" event triggered in nsXREDirProvider.cpp is also the event that causes the error.

Here we see UserAgentOverrideHelper app-startup in the output, followed by the breakpoint being hit. I step through seven lines of C++ code before continuing. After continuing we get the error message "Unexpected event profile-after-change" followed by the Created LOG for EmbedPrefs output.

My next thought is that maybe the problem is that URLQueryStrippingListService.jsm is being initialised too early. According to what I've read in the bug reports and comments in the code, it's the "profile-after-change" event that's supposed to trigger the start up of the URLQueryStrippingListService.jsm module. If so, maybe it's already started when the "profile-after-change" event is triggered and that's why the error is happening.

As a little background, we can see some evidence of this reasoning in this comment that can be found in toolkit/xre/nsXREDirProvider.cpp:

    obsSvc->NotifyObservers(nullptr, "profile-after-change", kStartup);

    // Any component that has registered for the profile-after-change category
    // should also be created at this time.
    (void)NS_CreateServicesFromCategory("profile-after-change", nullptr,
                                        "profile-after-change");

This is from directly before the "profile-after-change" event is triggered. We can see that the URLQueryStrippingListService service is registered to be started by this event in the toolkit/components/antitracking/components.conf file. Notice the categories section:

    {
        'cid': '{afff16f0-3fd2-4153-9ccd-c6d9abd879e4}',
        'contract_ids': ['@mozilla.org/query-stripping-list-service;1'],
        'singleton': True,
        'jsm': 'resource://gre/modules/URLQueryStrippingListService.jsm',
        'constructor': 'URLQueryStrippingListService',
        'categories': {'profile-after-change': 'URLQueryStrippingListService'},
    },

To see whether the module is being started up too early — before the "profile-after-change" event is sent — I put debug output in a few further places. Well I didn't actually use debug output, instead I copied the error generating line and used that. So the output will actually appear as errors in the log.

I placed these error generators all in the URLQueryStrippingListService code in the following places:

At the top of the constructor; this should happen at the point when the module is initialised.
At the top of the _init() method. This is where the observers are registered.
At the top of the observe() method. This will be triggered whenever an event message is received.

With these added here's what happens, starting at the breakpoint, which is the point where the "profile-after-change" event is triggered.

Thread 8 "GeckoWorkerThre" hit Breakpoint 1, nsXREDirProvider::DoStartup
    (this=)
    at /usr/src/debug/xulrunner-qt5-91.9.1-1.aarch64/toolkit/xre/
    nsXREDirProvider.cpp:978
978         obsSvc->NotifyObservers(nullptr, "profile-after-change", kStartup);
(gdb) c
Continuing.
Missing separate debuginfo for /usr/lib64/libsoftokn3.so
Try: zypper install -C "debuginfo(build-id)=932611cf165ab104fa7bae7176d0bfb63368f6bc"
Missing separate debuginfo for /usr/lib64/libfreeblpriv3.so
Try: zypper install -C "debuginfo(build-id)=ea9e4dc0f2a49e2a02ae37540a8561bf4f1ac79a"
[New LWP 18838]
[New LWP 18839]
Missing separate debuginfo for /usr/lib64/libnssckbi.so
Try: zypper install -C "debuginfo(build-id)=c2e22ddaa9e2d7802fd7ff14979ba02fb6508bcd"
Missing separate debuginfo for /usr/lib64/libtasn1.so.6
Try: zypper install -C "debuginfo(build-id)=95f83c2cec19f8b9030e796b6c25c31a10c185e8"
[New LWP 18840]
[LWP 18817 exited]
[New LWP 18841]
CONSOLE message:
[JavaScript Error: "URLQUERYSTRIPPINGLISTSERVICE: constructor"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 29}]
URLQueryStrippingListService@resource://gre/modules/
    URLQueryStrippingListService.jsm:29:8

CONSOLE message:
[JavaScript Error: "URLQUERYSTRIPPINGLISTSERVICE: init"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 49}]
_init@resource://gre/modules/URLQueryStrippingListService.jsm:49:8
URLQueryStrippingListService@resource://gre/modules/
    URLQueryStrippingListService.jsm:45:30

CONSOLE message:
[JavaScript Error: "URLQUERYSTRIPPINGLISTSERVICE: observe"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 217}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:217:8

CONSOLE message:
[JavaScript Error: "Unexpected event profile-after-change"
    {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 227}]
observe@resource://gre/modules/URLQueryStrippingListService.jsm:227:12

Created LOG for EmbedPrefs

The code progresses exactly as we might expected and completely consistent with a model where it's the event which causes the class to be instantiated:

On continuing after the breakpoint the constructor is called.
Immediately following this the _init() method is called.
Immediately following this the "profile-after-change" event is received.

So the module is definitely instantiated after nsXREDirProvider sends out the event. It's initialised before the event is received. Immediately after being initialised the event is received. No other events are received in between these two points.

This means that things appear to be working exactly as expected. So why generate an error? I'm going to have to continue trying to find an answer to this question tomorrow.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

25 Nov 2023 : Day 88 #

This morning the build has finished. I'm always happy to see that! I've transferred the packages over to my development phone and installed them.

You'll recall that yesterday and the day before we were working to remove the Services.appinfo errors from the log output while running the browser. The attempted fix was to add some annotations to our @mozilla.org/xre/app-info;1 interface specified in embedding/embedlite/components/components.conf that overrides the standard interface from toolkit/xre/components.conf.

I didn't see any easy way to test it without a rebuild, so has it done the trick?

Let's give it a whirl.

$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:0 - Unable to open bookmarks
    "/home/defaultuser/.local/share/org.sailfishos/browser/bookmarks.json"
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JavaScript error:
    file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js, line 108:
    TypeError: XPCOMUtils.generateNSGetFactory is not a function
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
CONSOLE message:
[JavaScript Error: "TypeError: XPCOMUtils.generateNSGetFactory is not a function"
    {file: "file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js"
    line: 108}]
@file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js:108:32
[...]

There's still plenty of output but the Services.appinfo errors are now all gone. Great!

Once I've pushed these changes I'll close the associated issue ticket.

It would be good to clean out all of the really obvious errors from the log. The next one seems to be this:

JavaScript error:
    file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js, line 108:
    TypeError: XPCOMUtils.generateNSGetFactory is not a function

There are a bunch of other errors related to mozembedlite/components components too. During the Sailfish OS Community Meeting this week Raine offered to help with the embedlite-components errors, so I'll need to be a little careful about which ones I pick to work on, to avoid us working on the same thing.

So my next task will be to pick out all of the obvious errors I can see and list them in the issue tracker, so that we can assign them and avoid any conflicts.

I also really need to move my changes over to the official repository. It's definitely reached that point now, I just really don't want to have to turn everything into patches as it's going to be a lot of work. But it needs to be done; maybe I'll have time over the weekend.

It's the end of the morning now, time for work!

[...]

The evening! Time top sort out some of these errors. Here's the list that I can see just at the moment. Raine already submitted issue tickets for several of them and I've created some additional tickets to cover the rest:

#1042 : JavaScript error: chrome://embedlite/content/embedhelper.js, line 259: TypeError: sessionHistory is null
#1044: JavaScript Error: "TypeError: XPCOMUtils.generateNSGetFactory is not a function" {file: "file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js" line: 108}
#1045 : JavaScript Error: "Unexpected event profile-after-change" {file: "resource://gre/modules/URLQueryStrippingListService.jsm" line: 224}
#1041: JavaScript Error: "NS_ERROR_FILE_NOT_FOUND: " {file: "file:///usr/lib64/mozembedlite/components/EmbedLiteChromeManager.js" line: 213}
#1046: JavaScript Error: "TypeError: Services.search.addEngine is not a function" {file: "file:///usr/lib64/mozembedlite/components/EmbedLiteSearchEngine.js" line: 53}

My plan was to look at issue #1044 and I think that's a good one to continue with:

JavaScript Error: "TypeError: XPCOMUtils.generateNSGetFactory is not a function"
    {file: "file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js"
    line: 108}

This EmbedLiteGlobalHelper.js file isn't part of the gecko-dev source tree at all but can instead be found in embedlite-components.

Here's the line that's causing the trouble:

this.NSGetFactory = XPCOMUtils.generateNSGetFactory([EmbedLiteGlobalHelper]);

To figure out what's going on I'll need to dig into this generateNSGetFactory() method. It's used quite a lot in the ESR 78 codebase (33 instances) and also in the embedlite-components code (25 times). It's also used quite liberally in the ESR 91 codebase too (29 instances). So why would it be generating an error in this particular place I wonder?

Something I notice while looking through these instances is that some look like this:

const { XPCOMUtils } = ChromeUtils.import(
  "resource://gre/modules/XPCOMUtils.jsm"
);
[...]
this.NSGetFactory = XPCOMUtils.generateNSGetFactory([EmbedLiteGlobalHelper]);

Whereas others look like this:

const { ComponentUtils } = ChromeUtils.import(
  "resource://gre/modules/ComponentUtils.jsm"
);
[...]
this.NSGetFactory = ComponentUtils.generateNSGetFactory([HandlerService]);

Could this be the reason? It seems worth a try. In order to test this out I also need to find out where the EmbedLiteGlobalHelper.js file ends up once it gets installed onto a phone. This shouldn't be too hard to figure out. There are probably clever, correct, ways to do this, but this way should also work:

$ find /usr -iname "EmbedLiteGlobalHelper.js" 2>/dev/null
/usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js

I've edited the file on-device to make the following changes:

$ git diff
--- a/EmbedLiteGlobalHelper.js
+++ b/EmbedLiteGlobalHelper.js
@@ -6,7 +6,7 @@ const Cc = Components.classes;
 const Ci = Components.interfaces;
 const Cr = Components.results;
 
-const { XPCOMUtils } = ChromeUtils.import("resource://gre/modules/XPCOMUtils.jsm");
+const { ComponentUtils } = ChromeUtils.import("resource://gre/modules/ComponentUtils.jsm");
 const { Services } = ChromeUtils.import("resource://gre/modules/Services.jsm");
 const { LoginManagerParent } = ChromeUtils.import("resource://gre/modules/LoginManagerParent.jsm");
 const { L10nRegistry, FileSource } = ChromeUtils.import("resource://gre/modules/L10nRegistry.jsm");
@@ -105,4 +105,4 @@ EmbedLiteGlobalHelper.prototype = {
   QueryInterface: ChromeUtils.generateQI([Ci.nsIObserver, Ci.nsISupportsWeakReference, Ci.nsIFormSubmitObserver])
 };
 
-this.NSGetFactory = XPCOMUtils.generateNSGetFactory([EmbedLiteGlobalHelper]);
+this.NSGetFactory = ComponentUtils.generateNSGetFactory([EmbedLiteGlobalHelper]);

Happily the "XPCOMUtils.generateNSGetFactory is not a function" error no longer appears, but now I get a couple of new errors:

JavaScript error: file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js,
    line 32: TypeError: ActorManagerParent.flush is not a function
JavaScript error: file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js,
    line 34: TypeError: L10nRegistry.registerSource is not a function

These errors are a positive sign: they suggest execution is getting further into the module than it did before. To take the first error, I can see the flush() method in the ESR 78 code. Here's what it looks like:

  /**
   * Serializes the current set of registered actors into ppmm.sharedData, for
   * use by ActorManagerChild. This must be called before any frame message
   * managers have been created. It will have no effect on existing message
   * managers.
   */
  flush() {
    for (let [name, data] of this.childGroups) {
      sharedData.set(`ChildActors:${name || ""}`, data);
    }
    sharedData.set("ChildSingletonActors", this.singletons);
  },

This has been removed from the ESR 91 code though, which is why the error is triggering. Let's find out why it was removed.

$ git log -S "flush" -1 ./toolkit/modules/ActorManagerParent.jsm
commit c5eff6620590dbf2c5eceb62a1c42fe8e71c8f48
Author: Neil Deakin 
Date:   Fri Nov 6 15:46:11 2020 +0000

    Bug 1649843, remove now unused legacy actor code, r=kmag
    
    Differential Revision: https://phabricator.services.mozilla.com/D95206

What I'm seeing in this diff is that all of the calls to ActorManagerParent.flush() have been removed. It's not being immediately replaced by something in this diff, but that doesn't mean that there wasn't something else added earlier (or later). In many places I'm seeing calls like this, which don't seem to appear in the EmbedLite code:

    ActorManagerParent.addJSProcessActors(JSPROCESSACTORS);
    ActorManagerParent.addJSWindowActors(JSWINDOWACTORS);

These all seem to relate to JSActors which have some associated documentation. But that's a rabbit hole that's going to be too complex to go down today. So today I'm just going to remove the call to flush() to match what's happening in the diff. Maybe we'll have to return to this later.

Now let's try to tackle the second error related to L10nRegistry.registerSource(). Checking the ESR 78 and ESR 91 code against each other, it looks like L10nRegistry.registerSources() (plural) is now used instead. Let's check this.

First from ESR 78:

  /**
   * Adds a new resource source to the L10nRegistry.
   *
   * @param {FileSource} source
   */
  registerSource(source) {
    if (this.hasSource(source.name)) {
      throw new Error(`Source with name "${source.name}" already registered.`);
    }
    this.sources.set(source.name, source);

    if (isParentProcess) {
      this._synchronizeSharedData();
      Services.locale.availableLocales = this.getAvailableLocales();
    }
  }

And then from ESR 91:

  /**
   * Adds new resource source(s) to the L10nRegistry.
   *
   * Notice: Each invocation of this method flushes any changes out to extant
   * content processes, which is expensive. Please coalesce multiple
   * registrations into a single sources array and then call this method once.
   *
   * @param {Array} sources
   */
  registerSources(sources) {
    for (const source of sources) {
      if (this.hasSource(source.name)) {
        throw new Error(`Source with name "${source.name}" already registered.`);
      }
      this.sources.set(source.name, source);
    }
    if (isParentProcess && sources.length > 0) {
      this._synchronizeSharedData();
      Services.locale.availableLocales = this.getAvailableLocales();
    }
  }

So they are different, but in practice it just means that we need to pass in an array of resources, rather than just a single resource. Checking with git blame also takes us back to the upstream diff and associated Bugzilla bug:

$ git log -1 391ecb85154ba
commit 391ecb85154bab7416ab01f3de35f506327704c8
Author: Aaron Klotz 
Date:   Fri Jun 26 21:04:21 2020 +0000

    Bug 1648631: Part 1 - Make L10nRegistryService._synchronizeSharedData() explicitly flush and convert source registration to accept arrays; r=zbraniecki
    
    Setting a key/value pair on the parent process message manager's `sharedData`
    initiates a pending runnable that runs at **idle priority**. If the current
    thread never gets a chance to idle, then those registry changes will never be
    synchronized.
    
    This patch adds an explicit flush so that `_synchronizeSharedData` does indeed
    actually synchronize.
    
    We replace the scalar `registerSource`, `updateSource`, and `removeSource` with
    variants that accept arrays instead. This allows us to process multiple
    registration changes while deferring the synchronization until the entire list
    has been processed. The scalar variants were removed to future-proof against
    perf issues.
    
    Differential Revision: https://phabricator.services.mozilla.com/D81243

That looks like pretty good corroboration.

Here are the remaining changes I've made to address these two errors:

 function EmbedLiteGlobalHelper()
 {
-  ActorManagerParent.flush();
-
-  L10nRegistry.registerSource(new FileSource(
-                                  "0-mozembedlite",
-                                  ["en-US", "fi", "ru"],
-                                  "chrome://browser/content/localization/{locale}/"))
+  L10nRegistry.registerSources([new FileSource(
+                                   "0-mozembedlite",
+                                   ["en-US", "fi", "ru"],
+                                   "chrome://browser/content/localization/{locale}/")])
 
   Logger.debug("JSComp: EmbedLiteGlobalHelper.js loaded");
 }

I've edited these changes directly on my device and now when I run the browser I just see the following coming up in relation to EmbedLiteGlobalHelper.js:

JSComp: EmbedLiteGlobalHelper.js loaded
EmbedLiteGlobalHelper app-startup

Much better! As I mentioned before, these changes aren't part of the gecko-dev codebase but rather make their way into the embedlite-components repository instead.

That'll have to be all for today. More tomorrow!

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

24 Nov 2023 : Day 87 #

Yesterday we started looking at the errors output from the privileged JavaScript shims, in particular the errors related to Services.appinfo like this one:

JavaScript error: resource://gre/modules/LoginRecipes.jsm, line 56:
    TypeError: Services.appinfo is undefined

What we found was that the Services object is being instantiated correctly but without the appinfo member. If we hack this back in manually, the errors go away.

So we need to figure out why Services is missing this piece. Running git blame on the Services.jsm file gives us a clue:

$ git blame toolkit/modules/Services.jsm
Blaming lines: 100% (7/7), done.
82ff7027aac0f Services.jsm (Gervase Markham 2012-05-21 12:12:37 +0100 1)
     /* This Source Code Form is subject to the terms of the Mozilla Public
82ff7027aac0f Services.jsm (Gervase Markham 2012-05-21 12:12:37 +0100 2)
      * License, v. 2.0. If a copy of the MPL was not distributed with this
82ff7027aac0f Services.jsm (Gervase Markham 2012-05-21 12:12:37 +0100 3)
      * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
eaf57306e97c0 Services.jsm (Gavin Sharp     2010-01-28 13:31:45 -0500 4) 
682b1ec3b2c92 Services.jsm (Florian Quèze   2018-02-23 20:50:01 +0100 5)
    var EXPORTED_SYMBOLS = ["Services"];
eaf57306e97c0 Services.jsm (Gavin Sharp     2010-01-28 13:31:45 -0500 6) 
a1cb850855b77 Services.jsm (Kris Maglione   2020-07-09 21:42:53 +0000 7)
    var Services = Cu.createServicesCache();

Strictly speaking we're interested in why all of the following lines were removed, but these aren't shown here because only lines that exist in the latest version are shown. But we can use the last line to find out. It's more than likely this line was changed at the same time the others were removed.

Checking the diff for this change confirms this. The amount of removed code is large, so I'm not going to show it all here, but here's an extract to provide a flavour:

$ git diff a1cb850855b77~ a1cb850855b77 toolkit/modules/Services.jsm
diff --git a/toolkit/modules/Services.jsm b/toolkit/modules/Services.jsm
index 0dc46f4fe5f8..9eb47cd45653 100644
--- a/toolkit/modules/Services.jsm
+++ b/toolkit/modules/Services.jsm
@@ -2,171 +2,6 @@
  * License, v. 2.0. If a copy of the MPL was not distributed with this
  * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
 
-/* eslint mozilla/use-services:off */
-
[...]
-XPCOMUtils.defineLazyServiceGetters(Services, initTable);
-
-initTable = undefined;
+var Services = Cu.createServicesCache();

Let's have a read of the git log associated with this change.

$ git log -1 a1cb850855b77
commit a1cb850855b77c78710637bd946ab62d14ba1b0d
Author: Kris Maglione 
Date:   Thu Jul 9 21:42:53 2020 +0000

    Bug 1464542: Part 3c - Change Services.jsm to use the C++-implemented
    services cache. r=mccr8
    
    Differential Revision: https://phabricator.services.mozilla.com/D81420

Checking this log message leads us back to the Phabricator changeset, which leads us to the Bugzilla Bug, which leads us to the full set of six changesets that are associated with the bug. One of those is changeset D81419 with the title "Part 3b - Add existing Services.jsm registrations to the new services cache". This is the one that's especially of interest to us.

If we take a look at the changes to toolkit/xre/components.conf in this changeset we can see that some additional annotations have been added to the appinfo component definition.

Referring back to the original issue in Bugzilla, we can see that the purpose was to migrate the code in Services.jsm from this JavaScript file to the C++ code. As part of that process an approach was implemented to automatically attach classes to Services, as we can see from this comment:

Any class entry with a 'js_name' attribute automatically becomes available on the services cache with that name, and any interfaces listed in its 'interfaces' list are automatically queried on it.

I believe this is our smoking gun.

It's early morning here and I have to head to work, so I'll continue digging in to these changes this evening.

[...]

It's evening now, time to get back to gecko debugging. So it seems that the problem is that the appinfo interface that used to be provided in JavaScript is now provided through a C++ interface. Annotations added to the components.conf file make these interfaces available in a persistent form through the Services interface.

However, with EmbedLite, for both ESR 78 and ESR 91 we specifically remove the default appinfo interface and replace it with our own. This is done through patch 0072 which has the title "Cleanup static components definitions". Here's our EmbedLite change in action:

diff --git a/toolkit/xre/components.conf b/toolkit/xre/components.conf
 Classes = [
-    {
-        'cid': '{95d89e3e-a169-41a3-8e56-719978e15b12}',
-        'contract_ids': [
-            '@mozilla.org/xre/app-info;1',
-            '@mozilla.org/xre/runtime;1',
-        ] + crash_reporter,
-        'legacy_constructor': 'mozilla::AppInfoConstructor',
-        'headers': ['nsAppRunner.h'],
-        'processes': ProcessSelector.ALLOW_IN_SOCKET_PROCESS,
-    },

Although it may not be entirely clear from this, this change is removing the appinfo class from the configuration. The upstream diff that's applied between ESR 78 and ESR 91 would have changed this section — had it not been deleted — to the following:

Classes = [
    {
        'js_name': 'appinfo',
        'cid': '{95d89e3e-a169-41a3-8e56-719978e15b12}',
        'contract_ids': [
            '@mozilla.org/xre/app-info;1',
            '@mozilla.org/xre/runtime;1',
        ] + crash_reporter,
        'interfaces': ['nsIXULRuntime', 'nsIXULAppInfo'],
        'legacy_constructor': 'mozilla::AppInfoConstructor',
        'headers': ['nsAppRunner.h'],
        'processes': ProcessSelector.ALLOW_IN_SOCKET_PROCESS,
        'overridable': True,
    },

This update version is clearer in my view, because it makes the appinfo connection explicit. In EmbedLite this gets added back in to the embedding/embedlite/components/components.conf file. But of course, currently it doesn't have the additional annotations. So my plan is to add in the annotations, perform a rebuild and see whether this helps. Here's the change I've made:

$ git diff
diff --git a/embedding/embedlite/components/components.conf
           b/embedding/embedlite/components/components.conf
index b13472df7051..a86a9430c57c 100644
--- a/embedding/embedlite/components/components.conf
+++ b/embedding/embedlite/components/components.conf
@@ -14,15 +14,18 @@ Classes = [
         'headers': ['EmbedWidgetFactoryRegister.h'],
     },
     {
+        'js_name': 'appinfo',
         'cid': '{bb11767a-9c26-11e2-bfb2-9f3b52956e00}',
         'contract_ids': [
             '@mozilla.org/xre/app-info;1',
             '@mozilla.org/xre/runtime;1'
         ],
+        'interfaces': ['nsIXULRuntime', 'nsIXULAppInfo'],
         'singleton': True,
         'type': 'mozilla::embedlite::EmbedLiteXulAppInfo',
         'categories': {'app-startup': 'EmbedLite Xul App Info Component'},
         'constructor': 'mozilla::embedlite::EmbedLiteXulAppInfo::GetSingleton',
-        'headers': ['/include/mozilla/embedlite/EmbedLiteXulAppInfo.h']
+        'headers': ['/include/mozilla/embedlite/EmbedLiteXulAppInfo.h'],
+        'overridable': True,
     },
 ]

Having made this change I now need to perform the rebuild. Unfortunately it's going to be a full rebuild because this is quite an intrusive changes. Tomorrow I'll see whether it's had any positive effect.

This is a small change. If it fixes it then that will be great, if it doesn't it'll need more investigation. But already I feel like we're making progress again. I much prefer these small, self-contained tasks compared to the lengthy open-ended debugging tasks (like the rendering fixes). I'm looking forward to more of these.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

23 Nov 2023 : Day 86 #

Getting back into the swing of things this morning. Yesterday we looked at the changes Raine made to get APZ (Asynchronous Pan/Zoom) working. Today I'm going to start hacking away at the privileged JavaScript layer.

We've covered Mozilla's Inter-Process Definition Language in previous diary entries. Gecko uses IPDL files which define an interface which is then used to automatically generate a virtual C++ class. You can identify IPDL files by their .ipdl extension.

Gecko uses a similar approach for communication between C++ and JavaScript based on the W3C's Web IDL (Web Interface Definition Language). Again, you write interface definition files, in this case using the .webidl extension. They look pretty similar to IPDL files. These files are also used to automatically generate C++ virtual classes that can then be implemented in the C++ code.

In addition to being accessible from C++, gecko also then exposes these classes to its JavaScript parser. Gecko provides a lot of functionality like this. Things like cookie management, permissions, stored logins, in-page search and much, much, more are exposed to the JavaScript layer in this way.

The result is that between the user interface and the C++ library there is also a layer of JavaScript. For security reasons not all JavaScript can access these IDL interfaces (you wouldn't want any random piece of JavaScript downloaded from the Web accessing your stored passwords!), but in this layer between the user interface and the library the JavaScript is privileged. That is, it has special access to these interfaces.

EmbedLite has its own set of JavaScript modules that live in this privileged layer. I call them shims because they fill a thin gap between the two layers on either side. Often they're just shepherding data from one side to the other, but you can fill them up with all sorts of interesting functionality. For example, the User Agent exceptions that the Sailfish Browser uses are managed by the ?? JavaScript module.

Right now, if you run ESR 91 with debugging output enabled, you'll see a lot of output, including errors, thrown out to the console. Many of these will be coming from this JavaScript layer.

These JavaScript errors aren't fatal: they won't crash the application. But they more than likely will cause the particular JavaScript module they're coming from to fail. This makes the overall browser more robust, but makes individual functions prone to silent failure.

Many of these errors are likely to be because one of the IDL files was changed between ESR 78 and ESR 91, so that the interfaces used by the JavaScript aren't what was expected by our EmbedLite layer. Let's take a look.

$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:0 - Unable to open bookmarks
    "/home/defaultuser/.local/share/org.sailfishos/browser/bookmarks.json"
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JavaScript error: resource://gre/modules/LoginRecipes.jsm, line 56:
    TypeError: Services.appinfo is undefined
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
JavaScript error: 
    file:///usr/lib64/mozembedlite/components/UserAgentOverrideHelper.js,
    line 110: TypeError: Services.appinfo is undefined
JavaScript error: resource://gre/modules/EnterprisePoliciesParent.jsm,
    line 500: TypeError: Services.appinfo is undefined
JavaScript error: resource://gre/modules/AddonManager.jsm, line 1479:
    NS_ERROR_NOT_INITIALIZED: AddonManager is not initialized
JavaScript error: resource://gre/modules/URLQueryStrippingListService.jsm,
    line 42: TypeError: Services.appinfo is undefined
CONSOLE message:
[JavaScript Error: "TypeError: Services.appinfo is undefined"
    {file: "resource://gre/modules/LoginRecipes.jsm" line: 56}]
LoginRecipesParent@resource://gre/modules/LoginRecipes.jsm:56:7
get recipeParentPromise@resource://gre/modules/LoginManagerParent.jsm:1429:24
@file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js:17:1

CONSOLE message:
[JavaScript Error: "TypeError: Services.appinfo is undefined"
    {file: "file:///usr/lib64/mozembedlite/components/UserAgentOverrideHelper.js"
    line: 110}]
ua_init@file:///usr/lib64/mozembedlite/components/UserAgentOverrideHelper.js:110:19
observe@file:///usr/lib64/mozembedlite/components/UserAgentOverrideHelper.js:43:19

CONSOLE message:
[JavaScript Error: "TypeError: Services.appinfo is undefined" {file:
    "resource://gre/modules/EnterprisePoliciesParent.jsm" line: 500}]
_getConfigurationFile@resource://gre/modules/EnterprisePoliciesParent.jsm:500:1
_readData@resource://gre/modules/EnterprisePoliciesParent.jsm:556:27
JSONPoliciesProvider@resource://gre/modules/EnterprisePoliciesParent.jsm:477:10
_chooseProvider@resource://gre/modules/EnterprisePoliciesParent.jsm:148:24
_initialize@resource://gre/modules/EnterprisePoliciesParent.jsm:118:25
BG_observe@resource://gre/modules/EnterprisePoliciesParent.jsm:285:14
[...]

One of the recurrent errors seems to relate to Services.appinfo, so that's likely to be a good place to start. We can see it near the start. Most of the EmbedLite modules are loading without error, but then we hit this:

JavaScript error: resource://gre/modules/LoginRecipes.jsm, line 56:
    TypeError: Services.appinfo is undefined

This LoginRecipes.jsm file is part of the password manager, found in the gecko-dev sources ./toolkit/components/passwordmgr/LoginRecipes.jsm. Here's the bit of code generating the error:

/**
 * Create an instance of the object to manage recipes in the parent process.
 * Consumers should wait until {@link initializationPromise} resolves before
 * calling methods on the object.
 *
 * @constructor
 * @param {String} [aOptions.defaults=null] the URI to load the recipes from.
 *                                          If it's null, nothing is loaded.
 *
 */
function LoginRecipesParent(aOptions = { defaults: null }) {
  if (Services.appinfo.processType != Ci.nsIXULRuntime.PROCESS_TYPE_DEFAULT) {
    throw new Error(
      "LoginRecipesParent should only be used from the main process"
    );
  }
  this._defaults = aOptions.defaults;
  this.reset();
}

Now this is part of the main gecko source, so the problem probably isn't coming from here but rather from whatever is creating Services.appinfo. Earlier in the file we can see this line:

const { Services } = ChromeUtils.import("resource://gre/modules/Services.jsm");

The Services.jsm file referred to there is pretty simple:

/* This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

var EXPORTED_SYMBOLS = ["Services"];

var Services = Cu.createServicesCache();

I'm not sure what the problem is here, but let's narrow it down. Let's add the Services.appinfo parts of Services that we need directly into this file.

All of these JavaScript shims are stored in the /usr/lib64/xulrunner-qt5-91.9.1/omni.ja file that gets installed alongside the libxul.so library. This omni.ja file is actually just a zip file, so we can unpack it, edit the contents and then repack it again.

All of this happens on the device:

$ zypper install -y zip unzip
$ mkdir omni
$ cd omni/
$ cp /usr/lib64/xulrunner-qt5-91.9.1/omni.ja .
$ unzip omni.ja 
$ find . -iname "Services.jsm"
./modules/Services.jsm

I've edited the ./modules/Services.jsm files so it now looks like this:

/* This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this
 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */

var EXPORTED_SYMBOLS = ["Services"];

var Services = Cu.createServicesCache();

Services.appinfo = {}
Services.appinfo.processType = Ci.nsIXULRuntime.PROCESS_TYPE_DEFAULT

Essentially I've hacked in a return value for Services.appinfo.processType so that the check in LoginRecipes.jsm will pass. We can now package back up the omni.ja and put the result back in the right place again.

$ zip -qr9XD omni.ja *
$ devel-su mv omni.ja /usr/lib64/xulrunner-qt5-91.9.1/

Now when we run the browser the particular error generated by LoginRecipes.jsm is gone:

$ EMBED_CONSOLE=1 sailfish-browser
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
greHome from GRE_HOME:/usr/bin
libxul.so is not found, in /usr/bin/libxul.so
Created LOG for EmbedLiteTrace
[W] unknown:0 - Unable to open bookmarks
    "/home/defaultuser/.local/share/org.sailfishos/browser/bookmarks.json"
[D] onCompleted:105 - ViewPlaceholder requires a SilicaFlickable parent
Created LOG for EmbedLite
JSComp: EmbedLiteConsoleListener.js loaded
JSComp: ContentPermissionManager.js loaded
JSComp: EmbedLiteChromeManager.js loaded
JSComp: EmbedLiteErrorPageHandler.js loaded
JSComp: EmbedLiteFaviconService.js loaded
JavaScript error:
    file:///usr/lib64/mozembedlite/components/EmbedLiteGlobalHelper.js,
    line 108: TypeError: XPCOMUtils.generateNSGetFactory is not a function
JSComp: EmbedLiteOrientationChangeHandler.js loaded
JSComp: EmbedLiteSearchEngine.js loaded
JSComp: EmbedLiteSyncService.js loaded
EmbedLiteSyncService app-startup
JSComp: EmbedLiteWebrtcUI.js: loaded
JSComp: EmbedLiteWebrtcUI.js: got app-startup
JSComp: EmbedPrefService.js loaded
EmbedPrefService app-startup
JSComp: EmbedliteDownloadManager.js loaded
JSComp: LoginsHelper.js loaded
JSComp: PrivateDataManager.js loaded
JSComp: UserAgentOverrideHelper.js loaded
UserAgentOverrideHelper app-startup
JavaScript error:
    file:///usr/lib64/mozembedlite/components/UserAgentOverrideHelper.js,
    line 110: TypeError: Services.appinfo.version is undefined
JavaScript error: resource://gre/modules/EnterprisePoliciesParent.jsm,
    line 500: TypeError: Services.appinfo.name is undefined

We get a different error, and we still get errors related to other parts of Services.appinfo that are missing. That makes sense because we didn't hack in a return values for the other parts that are now generating errors.

What does this tell us? It suggests that the Service component is being created properly, but that the appinfo portion isn't implemented. The question now is: where should it be implemented?

That part I've not yet found. Finding it shouldn't be hard, but it's late and my brain has decided I've done enough for today, so discovering it will have to wait for tomorrow.

Before I forget, one thing I notice is that the call to Services.scriptloader.loadSubScript() in EmbedLiteConsoleListener.js doesn't seem to be triggering an error. If I can find where scriptLoader() is implemented that my well give me a clue about appinfo.

If you'd like to catch up on all the diary entries, they're all available on my Gecko-dev Diary page.

Comment

22 Nov 2023 : Day 85 #

It's been two weeks since I paused my gecko dev diary, aside from a one-off intermediate blip. As anyone who's been following along will know, the pause was so I could take part in a Book Dash. That finished at the end of last week; if you're interested in reading about my Book Dash experience you'll find a summary on my blog.

Although I wasn't able to do much work on gecko myself during the gap, that didn't stop others from hacking away at it. As explained in the blip, mal, keto and direc85 continued working to get all three architectures set up and building correctly on OBS. That's a big step forwards, because it means anyone can now add the OBS repository to their device and test the packages for real.

Since then Raine, who you'll know from earlier entries, has also been working on the code. He submitted several changes which, when taken together, get the APZ functionality back up and running.

APZ functionality? That sounds a bit technical.

APZ stands for Async Pan/Zoom, which, to boil it down even further, essentially means "Touch input". You may recall that, before the gap, rendering had just starting working, but touch input was broken. That meant you could see what was at the top of the page, but there was no way to scroll around, zoom in or out, or in fact interact with the page at all.

APZ is pretty essential browser functionality.

Raine made two essential changes. First he ensured that the time stamps being passed in to the multi-touch event methods are set correctly:

diff --git a/embedding/embedlite/EmbedLiteView.cpp b/embedding/embedlite/EmbedLiteView.cpp
index 0d6ef9f98780..6b1693255f07 100644
--- a/embedding/embedlite/EmbedLiteView.cpp
+++ b/embedding/embedlite/EmbedLiteView.cpp
@@ -326,7 +326,7 @@ EmbedLiteView::ReceiveInputEvent(const EmbedTouchInput &aEvent)
   NS_ENSURE_TRUE(mViewImpl,);
 
   mozilla::MultiTouchInput multiTouchInput(static_cast(aEvent.type),
-                                           aEvent.timeStamp, TimeStamp(), 0);
+                                           aEvent.timeStamp, TimeStamp::Now(), 0);
 
   for (const mozilla::embedlite::TouchData &touchData : aEvent.touches) {
     nsIntPoint point = nsIntPoint(int32_t(floorf(touchData.touchPoint.x)),

Without this change a zero timestamp is passed in, which prevents touch event "deltas" from being calculated correctly.

A "delta" is a fancy term used for a "difference". This tends to be calculated as the difference between the positions before and after, divided by the difference between the timestamps:

$$
\frac{p_2 - p_1}{t_2 - t_1},
$$
where $p_1, p_2$ are the positions of the current and previous touch respectively and $t_1, t_2$ are the times at which they occurred. As you can imagine, if $t_1$ and $t_2$ are both zero, you end up trying to divide by zero, which you can't do, causing touch input to break.

So setting those timestamps correctly is essential.

That was a change on the EmbedLite side (the Sailfish OS specific code). But there was also a change on the gecko side to ensure the following bit of code is skipped:

bool gfxPlatform::AsyncPanZoomEnabled() {
#if !defined(MOZ_WIDGET_ANDROID) && !defined(MOZ_WIDGET_UIKIT) && !defined(MOZ_WIDGET_QT)
  // For XUL applications (everything but Firefox on Android)
  // we only want to use APZ when E10S is enabled. If
  // we ever get input events off the main thread we can consider relaxing
  // this requirement.
  if (!BrowserTabsRemoteAutostart()) {
    return false;
  }
#endif
  [...]

Without this change the gfxPlatform::AsyncPanZoomEnabled() method returns false, blocking touch input from working.

With these two changes integrated the touch interface now works, pretty much as expected: you can pan by dragging, zoom by pinching, and select links by tapping.

That means we're actually at what I'd call "usable browser" state, which is pretty phenomenal.

Usable, but not yet pleasant or functionally complete. There are still many other small changes to change and hooks to hook. Things like PDF printing of pages, user agent string configuration, WebRTC, playing video and audio...

The list is extensive. But that's what we have an issue tracker for and that's why it's crucial to maintain momentum and keep the development going.

Alright, that's it for today. We're into stage 3 now, which should involve lots of smaller fixes. That means I should hopefully be able to pick up some speed in these dev diaries.

Will we get a working browser by Christmas? I wouldn't want to commit to it, but we should certainly be able to make good progress before then.

This is also the best time for others to get involved. There are three main things to focus on:

Applying existing patches. We don't want to apply patches unnecessarily, but where they're needed to fix bugs or restore functionality, we should get them applied.
Adjusting the embedlite privileged JavaScript to match the new APIs. Some things changed since ESR 78 and they need fixing.
Fixing new bugs by delving in to the gecko C++ or browser Qt code. Hopefully there will be fewer of these.

We already have a bunch of issues related to these on the tracker and the list will grow over time. If you're comfortable with C++ but not particularly familiar with the gecko code then the first type of task would be a great way to get involved. If JavaScript is more your thing then the second type of task would no doubt suit your skills. If you really want to get to grips with the innards of gecko and how it works on Sailfish OS, then go for the third.

Personally, I'm planning to start with the second type of task. There's already a torrent of errors being generated by the JavaScript code which I'd love to clean up. More on this tomorrow.

If you'd like to catch up on all the diary entries before the gap, they're all available on my Gecko-dev Diary page.

Comment

19 Nov 2023 : The Turing Way Book Dash #

I've been making noises about taking part in The Turing Way Book Dash for the last couple of weeks. All of the actual dashing took place last week, and in spite of the long build up, it feels like it's come and gone rather quickly. It was a fantastic week of writing, reviewing, collaborating and generally working a lot with markdown and git. The Turing Way is undoubtedly an impressive project and the Book Dash somehow manages to capture the project's spirit and use that energy to generate words.

The Turing Way project is illustrated as a road or path with shops for different data science skills. People can go in and out with their shopping cart and pick and choose what they need.

In this post I'll be sharing my experiences: what is the Book Dash for? How does it work? Why was I involved? If you're interested in reproducible data science, in the broadest sense, then you should consider joining a future Book Dash. Applications are open to anyone and reading this post might help you decide on whether it's right for you.

But, without wanting to spoil the punchline: it almost certainly is. On my travels through space and time I've met many very talented researchers, software developers, technical writers and data scientists. If you fall into any of these camps, then there's a good chance you'd have something to contribute to The Turing Way.

What is The Turing Way book dash?

Let's start with The Turing Way. It's an online book, originally started at The Alan Turing Institute (where I work), but increasingly written by and directed by the community it tries to serve.

Here's how the book describes itself:

The Turing Way project is open source, open collaboration, and community-driven. We involve and support a diverse community of contributors to make data science accessible, comprehensible and effective for everyone. Our goal is to provide all the information that researchers and data scientists in academia, industry and the public sector need to ensure that the projects they work on are easy to reproduce and reuse.

That's a great description, but what it doesn't quite say is that The Turing Way project itself tries to embody the good practices it preaches. That means it aims to be open and community-oriented. The processes used to achieve this are themselves documented in The Turing Way. It doesn't quite reach Russellian levels of paradoxical self-reference, but it sometimes feels that might be what it's aiming for.

The Turing Way has a Book Dash every six months. This one was the tenth, which means they must have been running for around five years. The idea is that people at The Turing are busy. People in the community are busy. This can make it hard to focus and contributions to the book can get stuck or de-prioritised. The Book Dash is a full week to focus on getting stuff done and getting changes merged into the book.

I say merged, because the book is built around git and GitHub. It follows the same general rules that many open source projects follow: anyone can create a pull request, there's a solid continuous integration pipeline with a range of pre-merge checks and once something is merged the book automatically republishes itself with the new material. If you're a software developer, you'll feel right at home with the way things work.

A diagram describing how GitHub action listen to an event (for example, 'PR' created, issue created, PR merged) and then trigger a job which can be testing, sorting, labelling or deployment.

My Book Dash motivations

Earlier this year I attended RSECon23, a big shindig for Research Software Engineers held in Wales. During the conference dinner one of my colleagues was forced to suffer me ranting about code review.

I've been involved in software development, both commercial and open source, for at least twenty years. My first experience of code review was when I worked at Codemasters in the early noughties. Software development has changed radically since then and code review has become an established norm. And yet, from then until now, I've always found the process deeply uncomfortable at best, and soul-destroyingly antagonistic at worst.

Arrogance leads me to believe I might know some solutions. My colleague, who was probably trying to find a way out of the conversation, suggested I channel some of my frustration into content for The Turing Way. I've had a strong respect for The Turing Way since before I started work at The Turing, so this was immediately appealing.

By coincidence the deadline for applying to join the Book Dash was the following day, so I spent a bleary-eyed couple of late-night hours completing the application form. If I hadn't been so tired I might have had second thoughts. I'm glad I didn't.

Onboarding and dawning realisation

The Book Dash is split into three pieces. A couple of weeks before the Book Dash itself there's an onboarding session, during which we were encouraged to structure our thoughts in a SMART way and to plan our week's work. Between onboarding and the dash there's a git primer session. This covers the basics of git and GitHub for those not familiar with them and also touches on how they're used during the Book Dash. Finally there's the Book Dash itself, five days of intense working, each day split into five 150 minute sessions starting at 8:00 in the morning and running through until 21:30 at night (UK time).

It was the onboarding session when the realisation of what I'd done really sank it. Ironically it came about through a misunderstanding. During one of the breakouts my sessions partner explained that their line manager had been hugely supportive of their work on the Book Dash and that they'd also taken some time off as annual leave.

This commitment impressed me, especially given I was planning to work on the Book Dash only in the evenings. This information, coupled with the intense working sessions throughout the day, made me think I'd not fully appreciated how much would be expected of me during the Book Dash.

Following this realisation I rearranged my annual leave to take a couple of days off during the Book Dash. This would mean working in the evenings on three of the days, then working full days for two.

Two people sit across from each other at their laptops on a table but in different places. On the left hand side it's the morning, the sun is shining, someone is bringing in a cup of hot drink. On the screen is a videocall with four participants. On the right hand side it's might with the moon outside and the person is also taking part in a videocall. There is also a cat.

I also committed to work on preparation the week and weekend before the Book Dash. I decided the best way to be sure I'd make progress during the week was to have enough material prepared in advance.

With two additional days and enough material already prepared I felt convinced I could do something worthwhile by the end of the Book Dash.

It turns out I was wrong about two things, and right about two things. We'll come back to these.

Here are the plans I set myself during the onboarding session. We'll come back to these too.

Specific: What do I want to accomplish?:
1. Updated references to Twitter so that they reference X and/or Mastodon
2. Contribute to the Code Review sections to introduce more material about how to handle code review respectfully and avoiding the pitfalls of power imbalance inherent in the process.
Measurable: How will I measure my success?:
1. All references to Twitter are gone. Create a PR that removes references to Twitter.
2. Start a separate PR to discuss Mastodon, just as a skeleton.
3. A PR that introduces - at a minimum - a new section on code review power dynamics.
Attainable: Can I realistically achieve this goal? What steps will I take?
1. The switching from Twitter to X is somewhat mechanical so should be a quick win.
2. The aim to create a PR related to Mastodon only needs to be a skeleton, but can be more fully realised if there's time. There's a lot of good material and expertise to draw from, from the Turing Way community.
3. Creating the text on Code Review is more of a challenge, but I'm highly motivated to get something done. The hardest challenge will be understanding the text that's already there and aligning what I'm hoping to include with it.
Relevant: Does this goal meet a specific need?:
1. The shift away from Twitter - both the name and platform - I believe makes this timely.
2. On a personal level I continue to find code review challenging in spite of the existing good material written about it. It also remains a really important part of the development process, so I hope any improvements that can be made in the sections of the Turing Way would remain relevant.
Time-bound: What is my target deadline? (potentially between 14-18 Nov):
1. I'm aiming for a simple Twitter/X PR before the end of Tuesday 15th.
2. The remaining days I'll dedicate to the Code Review changes, ideally a PR before the end of the 18th.
Goal statement:
1. My goal is to ensure no reference to Twitter goes unchallenged, and to capture my thoughts on the challenges of code review in a way that will be beneficial to others.

Preparation

On the Monday before the Book Dash we had a git training session. This was an excellent refresher. I continue to be astonished at how every organisation or project I've worked on uses git slightly differently. It seems to tap in to a project's culture. The Turing Way doesn't work with forks, but rather gives all contributors direct merge and write access to upstream repository. That surprised me a little at first — it's a very trusting approach — but in retrospect it makes perfect sense for a project so defined by community.

I also discovered that on The Turing Way you merge feature branches rather than rebase them. There are pros and cons to both, but these are the kinds of cultural differences it's helpful to understand before force-pushing changes or complaining about unsquashed commits (which is what I might do on other projects).

Cultural differences, you see.

The instructor for the git session was knowledgeable and clearly understood these nuances well. Very reassuring.

On the following evenings in the week prior to the Book Dash I read through the Community Handbook and started collecting ideas and material together about Code Review.

While reading through the Handbook I started noticing typos and small grammatical errors here and there. The Turing Way being an open project, it feels negligent to leave them unfixed. So I put together a pre-emptive pull request full of small changes. Everyone's first pull request to The Turing Way gets adorned with a huge "Thanks" banner. It's a great way to make new contributors feel welcome and is just one of the many small touches that make the project so special.

The week of the Book Dash part I

My Book Dash week itself was a game of two halves. For the first half I continued my usual Alan Turing Institute work (unrelated to the Book Dash) from home during the day. Then in the evenings I attended the 17:00 and 20:00 Book Dash sessions. These are carefully orchestrated using Etherpad and Zoom, and and with instant messaging provided by Slack. The sessions provide an opportunity to collaborate with others, or to simply focus on writing material. The Cuckoo timer is used to split the session up into Pomodoro periods.

I picked a rather mechanical task as a way of easing myself into the process: converting all of the Twitter references to use X instead. It turns out there are quite a few, plus there was some work to be done rewriting the parts that no longer apply. X has diverged a fair bit from the site it was when the Twitter material was written.

Although I was able to make progress on this, I admit I found it quite challenging to focus initially during the sessions. My problem wasn't so much that I didn't know what I was doing, but more that there were so many possibilities that I could have been doing. It was difficult to pick something to focus on. This was exacerbated by the numerous notifications from GitHub streaming into my inbox. When multiple people are working intently on a project for a period of time this is one of the consequences. At this early stage the notifications were a really useful way to get an understanding of how things were supposed to work. But they sure were distracting.

Plus, working from 8:00 to 17:00 on my usual job before switching straight to several hours of Book Dash was a challenge. Switching context from one to the other left my head reeling, leaving a gap of up to an hour before I could properly focus.

Besides updating the Twitter content I'd hoped to write something about Mastodon, but this didn't get any further than me writing an issue about it. The week of the Book Dash part II

Things changed for the second half of the week. These were the days I booked as annual leave. Ironically I decided to travel to London for them, so I was in the same office where I usually work in the British Library. This is the first time I've chosen to spend my holiday in the office.

It felt very different to my usual working day. I switched to my second task, that of writing about Code Review. There's already a lot of great material in The Turing Way on the subject, but I wanted to adjust it in three main ways:

To make a clearer distinction between open and closed source projects; I strongly believe that the different modes do and should support different approaches to Code Review.
Emphasising the need to build and run the code as part of the code review process. Code Review shouldn't just be a theoretical exercise.
Adding in a section on when Code Review goes wrong. This is one of the few topics on which I feel I have some useful experience to share.

I handled the first two points on the Thursday and the third point on the Friday. I managed to do a fair bit of writing and am happy with the progress I made. I also had a great session with the illustrator from Scriberia, who managed to instantly take my poorly defined idea and turn it into something clear and visually striking. That was a great experience.

Celebrations

On the final day we had two public celebration sessions. I was also able to do a fair bit of writing between them.

The celebrations were superb. The Turing Way team knows how to run an online event that's inclusive, accessible, lively, but which also manages to allow everyone's voice to be heard. It's hard to overstate how hard it is to get this mix right, but the team managed it with style.

It made for a brilliant showcase of all the work everyone had done during the week; an excellent way to round things off.

Final thoughts and the future

Although it was a bit of a slow start, I think I eventually found my rhythm during the Book Dash. I wrote some words, made some edits, got a couple of pull requests merged. The largest of my contributions is still in a draft pull request, but I feel like it's now gained sufficient mass and momentum that I will finish it. Throughout the week I got great feedback from The Turing Way team and the community. I was also particularly pleased to get detailed and insightful feedback on my pull request from thigg, my fellow Sailfish OS contributor, after posting about it on Mastodon. The text draws on the literature, but also on my personal experience, which makes the need for a plurality of views all the more important. It's really difficult for me to express how happy I am that I got such useful feedback from the reviewers.

Going back to the plan I threw together during the onboarding session, here's how I got on:

All references to Twitter are gone. Create a PR that removes references to Twitter. Done!
Start a separate PR to discuss Mastodon, just as a skeleton. Not done! But I did at least create an issue.
A PR that introduces - at a minimum - a new section on code review power dynamics. Done!

It was a shame not to write anything about Mastodon, but once I've finished off the material on Code Review, that'll be my next task.

Finally, what were the two things I was wrong about and the two things I was right about?

I was wrong to think I could get away with doing the Book Dash in the evenings alone. It requires sustained commitment.

I was wrong to think my breakout session partner had taken two days off to work on the Book Dash. They actually took it as a break from both work and the Book Dash. I'd just misunderstood!

But I was right to conclude, based on this faulty information, that I should myself take some days off to work exclusively on the Book Dash. If there is a next time (which I hope there is) I'll aim to take the entire week off for it.

And finally, I was right to think that success during the week depended on spending some time prior to plan things out. If I hadn't had a pull request ready and several issues lodged already at the start, I think I'd have lost too much focus early on.

The Book Dash was a brilliant experience; I'm glad I took part. It successfully embodied the principles of openness, reproducibility and community that the book itself espouses. And was great fun. Thanks to everyone involved!

All the images in this post are The Turing Way project illustrations by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

If you've been following along with my Gecko dev diary you'll know I took a couple of weeks out to take part in the Book Dash. I'll soon be heading back to daily gecko development, so if you've been waiting for that, thank you for your patience and it'll be returning very soon.

Comment

12 Nov 2023 : Day 84 #

This is a blip, an anomaly, a perturbation in the space time continuum. I'm not supposed to be writing this while I pause my gecko development for a couple of weeks while I attend to a Book Dash. But I'm not the only person developing the gecko build and the world doesn't wait, so here's a quick update on some of the activity that's been happening despite my best efforts!

Before I get onto that though, I just want to go on a slight diversion. The world outside my window is turning from green to a vibrant yellow with the occasional red berries poking through. Soon everything will be orange, and soon after that the colour will drain away to leave just the browns of the skeletons beneath. It's all just too wonderful not to share with you.

The trees and bushes as seen from my window, a mixture of green and yellow leaves.

So what's the reason for this update? One of the key requirements to having gecko in a state that others can test and develop, and ultimately be deployed to users phones, is there being an OBS build available.

OBS, which will be familiar to many Sailfish OS users and developers, is the Open Build Service. When I use this term it's usually in relation to the OBS provided by Jolla for use by the community, but there are of course other OBS deployments. Jolla has its own internal OBS for building the full Sailfish OS repositories, and the software that runs it was originally developed for SUSE (where the 'O' stands for 'openSUSE').

If you're familiar with runners on GitLab or GitHub Actions you'll already have a good idea about OBS: it's essentially a bunch of VMs running remotely that build software and are accessible through a Web interface. OBS offers an environment that's very similar to the scratchbox2 environment provided by the Sailfish SDK, which means it's the ideal place to build software for Sailfish OS. If you want to know more about OBS I strongly recommend the Sailfish OS Chum documentation, which does a great job of explaining what it is and how to use it. Sailfish OS Chum is built around OBS.

Many moons ago — back on Day 53 in fact — I configured a project on OBS to build gecko. Things got stuck quite early on in the gecko build and I left it in a semi-configured state. I planned to go back to it, but never quite got around to it.

Over the last couple of days mal, keto and direc85 have been hard at work getting it into a usable state.

There were a bunch of changes needed for this. On the gcc side, direc85 identified a couple of bugs, 90139 and 96206, each caused an Internal Compiler Error (ICE) during the build. He and mal backported these so that they're incorporated into the OBS backend and available for the build.

These issues don't affect the aarch64 target which is why I'd not experienced them myself.

An additional problem arose in relation to the Python requirements. Back on Day 2 I hit an issue with this, the fix for which involved me configuring the build process to create a Python virtual environment. The Python requirements are then pip installed into this environment during the build.

This is fine for a local build, but on OBS there's no Internet access available after the prepare step completes; that means for most of the build, including at the point where this virtual environment is set up.

The result is that the build was failing on OBS due to it not being able to successfully complete creation of a suitable virtual environment. Mal identified this and switched the build to use a different approach. Essentially, he switched from using the create-mach-environmen to setting MACH_USE_SYSTEM_PYTHON instead:

$ git diff 4e8eeba521 5bd0410081
diff --git a/rpm/xulrunner-qt5.spec b/rpm/xulrunner-qt5.spec
index b3a7a4419142..5e397b2d11fe 100644
--- a/rpm/xulrunner-qt5.spec
+++ b/rpm/xulrunner-qt5.spec
@@ -470,7 +470,8 @@ echo 'mk_add_options LDFLAGS="$FIX_LDFLAGS"' >> "${MOZCONFIG}"
 
 RPM_BUILD_NCPUS=`nproc`
 
-./mach create-mach-environment
+export MACH_USE_SYSTEM_PYTHON=1
+
 #./mach build -j$RPM_BUILD_NCPUS
 ./mach build -j1
 # This might be unnecessary but previously some files

The consequence of this is that the virtual environment is no longer used and instead the preexisting Python install is used instead. Since all of the required packages are already available in the system install, this does the job and prevents the need to download anything during the build process.

Overnight mal triggered the OBS build to run, while at the same time I had the latest changes building locally on my machine.

Both went through successfully, which means the OBS build is nearly available for anyone to use.

There is still a little more to do: the packages that depend on gecko still need to be rebuilt. Since I'd not yet pushed any tags to my repository the gecko package was built with version number 60.9.1. That won't be accepted by any of the other packages, so I've pushed the tag `sailfishos/91.9.1+git1` to the repository and set it building again. Once that's complete (probably tomorrow) I can trigger the other packages to rebuild against it, which should be relatively swift.

In the meantime, a huge thank you goes to direct85, keto and mal for their efforts getting this working. It makes gecko that much more accessible to other developers and testers and is therefore an essential step in getting more people involved and getting it out and installed on to users' phones.

I'll be back writing again in just over a week unless there's something else specific to share in the meantime. Until then feel free to catch up on all the previous entries on my Gecko-dev Diary page.

Comment

12 Nov 2023 : Playing Draugen in the Autumn #

My computer game activities have taken a bit of a hit recently due to time spent on gecko, but I nevertheless did finally manage to complete Metro Exodus recently after 48 hours of play time spread over three months of calendar time (according to Steam). It was a great game: atmospheric, gameplay-heavy and unexpectedly diverse in its play styles and environments. But while it's in no way an afterthought, story isn't the central focus.

That contrasts nicely with Draugen which is the game I started playing this weekend as a replacement. Draugen is a 'walking simulator' where story takes precedence over gameplay mechanics.

My desk with a laptop and screen; on the screen is a scene from Draugen showing a view out of the window beyond which a silver birch with orange leaves; behind my monitor the view out of my window showing trees with autumnal leaves as well (more yellow than orange).

I've been a fan of Red Thread and games by Ragnar Tørnquist ever since The Longest Journey and Dreamfall games, with their captivating characters and brilliantly realised worlds.

So far I'm loving Draugen too. It's atmospheric and I find the setting of a small town in Norway hemmed in on all sides by fjord really draws me in. After just a few hours of play I'm only just getting into some of the more psychological aspects of the game so I'm holding off judgement on these story elements for now. But from what I've seen so far the story is holding together, despite the limited cast of on-set characters.

But what I'm loving most about the game is the autumnal setting. I've written before about how the switch between seasons shows nature at its best. The shift from Summer to Autumn as the leaves pass through their various hues enthrals me.

The yellow trees visible outside my window, which I see daily peaking above my monitor, don't quite yet match the vibrant orange colours from Draugen's game world. But they're getting there. The similarity helps make the game world more absorbing. It feels like I could almost — almost — be there in the nature-filled solitude of the Norwegian fjords.

Comment

7 Nov 2023 : Day 83 #

Over the last few of days I've been struggling with the EglLibrary initialisation routines. It's been a bit of an uphill climb, but I'm glad to say that things are working out better today.

In celebration of the fact, Thigg, who is now the sailfish-gecko artist in residence, very generously created the following wonderful image capturing the actual process of gecko learning to render.

A gecko holding a blue colouring pencil standing next to an easel drawing a pig with wings.

But we'll come on to that. First back to the usual rundown of my day's development.

As I mentioned I've been struggling with the EglLibrary initialisation routines. These were changed upstream to remove some crucial piece of functionality. Adding this back in wasn't too much of a problem, except my progress was hindered by the application failing to start at all for two days. I got to the bottom of that and now I'm back trying to figure out how to get EGL working correctly.

From stepping through the code I can see that this line is returning false:

  bool result = lib.operator->()->Init(false, &failureId, aDisplay);

There's only one part of this Init() method I've made edits to and that's to add in the aDisplay parameter and the call to CreateDisplay() that users it. So it seems reasonable to imagine that this is what's causing the problem. The new code looks like this:

  mDefaultDisplay = CreateDisplay(forceAccel, out_failureId, aDisplay);
  if (!mDefaultDisplay.lock()) {
    return false;
  }

Stepping through the code confirms my suspicion: the condition we see here is being entered, resulting in a return value of false. That's a failure state and so guaranteed to prevent anything else from working.

However the reason it's happening isn't what I was expecting. In theory the CreateDisplay() method should be pretty simple: because we're passing in an aDisplay parameter it's supposed to create an EglDisplay with the aDisplay stored inside it and just return that. It should be pretty straightforward, but I suspected I'd made some logic error in this call.

That turned out not to be the case. The debugger confirmed that CreateDisplay() was returning a value, rather than null.

So if the mDefaultDisplay value isn't null, why is the condition above being entered into?

The reason turned out to be a misunderstanding on my part about how the std::weak_ptr pointer wrapper works. Calling lock() on this is supposed to return a usable std::shared_ptr on which I had expected the null check ought to work.

I'm not exactly sure why the logic above doesn't work, but in practice !mDefaultDisplay.lock() equates to true. Changing the code to the following, however, works as expected:

  std::shared_ptr defaultDisplay = CreateDisplay(forceAccel, out_failureId, aDisplay);
  if (!defaultDisplay) {
    return false;
  }
  mDefaultDisplay = defaultDisplay;

It's a small change, but one that makes all the difference. Making this change allows the initialisation process to complete correctly. Or at least, it looks correct to me.

But there's still no rendering. Thankfully I can now step through and try to find out what's going on.

My efforts to step through the code are hampered somewhat by the fact that when using partial builds the debug source gets out-of-sync with the code, which means the debugger will tell you which line you're on, but not the source code on the line itself. It's not the end of the world as it's possible to refer back to the source in a text editor to check the line, but it's a bit of a cumbersome approach compared to just having the debugger tell you.

Nevertheless, stepping through this way and comparing against the same steps debugging ESR 78 shows that the EGLDisplay is now correct. I can even see that the GLContextEGL::SwapBuffers() method is being called to update the display.

But there's still nothing rendering to the screen.

I do notice one difference between the two versions though. The SwapBuffers() method is identical for them both. Here's what it looks like:

bool GLContextEGL::SwapBuffers() {
  EGLSurface surface =
      mSurfaceOverride != EGL_NO_SURFACE ? mSurfaceOverride : mSurface;
  if (surface) {
    if ((mEgl->IsExtensionSupported(
             EGLExtension::EXT_swap_buffers_with_damage) ||
         mEgl->IsExtensionSupported(
             EGLExtension::KHR_swap_buffers_with_damage))) {
      std::vector rects;
      for (auto iter = mDamageRegion.RectIter(); !iter.Done(); iter.Next()) {
        const IntRect& r = iter.Get();
        rects.push_back(r.X());
        rects.push_back(r.Y());
        rects.push_back(r.Width());
        rects.push_back(r.Height());
      }
      mDamageRegion.SetEmpty();
      return mEgl->fSwapBuffersWithDamage(surface, rects.data(),
                                          rects.size() / 4);
    }
    return mEgl->fSwapBuffers(surface);
  } else {
    return false;
  }
}

Even though stepping through is a bit of a pain, I can see that on ESR 78 it's the mEgl->fSwapBuffers() branch that's being executed, whereas on ESR 91 it's the mEgl->fSwapBuffersWithDamage() branch that's used.

The reason for this is that the mEgl->mAvailableExtensions flags are set differently. Crucially, on ESR 78 the EXT_swap_buffers_with_damage flag (that's bit 19 of the bitfield) is cleared:

Thread 38 "Compositor" hit Breakpoint 1, mozilla::gl::GLContextEGL::SwapBuffers
    (this=0x7ed41a27b0)
    at gfx/gl/GLContextProviderEGL.cpp:643
643	bool GLContextEGL::SwapBuffers() {
(gdb) p mEgl.mRawPtr->mAvailableExtensions
$4 = std::bitset = {[0] = 1, [2] = 1, [3] = 1, [5] = 1, [6] = 1, [7] = 1,
                    [13] = 1, [21] = 1, [22] = 1}
(gdb) p (int)mEgl.mRawPtr->mAvailableExtensions & (1 << (19 - 1))
$5 = 0
(gdb)

Whereas on ESR 91 it's set:

Thread 34 "Compositor" hit Breakpoint 1, mozilla::gl::GLContextEGL::SwapBuffers
    (this=0x7eb8115b30)
    at gfx/gl/GLContextProviderEGL.cpp:538
538	bool GLContextEGL::SwapBuffers() {
(gdb) p mEgl->mAvailableExtensions
$1 = std::bitset = {[0] = 1, [2] = 1, [4] = 1, [5] = 1, [6] = 1, [7] = 1,
                    [8] = 1, [11] = 1, [16] = 1, [17] = 1, [19] = 1, [20] = 1,
                    [22] = 1}
(gdb) p (int)mEgl.get()->mAvailableExtensions & (1 << ((int)EGLExtension::EXT_swap_buffers_with_damage - 1))
$2 = 262144

That may sound like a small difference, but while I've updated the browser code, one thing I've not done is update the underlying EGL implementation to offer different capabilities. So these flags ought to be the same.

Looking through the code I eventually realise that the flag has been forcefully cleared in the ESR 78 version. Which I track down to this patch (which I probably should have remembered):

From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Raine Makelainen 
Date: Tue, 28 Sep 2021 09:42:30 +0300
Subject: [PATCH] [sailfishos][egl] Drop swap_buffers_with_damage extension
 support. Fixes JB#55571 OMP#JOLLA-396

At least on "Adreno (TM) 508" swapping with damage triggered a hang on the
compositor thread i.e. call never returns. Conceptually I think we should
bind eglSwapBuffersWithDamage on hybris side to the eglSwapBuffers.

Co-authored-by: David Llewellyn-Jones 
Signed-off-by: Raine Makelainen 
---
 gfx/gl/GLLibraryEGL.cpp | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gfx/gl/GLLibraryEGL.cpp b/gfx/gl/GLLibraryEGL.cpp
index 13b69f2427c1..a07b8d983ab7 100644
--- a/gfx/gl/GLLibraryEGL.cpp
+++ b/gfx/gl/GLLibraryEGL.cpp
@@ -713,7 +713,7 @@ bool GLLibraryEGL::DoEnsureInitialized(bool forceAccel,
         {(PRFuncPtr*)&mSymbols.fSwapBuffersWithDamage,
          {{"eglSwapBuffersWithDamageEXT"}}},
         END_OF_SYMBOLS};
-    if (!fnLoadSymbols(symbols)) {
+    if (!fnLoadSymbols(symbols) || true) {
       NS_ERROR(
           "EGL supports EXT_swap_buffers_with_damage without exposing its "
           "functions!");
@@ -726,7 +726,7 @@ bool GLLibraryEGL::DoEnsureInitialized(bool forceAccel,
         {(PRFuncPtr*)&mSymbols.fSwapBuffersWithDamage,
          {{"eglSwapBuffersWithDamageKHR"}}},
         END_OF_SYMBOLS};
-    if (!fnLoadSymbols(symbols)) {
+    if (!fnLoadSymbols(symbols) || true) {
       NS_ERROR(
           "EGL supports KHR_swap_buffers_with_damage without exposing its "
           "functions!");
-- 
2.31.1

The way the flags are set has completely changed in ESR 91 so I can't reapply the patch. But I can find some other way to mask out these flags. This seems like a good idea given they clearly caused problems for ESR 78. I achieve this by adding in a couple of calls like this to the Init() method:

  MarkExtensionUnsupported(EGLExtension::EXT_swap_buffers_with_damage);
  MarkExtensionUnsupported(EGLExtension::KHR_swap_buffers_with_damage);

Stepping through without the source has become quite onerous, so I've decided to do a full rebuild this time, rather than the partial rebuilds I've been doing until now. This will put the source and symbols back in sync, but it will also take a while. I'll test it as soon as it's built. [...] It's later and I finally have my new build. There are a lot of symbols in the library and it takes the debugger a minute or so to load them, so as is usual I do my first test without the debugger. And to my astonishment... it renders!

Screenshots of the very first renders coming from ESR 91 on Sailfish OS

Honestly, I was expecting this change to be the one that made the difference. I was also expecting there to be render or layout artefacts, but as you can see from the screenshots it's not looking bad either.

The external browser checker picks it up as an ESR 91 version. As you can see from the screenshot it passes slightly fewer tests than the ESR 78 version, but that's to be expected because I had to disable some capabilities (e.g. WebRTC). More significantly, touch isn't working, so there's currently no way to interact with the site. That's also not too surprising and will be much easier to look into that now that it's rendering. I'm also pretty surprised to see that even WebGL is working fine on my website.

Side-by-side screenshot comparisons of ESR 78 and ESR 91. The ESR 78 version scores 451 in the HTML 5 Test, whereas ESR 91 scores 440.

I'm pretty satisfied with this to be honest. It's certainly not a usable replacement for ESR 78 yet, but once touch is working it will be getting there.

This also seems like a good place to pause.

You'll recall that on Day 77 I mentioned I'd be taking a pause from gecko development while I participate in a Turing Way Book Dash. With the rendering now working, it seems like the right time for this break. I'm anticipating a gap of a couple of weeks, but if I do get any time to work on gecko in the meantime, I'll be certain to post about it. During the pause, feel free to catch up on all the previous entries on my Gecko-dev Diary page.

Comment

6 Nov 2023 : Day 82 #

Okay, so I managed to get to the bottom of why the browser was failing to start at all.

This was the problem I was experiencing yesterday. To recap, I'd made edits to the code to adjust the way the EglDisplay is handled. I created a package overnight, transferred it over to my phone and ran the updated version. The browser dropped out before it had even reached the code I edited.

The browser was exiting cleanly so there was no backtrace to dig into. I tried stepping through the code but the debugger seemed to exit before the main had properly even started. I thought it might be the code that attempts to load the libxul.so library, but the debugger showed that wasn't being reached either.

Eventually I gave up and decided to do a full recompile of the library and the libraries that depend on it. As always the full rebuild took hours. I started it last night and it was still running well into my work day this morning.

It did eventually complete without errors. Installing and running it on my device... had no effect. Same problem.

So I rebuilt qtmozembed and installed that. Same problem.

I thought it might have been a dirty configuration folder, so I deleted ~/.local/share/org.sailfishos/browser. Not that.

So my last ditch attempt was to rebuild sailfish-browser as well. But that didn't fix it either.

Eventually I realised the reason: there was an old sailfish-browser process still running in the background. If you try to run the browser and it finds a copy of itself already running, rather than open a second browser showing a URL it will send a DBus message to the incumbent executable asking it to open the URL instead, then bail.

So that's what was happening. I found this out when I attempted to use the debugger on main() again. This time I was able to step into the application and got as far as the debug message being sent in on line 121 of the sailfish-browser/apps/browser/main.cpp file:

    BrowserService *service = new BrowserService(app.data());
    // Handle command line launch
    if (!service->registered()) {

       [..]

        QDBusConnection::sessionBus().asyncCall(message);
        if (QCoreApplication::hasPendingEvents()) {
            QCoreApplication::processEvents();
        }

        return 0;
    }

After using kill on the process the new version started up just fine. So I probably could have avoided all that rebuilding if I'd just noticed this first.

No matter, at least it's running now and I'll know for the future.

Having said that, it's still exiting early. But at least it is now getting to the code I changed.

You may recall from previous posts that the key method I'm editing looks like this:

already_AddRefed GLContextProviderEGL::CreateWrappingExisting(
    void* aContext, void* aSurface, void* aDisplay) {
  if (!aContext || !aSurface) return nullptr;

  nsCString failureId;
  const auto lib = gl::DefaultEglLibrary(&failureId);
  if (!lib) {
    gfxCriticalNote << "Failed[3] to load EGL library: " << failureId.get();
    return nullptr;
  }
  bool result = lib.operator->()->Init(false, &failureId, aDisplay);
  if (!result) {
    gfxCriticalNote << "Failed[3] initialise display: "
                      << failureId.get();
    return nullptr;
  }
  const std::shared_ptr egl = 
    lib.operator->()->DefaultDisplay(&failureId);

  if (!egl) {
    gfxCriticalNote << "Failed[3] to create EGL library  display: "
                    << failureId.get();
    return nullptr;
  }

  [...]

  return gl.forget();
}

I've been fixing up the Init() call you can above, so that it now calls GLLibraryEGL::CreateDisplay(). I've changed it so that you can enter an aDisplay parameter of type EglDisplay which it will store, rather than trying to create its own. This is important because the display used for rendering needs to be created by qtmozembed and passed in, rather than being created by gecko.

What I hadn't realised is that gl::DefaultEglLibrary(), which you can also see above, also calls this Init() method. But because it doesn't pass in any aDisplay value, this gets set to zero and the code I added now returns early with the failure flag set.

To address this I've now edited the method above to avoid it calling gl::DefaultEglLibrary(). Instead it constructs it directly, like this:

already_AddRefed GLContextProviderEGL::CreateWrappingExisting(
    void* aContext, void* aSurface, void* aDisplay) {
  if (!aContext || !aSurface) return nullptr;

  nsCString failureId;
  const RefPtr lib = new GLLibraryEGL;
  if (!lib) {
    gfxCriticalNote << "Failed[3] to load EGL library";
    return nullptr;
  }
  [...]

I'm hoping that by making this change the Init() call that happens a few lines later will now complete without error.

It's nearly rebuilt, so soon I'll be able to check.

[...]

This time copying the 2770 MiB library over to my phone took longer than actually building it. But when I run the browser with the new library I experience the same problem. So this will need a little more work tomorrow to get it working.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.

Comment

5 Nov 2023 : Day 81 #

This morning the new version had fully built. That means it incorporates the changes from yesterday to try to get the use of EglDisplay working correctly. Before I get into the results of the build I want to take you on a brief detour. It was Bonfire Night here in the UK last night, which is the annual celebration of the foiling of a plot to blow up parliament in 1605 in the hope of sparking a Catholic uprising. As celebrations go it's a weirdly parochial and unusually barbaric. But the fireworks and bonfires are completely mesmerising and it's part of the spirit of Autumn that makes me love the season so much.

I mention this for two reasons. First because the entire town where I live was out last night enjoying communal fireworks. Second because I wanted an excuse to share this wonderful image created by Thigg with the help of some AI (or is it the other way around?) that captures some more of what makes Autumn so wonderful.

A woman and two children walk on a path alongside a pig with wings in an autumnal forest. The pig is taller than the people, the path is surrounded by treas and mushrooms and many smaller winged pigs fly in the air.

If you've been sticking with the diary you may recall that Thigg also provided a wonderful image to capture activity on Day 46. And also produced some superb images posted to the Sailfish Forum. Such amazing stuff!

Let's return to gecko progress. Unfortunately after installing the packages I built overnight it will no longer run. It doesn't crash, it just stops, apparently dropping out cleanly. If I put breakpoints on the methods that I changed they don't trigger before the app closes, so it's probably not something wrong with the changes I made per se. Instead I suspect it's some broader issue with loading the library.

Let's take a look.

I know that ResolveXPCOMPath inside EmbedInitGlue.cpp gets called early in the startup routine, so I've put a breakpoint on that.

Even that isn't getting called. There are three layers to the browser. The main user-interface wrapper that provides all of the chrome is sailfish-browser. This wraps qtmozembed used for interfacing between Qt and gecko. Finally there's the gecko/xulrunner library itself (libxul.so). That last breakpoint was part of qtmozembed, so I'm going to go one step closer to the application entry point by adding a breakpoint to the sailfish-browser executable instead. The sailfish-browser executable is essentially the first entry point for everything (although there's also the booster, but let's leave that aside).

I've put a breakpoint on main() in sailfish-browser. That is, as a general rule, pretty early when it comes to application start-up.

Finally a hit! The breakpoint on main() works.

(gdb) b main
Breakpoint 4 at 0x555557a868 (2 locations)
(gdb) r
Starting program: /usr/bin/sailfish-browser 
warning: File "/lib64/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

Breakpoint 4, 0x000000555557a868 in main ()
(gdb)

The warnings are normal, nothing that we need to be concerned about. From this point I can now step through to figure out why it's closing so early. But although the breakpoint does fire, stepping inside the main() doesn't get us anywhere as the executable seems to exit before it even gets that far:

(gdb) n
Single stepping until exit from function main,
which has no line number information.
[D] unknown:0 - Using Wayland-EGL
library "libGLESv2_adreno.so" not found
library "eglSubDriverAndroid.so" not found
[New LWP 2484]
[New LWP 2485]
[LWP 2484 exited]
0x0000007fb77ea11c in __libc_start_main () from /lib64/libc.so.6
(gdb) n
Single stepping until exit from function __libc_start_main,
which has no line number information.
[LWP 2485 exited]
[Inferior 1 (process 1896) exited normally]
(gdb)

Unhelpful.

Well it's rather late now and I'm not sure what's going on here, so I've decided to do a complete rebuild from scratch. The build I performed last night was an incremental build, meaning I didn't delete all the partially built material before performing the build. I've no wiped everything so that the code is in the same state as if I'd just closed it from the repository. It's building now.

Sadly this means no more work on this for today; the build won't complete until the morning at which point I'll try again and see whether anything has changed.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.

Comment

4 Nov 2023 : Day 80 #

I've updated the code in GLContextProviderEGL::CreateWrappingExisting() and GLLibraryEGL::Init() as I described yesterday. I've had to make more change to the upstream code than I was hoping for, but practically speaking I have to make getting something to render on screen my priority over minimising code changes right now.

I'm not quite there yet though. In ESR 78 the DoEnsureInitialized() method — which is the equivalent of ESR 91's Init() method — stored the EGLDisplay in a class variable of GLLibraryEGL. This was the route out of the method for the display value. In ESR 91 that doesn't make so much sense any more. For a start we're not returning an EGLDisplay but rather an EglDisplay that wraps it.

I could update the Init() method so that it returns this as the result of the method, but currently it returns a true/false success/failure condition and changing that is going to get messy.

However, the GLLibraryEGL class does have an mDefaultDisplay field. I wonder if I could use that?

My train is arriving in to London, so it's a question I'll have to try to answer later. That looks like a promising approach though.

[...]

I'm on the train back home and after debugging and debugging and debugging for what feels like weeks, I'm finally building something again. I've made the changes discussed earlier in relation to EglDisplay and EGLDisplay and the updated code is now merrily compiling away.

The code compiles. Once I'm home this evening I'll transfer it over to my phone to see whether it has the positive effect I've been hoping for. If not, I will also hopefully have time to step through it to see whether or not it's doing what I planned.

[...]

I'm home now. The code did compile but the partial build got stuck and refused to produce a library. So I've had to do a full build instead. Which means no testing tonight, but hopefully there'll be a build to test tomorrow morning.

Back on Day 77 I mentioned the fact that from today onwards I'll be taking a pause from these developer diaries while I work on a Turing Way Book Dash. I still plan to have this break but I want to persevere a little longer. I feel like things are close to the point where a render may be possible and I don't want to pause before having tested out these changes. So please don't be surprised if there's another update tomorrow after all.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.

Comment

3 Nov 2023 : Day 79 #

I'm back at things feeling a lot fresher today after my head ended up swimming last night.

This new clarity allowed me to dig through the code a bit more and I'm a lot clearer about how things are supposed to fit together as a result.

In ESR 91 the key entry point method where the graphics initialisation from qtmozembed meets the gecko library is called GLContextProviderEGL::CreateWrappingExisting(). This should accept an aDisplay parameter that's created externally and passed in by qtmozembed.

Currently the code then goes off and performs some minimal initialisation by calling EglDisplay::Create(). But there are two versions of this Create() method. The method I really want called should then go on to call GLLibraryEGL::Init(), but instead we have it inserting the display into a list and returning. I do want the display in that list, but I don't want it to skip all the other initialisation steps.

The Create() method creates a copy of the library and uses it to call Init(). But we actually already have our library because we created it in GLContextProviderEGL::CreateWrappingExisting(). So we can short-circuit the call to Create() and call GLLibraryEGL::Init() directly instead.

So that's our first requirement. From that point on we're no longer in a static method, our lib becomes our this and we don't need to keep track of it anymore.

However there's another critical change that happened from ESR 78 to ESR 91. The GLLibraryEGL::DoEnsureInitialized method in ESR 78 (which has been renamed to this Init() we were talking about in ESR 91) calls GLLibraryEGL::CreateDisplay(). This looks like a key method, and it's also the method that I've noticed isn't getting called in ESR 91 (it was this realisation that confirmed my suspicions that this part of the code is broken).

So we're going to have to add it back in to the ESR 91 code. It looks like if we do this, the code will end up calling our Create() method (the same one we're calling now) but with a lot of other initialisation happening as well.

That should lead us to a better place than we are now.

So the steps are:

Have GLContextProviderEGL::CreateWrappingExisting() call GLLibraryEGL::Init() rather than EglDisplay::Create() as it is now.
Adjust GLLibraryEGL::Init() so that it calls GLLibraryEGL::CreateDisplay(). We can use the ESR 78 code as a guide for this.
The call to GLLibraryEGL::CreateDisplay() results in a call to a static GetAndInitDisplay() method, which needs to consume an aDisplay parameter.
Check that the call to GetAndInitDisplay() will eventually result in a call to the correct version of EglDisplay::Create().

In ESR 78 the GLLibraryEGL::CreateDisplay() method was returning an instance of EGLDisplay. Now it's returning an instance of EglDisplay (note the capitalisation). I think that's okay: the latter is essentially a wrapper for holding multiple of the former.

But it means we'll have to get our updated version of GLLibraryEGL::Init() to return the EglDisplay instance rather than the Boolean it's returning now. In ESR 78 the value was stored in the mEGLDisplay class member, but this has been removed. We need to return it instead.

It's not clear to me that this is the only problem to fix, but this definitely looks broken as the code stands right now. Fixing it can't do any harm.

It's an early start for me tomorrow so too late for me to make the changes tonight. It'll be my task for tomorrow.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.

Comment

2 Nov 2023 : Day 78 #

So, I'm back today looking now at the other end of the rendering pipeline. At the end of the day working on this yesterday I concluded I needed to do two tasks:

Step through GLContextProviderEGL::CreateWrappingExisting() and the changes made in commit 51e18eb9b16ee94a.
Check the unapplied patches to check whether there anything crucial missing.

Right now the first is much more appealing, so I'm going to go with that.

Putting a breakpoint on both versions works: both are triggered from nsWindow::GetGLContext() with the aDisplay set to 0x1 as expected. From this point on the actual internals of the two respective methods differ quite considerably.

The ESR 78 code passes the display to a call to GLLibraryEGL::EnsureInitialized(false, &discardFailureId, aDisplay). It then creates a GLContextEGL which it configures and returns.

The ESR 91 code gets the DefaultEglLibrary and uses it to create an EglDisplay structure, which involves passing in the aDisplay value: EglDisplay::Create(*lib.operator->(), (EGLContext)aDisplay, false). It finishes up creating a new GLContextEGL instance similar to the way ESR 78 does.

The initial step through of both each version throws up nothing unusual. For ESR 78:

1098	  if (!GLLibraryEGL::EnsureInitialized
            (false, &discardFailureId, aDisplay)) {
(gdb) n
1103	  if (!aContext || !aSurface) return nullptr;
(gdb) n
1105	  const auto& egl = GLLibraryEGL::Get();
(gdb) n
1106	  SurfaceCaps caps = SurfaceCaps::Any();
(gdb) n
1110	                       (EGLSurface)aSurface, (EGLContext)aContext);
(gdb) n
1106	  SurfaceCaps caps = SurfaceCaps::Any();
(gdb) n
1112	  gl->mOwnsContext = false;
(gdb) n
1106	  SurfaceCaps caps = SurfaceCaps::Any();
(gdb) n
1097	  nsCString discardFailureId;
(gdb) n
mozilla::embedlite::nsWindow::GetGLContext (this=0x7fb8ce4970)
    at mobile/
    sailfishos/embedshared/nsWindow.cpp:416
416	      if (!mozContext || !mozContext->Init()) {

And for ESR 91:

(gdb) n
1000      if (!aContext || !aSurface) return nullptr;
(gdb) n
1002      nsCString failureId;
(gdb) n
1003      const auto lib = gl::DefaultEglLibrary(&failureId);
(gdb) n
1004      if (!lib) {
(gdb) n
1008      const auto egl = EglDisplay::Create(*lib.operator->(),
            (EGLContext)aDisplay, false);
(gdb) n
1009      if (!egl) {
(gdb) n
1016      const auto desc = GLContextDesc{{flags}, false};
(gdb) n
1021                           (EGLSurface)aSurface, (EGLContext)aContext);
(gdb) n
1022      gl->SetIsDoubleBuffered(true);
(gdb) n
1023      gl->mOwnsContext = false;
(gdb) n
1008      const auto egl = EglDisplay::Create(*lib.operator->(),
            (EGLContext)aDisplay, false);
(gdb) n
1003      const auto lib = gl::DefaultEglLibrary(&failureId);
(gdb) n
1002      nsCString failureId;
(gdb) n
mozilla::embedlite::nsWindow::GetGLContext (this=this@entry=0x7f88580b50)
    at mobile/
    sailfishos/embedshared/nsWindow.cpp:421
421           if (!mozContext || !mozContext->Init()) {

They both bounce around a bit, but none of them drop out early due to failures or null pointers.

One obvious difference is how GLLibraryEGL::CreateDisplay() is called. In ESR 78 this happens inside the call to GLLibraryEGL::DoEnsureInitialized(). In ESR 91 this method has been renamed to GLLibraryEGL::Init() and is quite similar, apart from the fact that there is no call to GLLibraryEGL::CreateDisplay() inside it. It's notably absent.

The question I'm therefore wondering is: where, if at all, does this get called in ESR 91. I'm going to find out.

Here it is in ESR 78:

Thread 38 "Compositor" hit Breakpoint 2, mozilla::gl::GLLibraryEGL::CreateDisplay
    (this=this@entry=0x7ed81a25f0, forceAccel=forceAccel@entry=false, 
    gfxInfo=..., out_failureId=out_failureId@entry=0x7f2200e298,
    aDisplay=aDisplay@entry=0x1)
    at gfx/gl/GLLibraryEGL.cpp:760
760	                                       EGLDisplay aDisplay) {
(gdb) bt
#0  mozilla::gl::GLLibraryEGL::CreateDisplay (this=this@entry=0x7ed81a25f0,
    forceAccel=forceAccel@entry=false, gfxInfo=..., 
    out_failureId=out_failureId@entry=0x7f2200e298, aDisplay=aDisplay@entry=0x1)
    at gfx/gl/GLLibraryEGL.cpp:760
#1  0x0000007ff29106c4 in mozilla::gl::GLLibraryEGL::DoEnsureInitialized
    (this=0x7ed81a25f0, forceAccel=false, out_failureId=0x7f2200e298,
    aDisplay=0x1)
    at gfx/gl/GLLibraryEGL.cpp:578
#2  0x0000007ff2910e3c in mozilla::gl::GLLibraryEGL::DoEnsureInitialized
    (this=<optimized out>, forceAccel=<optimized out>, 
    out_failureId=<optimized out>, aDisplay=<optimized out>)
    at gfx/gl/GLLibraryEGL.cpp:403
#3  0x0000007ff2911028 in mozilla::gl::GLContextProviderEGL::
    CreateWrappingExisting (aContext=0x7ed8004e70, aSurface=0x5555ba2d70, 
    aDisplay=<optimized out>) at gfx/gl/GLContextProviderEGL.cpp:1098

But in ESR 91 it's not getting called at all. This is a potentially important difference and definitely worth digging in to further.

From the code it's possible to see how this has happened. I've even discussed it in previous posts. Essentially there was a change that meant all of the displays are created internally, rather than allowing them to be passed in externally. This is a problem for us because when it comes to the Sailfish Browser the display is created in qtmozembed and passed in from there.

In ESR 78 this display gets passed in to GLLibraryEGL::DoEnsureInitialized() (you can see it in the backtrace just earlier). Towards the end of this method there's a call like this:

  mEGLDisplay = CreateDisplay(forceAccel, gfxInfo, out_failureId, aDisplay);
  if (!mEGLDisplay) {
    return false;
  }

If aDisplay isn't set then the CreateDisplay() method does some work. But if it is set — which is the case we're interested in — then the method just returns the value that was passed in without doing anything with it.

In ESR 91 this all changed. Here things would have happened via the GLContextEGLFactory::CreateImpl() method, but this removed the ability to pass in a display.

In the code now, the EglDisplay still goes into a "create" method like this:

Thread 34 "Compositor" hit Breakpoint 4, mozilla::gl::EglDisplay::EglDisplay
    (this=0x555596e700, lib=..., disp=0x1, isWarp=false)
    at gfx/gl/GLLibraryEGL.cpp:681
681     EglDisplay::EglDisplay(const PrivateUseOnly&, GLLibraryEGL& lib,
(gdb) bt
#0  mozilla::gl::EglDisplay::EglDisplay (this=0x555596e700, lib=..., disp=0x1,
    isWarp=false)
    at gfx/gl/GLLibraryEGL.cpp:681
#1  0x0000007fba5c4260 in __gnu_cxx::new_allocator<mozilla::gl::EglDisplay>::
    construct<mozilla::gl::EglDisplay, mozilla::gl::EglDisplay::PrivateUseOnly,
    mozilla::gl::GLLibraryEGL&, void* const&, bool const&> (__p=0x555596e700,
    this=<optimized out>)
    at /srv/mer/toolings/SailfishOS-4.5.0.18/opt/cross/aarch64-meego-linux-gnu/
    include/c++/8.3.0/new:169
[...]
#9  mozilla::gl::EglDisplay::Create (lib=..., display=<optimized out>,
    display@entry=0x1, isWarp=isWarp@entry=false)
    at gfx/gl/GLLibraryEGL.cpp:676
#10 0x0000007fba5c4420 in mozilla::gl::GLContextProviderEGL::
    CreateWrappingExisting (aContext=0x7eb80041a0, aSurface=0x55559676b0,
    aDisplay=0x1)
    at gfx/gl/GLContextProviderEGL.cpp:1008

The call to the EglDisplay constructor is somewhat masked by this line:

  const auto ret =
      std::make_shared(PrivateUseOnly{}, lib, display, isWarp);

So in ESR 78 we call GLLibraryEGL::CreateDisplay() whereas in ESR 91 we call EglDisplay::Create(). Both are defined in GLLibraryEGL.cpp, but they are from different classes and do different things. What we really want is for the ESR 91 to also call GLLibraryEGL::CreateDisplay(). That method exists, although it changed so that it can't take in display values. So that's what we're going to have to change I think.

This has all become rather confusing and the different methods are swimming around in my head. I'm fairly certain that means I've reached the limit of my ability to work on this tonight. I'll have to pick it up again tomorrow.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.

Comment

1 Nov 2023 : Day 77 #

During the week of November 13th I'll be taking part in a Book Dash for The Turing Way. A book dash is a bit like a design sprint, where a collection of users commit to making a concerted effort to work on contributions for a short period of time; in this case a week.

The Turing Way is a bit like the bible of reproducible science and data science in particular. It calls itself the "handbook to reproducible, ethical and collaborative data science" and I'm a bit of a fan.

Until now I've not made the effort to try to contribute to it, but there are a number of areas where I'd like to.

In particular I've always had a bit of a problem with code review. I've been involved with code reviews for most of my adult life (since 2002 at least) and it's always been a rocky relationship. Jolla had a particular approach to code review that combined elements from both open source and commercial approaches. That's also helped shaped my opinion on it.

Given how challenging I've always found code review (as both reviewer and recipient) I really wanted to add something to The Turing Way about the power structures of code review. I'd like to think it might help others negotiate this tricky activity.

So that I can give the Book Dash the focus it deserves I've decided I'm going to have to pause my Gecko work while the dash is ongoing. I also need some time for preparation, so that means there are unlikely to be any Gecko Dev Diary entries between 4th and 19th November. On the weekend of the 18th I'll write a post about my experience on the Book Dash, and then after that my focus will be back on Gecko again.

This might slow down ESR 91 development. But it might also speed it up: I think it will be good for me to take a step back and mull things over. If I do come up with anything during the gap I'll be certain to write about it here. But if there's some silence for a bit, please don't be surprised!

And if anyone is particularly interested in this, feel free to join the celebrations on the last day of the book dash.

[...]

But let's get back to the code. Yesterday we followed the ESR 78 build down into the rendering depths of the rectangle fill algorithm. We eventually reached the following method which essentially blits a set of values into memory to represent the rectangle.

    template 
    static void rect_memsetT(T buffer[], T value, int count, size_t rowBytes, int height) {
        while (height --> 0) {
            memsetT(buffer, value, count);
            buffer = (T*)((char*)buffer + rowBytes);
        }
    }

The question I'm asking today is: does the same thing happen on ESR 91? This is as far as we can dig in this direction, so if we find it's doing the same thing, it would suggest that the rendering problem is happening at the other end of the process: the part where the surface gets rendered to the screen rather than the part where items get rendered to the surface.

For ESR 78 I had to step through the code quite carefully. That's because we didn't know where we were heading. Armed with the knowledge of what happens with ESR 78 I'm hoping we can short-circuit the process on ESR 91 by simply sticking a breakpoint on the rect_memsetT method to see whether it gets hit.

Unfortunately when I try to create the breakpoint there's nothing for it to attach to, maybe because it's template code. I've added a breakpoint to SkOpts::rect_memset32() instead. This is the method that calls rect_memsetT. This sticks, but will it be triggered?

It does get triggered and gives us an incredibly lengthy but otherwise clean backtrace.

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, neon::rect_memset32
    (buffer=0x7ee07e6010, value=4294967295, count=1080, rowBytes=4320, height=1080)
    at gfx/skia/skia/src/opts/
    SkUtils_opts.h:57
57              rect_memsetT(buffer, value, count, rowBytes, height);
(gdb) bt
#0  neon::rect_memset32 (buffer=0x7ee07e6010, value=4294967295, count=1080,
    rowBytes=4320, height=1080)
    at gfx/skia/skia/src/opts/SkUtils_opts.h:57
#1  0x0000007fbc7078b4 in antifilldot8 (L=0, T=, R=276480,
    B=276480, blitter=0x7f9f3ceda8, fillInner=fillInner@entry=true)
    at gfx/skia/skia/src/core/SkScan_Antihair.cpp:683
#2  0x0000007fbc7079c4 in antifillrect (xr=..., blitter=)
    at gfx/skia/skia/src/core/SkScan_Antihair.cpp:697
#3  0x0000007fbc707ad0 in antifillrect (r=..., blitter=blitter@entry=0x7f9f3ceda8)
    at gfx/skia/skia/src/core/SkScan_Antihair.cpp:768
#4  0x0000007fbc708a54 in SkScan::AntiFillRect (origR=..., clip=0x7f889ae190,
    blitter=0x7f9f3ceda8)
    at gfx/skia/skia/src/core/SkScan_Antihair.cpp:791
#5  0x0000007fbc709090 in SkScan::AntiFillRect (r=..., clip=...,
    blitter=)
    at gfx/skia/skia/src/core/SkRasterClip.h:76
#6  0x0000007fbc754b20 in SkDraw::drawRect (this=this@entry=0x7f9f3cfb78,
    prePaintRect=..., paint=..., paintMatrix=paintMatrix@entry=0x0, 
    postPaintRect=postPaintRect@entry=0x0) at gfx/skia/skia/src/core/SkDraw.cpp:677
#7  0x0000007fbc719800 in SkDraw::drawRect (paint=..., rect=..., this=0x7f9f3cfb78)
    at gfx/skia/skia/src/core/SkDraw.h:42
#8  SkBitmapDevice::drawRect (this=, r=..., paint=...)
    at gfx/skia/skia/src/core/SkBitmapDevice.cpp:365
#9  0x0000007fbc72903c in SkCanvas::onDrawRect (this=, r=...,
    paint=...)
    at gfx/skia/skia/src/core/SkCanvas.cpp:2202
#10 0x0000007fbc720dec in SkCanvas::drawRect (this=0x7f889f0c60, r=...,
    paint=...)
    at gfx/skia/skia/include/core/SkRect.h:632
#11 0x0000007fba508ce8 in mozilla::gfx::DrawTargetSkia::FillRect
    (this=0x7f88978a00, aRect=..., aPattern=..., aOptions=...)
    at gfx/2d/DrawTargetSkia.cpp:852
#12 0x0000007fba53dd9c in mozilla::gfx::DrawTargetTiled::FillRect
    (this=this@entry=0x7f88c22350, aRect=..., aPattern=..., aDrawOptions=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#13 0x0000007fbc15245c in nsDisplayCanvasBackgroundColor::Paint (this=0x7f88d5d6c0,
    aBuilder=, aCtx=)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/gfx/2D.h:124
#14 0x0000007fbc3a5584 in mozilla::FrameLayerBuilder::PaintItems
    (this=this@entry=0x7f88c1d6e0, aItems=std::vector of length 60,
    capacity 64 = {...}, aRect=..., aContext=aContext@entry=0x7f8856bbe0,
    aBuilder=aBuilder@entry=0x7f9f3d1268, aPresContext=aPresContext@entry=0x7f88a3d310, aOffset=..., 
    aXScale=1, aYScale=) at layout/painting/FrameLayerBuilder.cpp:7112
#15 0x0000007fbc3a59c4 in mozilla::FrameLayerBuilder::DrawPaintedLayer
    (aLayer=0x7f889802b0, aContext=0x7f8856bbe0, aRegionToDraw=...,
    aDirtyRegion=..., aClip=mozilla::layers::DrawRegionClip::DRAW,
    aRegionToInvalidate=..., aCallbackData=0x7f9f3d1268)
    at layout/painting/FrameLayerBuilder.cpp:7270
[...]
#62 0x0000007fb78b389c in ?? () from /lib64/libc.so.6

The question now is how does this compare with the "manual" backtrace we collected yesterday while following the ESR 78 code. Here's what we're comparing it to. This is greatly simplified because I only noted down a selection of the methods that were called.

static void rect_memsetT(T buffer[], T value, int count, size_t rowBytes, int height)
SkOpts::rect_memset32(device, color, width, rowBytes, height)
SkARGB32_Blitter::blitRect (height=45, width=777, y=0, x=256, this=0x7fde8d2038)
SkAAClipBlitter::blitRect (this=0x7fde8d1a58, x=29, y=0, width=1008, height=1080)
static void antifillrect(const SkXRect& xr, SkBlitter* blitter)
SkDraw::drawRect()
nsDisplayList::PaintRoot()
nsLayoutUtils::PaintFrame()
PresShell::Paint()

We see all of these there, apart from the calls to blitRect()

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, neon::rect_memset32
    (buffer=0x7ee07e6010, value=4294967295, count=1080, rowBytes=4320,
    height=1080)
    at gfx/skia/skia/src/opts/SkUtils_opts.h:57
57              rect_memsetT(buffer, value, count, rowBytes, height);
(gdb) bt
#0  neon::rect_memset32 (buffer=0x7ee07e6010, value=4294967295, count=1080,
    rowBytes=4320, height=1080)
#2  0x0000007fbc7079c4 in antifillrect (xr=..., blitter=)
#6  0x0000007fbc754b20 in SkDraw::drawRect (this=this@entry=0x7f9f3cfb78,
    prePaintRect=..., paint=..., paintMatrix=paintMatrix@entry=0x0, 
#26 0x0000007fbc3a2acc in nsDisplayList::PaintRoot (this=this@entry=0x7f9f3d3078,
    aBuilder=aBuilder@entry=0x7f9f3d1268, aCtx=aCtx@entry=0x0, 
#27 0x0000007fbc12ee28 in nsLayoutUtils::PaintFrame
    (aRenderingContext=aRenderingContext@entry=0x0,
    aFrame=aFrame@entry=0x7f88bd0cd0, aDirtyRegion=..., 
#28 0x0000007fbc0b99ac in mozilla::PresShell::Paint
    (this=this@entry=0x7f88b8c360, aViewToPaint=aViewToPaint@entry=0x7f887beb20,
    aDirtyRegion=...,

This is just one case and it may be that one of the other traces has a different call stack. So I worked through them all. There were cases that went a different route and ended up calling a method from the SKBlitter class. No calls to the blitRect() method though.

Thread 8 "GeckoWorkerThre" hit Breakpoint 2, neon::rect_memset32
    (buffer=0x7ee08a3f90, value=4293190837, count=824, rowBytes=4320,
    height=57)
    at gfx/skia/skia/src/opts/SkUtils_opts.h:57
57              rect_memsetT(buffer, value, count, rowBytes, height);
(gdb) bt
#0  neon::rect_memset32 (buffer=0x7ee08a3f90, value=4293190837, count=824,
    rowBytes=4320, height=57)
    at gfx/skia/skia/src/opts/SkUtils_opts.h:57
#1  0x0000007fbc6eb43c in SkBlitter::blitAntiRect (this=0x7f9f3cec28,
    x=, y=180, width=824, height=57, leftAlpha=, 
    rightAlpha=0 '\000') at gfx/skia/skia/src/core/SkBlitter.cpp:137
#2  0x0000007fbc792c58 in aaa_walk_convex_edges
    (prevHead=prevHead@entry=0x7f9f3cddc0, blitter=blitter@entry=0x7f9f3cde50,
    start_y=start_y@entry=179, 
    stop_y=stop_y@entry=238, leftBound=leftBound@entry=0,
    riteBound=riteBound@entry=70778880, isUsingMask=isUsingMask@entry=false)
    at gfx/skia/skia/include/private/SkTo.h:19
#3  0x0000007fbc79b980 in aaa_fill_path (forceRLE=false, isUsingMask=false,
    pathContainedInClip=true, stop_y=238, start_y=179, blitter=0x7f9f3cde50, 
    clipRect=..., path=...) at gfx/skia/skia/src/core/SkScan_AAAPath.cpp:1926
#4  SkScan::AAAFillPath (path=..., blitter=blitter@entry=0x7f9f3cec28, ir=...,
    clipBounds=..., forceRLE=forceRLE@entry=false)
    at gfx/skia/skia/src/core/SkScan_AAAPath.cpp:1988
#5  0x0000007fbc706b58 in SkScan::AntiFillPath (path=..., origClip=...,
    blitter=blitter@entry=0x7f9f3cec28, forceRLE=forceRLE@entry=false)
    at gfx/skia/skia/include/core/SkRegion.h:144
#6  0x0000007fbc707150 in SkScan::AntiFillPath (path=..., clip=...,
    blitter=0x7f9f3cec28)
    at gfx/skia/skia/src/core/SkRasterClip.h:76
#7  0x0000007fbc753dec in SkDraw::drawDevPath (this=this@entry=0x7f9f3cfb08,
    devPath=..., paint=..., drawCoverage=drawCoverage@entry=false, 
    customBlitter=customBlitter@entry=0x0, doFill=doFill@entry=true)
    at gfx/skia/skia/src/core/SkDraw.cpp:881
#8  0x0000007fbc754200 in SkDraw::drawPath (this=this@entry=0x7f9f3cfb08,
    origSrcPath=..., origPaint=..., prePathMatrix=prePathMatrix@entry=0x0, 
    pathIsMutable=pathIsMutable@entry=false,
    drawCoverage=drawCoverage@entry=false, customBlitter=customBlitter@entry=0x0)
    at gfx/skia/skia/src/core/SkTLazy.h:195
#9  0x0000007fbc719ec0 in SkDraw::drawPath (pathIsMutable=false,
    prePathMatrix=0x0, paint=..., path=..., this=0x7f9f3cfb08)
    at gfx/skia/skia/src/core/SkDraw.h:56
#10 SkBitmapDevice::drawPath (this=, path=..., paint=...,
    pathIsMutable=false)
    at gfx/skia/skia/src/core/SkBitmapDevice.cpp:401
#11 0x0000007fbc72a398 in SkCanvas::onDrawPath (this=0x7f889f0c60, path=...,
    paint=...)
    at gfx/skia/skia/src/core/SkCanvas.cpp:2365
#12 0x0000007fbc7213c4 in SkCanvas::drawPath (this=0x7f889f0c60, path=...,
    paint=...)
    at gfx/skia/skia/src/core/SkCanvas.cpp:1878
#13 0x0000007fba5094d8 in mozilla::gfx::DrawTargetSkia::Fill
    (this=0x7f88978a00, aPath=0x7f889aa920, aPattern=..., aOptions=...)
    at gfx/2d/PathSkia.h:84
#14 0x0000007fba538e5c in mozilla::gfx::DrawTargetTiled::Fill
    (this=0x7f88c22350, aPath=0x7f889aa920, aPattern=..., aDrawOptions=...)
    at ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/RefPtr.h:313
#15 0x0000007fba53a6d0 in mozilla::gfx::DrawTarget::FillRoundedRect
    (this=this@entry=0x7f88c22350, aRect=..., aPattern=..., aOptions=...)
    at gfx/2d/DrawTarget.cpp:190
[...]
#65 0x0000007fb78b389c in ?? () from /lib64/libc.so.6

In practice at this depth the code is identical between the two, so I'm going to put the differences down to compiler optimisation. At any rate I can't see any obvious way the rect_memset32() calls could be made without going through them.

This is enough for me. I'm going to call this a wrap and say that the sharp end of the rendering process — the point at which the individual items are rendered onto the backbuffer surface — appears to be pretty much identical between the two executables.

So it's time to switch focus and look at the other end of the process.

For the next steps I plan to check two things. First I want to step through GLContextProviderEGL::CreateWrappingExisting(), added in the following commit:

commit 51e18eb9b16ee94a775d3d4f1f333c11e306368e
Author: David Llewellyn-Jones 
Date:   Sat Sep 16 15:30:19 2023 +0100

    Restore GLContextProviderEGL::CreateWrappingExisting()
    
    Restores a method for creating a GLContexst for an existing EglDisplay.
    This is useful because on Sailfish OS the EglDisplay is created in
    QtMozEmbed and passed to xulrunner, rather than being created by
    xulrunner directly.

This was a change I made to allow the display surface to be passed from qtmozembed to the gecko code, since it had been removed from upstream. This is an obvious place where things might be going wrong.

So I'll step through both versions and see whether they differ.

Second, I need to work through all of the patches that have yet to be applied. It's quite possible there's something important in there that's missing and may be causing the rendering problems.

All of this is for the future though.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.

Comment

flypig.co.uk

Location

List items

Blog

All items from November 2023

Navigate

Actions

Archives