flypig.co.uk

List items

Items from the current list are shown below.

Blog

All items from August 2023

31 Aug 2023 : Day 15 #
You may recall that last night I had to leave the build running after fixing a font structure error. I was happy to hear during the Sailfish Community meeting on IRC this morning that there was some anticipation for the result (thanks ExTechOp!). So let's get to it...

When I returned to my laptop this morning, I found the following at the bottom of my console output:
128:15.56 ipc/chromium
128:19.37 make[4]: *** No rule to make target 'moc_message_pump_qt.cc',
          needed by 'moc_message_pump_qt.o'.  Stop.
128:19.37 make[3]: *** [${PROJECT}/gecko-dev/config/recurse.mk:72: ipc/chromium/target-objects] Error 2
128:19.38 make[2]: *** [${PROJECT}/gecko-dev/config/recurse.mk:34: compile] Error 2
128:19.38 make[1]: *** [${PROJECT}/gecko-dev/config/rules.mk:355: default] Error 2
128:19.38 make: *** [client.mk:65: build] Error 2
128:19.38 0 compiler warnings present.
Well, that's not a built library, but it is further than it was before! The build ran for over two hours before failing this time. But as it happens we also already saw this error before on Day 10. Grepping the patches, we can see that moc_message_pump_qt.cc actually appears in two patches: 0002 which we've already discussed at some length, and 0007 which is new:
$ grep -rIn "moc_message_pump_qt.cc" *
0002-sailfishos-qt-Bring-back-Qt-layer.-JB-50505.patch:2392:+        '!moc_message_pump_qt.cc',
0007-sailfishos-gecko-Disable-MOC-code-generation-for-mes.patch:25:- '!moc_message_pump_qt.cc',
The latter has the title "Disable MOC code generation for message_pump_qt". That's pretty encouraging, because it's precisely this MOC code generation that's causing difficulty.

In case you're curious and don't already know, MOC stands for "Meta Object Compiler". This is a Qt-thing, because Qt doesn't exactly use pure C++. More specifically, it uses a bunch of extensions (e.g. new keywords) that the C++ compiler doesn't understand. Before being compiled, Qt code is passed through the Meta Object Compiler in order to convert these extensions into your common-or-garden standard C++.

The files it generates are generally of the form moc_something.cc which is what we have here. Based on this, and looking at the error, it looks very much like the build makefile is asking for the object file moc_message_pump_qt.o to be generated. The make file is specifying that in order to generate the object file, it should build the C++ file moc_message_pump_qt.cc. But that file doesn't exist (or can't be found).

And that's because the file isn't currently being generated. Patch 0007 should help with this, not by generating the code needed, but rather by ensuring the object file isn't asked for at all.

And the patch applies first time without glitches!
$ patch -d gecko-dev -p1 < rpm/0007-sailfishos-gecko-Disable-MOC-code-generation-for-mes.patch 
patching file ipc/chromium/moz.build
Hunk #1 succeeded at 119 (offset 8 lines).
patching file ipc/chromium/src/base/message_pump_qt.cc
patching file ipc/chromium/src/base/message_pump_qt.h
Looking at the patch, there's some other non-MOC related stuff happening in there too, but it all looks pretty sensible and worth keeping.

The question now is, as always, will it build?!.

Sadly, the answer is not exactly. Even though it's an incremental build on top of the previous one it still runs for over two hours. But then bails out with this:
134:28.06 ipc/chromium
134:34.13 In file included from Unified_cpp_ipc_chromium0.cpp:65:
134:34.13 ${PROJECT}/gecko-dev/ipc/chromium/src/base/message_loop.cc:
          In constructor ‘MessageLoop::MessageLoop(MessageLoop::Type, nsIEventTarget*)’:
134:34.13 ${PROJECT}/gecko-dev/ipc/chromium/src/base/message_loop.cc:248:17: error:
          expected type-specifier
134:34.13      pump_ = new base::MessagePumpForUI();
134:34.13                  ^~~~
The full section of code causing the problem looks like this:
233 #if defined(OS_WIN)
234   // TODO(rvargas): Get rid of the OS guards.
235   if (type_ == TYPE_DEFAULT) {
236     pump_ = new base::MessagePumpDefault();
237   } else if (type_ == TYPE_IO) {
238     pump_ = new base::MessagePumpForIO();
239   } else {
240     DCHECK(type_ == TYPE_UI);
241     pump_ = new base::MessagePumpForUI();
242   }
243 #elif defined(OS_POSIX)
244   if (type_ == TYPE_UI) {
245 #  if defined(OS_MACOSX)
246     pump_ = base::MessagePumpMac::Create();
247 #  elif defined(OS_LINUX) || defined(OS_BSD)
248     pump_ = new base::MessagePumpForUI();
249 #  endif  // OS_LINUX
250   } else if (type_ == TYPE_IO) {
252     pump_ = new base::MessagePumpLibevent();
252   } else {
253     pump_ = new base::MessagePumpDefault();
254   }
255 #endif    // OS_POSIX
That's a lot of preprocessor directives.

Line 248 of this is the one that's failing, which we can see is already neatly wrapped in a check for Linux. So this is almost certainly the line we need to address. So why isn't it building? Once again, it looks like the solution is in the existing patches, this one being patch 0009 "Backport Embed MessageLoop constructor back". So the obvious next step is to apply this and restart the build.

Once again the patch applies cleanly.
$ patch -d gecko-dev -p1 < rpm/0009-sailfishos-gecko-Backport-Embed-MessageLoop-contruct.patch 
patching file ipc/chromium/src/base/message_loop.cc
Hunk #1 succeeded at 28 (offset 3 lines).
Hunk #2 succeeded at 178 (offset 9 lines).
patching file ipc/chromium/src/base/message_loop.h
patching file ipc/chromium/src/base/message_pump_qt.h
So off we go again with another build, hopefully not a two hour wait this time.

By this point you might be wondering "why don't you just apply all the patches first, and only then check the build?" As it happens, this is pretty much the strategy that was initially attempted for the jump from ESR 68 to ESR 78. The problem is there are a lot of patches. Getting through them all would take a very long time, and changes from one patch to the next are much harder to test if you don't already have a working build.

Moreover, there are far fewer patches that are necessary to get the code to build, compared to getting it running well on Sailfish OS. The upshot is that picking out only those patches that are absolutely necessary to get the build to pass turns out to be the better strategy.

There's a bunch of assumptions and suppositions in this reasoning, but even if the claim turns out not to be true, it's the path I've started down already. So I'm sticking to it. Sunk costs and all that.

Two hours later and the build reaches it's conclusion again with the following error:
133:43.62 In file included from PEmbedLiteViewChild.cpp:14,
133:43.62                  from UnifiedProtocols14.cpp:20:
133:43.62 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/embedlite
          /EmbedLiteViewChild.h:17:10  fatal error: nsIIdleServiceInternal.h:
          No such file or directory
133:43.62  #include "nsIIdleServiceInternal.h"
133:43.62           ^~~~~~~~~~~~~~~~~~~~~~~~~~
133:43.62 compilation terminated.
Let's see what that error relates to by checking the git log.
$ git log -- widget/nsIdleService.h
commit 8f54209ef6191bf7f25082ec10cf81786c04aeff
Author: Doug Thayer 
Date:   Mon Jul 20 16:06:59 2020 +0000

    Bug 1651165 - Rename idle service r=Gijs,geckoview-reviewers,snorp
    
    Differential Revision: https://phabricator.services.mozilla.com/D83413Id
From the diff we can see that the nsIdleService and nsIdleServiceInternal classes have been renamed to nsUserIdleService and nsUserIdleServiceInternal respectively. To fix this we need to update all references in the EmbedLite code to match these changes. Happily it's a pretty straightforward change.

However, since these changes included renaming of a generated header files, it was necessary to re-run the build from scratch. This time it takes longer — two hours fifty minutes — before it fails. That's good. The next error message is the following.
169:45.72 In file included from ${PROJECT}/obj-build-mer-qt-xr/dist/include
          /mozilla/embedlite/EmbedLiteViewChild.h:18,
169:45.72                  from PEmbedLiteViewChild.cpp:14,
169:45.72                  from UnifiedProtocols14.cpp:20:
169:45.72 ${PROJECT}/obj-build-mer-qt-xr/dist/include/mozilla/embedlite
          /BrowserChildHelper.h:16:10  fatal error: nsIWebBrowserChrome3.h:
          No such file or directory
169:45.72  #include "nsIWebBrowserChrome3.h"
169:45.72           ^~~~~~~~~~~~~~~~~~~~~~~~
169:45.72 compilation terminated.
Looking at the ESR 78 code I can see that there is an nsIWebBrowserChrome3.h file in this version. It's in the toolkit/components/browser directory. A quick check of the git log shows it was removed for Bugzilla Bug 1701668 (with the descriptive title of "Remove nsIWebBrowserChrome3 interface"). The commit message also helpfully points out the relevant upstream diff.

So nsIWebBrowserChrome3 is gone from upstream. The fix has to be one of two things. Either we revert the upstream change to bring nsIWebBrowserChrome3.h back, or we remove all references to nsIWebBrowserChrome3.h in the EmbedLite code as well. I need to figure out which is the right approach.

Grepping the code for instances of its use gives the following.
$ grep -rIn "nsIWebBrowserChrome3" *
embedding/embedlite/utils/BrowserChildHelper.cpp:728:
    BrowserChildHelper::GetWebBrowserChrome(nsIWebBrowserChrome3** aWebBrowserChrome)
embedding/embedlite/utils/BrowserChildHelper.cpp:735:
    BrowserChildHelper::SetWebBrowserChrome(nsIWebBrowserChrome3* aWebBrowserChrome)
embedding/embedlite/utils/BrowserChildHelper.h:16:
    #include "nsIWebBrowserChrome3.h"
embedding/embedlite/utils/BrowserChildHelper.h:148:
    nsCOMPtr mWebBrowserChrome;
This includes both our EmbedLite code and upstream's gecko code, and essentially means that only the Getter and Setter are left. That's a strong indication that the right move is to get rid of it entirely by just ripping out the Getters and Setters and being done with it.

There are still a bunch of calls to GetWebBrowserChrome() and SetWebBrowserChrome(), but they relate to a different type and so I'm expecting those shouldn't interact badly with the removal from BrowserChildHelper. But another build should make that clear.

It'll be another long build, so I won't now be finding out the answer to this until tomorrow.

I can't finish this post without mentioning the awesome work that direc85 has been doing. You may recall that two days ago I mentioned that he'd tracked down a plausible candidate for the gcc bug causing the swgl build to fail. I'm pleased to say that direct85 has continued his work, applying the patch to the gcc source and working through the steps needed to get it to rebuild. As I write this it's still a work-in-progress, but nevertheless progressing well. As part of my work for tomorrow I hope to test it out with the swgl build to see what happens.

Let's hope this validates direc85's impressive detective work. It'd be really great if this fixes things, so something to look forward to!

For all the other posts, check out my full Gecko Dev Diary.
Comment
30 Aug 2023 : Day 14 #
Let's quickly recap on what happened yesterday. We looked at the Rust components and style component in particular. It was generating a segfault which we eventually tracked down to the debug symbols.

Building without symbols seemed to fix the problem in the isolated context of building the single component, so after making the relevant changes to the build process, I'd set a complete rebuild to run overnight.

This morning all the indications are that style went through okay. The only error now showing up in the output is the font error we saw back on Day 10. Here's what it shows:
164:53.88 ${PROJECT}/gecko-dev/gfx/thebes/gfxFcPlatformFontList.cpp: In lambda function:
164:53.88 ${PROJECT}/gecko-dev/gfx/thebes/gfxFcPlatformFontList.cpp:1725:72: error: 
          could not convert ‘false’ from ‘bool’ to ‘mozilla::WeightRange’
164:53.88                                               weight,     stretch, style};
164:53.89                                                                         ^
164:53.89 ${PROJECT}/gecko-dev/gfx/thebes/gfxFcPlatformFontList.cpp:1725:72: error: 
          could not convert ‘weight’ from ‘gfxPlatformFontList::WeightRange’
          {aka ‘mozilla::WeightRange’} to ‘mozilla::StretchRange’
164:53.89 ${PROJECT}/gecko-dev/gfx/thebes/gfxFcPlatformFontList.cpp:1725:72: error:
          could not convert ‘stretch’ from ‘gfxPlatformFontList::StretchRange’
          {aka ‘mozilla::StretchRange’} to ‘mozilla::SlantStyleRange’
164:53.89 ${PROJECT}/gecko-dev/gfx/thebes/gfxFcPlatformFontList.cpp:1725:72: error:
          could not convert ‘style’ from ‘gfxPlatformFontList::SlantStyleRange’
          {aka ‘mozilla::SlantStyleRange’} to ‘RefPtr’
164:55.60 make[4]: *** [${PROJECT}/gecko-dev/config/rules.mk:676: gfxFcPlatformFontList.o] Error 1
This is the kind of issue I like: a nice, clear error in the code, rather than an obscure compiler bug or failure hidden deep in the build pipeline. The output tells us exactly where the issue is, and the job is to go through the code and match up the function prototype with the calling code.

So, here is the calling code taken from line 1725 of gfxFcPlatformFontList.cpp:
    auto initData = fontlist::Face::InitData{descriptor, 0,       size, false,
                                             weight,     stretch, style};
And looking carefully, it turns out it's not a call at all, but rather a struct initialiser (it has curly braces rather than parentheses). The principle is the same though: the input parameters need to match up with what the structure is expecting.

Looking a few lines up in the code we can see that the weight, stretch and style variables do have the types it's complaining about.
    WeightRange weight(FontWeight::Normal());
    StretchRange stretch(FontStretch::Normal());
    SlantStyleRange style(FontSlantStyle::Normal());
Comparing these to the error it looks like the constructor parameters have been shifted along by one, so either the false has been added to the constructor but not the structure, or something else has been added before it.

A quick search shows the fontlist::Face::InitData structure is defined in the SharedFontList.h file, in which we find it defined like this:
struct Face {
  // Data required to initialize a Face
  struct InitData {
    nsCString mDescriptor;  // descriptor that can be used to instantiate a
                            // platform font reference
    uint16_t mIndex;        // an index used with descriptor (on some platforms)
#ifdef MOZ_WIDGET_GTK
    uint16_t mSize;  // pixel size if bitmap; zero indicates scalable
#endif
    bool mFixedPitch;                  // is the face fixed-pitch (monospaced)?
    mozilla::WeightRange mWeight;      // CSS font-weight value
    mozilla::StretchRange mStretch;    // CSS font-stretch value
    mozilla::SlantStyleRange mStyle;   // CSS font-style value
    RefPtr mCharMap;  // character map, or null if not loaded
  };
From this the problem is immediately clear: the mSize parameter is being masked out by the MOZ_WIDGET_GTK pre-processor directive. For our build this is unset, but we still want to include the line that's being masked out.

It's a little baffling that there's no similar pre-processor condition in gfxFcPlatformFontList.cpp. It looks like the code is guaranteed to fail if MOZ_WIDGET_GTK hasn't been defined. But there it is. Probably there's something higher up the process that's preventing that bit of code from being built.

For our part, we can fix this by extending the pre-processor condition to include MOZ_WIDGET_QT which is our special "Sailfish OS" define, like this:
#if defined(MOZ_WIDGET_GTK) || defined(MOZ_WIDGET_QT)
    uint16_t mSize;  // pixel size if bitmap; zero indicates scalable
#endif
There are a few other places in the same file — and also in SharedFontList.cpp — that make use of the size variable. These also need amending, but all the changes are along similar lines.

Setting the build off after making these changes, it seems to be progressing a lot further than previously. That's a good sign.

After 88 minutes 2.08 seconds (according to the log output) it's still working away without having hit an error. That's definitely the furthest yet. But it's also pushed me way past my bed time. So I'll leave it running overnight and see where it's got to in the morning.

For all the other posts, check out my full Gecko Dev Diary.
Comment
29 Aug 2023 : Day 13 #
Yesterday we discussed a mysterious compiler segfault which we worked around by reducing the optimisation level of the build from O2 to O1. We also ended up with a Rust error that also triggered a segfault.

Before we look at that Rust error, a couple of readers made really good suggestions that deserve a mention.

First Fabrice Desré suggested on Mastodon that it would be worth switching from using gcc to clang for the build:
 
@flypig you're not building with clang instead of gcc? [...] I'm just wondering if clang would not fare better than gcc. Gecko prebuilt toolchain that you can install with ./mach bootstrap is clang based these days.

That's a really useful point to make. I double checked the build and it's definitely using gcc and it sounds like we really should be using clang. At the outset of this journey I did try using ./mach bootstrap but I didn't get good results and ended up using ./mach create-mach-environment instead.

I don't plan to look into this now as changing the optimisation seems to have done the job. But this is definitely something to look into, either once the build is working, or in case there are later failures that don't seem to have another solution.

The reason I'm not looking in to this now is because — in spite of all evidence to the contrary — there is a plan here, which I need to stick to: get a successful build working as quickly as possible. Once the build is working, there will then be more time to try to do things properly. I'm really grateful to the input from Fabrice and it's important not to forget along the way that this needs investigation.

I also got a great suggestion from Nephros on the Sailfish Forum.
 
I'm sure I'm telling nothing new here, but -O2 is just a shorthand for a set of optimization switches. Maybe it's useful to try specifying them manually, and "bisect" the list to find which precise switch is behind this.
    gcc -Q --help=optimizers

This is also a really neat idea. If it's possible to narrow this down to a single compiler option, not only will it help avoid unnecessary de-optimisation, it might also clarify what the underlying compiler bug is (if that's what it is) that's causing the segfault.

What's more direc85 has very generously offered to try this out:
 
That sounds like a lot of grunt work that needs some horsepower... No promises yet, but that’s something I may be able to help with.

Having now read the daily nerd snipe over lunch, one thought popped into my mind: what if it is architecture specific? If it compiles for armv7hl but not aarch64, that would indeed indicate a compiler bug... Something that may have been fixed already, somewhat likely... I’ll do some research after work, let’s see what I can find.

I think nephros and direc85 are onto something here. And once again, these interactions have really highlighted to me how important it is to have external input on all this.

Let's now move on to the Rust components that we were looking at yesterday and try to figure out why they're failing to build. First we had glslopt, which was being built for the wrong architecture. Then we had a strange optimisation failure building webrender. The last thing we saw was the style component failing to build.

The error generated was this:
 6:29.21 fatal runtime error: Rust cannot catch foreign exceptions
 6:29.35 warning: `style` (lib) generated 5 warnings
 6:29.35 error: could not compile `style`; 5 warnings emitted
 6:29.35 Caused by:
 6:29.35   process didn't exit successfully: `/usr/bin/rustc --crate-name [...]`
           (signal: 6, SIGABRT: process abort signal)
This is slightly different from the previous errors — it's not immediately obvious that it's an optimisation error — but it's also not obvious that it's a genuine coding or configuration error. It could be a compiler bug, or something else entirely. It's also not something we saw when working on ESR 78. Finally, it's not something that appears to be well-documented on the Web.

Things I've not seen before make me nervous, especially if a Web search isn't throwing up good leads.

The good news is that the error output includes the final command that caused the problem. The entire command is 9712 characters long (that's nearly enough text to fill up the available memory on a BBC B running in Mode 2) so I'm not going to copy it out here. But here's an abridged version with the most important parts retained.
/usr/bin/rustc --crate-name style --edition=2018 servo/components/style/lib.rs \
  --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat \
  --crate-type lib --emit=dep-info,metadata,link -C opt-level=2 -C panic=abort \
  -C embed-bitcode=no --cfg 'feature="bindgen"' --cfg 'feature="gecko"' \
  [...]
  debuginfo=2 -C force-frame-pointers=yes --cap-lints warn -Cembed-bitcode=yes \
  -C codegen-units=1
By setting the OUT_DIR environment variable appropriately I'm able to execute this command inside the scratchbox2 target in order to generate the same error.
fatal runtime error: Rust cannot catch foreign exceptions
Aborted (core dumped)
This is really helpful because it converts a multi-hour build odyssey into a few-minute build excursion. My immediate impulse is to try this with a lower optimisation setting. Checking the rustc options I see that -C opt-level=2 is setting the optimisation level for the build:
$ rustc -C help | grep "opt-level"
    -C                opt-level=val -- optimization level (0-3, s, or z; default: 0)
When I try to build with the optimisation level set to zero, a slightly different error occurs.
LLVM ERROR: out of memory
Allocation failed
Aborted (core dumped)
Maybe memory is the underlying issue here? We've had memory problems with previous gecko builds, so this is definitely a possibility. They seem to come in two flavours. First there are memory errors that come from a lack of physical memory in the machine doing the build. From previous experience I know the build will require at least 16GiB of memory. The machine I'm using has 32GiB of RAM and 16GiB of swap memory, which really should be enough. Second there are memory errors caused be Sailfish SDK (scratchbox2 and QEMU) restrictions.

You may recall Thigg highlighted this second issue a while back, flagging up a problem experienced with QEMU related to high memory usage. Upgrading QEMU was seen as the solution in that case, but that's going to be a long way around for us.

With previous gecko builds one of the challenges was building using debug symbols, as this in particular seemed to push the build over the edge at the final linking step. We're nowhere near final linking yet (I wish!), but it makes me thing that dropping debug symbols might have some beneficial effect here.

Checking the rustc options again I see that this is controlled by the debuginfo option in the command we're running.
$ rustc -C help | grep " debuginfo"
    -C                debuginfo=val -- debug info emission level (0 = no debug
                      info, 1 = line tables only, 2 = full debug info with
                      variable and type information; default: 0)
If I drop this down from 2 to 0, I discover the command builds without error.

That's great, but now I have to get that into the build process to see if it will have the same effect there.

In build/moz.configure/rust.configure I find this code:
    debug_info = None
	[...]
    if debug_symbols:
        debug_info = "2"
I thought this might be being controlled by the MOZ_DEBUG_SYMBOLS environment variable (set in mozconfig.merqtxulrunner). But a quick test showed this not to be the case. So instead I just adjusted this piece of code to fix the debuginfo level for Rust.

Ideally I'd like to filter based on the component to be built, to minimise the extent to which the debug information is removed. But I'll start with this and see if it at least helps the build run further.

Changing the build like this means scrapping my current incremental build, so after making the edit and kicking off a full build, it's going to be at least an hour before the results come in.

So it looks like that's it until tomorrow.

Today is the last day of my two week holiday, so I'll be back to work tomorrow and things may slow down a bit as a result. Let's see what happens as it feels like I have some momentum right now.

Don't forget, my Gecko Dev Diary has all the other posts.
Comment
28 Aug 2023 : Day 12 #
Yesterday we hit an error with the Rust build process which I tried to fix by reworking an existing patch. If you've been following along you'll know I left the build running overnight.

I'm up bright and early to check the build (after also one quick check in the middle of the night when I couldn't sleep; but it was still crunching away). The results this morning are a little hard to determine.

Somewhere in the middle of the build I spot this.
47:04.13   cargo:warning=during RTL pass: expand
47:04.13   cargo:warning=src/glsl.h: In function ‘glsl::vec2_scalar
           glsl::sign(glsl::vec2_scalar)’:
47:04.13   cargo:warning=src/glsl.h:662:39: internal compiler error:
           Segmentation fault
47:04.13   cargo:warning= float sign(float a) { return copysignf(1.0f, a); }
47:04.13   cargo:warning=                              ~~~~~~~~~^~~~~~~~~
47:04.13   cargo:warning=Please submit a full bug report,
47:04.13   cargo:warning=with preprocessed source if appropriate.
47:04.14   cargo:warning=See  for instructions.
47:04.14   exit status: 1
47:04.14   --- stderr
47:04.14   error occurred: Command "/usr/bin/g++" "-O2" "-ffunction-sections"
           "-fdata-sections" "-fPIC" "-std=gnu++17"
           "-Iobj-build-mer-qt-xr/dist/stl_wrappers"
           "-Iobj-build-mer-qt-xr/dist/system_wrappers" "-include"
           "gecko-dev/config/gcc_hidden.h" "-U_FORTIFY_SOURCE"
           "-D_FORTIFY_SOURCE=2" "-fstack-protector-strong" "-DNDEBUG=1"
           "-DTRIMMED=1" "-Igecko-dev/toolkit/library/rust"
           "-Iobj-build-mer-qt-xr/toolkit/library/rust"
           "-Iobj-build-mer-qt-xr/dist/include" "-I/usr/include/nspr4"
           "-I/usr/include/nss3" "-I/usr/include/nspr4"
           "-Iobj-build-mer-qt-xr/dist/include/nss" "-I/usr/include/pixman-1"
           "-DMOZILLA_CLIENT" "-include" "obj-build-mer-qt-xr/mozilla-config.h"
           "-Wall" "-Wempty-body" "-Wignored-qualifiers" "-Wpointer-arith"
           "-Wsign-compare" "-Wtype-limits" "-Wunreachable-code"
           "-Wno-invalid-offsetof" "-Wduplicated-cond" "-Wimplicit-fallthrough"
           "-Wno-error=maybe-uninitialized"
           "-Wno-error=deprecated-declarations" "-Wno-error=array-bounds"
           "-Wno-error=coverage-mismatch" "-Wno-error=free-nonheap-object"
           "-Wno-multistatement-macros" "-Wno-error=class-memaccess"
           "-Wno-error=unused-but-set-variable" "-Wformat"
           "-Wformat-overflow=2" "-Wno-psabi" "-fno-sized-deallocation"
           "-fno-aligned-new" "-O3" "-I/usr/include/freetype2"
           "-DUSE_ANDROID_OMTC_HACKS=1" "-DUSE_OZONE=1"
           "-DMOZ_UA_OS_AGNOSTIC=1" "-Wno-psabi" "-Wno-attributes" "-Wno-psabi"
           "-Wno-attributes" "-fno-exceptions" "-fno-strict-aliasing" "-fPIC"
           "-ffunction-sections" "-fdata-sections" "-fno-exceptions"
           "-fno-math-errno" "-pthread" "-pipe" "-gdwarf-4" "-freorder-blocks"
           "-O2" "-fno-omit-frame-pointer" "-funwind-tables"
           "-DMOZILLA_CONFIG_H" "-I" "gecko-dev/gfx/wr/webrender/res" "-I"
           "src" "-I"
           "obj-build-mer-qt-xr/aarch64-unknown-linux-gnu/release/build
           /swgl-c7fddee6f1578b80/out"
           "-std=c++17" "-fno-exceptions" "-fno-rtti" "-fno-math-errno"
           "-UMOZILLA_CONFIG_H" "-D_GLIBCXX_USE_CXX11_ABI=0" "-o"
           "obj-build-mer-qt-xr/aarch64-unknown-linux-gnu/release/build
           /swgl-c7fddee6f1578b80/out/src/gl.o"
           "-c" "src/gl.cc" with args "g++" did not execute successfully (status
           code exit status: 1).
47:04.14 warning: build failed, waiting for other jobs to finish...
Unravelling this error is a challenge, but in the middle of it there's a "Segmentation fault". That's not an error that the compiler ought to be generating. This sort of thing can happen if, for example, the compiler runs out of memory.

Further along I see this:
50:42.73 gecko-dev/gfx/thebes/gfxFcPlatformFontList.cpp:1725:72:
         error: could not convert ‘false’ from ‘bool’ to ‘mozilla::WeightRange’
50:42.74                                               weight,     stretch, style};
50:42.74                                                                         ^
This suggests that all of the errors from earlier indeed haven't been fixed. Then finally there is also this:
47:02.31 warning: during RTL pass: expand
47:02.32 warning: src/glsl.h: In function ‘glsl::vec2_scalar
         glsl::sign(glsl::vec2_scalar)’:
47:02.32 warning: src/glsl.h:662:39: internal compiler error: Segmentation fault
47:02.33 warning:  float sign(float a) { return copysignf(1.0f, a); }
47:02.33 warning:                               ~~~~~~~~~^~~~~~~~~
47:02.33 warning: Please submit a full bug report,
47:02.33 warning: with preprocessed source if appropriate.
47:02.33 warning: See  for instructions.
47:02.33 error: failed to run custom build command for `swgl v0.1.0
         (gecko-dev/gfx/wr/swgl)`
This is also a segmentation fault. Not encouraging.

Part of the challenge here is disentangling the separate jobs. With up to sixteen running simultaneously, the original source error can be tricky to determine. Probably these are three separate errors, but on a fresh run we might hit one and not another.

To circumnavigate this I've dropped the build down to using just a single job by editing the spec file. Maybe this will make things clearer.
#./mach build -j$RPM_BUILD_NCPUS
./mach build -j1
The build very quickly stumbles on the swgl segmentation fault error. Running the build a third time I get the same result. So it looks like this is where we're at, and the next thing I need to fix.

The error occurs building swgl, but I am a little concerned the underlying issue is still webrender related, since this is also mentioned in the debug output.

The error itself is happening while building gl.o on line 662 of src/glsl.h. This line looks like this:
float sign(float a) { return copysignf(1.0f, a); }
The line is unremarkable and the error "Segmentation fault" is nothing to do with the code we see there. It could be the compiler is running out of resources; it could be a compiler versioning issue; even something more esoteric like a compiler bug.

The debug output does at least include the full command being executed when the error occurred, so I can rerun and tweak this manually to see the results. This is always a helpful and time-saving technique.
$ sfdk engine exec
$ sb2 -R -m sdk-install -t SailfishOS-devel-aarch64.default
$ /usr/bin/g++ -O2 -ffunction-sections -fdata-sections -fPIC -std=gnu++17 \
  -I${PROJECT}/obj-build-mer-qt-xr/dist/stl_wrappers \
  -I${PROJECT}/obj-build-mer-qt-xr/dist/system_wrappers \
  -include ${PROJECT}/gecko-dev/config/gcc_hidden.h -U_FORTIFY_SOURCE \
  -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DNDEBUG=1 -DTRIMMED=1 \
  -I${PROJECT}/gecko-dev/toolkit/library/rust \
  -I${PROJECT}/obj-build-mer-qt-xr/toolkit/library/rust \
  -I${PROJECT}/obj-build-mer-qt-xr/dist/include -I/usr/include/nspr4 \
  -I/usr/include/nss3 -I/usr/include/nspr4 \
  -I${PROJECT}/obj-build-mer-qt-xr/dist/include/nss -I/usr/include/pixman-1 \
  -DMOZILLA_CLIENT -include ${PROJECT}/obj-build-mer-qt-xr/mozilla-config.h \
  -Wall -Wempty-body -Wignored-qualifiers -Wpointer-arith -Wsign-compare \
  -Wtype-limits -Wunreachable-code -Wno-invalid-offsetof -Wduplicated-cond \
  -Wimplicit-fallthrough -Wno-error=maybe-uninitialized \
  -Wno-error=deprecated-declarations -Wno-error=array-bounds \
  -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object \
  -Wno-multistatement-macros -Wno-error=class-memaccess \
  -Wno-error=unused-but-set-variable -Wformat -Wformat-overflow=2 -Wno-psabi \
  -fno-sized-deallocation -fno-aligned-new -O3 -I/usr/include/freetype2 \
  -DUSE_ANDROID_OMTC_HACKS=1 -DUSE_OZONE=1 -DMOZ_UA_OS_AGNOSTIC=1 -Wno-psabi \
  -Wno-attributes -Wno-psabi -Wno-attributes -Wno-psabi -Wno-attributes \
  -Wno-psabi -Wno-attributes -Wno-psabi -Wno-attributes -Wno-psabi \
  -Wno-attributes -fno-exceptions -fno-strict-aliasing -fPIC \
  -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread \
  -pipe -gdwarf-4 -freorder-blocks -O2 -fno-omit-frame-pointer -funwind-tables \
  -DMOZILLA_CONFIG_H -I ${PROJECT}/gecko-dev/gfx/wr/webrender/res -I src \
  -I ${PROJECT}/obj-build-mer-qt-xr/aarch64-unknown-linux-gnu/release/build/swgl-c7fddee6f1578b80/out \
  -std=c++17 -fno-exceptions -fno-rtti -fno-math-errno -UMOZILLA_CONFIG_H \
  -D_GLIBCXX_USE_CXX11_ABI=0 \
  -o ${PROJECT}/obj-build-mer-qt-xr/aarch64-unknown-linux-gnu/release/build/swgl-c7fddee6f1578b80/out/src/gl.o \
  -c src/gl.cc
Soon after I've issued this command the fans on my laptop whir up. It's taking a while and stressing the processor. Something is happening.

Using this abbreviated build process the error is triggered at exactly the same spot:
during RTL pass: expand
src/glsl.h: In function ‘glsl::vec2_scalar glsl::sign(glsl::vec2_scalar)’:
src/glsl.h:662:39: internal compiler error: Segmentation fault
 float sign(float a) { return copysignf(1.0f, a); }
                              ~~~~~~~~~^~~~~~~~~
Here "RTL" means "Register Transfer Language". I'm no compiler expert and the internals of gcc are mysterious to me, but a quick skim across the Internet suggests that "expand" is an RTL optimisation pass. As we can see in the command above the compiler is currently using O2 level optimisation. If this really is an optimisation problem, or even just a compiler bug, then changing the optimisation level might help.

And indeed, running the abbreviated build again using O1 (noting that the flag has to be replaced twice in the command) yields good results. Now the command goes through without the segmentation fault error. Ordinarily I'd look in the mozconfig.merqtxulrunner file to adjust the optimisation level, but there it's stated as O3, so the O2 must be coming from somewhere else.

Inside the gfx/wr/swgl/build.rs I find the following:
        // SWGL relies heavily on inlining for performance so override -Oz with -O2
        if tool.args().contains(&"-Oz".into()) {
            build.flag("-O2");
        }
So switching this for O1 might be a good thing to try.

I do that and build again. Now a different error is coming up, so maybe that did the trick? We won't know for sure until further down the line. Now we have this error:
 6:29.21 fatal runtime error: Rust cannot catch foreign exceptions
 6:29.35 warning: `style` (lib) generated 5 warnings
 6:29.35 error: could not compile `style`; 5 warnings emitted
 6:29.35 Caused by:
 6:29.35   process didn't exit successfully: `/usr/bin/rustc --crate-name [...]`
           (signal: 6, SIGABRT: process abort signal)
Again, we have a full command to test, so I'll run it manually. But as I've reached the end of my available coding time today, the results of that will have to wait until tomorrow.

As always, you can find all the other posts on my Gecko Dev Diary.
Comment
27 Aug 2023 : Day 11 #
For various reasons I had to clean out the repository and kick off a completely clean build. It takes a while, but is otherwise an unremarkable thing.

However, now it no longer fails on the errors I hit at the end of yesterday's session. There could be one of two reasons for this. It could be that the previous issue has been fixed by the cleanout. This is always a possibility if, due to the incremental build, a change doesn't have all the cascading effects it ought to have. Alternatively it could be because of the parallel build process. We're building with sixteen parallel jobs, meaning up to sixteen components can be building simultaneously. A small change can therefore cause a re-ordering of the failures.

Instead I now get the following error (abridged for clarity).
10:27.28 error: linking with `i686-unknown-linux-gnu-gcc` failed: exit status: 1
10:27.28   |
10:27.28   = note: "i686-unknown-linux-gnu-gcc" "-m32"
           "obj-build-mer-qt-xr/release/build/webrender-e0dea14bb1724795
           /build_script_build-e0dea14bb1724795.12luv6p7bzvqqwun.rcgu.o"
[...]
           "SailfishOS-devel-aarch64.default/usr/lib/rustlib
           /i686-unknown-linux-gnu/lib/libcore-a463ad6716e26c15.rlib"
           "-Wl,--end-group" "SailfishOS-devel-aarch64.default/usr/lib/rustlib
           /i686-unknown-linux-gnu/lib/libcompiler_builtins-1e7df035dceb93b6.rlib"
           "-Wl,-Bdynamic" "-lstdc++" "-lgcc_s" "-lutil" "-lrt" "-lpthread"
           "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-L"
           "SailfishOS-devel-aarch64.default/usr/lib/rustlib/i686-unknown-linux-gnu/lib"
           "-o" "obj-build-mer-qt-xr/release/build/webrender-e0dea14bb1724795
           /build_script_build-e0dea14bb1724795"
           "-Wl,--gc-sections" "-pie" "-Wl,-zrelro,-znow" "-nodefaultlibs"
10:27.29   = note: /usr/bin/ld: obj-build-mer-qt-xr/release/deps
           /libglslopt-e6e81689b2825cc3.rlib(glsl_optimizer.o): 
           Relocations in generic ELF (EM: 183)
[...]
10:27.32           /usr/bin/ld: obj-build-mer-qt-xr/release/deps
                   /libglslopt-e6e81689b2825cc3.rlib: error adding symbols:
                   file in wrong format
10:27.32           collect2: error: ld returned 1 exit status
This looks like a Rust issue: that is, the rust compiler building an object file for x86 when it should be building it for aarch64. We had to deal with these during the ESR 78 build as well and they were a real pain to fix (maybe because of my poor understanding of the Rust toolchain). But maybe this will turn out to be the same as something we saw back then. If that's the case, knowing the details could make this a whole lot easier to fix. So I need to check the patches.

A quick grep -rIn "glslopt" rpm/* throws up something that looks promising. Patch 0037 has the title "Patch glslopt to build on arm". That definitely looks like the same problem, so the next step is to try out the same patch.

A quick visual inspection suggests the patch will fail, but let's try it anyway.
$ patch -d gecko-dev -p1 < rpm/0037-sailfishos-configure-Patch-glslopt-to-build-on-arm.patch
patching file third_party/rust/glslopt/.cargo-checksum.json
Hunk #1 FAILED at 1.
1 out of 1 hunk FAILED -- saving rejects to file third_party/rust/glslopt/.cargo-checksum.json.rej
patching file third_party/rust/glslopt/build.rs
Hunk #1 FAILED at 1.
Hunk #2 succeeded at 26 with fuzz 2 (offset 7 lines).
Hunk #3 FAILED at 140.
2 out of 3 hunks FAILED -- saving rejects to file third_party/rust/glslopt/build.rs.rej
I'm not too worried about the checksum patch; that was always going to fail. The code in build.rs has changed a bit, so I'm not surprised the patch didn't apply completely, but the single hunk that did apply saved me a bit of time manually copying it over. Applying the remaining two failed hunks manually was pretty straightforward.

Kicking the build off I find I'm following it far more intensely than usual. This is a critical patch, it would be really great if it succeeded, and given the unpredictable ordering of the process I'd really like to know when (if) it's gone through. That is, when the webwender component, the one that needs this library, goes through.

Sadly, it doesn't. It seems the patch will need some more work after all.
21:03.89    Compiling glslopt v0.1.9
21:03.95 error[E0433]: failed to resolve: use of undeclared crate or module `env`
21:03.95   --> ${PROJECT}/gecko-dev/third_party/rust/glslopt/build.rs:36:11
21:03.95    |
21:03.95 36 |     match env::var("TARGET") {
21:03.95    |           ^^^ use of undeclared crate or module `env`
21:03.96 error[E0433]: failed to resolve: use of undeclared crate or module `env`
21:03.96   --> ${PROJECT}/gecko-dev/third_party/rust/glslopt/build.rs:45:11
21:03.96    |
21:03.96 45 |     match env::var("SB2_TARGET") {
21:03.96    |           ^^^ use of undeclared crate or module `env`
21:03.96 error[E0433]: failed to resolve: use of undeclared crate or module `env`
21:03.96   --> ${PROJECT}/gecko-dev/third_party/rust/glslopt/build.rs:53:19
21:03.96    |
21:03.96 53 |     let out_dir = env::var("OUT_DIR").unwrap();
21:03.96    |                   ^^^ use of undeclared crate or module `env`
21:03.96 error[E0433]: failed to resolve: use of undeclared crate or module `env`
21:03.96    --> ${PROJECT}/gecko-dev/third_party/rust/glslopt/build.rs:101:19
21:03.96     |
21:03.96 101 |     let out_dir = env::var("OUT_DIR").unwrap();
21:03.96     |                   ^^^ use of undeclared crate or module `env`
21:03.96 error[E0433]: failed to resolve: use of undeclared crate or module `env`
21:03.96    --> ${PROJECT}/gecko-dev/third_party/rust/glslopt/build.rs:129:19
21:03.97     |
21:03.97 129 |     let out_dir = env::var("OUT_DIR").unwrap();
21:03.97     |                   ^^^ use of undeclared crate or module `env`
21:03.99    Compiling whatsys v0.1.2
21:03.99 For more information about this error, try `rustc --explain E0433`.
21:04.00    Compiling audioipc v0.2.5 (https://github.com/mozilla/audioipc-2
            ?rev=7537bfadad2e981577eb75e4f13662fc517e1a09#7537bfad)
21:14.57    Compiling mozglue-static v0.1.0 (${PROJECT}/gecko-dev/mozglue/static/rust)
21:14.58    Compiling rand v0.7.3
21:14.62 dom/media/mediasession
21:15.26    Compiling rust_cascade v0.6.0
21:15.34    Compiling mp4parse v0.11.5 (https://github.com/mozilla/mp4parse-rust
            ?rev=1bb484e96ae724309e3346968e8ffd4c25e61616#1bb484e9)
21:15.45    Compiling tokio-udp v0.1.1
21:15.48    Compiling tokio-tcp v0.1.1
21:15.52    Compiling tokio-uds v0.2.5
21:15.53 error: could not compile `glslopt` due to 5 previous errors
From the error the issue is immediately clear though. The std::env crate was previously included in the original code, so there was no need to add it using the patch. But that's changed: the original code doesn't use it. Adding it back in should help.

It's suddenly started to rain heavily here in Cambridgeshire. Storm Betty was in the news earlier today, so it's not a surprise and personally I've been looking forward to it. My window is open, I can hear the rain outside and the air is cooling after a hot day. This has nothing to do with the code of course, except that the local environment has a big impact on my coding frame of mind.

As the rain continues the build now throws up a new error.
 2:06.58 error: the listed checksum of `/home/flypig/Documents/Development/jolla
         /gecko-dev-esr91/gecko-dev/gecko-dev/third_party/rust/glslopt/build.rs` has changed:
 2:06.58 expected: 8bcf41e15f780dcda7da689faf08d65ef4973827a0ca17faecce98dbc404b270
 2:06.58 actual:   1efe2b58d7281c5c8cb5d9a912871015625523281e9c38d0d0a8d89f33c918fb
New errors (or, more specifically, not the old errors) are always encouraging. The .cargo_checksum.json file that's causing the error has at least one enormously long line it it, which causes massive slowdown in gedit, the editor I'm using to make these changes. So I fire it up in vim, make the checksum edit from there, and set the build off again.

Sadly this isn't enough though. The custom glslopt build code we just changed has further problems.
11:02.28 error: failed to run custom build command for `glslopt v0.1.9`
11:02.31 Caused by:
11:02.31   process didn't exit successfully: `obj-build-mer-qt-xr/release/build
           /glslopt-1c70b99116f08314/build-script-build` (exit status: 101)
11:02.31   --- stdout
11:02.31   ####################### Compiling: glsl-optimizer/src/compiler/glsl
           /glcpp/glcpp-lex.c
11:02.31   Build command: "host-cc" "-isystem" "obj-build-mer-qt-xr/release
           /build/glslopt-3bb4f22cadf6e656/out/../../../../include"
           "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g"
           "-fno-omit-frame-pointer" "-m32" "-march=i686" "-I"
           "glsl-optimizer/include" "-I" "glsl-optimizer/src/mesa" "-I"
           "glsl-optimizer/src/mapi" "-I" "glsl-optimizer/src/compiler" "-I"
           "glsl-optimizer/src/compiler/glsl" "-I"
           "glsl-optimizer/src/gallium/auxiliary" "-I"
           "glsl-optimizer/src/gallium/include" "-I" "glsl-optimizer/src" "-I"
           "glsl-optimizer/src/util" "-D__STDC_FORMAT_MACROS" "-DHAVE_ENDIAN_H"
           "-DHAVE_PTHREAD" "-DHAVE_TIMESPEC_GET" "-o"
           "obj-build-mer-qt-xr/release/build/glslopt-3bb4f22cadf6e656/out
           /glsl-optimizer/src/compiler/glsl/glcpp/glcpp-lex.o"
           "-c" "glsl-optimizer/src/compiler/glsl/glcpp/glcpp-lex.c"
11:02.32   Compile status: exit status: 0
11:02.32   ######## stdout:
11:02.32   ######## stderr:
11:02.32   #######################
[...]
11:02.49   ####################### Compiling: glsl-optimizer/src/mesa/main/imports.c
11:02.49   Build command: "host-cc" "-isystem"
           "obj-build-mer-qt-xr/release/build/glslopt-3bb4f22cadf6e656/out/..
           /../../../include"
           "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g"
           "-fno-omit-frame-pointer" "-m32" "-march=i686" "-I"
           "glsl-optimizer/include" "-I" "glsl-optimizer/src/mesa" "-I"
           "glsl-optimizer/src/mapi" "-I" "glsl-optimizer/src/compiler" "-I"
           "glsl-optimizer/src/compiler/glsl" "-I"
           "glsl-optimizer/src/gallium/auxiliary" "-I"
           "glsl-optimizer/src/gallium/include" "-I" "glsl-optimizer/src" "-I"
           "glsl-optimizer/src/util" "-D__STDC_FORMAT_MACROS" "-DHAVE_ENDIAN_H"
           "-DHAVE_PTHREAD" "-DHAVE_TIMESPEC_GET" "-o"
           "obj-build-mer-qt-xr/release/build/glslopt-3bb4f22cadf6e656/out
           /glsl-optimizer/src/mesa/main/imports.o"
           "-c" "glsl-optimizer/src/mesa/main/imports.c"
11:02.49   Compile status: exit status: 1
11:02.49   ######## stdout:
11:02.49   ######## stderr:
11:02.50   #######################
11:02.50   --- stderr
11:02.50   In file included from glsl-optimizer/src/util/u_queue.h:43,
11:02.50                    from glsl-optimizer/src/mesa/main/glthread.h:48,
11:02.50                    from glsl-optimizer/src/mesa/main/mtypes.h:42,
11:02.50                    from src/compiler/glsl/glcpp/glcpp-parse.y:32:
11:02.50   glsl-optimizer/src/util/u_thread.h: In function ‘u_thread_setname’:
11:02.51       pthread_setname_np(pthread_self(), name);
11:02.51       ^~~~~~~~~~~~~~~~~~
11:02.51       u_thread_setname
11:02.51   In file included from glsl-optimizer/src/util/u_queue.h:43,
11:02.52                    from glsl-optimizer/src/mesa/main/glthread.h:48,
11:02.52                    from glsl-optimizer/src/mesa/main/mtypes.h:42,
11:02.52                    from glsl-optimizer/src/compiler/glsl/glcpp
                            /pp_standalone_scaffolding.h:34,
11:02.52                    from glsl-optimizer/src/compiler/glsl/glcpp
                            /pp_standalone_scaffolding.c:30:
11:02.52   glsl-optimizer/src/util/u_thread.h: In function ‘u_thread_setname’:
11:02.53       pthread_setname_np(pthread_self(), name);
11:02.53       ^~~~~~~~~~~~~~~~~~
11:02.53       u_thread_setname
11:02.53   In file included from glsl-optimizer/src/util/u_queue.h:43,
11:02.53                    from glsl-optimizer/src/mesa/main/glthread.h:48,
11:02.53                    from glsl-optimizer/src/mesa/main/mtypes.h:42,
11:02.53                    from glsl-optimizer/src/compiler/glsl/glcpp/pp.c:28:
11:02.54   glsl-optimizer/src/util/u_thread.h: In function ‘u_thread_setname’:
11:02.54       pthread_setname_np(pthread_self(), name);
11:02.54       ^~~~~~~~~~~~~~~~~~
11:02.54       u_thread_setname
11:02.54   In file included from glsl-optimizer/src/util/u_queue.h:43,
11:02.54                    from glsl-optimizer/src/mesa/main/glthread.h:48,
11:02.54                    from glsl-optimizer/src/mesa/main/mtypes.h:42,
11:02.55                    from glsl-optimizer/src/mesa/main/extensions.h:39,
11:02.55                    from glsl-optimizer/src/mesa/main/extensions_table.c:26:
11:02.55   glsl-optimizer/src/util/u_thread.h: In function ‘u_thread_setname’:
11:02.55       pthread_setname_np(pthread_self(), name);
11:02.55       ^~~~~~~~~~~~~~~~~~
11:02.55       u_thread_setname
11:02.55   cc: error: glsl-optimizer/src/mesa/main/imports.c:
           No such file or directory
11:02.56   cc: fatal error: no input files
11:02.56   compilation terminated.
This looks like there's at least a problem with the glsl-optimizer/src/mesa/main/imports.c file I'm trying to import in the code. Going through the removed code and the new code I can see that yes, indeed, some of the includes have changed. I just applied the patch and hoped, but now I'll have to go through it line-by-line and compare.

It turns out it's not so much code, so not really such a big undertaking after all. But it does require some care and attention.

Since there are quite a few changes to both the source and include files. I've been through them all and tried to make them manually align once again. Once that's done and after updating the checksum once again, I start the build. Time to listen to the rain once more in the background.

While it's chugging away it might be worth taking a moment to consider why this patch is needed. The problem here is to do with the target the Rust code is being built for. The code that gets patched is a build.rs. That's a file written in Rust that's used to build something else. As such the binary that's output gets linked in to the build chain, rather than being linked in to the final executable.

When this code is executed, it itself executes various parts of the build tooling in order to build something that does get linked into the final executable.

The problem is, which tooling should it be built with and which should it invoke? We want it to be built with the tooling for the x86 target so that the binary output can be run without emulation on the host. But we want that code to invoke the aarch64 target tooling. To get this latter part right, we have to amend the code to use the correct tools.

The changes needed aren't huge, but they are scratchbox2-specific. For example, rather than calling gcc for the build, we want to call host-cc instead. That ensures we use the correct toolchain. We also need to point to the relevant system header files and so on. And we need to do different things depending on whether we're building for an x86, armv7hl or aarch64 target, so we need some code to check this too.

If you have a look at the patch you'll see this is exactly what it's intended to achieve.

I'm now back to watching the build as it gradually progresses. Often slowly, but then with a sudden surge of text rushing past the screen. I'm really hoping webrender goes through this time. But it's also late, so I've decide to let the build complete overnight. Knowing whether or not this fixed this step of the build will have to wait until morning.

So that's it for today. See my Gecko Dev Diary page for all the other posts on this topic.
Comment
26 Aug 2023 : Day 10 #
Today I'm travelling back from the Isle of Arran to Cambridge. It's a ten hour journey by train, which is exciting because it gives me a good long stretch to do some coding. The middle leg from Glasgow to London is the highlight as it gives me a full five hours of uninterrupted opportunity to code. That means it's a bit of a longer write-up today.

When we reached Glasgow there was a point when it looked like we might not be making the journey after all given our train had been cancelled. But due to the quick-action and quick-thinking of my wife, we managed to get booked on the next train with seats and all. So my coding can continue!

During the trip on the way here a week ago I was able to apply patch 0002 to return the Qt layer. That was a lot of manual drudgery. This time I've not got a single long monotonous task, instead I'll be continuing to work through as many of the build errors that come up as I can and figuring out the tweaks needed to get past them.

So let's get to it. Yesterday recall that we finished the day with an error about a missing header file.
18:06.85 In file included from ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:44:
18:06.85 ${PROJECT}/gfx/thebes/gfxQtPlatform.h:11:10: fatal error:
         nsDataHashtable.h: No such file or directory
18:06.85  #include "nsDataHashtable.h"
18:06.85           ^~~~~~~~~~~~~~~~~~~
18:06.85 compilation terminated.
Thanks go to mal for looking in to this in the meantime and noting the the header file isn't needed any more. He messaged me on IRC:
 
<mal> flypig_: afaik that nsDataHashtable.his not needed anymore
<mal> and based on other headers which included it nothing is needed as replacement
<mal> if you check "git log -- xpcom/ds/nsDataHashtable.h" you see that the type defined in that header was made an alias which made the header empty

So to tackle the error I just removed the include line. Even if there is something else needed, I figured the resulting error would indicate the structures and methods needed. It turns out that in this particular instance there is — just as mal suggested — no need to add anything. We'll discover later (spoiler alert!) this issue is going to return and then we'll need to do something slightly different.

Building the project now throws up the following:
 2:17.49 ${PROJECT}/gfx/cairo/cairo/src/cairo-qt-surface.cpp:790:14: error:
         ‘class QRegion’ has no member named ‘unite’; did you mean ‘united’?
 2:17.49       qr = qr.unite(r);
 2:17.49               ^~~~~
 2:17.50               united
This is confusing. This issue was introduced by the changes made to update Cairo to 1.17.4, Bugzilla bug 739096. We can see all of these changes in the associated pull request.

The use of QRegion::united() here was replaced with QRegion::unite(). From the code it looks like the intention is still to combine the regions, but both Qt 5 and Qt 6 seem to use the same interface with united() rather than the call it's being replaced with.

So the change appears to be breaking whichever version of Qt is intended to be used. I'll need to figure out what's really going on here, and for that I'll need to check why Cairo was updated.

I'm going to leave this investigation for a future time though. In the meantime changing the code back to using united() allows it to build. The build now throws up the following error, which seems to be unrelated.
 2:18.03 ${PROJECT}/image/decoders/icon/qt/nsIconChannel.cpp: In function
         ‘nsresult moz_qicon_to_channel(QImage*, nsIURI*, nsIChannel**)’:
 2:18.03 ${PROJECT}/image/decoders/icon/qt/nsIconChannel.cpp:101:35: error:
         ‘NS_LITERAL_CSTRING’ was not declared in this scope
 2:18.03                                    NS_LITERAL_CSTRING(IMAGE_ICON_MS));
 2:18.03                                    ^~~~~~~~~~~~~~~~~~
You may recall that we saw this error yesterday, but then it went away. That can happen: usually it's because of the multiple compilation jobs running in parallel. Fixing an error can change the timing and order of things so that we get some other error appearing on subsequent runs.

This one has come back, so we now have to tackle it. The file causing the error was pulled in as part of the changes from 0002 "Bring back Qt layer", so it's not so surprising if something has changed and broken the code. But first, I need to check that it wasn't some later patch that introduced this change. If it was, it may be that I've not pulled in all of the associated changes, which might also explain why this error is occurring.

A quick grep of the filename in the rpm folder shows it's not touched by any later patches.
$ grep -rIn "nsIconChannel.cpp" *
0002-sailfishos-qt-Bring-back-Qt-layer.-JB-50505.patch:278: image/decoders/icon/qt/nsIconChannel.cpp      | 137 +++++
0002-sailfishos-qt-Bring-back-Qt-layer.-JB-50505.patch:355: create mode 100644 image/decoders/icon/qt/nsIconChannel.cpp
0002-sailfishos-qt-Bring-back-Qt-layer.-JB-50505.patch:2125:+    'nsIconChannel.cpp',
0002-sailfishos-qt-Bring-back-Qt-layer.-JB-50505.patch:2157:diff --git a/image/decoders/icon/qt/nsIconChannel.cpp b/image/decoders/icon/qt/nsIconChannel.cpp
0002-sailfishos-qt-Bring-back-Qt-layer.-JB-50505.patch:2161:+++ b/image/decoders/icon/qt/nsIconChannel.cpp
That's enough to convince me that it does need a new fix. What's more, looking through the ESR 78 code and comparing it to the ESR 91 code, I can see that all except one instance of NS_LITERAL_CSTRING usage has been removed, and that one remaining case is wrapped in a preprocessor conditional so may not be used in practice. Could it be that this macro has been removed in its entirety? There must be some way to define string literals, so hopefully it's just a case of finding out how.

The answer comes from Fabrice Desré who got in contact via Mastodon to share his experience. Fabrice is working on Capyloon, a mobile OS which is based on Boot to Gecko (the fork of Firefox OS which eventually became the basis for KaiOS). As such Fabrice is familiar with upstream Gecko code refactorings and "other fun changes" (as he puts it!). This is what Fabrice said:
 
About the NS_LITERAL_CSTRING macro, they were replaced in https://bugzilla.mozilla.org/show_bug.cgi?id=1648010

Indeed they were. We can see this directly in the code by comparing the old ESR 78 code with our new ESR 91 code. Here's an example line from ESR 78:
browser/components/shell/nsGNOMEShellService.cpp:431:
    gsettings->GetCollectionForSchema(NS_LITERAL_CSTRING(kDesktopBGSchema),
And here's the same line in ESR 91:
    gsettings->GetCollectionForSchema(nsLiteralCString(kDesktopBGSchema),
As you can see, the NS_LITERAL_CSTRING macros has been replaced with nsLiteralCString. So I've made an identical change to the nsIconChannel.cpp file that triggered the error. Thanks Fabrice! Let's kick the build off again.

Next up are these errors:
 1:42.03 In file included from ${PROJECT}/gfx/thebes/gfxQtPlatform.cpp:12:
 1:42.03 ${PROJECT}/gfx/thebes/gfxQtPlatform.h:35:22: error:
         ‘virtual nsresult gfxQtPlatform::UpdateFontList()’ marked ‘override’, 
         but does not override
 1:42.04      virtual nsresult UpdateFontList() override;
 1:42.04                       ^~~~~~~~~~~~~~
 1:42.04 ${PROJECT}/gfx/thebes/gfxQtPlatform.h:37:34: error:
         conflicting return type specified for
         ‘virtual gfxPlatformFontList*gfxQtPlatform::CreatePlatformFontList()’
 1:42.04      virtual gfxPlatformFontList* CreatePlatformFontList() override;
 1:42.04                                   ^~~~~~~~~~~~~~~~~~~~~~
 1:42.04 In file included from ${PROJECT}/gfx/thebes/gfxQtPlatform.h:9,
 1:42.04                  from ${PROJECT}/gfx/thebes/gfxQtPlatform.cpp:12:
 1:42.05 ${PROJECT}/gfx/thebes/gfxPlatform.h:380:16: note: overridden function is
         ‘virtual bool gfxPlatform::CreatePlatformFontList()’
 1:42.05    virtual bool CreatePlatformFontList() = 0;
 1:42.05                 ^~~~~~~~~~~~~~~~~~~~~~
My immediate reaction to this is to just remove the override annotation from gfxQtPlatform::UpdateFontList(). But the gfx/thebes/gfxQtPlatform.h file this comes from is one of the new files we added ourselves, and something in the back of my mind makes me think this needs closer investigation.

Checking the code out, the class definition being inherited from has changed. Let's do a bit of software archaeology.

Here's the ESR 78 version:
  /**
   * Rebuilds the any cached system font lists
   */
  virtual nsresult UpdateFontList();
Here's the ESR 91 version:
  /**
   * Rebuilds the system font lists (if aFullRebuild is true), or just notifies
   * content that the list has changed but existing memory mappings are still
   * valid (aFullRebuild is false).
   */
  nsresult UpdateFontList(bool aFullRebuild = true);
Using git blame and git log I can see that the change happened in commit 809ac3660845f, the log entry for which references Bugzilla bug 1676966

The changes were made to optimise page rendering by removing unnecessary font re-initialisation. So there's some functional change here that we'll need to try to retain or replicate. The good news is that our overrides are actually just calling the underlying implementation, so it should just be a case of redirecting the methods and parameters appropriately.

The relevant commits to look at seem to be 809ac3660845f and 18310539bfe25.

There are some more related errors that follow.
 2:43.13 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp: In member function ‘virtual nsresult gfxFcPlatformFontList::InitFontListForPlatform()’:
 2:43.13 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1445:3: error:
         ‘ClearSystemFontOptions’ was not declared in this scope
 2:43.14    ClearSystemFontOptions();
 2:43.14    ^~~~~~~~~~~~~~~~~~~~~~
 2:43.25 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1445:3: note:
         suggested alternative: ‘PrepareFontOptions’
 2:43.25    ClearSystemFontOptions();
 2:43.25    ^~~~~~~~~~~~~~~~~~~~~~
 2:43.25    PrepareFontOptions
 2:43.26 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1514:3: error: 
         ‘UpdateSystemFontOptions’ was not declared in this scope
 2:43.26    UpdateSystemFontOptions();
 2:43.26    ^~~~~~~~~~~~~~~~~~~~~~~
 2:43.38 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1514:3: note: 
         suggested alternative: ‘PrepareFontOptions’
 2:43.38    UpdateSystemFontOptions();
 2:43.39    ^~~~~~~~~~~~~~~~~~~~~~~
 2:43.39    PrepareFontOptions
 2:43.46 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp: In lambda function:
 2:43.46 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1721:72: error: 
         could not convert ‘false’ from ‘bool’ to ‘mozilla::WeightRange’
 2:43.46                                               weight,     stretch, style};
 2:43.46                                                                         ^
 2:43.47 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1721:72: error:
         could not convert ‘weight’ from ‘gfxPlatformFontList::WeightRange’
         {aka ‘mozilla::WeightRange’} to ‘mozilla::StretchRange’
 2:43.47 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1721:72: error:
         could not convert ‘stretch’ from ‘gfxPlatformFontList::StretchRange’
         {aka ‘mozilla::StretchRange’} to ‘mozilla::SlantStyleRange’
 2:43.49 ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:1721:72: error:
         could not convert ‘style’ from ‘gfxPlatformFontList::SlantStyleRange’
         {aka ‘mozilla::SlantStyleRange’} to ‘RefPtr
From the blame of the file, the error here seems to be related to a different change; that in commit e2df872ab83b5 which references Bugzilla bug 1708285.

The fix for these turns out to be less formulaic than most of the issues we've seen so far. To address this I had to essentially update the Qt versions of the GetFontList() and CreatePlatformFontList() methods to match those made for other platforms. This meant passing in different parameters and doing slightly different things with them.

This highlights one of the difficulties in working without a working build. I've made the changes I think look appropriate, but I have no way to test them at this stage. I'll just have to come back to them later and perform the checks once things the build is working fully.

The final error in the above log output relates to the change in commit 2f1aa020c3d4c, Bugzilla bug 1714282.

While working on this I also noticed the following error. I might have missed it before because it doesn't generate a red error message. I'll need to follow this up too.
 2:13.63 make[4]: *** No rule to make target 'moc_message_pump_qt.cc', needed by
         'moc_message_pump_qt.o'.  Stop.
During the journey I'm able to fix the font method failures. The parameter conversion and error relating to the message pump will have to wait until tomorrow.

Reflecting back on my train journey, it wasn't quite as productive as it could have been. The earlier cancellation meant the train we ended up on was super-busy, which made it a trickier to concentrate. But I did make progress, which is the aim of the game here. Slowly but surely.

If you want to read my other posts on this topic, check out my Gecko Dev Diary.
Comment
25 Aug 2023 : Day 9 #
In yesterday's post we talked about a range of different issues, from checksums to defines, ultimately finishing off with a discussion about software archaeology. We'll need to dig through more software detritus in the future too.

A few errors follow in the build this time.
15:26.27 ${PROJECT}/gfx/cairo/cairo/src/cairo-qt-surface.cpp:65:10: fatal error:
         QWidget: No such file or directory
15:26.27  #include 
15:26.27           ^~~~~~~~~
15:26.27 compilation terminated.
Since QWidget is definitely a thing, this should be straightforward to fix. But there are more errors later in the output.
17:14.68 In file included from ${PROJECT}/../obj-build-mer-qt-xr/dist/include/cairo/cairo-ft.h:46,
17:14.68                  from ${PROJECT}/../obj-build-mer-qt-xr/dist/system_wrappers/cairo-ft.h:3,
17:14.68                  from ${PROJECT}/../obj-build-mer-qt-xr/dist/include/mozilla/gfx/UnscaledFontFreeType.h:10,
17:14.68                  from ${PROJECT}/gfx/thebes/gfxFT2FontBase.h:13,
17:14.69                  from ${PROJECT}/gfx/thebes/gfxFT2FontBase.cpp:6:
17:14.69 ${PROJECT}/../obj-build-mer-qt-xr/dist/system_wrappers/ft2build.h:3:15: fatal error:
         ft2build.h: No such file or directory
17:14.69  #include_next 
17:14.69                ^~~~~~~~~~~~
17:14.69 compilation terminated.
Another apparently missing header file. It's worth noting that with my configuration the build is running with 16 threads at a time. If these errors came from the same thread then they could be related (meaning that one error is a consequence of the other). But equally they may have come from different threads, in which case they're likely to be totally unrelated. I'm using 16 threads because it makes the build faster, but it does also add some additional uncertainty to the debugging. There's a third error as well.
 2:26.53 ${PROJECT}/image/decoders/icon/qt/nsIconChannel.cpp: In function
         ‘nsresult moz_qicon_to_channel(QImage*, nsIURI*, nsIChannel**)’:
 2:26.54 ${PROJECT}/image/decoders/icon/qt/nsIconChannel.cpp:101:35: error:
         ‘NS_LITERAL_CSTRING’ was not declared in this scope
 2:26.54                                    NS_LITERAL_CSTRING(IMAGE_ICON_MS));
 2:26.54                                    ^~~~~~~~~~~~~~~~~~
Not directly header related, but it could be if, for example, this is a define that's coming from one of the missing headers. This will need investigation.

Since it's not immediately clear whether they're related, I figure it's probably best to look at them separately, starting with the first of them as they appeared in the logs.

To reiterate, the error suggests that the QWidget header file can't be found. Looking at the Qt5 include structures, the header is contained in the QtWidgets folder, and indeed by changing the include to use QtWidgets/QWidget instead of just QWidget does seem to address this particular issue.

It doesn't fix the ft2build.h error though; the compiler still can't find this header. A quick search of the Web shows this header comes from freetype2. Somewhere in the code we should be seeing a freetype2 include folder, something like /usr/include/freetype2 being added to the build flags.

In the ESR 78 code this is referenced in the following file:
gecko-dev/media/webrtc/trunk/peerconnection_client.target.mk
Oddly this isn't a file that exists in the ESR 91 codebase. It's not clear why — maybe it'll become clear later — but in the meantime we need to find another way to add this include as a build parameter.

A quick and somewhat dirty solution is to add it to the CFLAGS export in embedding/embedlite/config/mozconfig.merqtxulrunner. Adding that in seems to have a positive effect. There is almost certainly a better place to put this, but that'll do for now, it can all be finessed later.
export CFLAGS="-O3 -I/usr/include/freetype2"
export CXXFLAGS="-O3 -I/usr/include/freetype2"
After building with these changes the errors now also change slightly, so that the next failure is the following:
18:06.85 In file included from ${PROJECT}/gfx/thebes/gfxFcPlatformFontList.cpp:44:
18:06.85 ${PROJECT}/gfx/thebes/gfxQtPlatform.h:11:10: fatal error:
         nsDataHashtable.h: No such file or directory
18:06.85  #include "nsDataHashtable.h"
18:06.85           ^~~~~~~~~~~~~~~~~~~
18:06.85 compilation terminated.
In the ESR 78 codebase the nsDataHashtable.h header used to be found at gecko-dev/xpcom/ds/nsDataHashtable.h This no longer seems to be the case for ESR 91: the file doesn't exist anywhere. Is the problem that the file should exist, or that it should no longer be being included?

Well, the file is included from gecko-dev/gfx/thebes/gfxQtPlatform.h, a file which was entirely added by our "bring back Qt" patch. So probably its referencing something that it shouldn't; viz this include that no longer exists. We could just remove it and see what happens, but it may be worthwhile digging through the history a bit to see why it was removed. That should be our first port of call for tomorrow.

That's it for today. As usual, the rest of these posts can be found on my Gecko Dev Diary page.
Comment
24 Aug 2023 : Day 8 #
Yesterday we were tackling IPDL syntax changes. The next build failure today seems to be of the more substantial variety. The error looks like this:
 2:11.87 ./application.ini.h.stub
 3:06.31 ${PROJECT}/netwerk/protocol/gio/PGIOChannel.ipdl:21: error: |manager|
         declaration in protocol `PGIOChannel' does not match any |manages| 
         declaration in protocol `PNecko'
 3:06.31 Specification is not well typed.
 3:07.75 make[4]: *** [Makefile:30: ipdl.track] Error 1
 3:07.76 make[3]: *** [${PROJECT}/config/recurse.mk:99: ipc/ipdl/export] Error 2
 3:07.76 make[2]: *** [${PROJECT}/config/recurse.mk:34: export] Error 2
 3:07.76 make[1]: *** [${PROJECT}/config/rules.mk:355: default] Error 2
 3:07.76 make: *** [client.mk:65: build] Error 2
Searching the previous patches for PGIOChannel and PNecko doesn't throw anything up. This looks like a new error and at this point in time I've absolutely no idea what the underlying reason for it might be. Some digging is in order.

The PGIOChannel.ipdl file isn't part of the EmbedLite changes as far as I'm aware, so this is a bit confusing. But looking in the PNecmo.ipdl file we can see the following:
#ifdef MOZ_WIDGET_GTK
  manages PGIOChannel;
#endif
That looks like a smoking gun. We want this manages to be defined, but we don't have MOZ_WIDGET_GTK defined because we're using the MOZ_WIDGET_QT define instead. Probably the right thing to do is to extend this condition to include the Qt case as well.

So, this:
#ifdef MOZ_WIDGET_GTK
gets switched for this:
#if defined(MOZ_WIDGET_GTK) || defined(MOZ_WIDGET_QT)
in a few places.

To keep things neat I've merged these changes in with the patch 0002 changes described earlier. And now it's building again.

As the build progresses things are getting even more exciting: C++ code is quite clearly being compiled, with associated warnings being spat out. Warnings are warnings, but the fact we're onto the C++ build is nevertheless a very good sign.

Then this happens:
 4:25.81 dom/broadcastchannel
 4:30.46 error: the listed checksum of `${PROJECT}/third_party/rust/cc/src/lib.rs` has changed:
 4:30.46 expected: 20f6fce88058fe2c338a8a7bb21570c796425a6f0c2f997cd64740835c1b328c
 4:30.46 actual:   1ee1bc9318afd044e5efb6df71cb44a53ab6c5166135d645d4bc2661ce6fecce
 4:30.46 directory sources are not intended to be edited, if modifications are 
         required then it is recommended that `[patch]` is used with a forked 
         copy of the source
 4:30.48 make[4]: *** [${PROJECT}/config/makefiles/rust.mk:405: force-cargo-library-build] Error 101
We're running with 16 threads and it takes a while for the other threads to complete. But this is clearly an error of the build-failing variety.

The problem is the checksum in the file gecko-dev/third_party/rust/cc/.cargo-checksum.json. If a .cargo-checksum.json file is missing it should get automatically regenerated, so in this case I just deleted the file and kicked the build off again. Let's see what happens now.
 2:11.19 error: failed to load source for dependency `cc`
 2:11.20 Caused by:
 2:11.20   Unable to update https://github.com/alexcrichton/cc-rs/
           ?rev=b2f6b146b75299c444e05bbde50d03705c7c4b6e#b2f6b146
 2:11.20 Caused by:
 2:11.20   failed to update replaced source https://github.com/alexcrichton/cc-rs/
           ?rev=b2f6b146b75299c444e05bbde50d03705c7c4b6e#b2f6b146
 2:11.20 Caused by:
 2:11.20   failed to load checksum `.cargo-checksum.json` of cc v1.0.71
 2:11.20 Caused by:
 2:11.21   failed to read `${PROJECT}/third_party/rust/cc/.cargo-checksum.json`
 2:11.21 Caused by:
 2:11.21   No such file or directory (os error 2)
 2:11.21 make[4]: *** [${PROJECT}/config/makefiles/rust.mk:405: force-cargo-library-build] Error 101
Oh, okay, so maybe the part about it being automatically regenerated isn't true after all! I restore the file and make the checksum change manually.
git checkout third_party/rust/cc/.cargo-checksum.jso
sed -i -e 's/20f6fce88058fe2c338a8a7bb21570c796425a6f0c2f997cd64740835c1b328c/1ee1bc9318afd044e5efb6df71cb44a53ab6c5166135d645d4bc2661ce6fecce/g' \
  third_party/rust/cc/.cargo-checksum.json
I do hope there aren't too many incorrect checksums or this will take a long time.

Well, the good news is that there are no other immediate checksum errors. The next error comes from some C++ code in an area where I know from previous experience there are changes needed for Sailfish OS rendering. So this is promising.
 2:38.08 In file included from :
 2:38.08 ${PROJECT}/../obj-build-mer-qt-xr/mozilla-config.h:128:25: error: 
         redefinition of ‘class mozilla::gl::GLContextProviderEGL’
 2:38.08  #define MOZ_GL_PROVIDER GLContextProviderEGL
 2:38.08                          ^~~~~~~~~~~~~~~~~~~~
It looks very much like this error is fixed by patch 0011 "Fix GLContextProvider defines". Let's apply the patch:
$ patch -d gecko-dev/ -p1 < rpm/0011-sailfishos-compositor-Fix-GLContextProvider-defines.patch 
patching file gfx/gl/GLContextProvider.h
Hunk #1 FAILED at 48.
Hunk #2 succeeded at 83 (offset 7 lines).
1 out of 2 hunks FAILED -- saving rejects to file gfx/gl/GLContextProvider.h.rej
Okay, let's apply the patch manually... The patches may not apply directly as they are, but having these patches from the previous version sure does make things a lot easier than they would otherwise be.

It feels like I'm making good progress, because after making these changes the build is making good progress too. It's running through quite a few files and I'm seeing lots of rather endorphin-releasing green lines of console output. The next error pops up.
 3:19.99 In file included from Unified_cpp_dom_ipc0.cpp:119:
 3:19.99 ${PROJECT}/dom/ipc/ContentParent.cpp: In member function ‘bool 
         mozilla::dom::ContentParent::InitInternal(mozilla::dom::
         PContentParent::ProcessPriority)’:
 3:19.99 ${PROJECT}/dom/ipc/ContentParent.cpp:2931:46: error:
         ‘mozilla::components::GfxInfo’ has not been declared
 3:20.00    nsCOMPtr gfxInfo = components::GfxInfo::Service();
 3:20.00                                               ^~~~~~~
 3:20.45 dom/media/webrtc/libwebrtcglue
This took a bit of digging to figure out. By using git blame I was able to track down the change that caused the error to Bugzilla bug 1686616.

That makes this a good opportunity to talk about Software Archaeology.

Software archaeology is a method of debugging that was explained to me by Raine a couple of years back when we were working on ESR 68. As he explained at the time, in order to get gecko to build, software archaeology is one of the most important skills needed.

The tools of the software archaeologist are git log, git blame, grep, find and Bugzilla search. The objective is not to understand the code per se but rather to follow the history of what led to the code change. This can require a lot of metaphorical digging.

The result of this digging is ideally a Phabricator diff that can be applied directly to our codebase. More often it's a diff that at least shows what changed in the past, in a way that explains the error happening now.

For this GvxInfo bug, the software archaeology approach seems to have worked. Looking at the diff on Phabricator for this bug, it becomes clear that the fix is to ensure GfxInfo is properly named in the Qt version of the components.conf file, to replicate the same change that was made upstream for the Gtk widgets.

After making this change and setting the build running, that takes us to the next step. Since the next step involves considering a triplicate of errors, this is a good place to stop for today.

As always, don't forget you can check out the Gecko Dev Diary page for previous days.
Comment
23 Aug 2023 : Day 7 #
Last night we were left with a bit of a mystery: is 14 greater than 5? If so, why does our build error claim our version of libclang is too old when it's asking for at least version 5 and we have version 14 installed?

You'll be happy to hear 14 is greater than five and maths isn't broken. The answer to the mystery should have occurred to me earlier because back in August 2021 I hit exactly the same problem building ESR 78. We can see that in the old patch 0036 which is also designed to circumvent this exact same clang version check.

I'll go into the details in shortly, but first I must tip my hat to mal who figured this out and let me know on IRC:
 
<mal> flypig: there is a patch in gecko to disable clang version check
<mal> patch 36

Mal is correct of course. Kudos also to filip.k on the Sailfish Forum who also gave a nice suggestion for what might be going wrong, saying that it might be a single digit check.

I'm immediately reminded of the legendary tale that Microsoft skipped Windows 9 for precisely this reason. The claim was that third party software was breaking because it tested for Windows 95 or 98 by just checking the first digit of the version number. The story is no doubt apocryphal, but also deliciously plausible.

In our case it's slightly different in that it's not checking the version number at all. Instead it's checking whether a particular function is present in the library (a function that's present in version 5, but not in version 4). This is failing for us for very specific reasons, which is why we have to remove the version check entirely.

The reason it's failing is because of the cross-compiling dance we do, building for multiple platforms (armv7hl, aarch64, i486) all on an i486 machine (which is what OBS and my local Sailfish SDK are both running on). We need to use i486 clang for the build process, but one of the other variants (in the case of the arm builds) for the actual compiled code.

As such the Python build process is expecting to find an arm version of the library, picks up the i486 version and chokes on it. We can see this if we negotiate ourselves inside the build engine and check the file it's complaining about:
$ sfdk engine exec
[mersdk@f1c636fdd533]$ sb2 -t SailfishOS-devel-aarch64.default
[SB2 sdk-install] I have no name!@f1c636fdd533 # file gecko-dev/../obj-build-mer-qt-xr/lib/libclang.so.10
gecko-dev/../obj-build-mer-qt-xr/lib/libclang.so.10: ELF 32-bit LSB shared object,
  Intel 80386, version 1 (GNU/Linux), dynamically linked,
  BuildID[sha1]=53fda5b6446347702ed3b6163ba0f59b69a0c53e, stripped
We can even perform the same python steps that are in the build script manually and see that they fail. The code is trying to dynamically load the library and check whether a particular function exists, but this assumes that the library is in the right format, which it's not.
$ sfdk engine exec
[mersdk@f1c636fdd533]$ sb2 -t SailfishOS-devel-aarch64.default
[SB2 sdk-build] I have no name!@f1c636fdd533 gecko-dev $ python3
Python 3.8.11 (default, Aug 26 2022, 00:00:00) 
[GCC 8.3.0 20190222 (Sailfish OS gcc 8.3.0-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ctypes import CDLL
>>> lib = CDLL("obj-build-mer-qt-xr/lib/libclang.so.10")
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: obj-build-mer-qt-xr/lib/libclang.so.10: wrong ELF class: ELFCLASS32
The good news is that we're already enforcing the correct library availability through our spec file, so we can safely skip this check. The check makes sense for the general build process, but we don't really need it. At least the error highlighted the fact we need to apply the same patch as for ESR 78 again.

So no need to rebuild libclang. And the other libraries we need have all now been rebuilt.

However I hit a bit of a problem with icu 69.1. Upgrading icu has a cascading effect on other packages, all of which need to be rebuilt against it. This is what OBS is great at, but I'm doing all this locally right now. So I'm going to continue using the older version until it becomes absolutely necessary.

Happily the NSPR upgrade works without a hitch.

So now we're off with the build again and very quickly there's a problem with the generation of our custom EmbedLite IPDL files. It seems the file format has actually changed since ESR 78. Not significantly, but enough to trigger an error.

IPDL is gecko's "Inter-process/thread-communication Protocol Definition Language". It's used by gecko to pass messages between different parts of itself and can be thought of a little like internal RPC.

The developer creates an IPDL file which describes classes and methods. These are processed to generate C++ header files for two classes: a parent and a child class which represent the parent and child process respectively. Glued together with some internal gecko magic, the parent can now communicate safely with the child (and vice versa) by calling the methods defined in the IPDL file.

Anyhow, the issue seems to be that the IPDL syntax was changed so that method annotations such as compress, priority and tainted are now prefixes rather than suffixes, as explained in Bugzilla bug #1689147 and which we can see in the upstream change set D103368.

Happily it's just a syntax change, so fixing the Embedlite IPDL files is just a case of fixing the syntax, no logic or API changes.

That seems like a good place to stop for today. Tomorrow I'll run the build to find out if the IPDL changes fixed things, and move on to the next error we're up against.

If you want to catch up on any previous entries, check out the full Gecko-dev Diary page.
Comment
22 Aug 2023 : Day 6 #
Before I get into the building of things, I want to first shout out to a few people who've taken an interest in this stuff.

First off to attah, direc85 and mal on IRC. I don't know how, but attah managed to discover my GitHub changes before I'd even starting posting anything about them and yesterday managed to announce the Day 5 blog post before I'd even posted it. Thanks to you all for your encouragement and generous offers of help.

Nephros has been building an updated NSPR and NSS, amongst other things, which I hope we'll see in the Jolla repositories in due course.

Thigg has provided really useful feedback via the Sailfish Forum on the particular issue I hit yesterday. We're actually taking a bit of a detour around it today, but I'll talk more about it in future posts.

On Mastodon I'm grateful to Thaodan for his always-useful discussion and feedback.

Thanks to Peter for featuring this work in the LINux on MOBile weekly news update. I'm chuffed to have been included and recommend LINMOB's updates as a great way to stay informed about all things mobile Linux related.

Thanks to throwaway69 for posting about this on the Sailfish Forum and to the lovely community there for all the nice comments!

Last but definitely not least thanks to Raine for suggesting this in the first place, but also to him and the IRC posse for the nice discussion at the most recent Community Meeting.

I hope I didn't miss anyone and apologies if I did. It's always encouraging to get feedback and I appreciate all of the help and offers that everyone's been giving.

Let's talk now about where things are at. Yesterday we hit a "stack overflow" error. It's not quite clear where this is coming from. Thigg pointed out that this could be similar to an issue he's experienced using GraalVM to build using the Sailfish SDK.

That's definitely a possibility, but it's also the case that we're still using a version of the tooling (libclang and cbindgen) that are technically too old for the code we're building. I'd like to eliminate this as a possibility first.

Upgrading the various dependencies is something I'm going to have to do sooner or later. So updating NSPR, cbindgen, libclang and icu-i18n are going to be my next steps. This feels like a bit of a side quest, but it might turn out to be an essential achievement for progressing in the game.

If you recall I previously hacked around these requirements, so it's already clear that we'll need the following:
  1. nspr >= 4.32
  2. cbindgen >= 0.19.0
  3. libclang >= 5.0
  4. icu-i18n >= 69.1
First up is NSPR. As mentioned above Nephos has already done the hard work here and posted the changes needed to GitHub.

Nephros took the Sailfish OS specific patches from version 4.29 and re-worked them to apply to 4.35. The package then builds nicely with these changes, and I'm able to use it in my build using the nice task-oriented capabilities of the Sailfish SDK.

Next up is libicu. This was a bit tricky, mostly because the build (and the unit tests in particular) take so long to run. But eventually I got it to build some packages by disabling only a few of the unit tests. The related pull request is still in draft because I've not had a chance to test it with anything other than aarch64. Previous versions required different unit tests to be masked out for different architectures, so this is one of those times when it needs testing for all available targets.

Now on to cbindgen. This was distressingly straightforward. No patches, no changes to the spec file, just update the submodule to the appropriate tag and rebuild. Maybe I did something wrong?

Finally (hopefully finally) libclang. Finding the correct package for this was a little trickier than I'd expected, but eventually I dug out the clang.spec file hiding inside the LLVM repository.

Having found it, looking through the spec file and code gave me pause for thought. What's there is clearly clang version 14, whereas the gecko build script was complaining (before I hacked it) about needing clang greater than 5. My maths may be a bit rusty, but I'm pretty sure 14 is greater than five.

For full disclosure I must admit that I do know the answer to the mystery of what's going wrong here. These posts are lagging a little behind reality as I've not yet had the chance to write everything up. But as I write this it's late, so the mystery will have to wait until tomorrow.

If you want to catch up on any previous entries, check out the full Gecko-dev Diary page.
Comment
21 Aug 2023 : Day 5 #
Yesterday we left things at the following error:
 0:29.39 checking for rust host triplet...
 0:29.39 ERROR: The rust compiler host (i686-unknown-linux-gnu) is not suitable 
         for the configure host (aarch64-unknown-linux-gnu).
 0:29.39 You can solve this by:
 0:29.39 * Set your configure host to match the rust compiler host by editing your
 0:29.39 mozconfig and adding "ac_add_options --host=i686-unknown-linux-gnu".
 0:29.39 * Or, install the rust toolchain for aarch64-unknown-linux-gnu, if 
         supported, by running
 0:29.39 "rustup default stable-aarch64-unknown-linux-gnu"
So now we have to tackle it. As I mentioned yesterday, this looks likely due to the peculiar cross-compilation-inside-a-cross-compilation environment that we're using for our build.

The good news is that we have a patch 0032 "Read rustc host from environment" which might help with this. Unfortunately it doesn't apply cleanly:
patching file build/moz.configure/rust.configure
Hunk #1 FAILED at 91.
Hunk #2 FAILED at 99.
2 out of 2 hunks FAILED -- saving rejects to file build/moz.configure/rust.configure.rej
patching file third_party/rust/cc/.cargo-checksum.json
Hunk #1 FAILED at 1.
1 out of 1 hunk FAILED -- saving rejects to file third_party/rust/cc/.cargo-checksum.json.rej
patching file third_party/rust/cc/src/lib.rs
Hunk #1 FAILED at 1977.
Hunk #2 succeeded at 2639 (offset 381 lines).
Hunk #3 succeeded at 2683 (offset 381 lines).
1 out of 3 hunks FAILED -- saving rejects to file third_party/rust/cc/src/lib.rs.rej
However it's not a big patch, so maybe it can be fixed up manually? A quick glance at the changes suggests the problems are caused by the same quoting issue we saw yesterday, i.e. that all of the single quotes in the Python build code have been switched for double quotes.

That at least means that adding the remaining failed hunks turns out to be straightforward, and I was pretty astonished that this does in fact fix the issue. So we can swiftly move on to the next error:
 0:30.07 checking for cbindgen...
 0:30.07 DEBUG: trying cbindgen: /usr/bin/cbindgen
 0:30.07 DEBUG: Executing: `/usr/bin/cbindgen --version`
 0:30.07 DEBUG: /usr/bin/cbindgen has version 0.17.0
 0:30.07 DEBUG: trying cbindgen: /usr/bin/cbindgen
 0:30.07 DEBUG: Executing: `/usr/bin/cbindgen --version`
 0:30.07 DEBUG: /usr/bin/cbindgen has version 0.17.0
 0:30.07 DEBUG: trying cbindgen: /usr/bin/cbindgen
 0:30.07 DEBUG: Executing: `/usr/bin/cbindgen --version`
 0:30.08 DEBUG: /usr/bin/cbindgen has version 0.17.0
 0:30.08 ERROR: cbindgen version 0.17.0 is too old. At least version 0.19.0 is required.
 0:30.08 Please update using 'cargo install cbindgen --force' or running
 0:30.08 './mach bootstrap', after removing the existing executable located at
 0:30.08 /usr/bin/cbindgen.
I'm able to work around this by hacking the version check from 0.19.0 to 0.17.0 in build/moz.configure/bindgen.configure, but that just leads to the next similar error.
 0:30.34 checking for clang for bindgen... /usr/bin/clang++
 0:30.36 checking for libclang for bindgen... ${PROJECT}/../obj-build-mer-qt-xr/lib/libclang.so.10
 0:30.36 checking that libclang is new enough...
 0:30.37 ERROR: The libclang located at ${PROJECT}/../obj-build-mer-qt-xr/lib/libclang.so.10
         is too old (need at least 5.0).
 0:30.37 Please make sure to update it or point to a newer libclang using
 0:30.37 --with-libclang-path.
The libclang that it's checking against is ultimately coming from the clang-libs package in the Sailfish OS repositories. I'm able to hack around the check (once again to be found in bindgen.configure), but it's getting increasingly clear that performing the underlying package upgrades is going to be necessary, and that maybe I'd be better off biting the bullet and just doing the work to upgrade them.

It turns out there are more too:
 0:30.33 checking for icu-i18n >= 69.1... no
 0:30.33 ERROR: Requested 'icu-i18n >= 69.1' but version of icu-i18n is 68.2
Working around this involves hacking away at gecko-dev/js/moz.configure.

That's now NSPR, cbindgen, libclang and icu-i18n that all need upgrading.

Hacking around the last of these gets the build further...
 0:54.13 Reticulating splines...
...which is actually a bit of a milestone, since the crucial reticulating splines step signifies the shift from configuring the build process to actually starting the build.
 
Console build output: reticulating splines

With the build actually running, the next failure is something more interesting.
 0:54.13 Reticulating splines...
 0:59.17  0:05.38 File already read. Skipping: ${PROJECT}/intl/components/moz.build
 1:02.48  0:08.69 File already read. Skipping: ${PROJECT}/gfx/angle/targets/angle_common/moz.build
 1:20.38 Traceback (most recent call last):
 1:20.38   File "${PROJECT}/configure.py", line 226, in 
 1:20.39     sys.exit(main(sys.argv))
 1:20.39   File "${PROJECT}/configure.py", line 80, in main
 1:20.39     return config_status(config)
 1:20.39   File "${PROJECT}/configure.py", line 221, in config_status
 1:20.39     return config_status(args=[], **sanitized_config)
 1:20.39   File "${PROJECT}/python/mozbuild/mozbuild/config_status.py", line 174, in config_status
 1:20.39     definitions = list(definitions)
 1:20.39   File "${PROJECT}/python/mozbuild/mozbuild/frontend/emitter.py", line 167, in emit
 1:20.39     objs = list(emitfn(out))
 1:20.39   File "${PROJECT}/python/mozbuild/mozbuild/frontend/emitter.py", line 1397, in emit_from_context
 1:20.40     for obj in self._handle_linkables(context, passthru, generated_files):
 1:20.40   File "${PROJECT}/python/mozbuild/mozbuild/frontend/emitter.py", line 1041, in _handle_linkables
 1:20.40     raise SandboxValidationError(
 1:20.40 mozbuild.frontend.reader.SandboxValidationError:
 1:20.40 ==============================
 1:20.40 FATAL ERROR PROCESSING MOZBUILD FILE
 1:20.40 ==============================
 1:20.40 The error occurred while processing the following file or one of the files it includes:
 1:20.41     ${PROJECT}/widget/qt/moz.build
 1:20.41 The error occurred when validating the result of the execution. The reported error is:
 1:20.41     File listed in SOURCES does not exist: '${PROJECT}/widget/qt/nsNativeThemeQt.cpp'
From the error and file affected, it looks like this needs patch 0016 "Provide checkbox/radio renderer for Sailfish" to be applied to fix it. As well as providing a Sailfish OS checkbox/radio renderer, this patch also creates the widget/qt/nsNativeThemeQt.cpp file. My instinct tells me this is going to be a bit of work, it's late, and the sensible thing to do would be to wait until tomorrow to start this.

Except... I don't want to leave this alone. Let's continue.

So let's apply patch 0016 and create the widget/qt/nsNativeThemeQt.cpp file.

Surprisingly this time applying the patch works, albeit with a couple of the changes having already been applied:
$ patch -d gecko-dev -p1 < rpm/0016-sailfishos-qt-Provide-checkbox-radio-renderer-for-Sa.patch 
patching file layout/style/res/forms.css
Hunk #1 succeeded at 479 with fuzz 1 (offset -2 lines).
Hunk #2 succeeded at 489 with fuzz 1 (offset -1 lines).
patching file widget/qt/QtColors.h
patching file widget/qt/moz.build
Reversed (or previously applied) patch detected!  Assume -R? [n] n
Apply anyway? [n] n
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file widget/qt/moz.build.rej
patching file widget/qt/nsNativeThemeQt.cpp
patching file widget/qt/nsNativeThemeQt.h
Checking the git blame I can see that the duplicate changes were added when I applied the 0002 "Bring back Qt layer" patch. Why did that happen? It turns out it's because, rather than apply the patch for the new files added by patch 0002, instead I just copied the files over from the fully patched ESR78 build tree. It didn't occur to me that this would pull in changes that were applied to these files by later patches. D'oh! At least the reason is clear.

With the full patch 0016 applied, I kick off the build again. Now it actually, genuinely, appears to be building. I can tell because my legs are getting hot from my laptop fans!

It gets a little way in before generating an error I wasn't expecting. And one I'm as yet not at all sure how to tackle.
 2:42.29 config/system-header.sentinel.stub
 2:45.63 WARN: Skip mp4parse::MIN_SIZE - (not `pub`).
 2:45.63 WARN: Skip mp4parse::CONFIG_OBUS_OFFSET - (not `pub`).
 2:45.63 WARN: Skip mp4parse::MIN_PROPERTIES - (not `pub`).
 2:45.63 WARN: Skip mp4parse::MAX - (not `pub`).
 2:45.63 thread 'main' has overflowed its stack
 2:45.64 fatal runtime error: stack overflow
 2:45.64 qemu: uncaught target signal 6 (Aborted) - core dumped
 2:47.49 make[3]: *** [backend.mk:188: media/mp4parse-rust/.deps/mp4parse_ffi_generated.h.stub] Error 250
 2:47.49 make[3]: *** Waiting for unfinished jobs....
 3:45.71   in file included from `${PROJECT}/mobile/sailfishos/PEmbedLiteApp.ipdl', line 6:
 3:45.71 ${PROJECT}/mobile/sailfishos/PEmbedLiteView.ipdl:63: error: bad syntax near `compress'
 3:45.71 Specification could not be parsed.
 3:46.94 make[4]: *** [Makefile:30: ipdl.track] Error 1
 3:46.94 make[3]: *** [${PROJECT}/config/recurse.mk:99: ipc/ipdl/export] Error 2
 3:46.95 make[2]: *** [${PROJECT}/config/recurse.mk:34: export] Error 2
 3:46.95 make[1]: *** [${PROJECT}/config/rules.mk:355: default] Error 2
 3:46.95 make: *** [client.mk:65: build] Error 2
 3:46.95 0 compiler warnings present.
The key part here appears to be:
 2:45.64 fatal runtime error: stack overflow
The fans whirr down and the build finishes with this error. This will need more investigation. But I really will need to leave this one until tomorrow.

You can read all of my gecko-dev diary posts on my Gecko-dev Diary page
Comment
20 Aug 2023 : Day 4 #
The majority of development yesterday was taken up with reapplying patch 0002 to "Bring back the Qt layer". I also adjusted a couple of build flags. Triggering a rebuild we immediately hit a similar issue today.
 0:18.59 checking whether cross compiling... no
 0:20.19 Traceback (most recent call last):
 0:20.19   File "${PROJECT}/configure.py", line 226, in 
 0:20.19     sys.exit(main(sys.argv))
 0:20.20   File "${PROJECT}/configure.py", line 50, in main
 0:20.20     sandbox.run(os.path.join(os.path.dirname(__file__), "moz.configure"))
 0:20.20   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 554, in run
 0:20.20     raise InvalidOptionError(msg)
 0:20.20 mozbuild.configure.options.InvalidOptionError: Unknown option: --enable-dconf
Here "dconf" is a reference to Gnome's low-level configuration system. This is a key-value pair database (similar to the Windows registry) that also happens to be heavily used for configuration settings on Sailfish OS. Part of the reason for its appeal on Sailfish OS is that there are really nice QML bindings for it, but there's actually been a move away from it in recent releases.

For gecko enable-dconf is a Sailfish-specific flag that enables dconf access, which is used by patch 0061 to get the 12/24 hour time format setting from dconf. Without the patch there is no flag, so when we try to set it in the build configuration (having assumed that the patch is applied when it's not) the build complains and fails.

We'll need to add this patch back later, but adding it before the build is working would again be premature. So I've simply taken the code that introduced the flag from the patch and added it back in to build/moz.configure/old.configure even though as yet it won't be doing anything.

That gets the build moving again. An alternative would have been to remove the use of the flag from embedding/embedlite/config/mozconfig.merqtxulrunner, but right now the result is the same and I think I'm more likely to remember if I do things this way around.

Slowly but surely the build is progressing. Now we have the following error:
 0:20.33 Traceback (most recent call last):
 0:20.33   File "${PROJECT}/configure.py", line 226, in 
 0:20.34     sys.exit(main(sys.argv))
 0:20.34   File "${PROJECT}/configure.py", line 50, in main
 0:20.34     sandbox.run(os.path.join(os.path.dirname(__file__), "moz.configure"))
 0:20.34   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 554, in run
 0:20.34     raise InvalidOptionError(msg)
 0:20.34 mozbuild.configure.options.InvalidOptionError: Unknown option: --enable-system-sqlite
Yet another missing option, again something that would ordinarily be added by a patch we're not yet applying. This is for patch 0088 with the title "Drop support for --enable-system-sqlite". The patch looks relatively straightforward, so I thought it might be worth checking whether it just applies as-is. Here's what I get when I give this a go:
$ patch -d gecko-dev -p1 < rpm/0088-Revert-Bug-1611386-Drop-support-for-enable-system-sq.patch 
patching file browser/installer/package-manifest.in
Hunk #1 succeeded at 142 (offset 1 line).
patching file build/moz.configure/old.configure
Hunk #1 FAILED at 228.
1 out of 1 hunk FAILED -- saving rejects to file build/moz.configure/old.configure.rej
patching file config/external/sqlite/moz.build
Hunk #1 FAILED at 4.
1 out of 1 hunk FAILED -- saving rejects to file config/external/sqlite/moz.build.rej
patching file old-configure.in
Hunk #1 succeeded at 57 with fuzz 1 (offset -2 lines).
Hunk #2 succeeded at 2128 (offset 88 lines).
patching file storage/SQLiteMutex.h
patching file storage/moz.build
Hunk #1 succeeded at 99 with fuzz 2 (offset 1 line).
patching file storage/mozStorageConnection.cpp
Hunk #1 succeeded at 877 (offset 113 lines).
patching file storage/mozStorageService.cpp
Hunk #1 succeeded at 35 with fuzz 2 (offset 4 lines).
Hunk #2 succeeded at 166 (offset -17 lines).
patching file third_party/sqlite3/src/moz.build
Hunk #1 succeeded at 79 (offset -1 lines).
Hunk #2 succeeded at 89 with fuzz 1 (offset -1 lines).
patching file third_party/sqlite3/src/sqlite.symbols
The results aren't quite as bad as I'd feared they might be: only two failed hunks. The reason these hunks are failing is that Mozilla have switched from using single quotes to double quotes as their default Python string delimiter throughout the Python build scripts. That's going to cause havoc for pretty-much all the patches that touch the build process.

Nevertheless, armed with this knowledge, adding the two failing hunks is straightforward. I apply the patch, commit the changes, and try again.

The enable-system-sqlite flag now goes through. What's next? The configure flag steps now all go through, which means the configure script is getting a lot further. But it does still fail, seemingly because of lack of availability for a specific version of NSPR.

NSPR stands for Netscape Portable Runtime and is one of the gecko build dependencies. According to the packaging, it:
 
“provides platform independence for non-GUI operating system facilities. These facilities include threads, thread synchronization, normal file and network I/O, interval timing and calendar time, basic memory management (malloc and free) and shared library linking.“

That's all pretty crucial stuff and as the error makes clear, the version installed in my SDK is too old:
 0:29.31 ERROR: Requested 'nspr >= 4.32' but version of NSPR is 4.29.0
Checking the version in the repositories I see the error is correct:
$ sfdk engine exec
[mersdk@f1c636fdd533 gecko-dev]$ sb2 -t SailfishOS-devel-aarch64.default
[SB2 sdk-build SailfishOS-devel-aarch64.default] gecko-dev $ zypper search --details nspr
Loading repository data...
Reading installed packages...

S  | Name             | Type       | Version                 | Arch    | Repository
---+------------------+------------+-------------------------+---------+-----------
i  | nspr             | package    | 4.29.0+git2-1.3.3.jolla | aarch64 | oss
   | nspr             | srcpackage | 4.29.0+git2-1.3.3.jolla | noarch  | oss
   | nspr-debuginfo   | package    | 4.29.0+git2-1.3.3.jolla | aarch64 | oss
   | nspr-debugsource | package    | 4.29.0+git2-1.3.3.jolla | aarch64 | oss
i+ | nspr-devel       | package    | 4.29.0+git2-1.3.3.jolla | aarch64 | oss
The spec file has a build requirement for NSPR, but it's set to be >= 4.25.0. It may be that we'll need to build an updated version of NSPR to get this to work. The requirement is coming from gecko-dev/build/moz.configure/nspr.configure. It's likely we won't be able to skip this requirement, but we can hack down the version requirement to see if we can still make progress in the meantime.

Changing the 4.32 to 4.29 in nspr.configure is nasty, but gets us past. We've still not reached the compilation step yet and the workaround will surely have to be addressed properly by the time we reach that point. But one step at a time. And the commit I've added makes clear what's going on, so I won't forget to fix this in the future.

Now we hit something more awkward:
 0:29.39 checking for rust host triplet...
 0:29.39 ERROR: The rust compiler host (i686-unknown-linux-gnu) is not suitable
         for the configure host (aarch64-unknown-linux-gnu).
 0:29.39 You can solve this by:
 0:29.39 * Set your configure host to match the rust compiler host by editing your
 0:29.39 mozconfig and adding "ac_add_options --host=i686-unknown-linux-gnu".
 0:29.39 * Or, install the rust toolchain for aarch64-unknown-linux-gnu, if
         supported, by running
 0:29.39 "rustup default stable-aarch64-unknown-linux-gnu"
Both scratchbox2 and rustc have cross-compilation capabilities. Getting them to work together for ESR78 was... let's just say it was challenging. We had to introduce some quite nasty workarounds. There's some nice advice from the error which will be worth trying, but it's quite late now, so something to try another day.

The fact it gets this far already is positive. Things are moving forwards.

If you're read through all the other entries in my diary up to now, I can only thank you and tell you I'm impressed with your perseverance. If not, feel free to check them out on my Gecko-dev Diary page.

There'll be more tomorrow.
Comment
20 Aug 2023 : Reflections on Leaving Twitter #
Last month I left Twitter (just a week prior to it being renamed as X). I shut my account completely, posting on my blog about my reasons. As I explain there, although I approve of many of the recent changes made by X, such as the shift away from advertising revenue and towards a subscription-based funding model, there were several other changes that I couldn't accept. Chief among them was the closing off of access to users who aren't logged in with an account.
 
Twitter profile @flypigahoy: "This account doesn't exist"

That was before the news broke this week that X had been failing to act on accounts that it had itself identified as posting antisemitic content for "months". Be aware that the content is quite shocking before clicking on the Media Matters link, but this stood out to me from the original Media Matters report:
 
“The suspension came only after the company verified the account; allowed it to repeatedly post antisemitic content; and monetized it by placing advertisements for major brands on the account. X's monetization of the account also happened even though the company had reportedly acknowledged that the antisemitic account engaged in ‘violent speech.’”

I didn't talk about freedom of expression in my previous post, despite it being one of the big discussions that comes up around Twitter. That's because at the time it wasn't a major factor in my decision. But I want to make my position clear on this.

If I hadn't closed my account already, then the recent news would have made me close my account.

Freedom of expression is an important right. I believe in it as a principle. But it's also important to understand what freedom of expression actually means, at least to me. Freedom of thought entitles you to hold any view or opinion; freedom of expression entitles you to express these views. And they both protect you from persecution from the state based on these views.

To clarify, this is how freedom of expression is codified in Protocol 10 of the European Convention on Human Rights.
 
“Everyone has the right to freedom of expression. This right shall include freedom to hold opinions and to receive and impart information and ideas without interference by public authority and regardless of frontiers. This Article shall not prevent States from requiring the licensing of broadcasting, television or cinema enterprises.”

This isn't the full protocol and I recommend reading it in its entirety. Nevertheless it makes clear that the protection is aimed at preventing interference from the state. This is a legal and political document, not an ethical one, and as with all things of this nature it's more nuanced than this. But it serves as a useful illustration.

Crucially, freedom of expression doesn't enshrine the right for anyone to post any opinion on any platform. This would be absurd. It would prevent platforms being topic-specific (e.g. a forum dedicated to cakes, or policing), it would block any form of content moderation and it would forbid publications from having a political slant.

That doesn't mean that any particular platform must moderate content either. Twitter, or X, should be free to moderate content, or free not to. And I should be free to participate in such a platform, or not. That's my position.

However, I do accept that until recently Twitter's reach and the way it was being used meant it had acquired a public-service-like role in sharing information from government, commercial and social organisations, as well as from individuals. This role arguably attracted additional responsibilities to protect freedom of expression. In fact, that's a big part of the reason why I felt it was so important for Twitter to be accessible even without an account. But by restricting access it's now made clear that it doesn't want this role, so I don't feel the responsibility applies in quite the same way any more.

But I digress. The point I want to make is that if X had been taking its obligation to protect freedom of expression seriously, then I could understand its desire not to block accounts, even if they share content that I personally find repugnant. The fact is, X hasn't been taking this role seriously at all. There are multiple actions Twitter has taken recently that make this clear.

For example, in December Twitter started blocking tweets containing links to rival services. Earlier this month X filed a lawsuit against the Center for Countering Digital Hate in an attempt to silence criticism of X's content moderation policies. Just last week X starting adding a five second delay to hyperlinked external resources expressing views the company disagrees with. And most egregiously, in May Twitter also acquiesced to censorship requirements from the government of the Republic of Türkiye. These are all actions aimed at curtailing expression.

At the same time the company has been reinstating the accounts of those with known extreme views.

When juxtaposed like this, it looks far more like X is pushing a particular viewpoint than championing freedom of expression.

Personally I don't want any part of a platform that is openly promoting views antithetical to mine, while restricting views I agree with. I'd be happy contributing to a platform hosting views I disagree with if it were to uphold freedom of expression. But promoting extreme views while blocking moderate ones: that's something else entirely.
19 Aug 2023 : Day 3 #
Yesterday we continued to set up the repository and spec file to get the build properly going. But that means there are only two possibilities today:
  1. The build generated an error and stopped.
  2. The build completed successfully.
You might think we'd be nowhere near approaching 2 yet. And you'd be right.

Which means we've now hit our first real error during the build process. Here's what it looks like.
 0:18.65 checking whether cross compiling... no
 0:20.22 Traceback (most recent call last):
 0:20.22   File "${PROJECT}/configure.py", line 226, in 
 0:20.22     sys.exit(main(sys.argv))
 0:20.22   File "${PROJECT}/configure.py", line 50, in main
 0:20.22     sandbox.run(os.path.join(os.path.dirname(__file__), "moz.configure"))
 0:20.22   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 507, in run
 0:20.23     self._value_for(option)
 0:20.23   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 612, in _value_for
 0:20.23     return self._value_for_option(obj)
 0:20.23   File "${PROJECT}/python/mozbuild/mozbuild/util.py", line 1050, in method_call
 0:20.23     cache[args] = self.func(instance, *args)
 0:20.23   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 647, in _value_for_option
 0:20.23     value, option_string = self._helper.handle(option)
 0:20.23   File "${PROJECT}/python/mozbuild/mozbuild/configure/options.py", line 593, in handle
 0:20.23     ret = option.get_value(arg, origin)
 0:20.23   File "${PROJECT}/python/mozbuild/mozbuild/configure/options.py", line 482, in get_value
 0:20.24     raise InvalidOptionError(
 0:20.24 mozbuild.configure.options.InvalidOptionError: 'cairo-qt' is not one of 'cairo-gtk3', 'cairo-gtk3-wayland'
The reason for this is immediately clear to me, because this is something that applied to the ESR78 build as well. Basically, there are several widget targets that gecko can be built for: Gtk3, Windows, Darwin, Android and so on. In the past this list also included Qt, but sadly the option was removed back in 2016 (version 50, as you can see in Bugzilla bug 1282866). Here's the reasoning that's given in the Bugzilla entry:
 
“We need not just active code peers, but also active bug triagers and some commitment to continuous integration. Because nobody is committing to that, we're going to remove this code from the main tree and interested parties can maintain it separately if desired.”

But the Sailfish OS version needs this code, so since then Jolla has been reapplying the changes with a patch. As it happens the changes propagated beyond just this, and so there are a whole swathe of changes that are now reapplied in order to "bring back the Qt layer" as the patch is succinctly described.

The patch, which you can see in the repository for ESR78, makes significant changes: 110 files changed, 4888 insertions, 53 deletions, including 52 entirely new files. it's probably the biggest single change needed to get gecko working for Sailfish OS.

It also means that applying the patch is going to be tricky, and because of changes to the upstream code, also has to be done manually.

The good news is that I'm on holiday and currently travelling from Cambridge to the Isle of Arran in Scotland. The London to Glasgow leg of the journey is an uninterrupted five our train journey. I have a comfortable seat, airplane-style table for my laptop and a power socket to keep the battery topped up. That should give me enough opportunity to apply the patch.

[...time passes...]

Well, reapplying the patch was laborious, unglamorous, but also uneventful. It took the entire five hour journey, but now at the end of my train trip the reapplication seems to have done the trick. Phew.

Arriving at the guest house where we're staying I still have a couple of hours in the evening to squash a few more bugs. So continuing on from here, the build now throws up the following.
 0:18.59 checking whether cross compiling... no
 0:20.20 Traceback (most recent call last):
 0:20.20   File "${PROJECT}/configure.py", line 226, in 
 0:20.20     sys.exit(main(sys.argv))
 0:20.20   File "${PROJECT}/configure.py", line 50, in main
 0:20.20     sandbox.run(os.path.join(os.path.dirname(__file__), "moz.configure"))
 0:20.20   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 554, in run
 0:20.20     raise InvalidOptionError(msg)
 0:20.20 mozbuild.configure.options.InvalidOptionError: Unknown option: --with-embedlite
The with-embedlite flag is coming from embedding/embedlite/config/mozconfig.merqtxulrunner and only appears to be needed for patch 0089 that adds a video decoder based on gecko-camera, used for WebRTC video calling. We'll hit this patch eventually, but trying to fix this before we have a working build is premature. To work around it, we just take the part of the patch that introduces the flag and add that back in. We'll address the rest of the patch later.

Next we hit this:
 0:18.59 checking whether cross compiling... no
 0:20.19 Traceback (most recent call last):
 0:20.19   File "${PROJECT}/configure.py", line 226, in 
 0:20.19     sys.exit(main(sys.argv))
 0:20.20   File "${PROJECT}/configure.py", line 50, in main
 0:20.20     sandbox.run(os.path.join(os.path.dirname(__file__), "moz.configure"))
 0:20.20   File "${PROJECT}/python/mozbuild/mozbuild/configure/__init__.py", line 554, in run
 0:20.20     raise InvalidOptionError(msg)
 0:20.20 mozbuild.configure.options.InvalidOptionError: Unknown option: --disable-marionette
The disable-marionette flag seems to have been removed. Marionette is useful for running unit tests, but we're not planning on doing that any time soon, so it's good to disable it. It seems that now the flag may have been superseded by the disable-webdriver flag instead. We add this in to mozconfig.merqtxulrunner; hopefully it will give an equally effective result.
#ac_add_options --disable-marionette
ac_add_options --disable-webdriver
No errors after adding in this disable-webdriver flag, so we can continue from here. Okay, that's it for today. It's been a long day of coding and travelling, but with positive results. There'll be more tomorrow.

For all the other entries in my developer diary, check out the Gecko-dev Diary page.
Comment
18 Aug 2023 : Day 2 #
Yesterday we set up our gecko-dev repository submodule to point to the ESR 91 code and updated the spec file with related parameters. We even tried building the project, although that didn't get very far.

So, now we're on to the work proper and have to try to address some of these errors. The very first error that comes up for me is the following:
This mach command requires /home/mersdk/.mozbuild/_virtualenvs/mach/bin/python, which wasn't found on the system!
Consider running 'mach bootstrap' or 'mach create-mach-environment' to create the mach virtualenvs, or set MACH_USE_SYSTEM_PYTHON to use the system Python installation over a virtualenv.
error: Bad exit status from /var/tmp/rpm-tmp.ho9Gok (%build)
This seems to be a change from the previous version used by Sailfish OS. Previously we set up python and its dependencies directly without use of a virtual environment. The Sailfish SDK builds inside a docker container with snapshots to ensure the main target doesn't get tainted by package installs, so using a virtual environment doesn't gain us much. Similarly on OBS, the build environment is recreated each build. But gecko wants to create one, so we may as well let it.

So, first we add python3-devel to the spec file as a build a dependency (it gets used during the build, but now when the browser is actually running). We already have python3 there, but the development package is needed for the virtual environment. Next, we just follow the advice in the error, by adding the following into the build section of the spec file:
./mach create-mach-environment
I've added it just before the main build call:
./mach create-mach-environment
./mach build -j$RPM_BUILD_NCPUS
Maybe this new addition could go in the prep section instead? For now, it works in the build section and running it afresh for each build doesn't seem to do any harm, other than add some additional time to the build. If there's ways to optimise this, we can come back to it later.

I notice that further up in the build section we have the following:
# hack for when not using virtualenv
ln -sf "%BUILD_DIR"/config.status $PWD/build/config.status
Now I'm wondering whether this should be removed. Again, keeping it doesn't seem to do any harm, but it may now be creating the link unnecessarily. This will be easy to check by just removing the line and building to see what happens. However the testing will be much easier and more robust when more of the build process is working. If I do the test now, problems evident later on in the build process might not get exposed.

So, let's stick a pin in this and return to it later.

I've created the virtual environment, so the build should get a bit further. Let's see.
sfdk build -d --with git_workaround
Note how I've dropped the -p flag. This flag is used to trigger the prepare step of the build process as defined in the spec file. We already successfully completed the prepare step and we've not made any changes to it, so we can skip it for future runs. In the future skipping the prepare step will be essential in order to avoid applying the patches on top of already patched code (which will result in errors). We don't have any patches right now, so we don't actually have to worry about this, but it's good practice anyway.

Now the build does get further... but not by much. We now have an error during the creation of the virtual environment.
Collecting glean_sdk==36.0.0
  Using cached glean-sdk-36.0.0.tar.gz (2.2 MB)
    ERROR: Command errored out with exit status 1:
     command: /home/mersdk/.mozbuild/_virtualenvs/mach/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5/setup.py'"'"'; __file__='"'"'/tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-1xv51c_9
         cwd: /tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5/
    Complete output (9 lines):
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5/setup.py", line 16, in 
        from setup import *
      File "/tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5/glean-core/python/setup.py", line 47, in 
        FROM_TOP = PYTHON_ROOT.relative_to(Path.cwd())
      File "/home/flypig/Programs/sailfish-sdk/sailfish-sdk/mersdk/targets/SailfishOS-devel-aarch64.default/usr/lib64/python3.8/pathlib.py", line 908, in relative_to
        raise ValueError("{!r} does not start with {!r}"
    ValueError: '/tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5/glean-core/python' does not start with '/tmp/sb2-mersdk-20230809-221116.wepDIc/tmp/pip-install-xlikpl2b/glean-sdk_9db1f9a0519b4c968d1e22841ac4a8f5'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Could not install glean_sdk, so telemetry will not be collected. Continuing.
As it happens this error looks pretty bad, but doesn't seem to be fatal. We're not worried about build telemetry at this stage (maybe not ever?) so can we get away with ignoring this? It certainly doesn't look like a priority to fix it right now as the build continues irrespective, but maybe this will come back to haunt us.

I'm going to stick a pin in this one too. Let's hope it won't cause problems later, but if it does it should become pretty clear. Even if it doesn't cause problems we should also figure out a way to avoid the error by configuring the build to not even attempt to install glean_sdk.

Hopefully you're already getting a feel for my modus operandi. The objective is to get the build to complete as quickly as possible and by whatever means necessary. This means not worrying too much about the details or aesthetics of what we're doing. All of this can be neatly ironed out later when things are building and when it's easier to work in parallel.

The next error is the following.
 0:12.85 checking for vcs source checkout... no
 0:12.97 ERROR: Cannot find project mobile/sailfishos
This looks serious, but the good news is that this is easily fixed by adding a symlink from mobile/sailfishos pointing to the ../../embedding/embedlite directory.

How do I know this? It's because this is actually the action of patch 0001 in the original project from the ESR78 build. The patch already works as-is, so I can just apply it to restore the symlink. After restoring the link the build continues.

It's worth noting that I'm not going to add the patch back in at this stage. Instead, I'll record it as a commit in my personal mirror of the upstream code (so strictly speaking it will no longer be a mirror any more). This is something we discussed on Day 1 in case you want a refresh on why I'm doing it this way.

After making this change, things are looking even better again, so this seems like a good place to pause for the day. The build gets further and then triggers an error, which I'll look in to tomorrow.

Don't forget you can read all the other developer diary entries on my Gecko-dev Diary page.
Comment
17 Aug 2023 : Day 1 #
So this is Day 1 of my Gecko-dev Diary. It's a developer diary about upgrading the Gecko rendering engine in the Sailfish OS browser from ESR 78 to ESR 91. I've posted a Preamble to give some background in case you've not already read it.

The Sailfish Browser is actually split across multiple repositories:
  1. gecko-dev: this is the one I'll initially be mostly interested in. It's the wrapper around the upstream gecko code
  2. gecko-dev mirror: this is just the sailfish mirror that's submoduled into gecko-dev. It's identical to the upstream mozilla repository of the same name.
  3. sailfish-browser: the main application wrapper for the Sailfish Browser
  4. qtmozembed: the EmbedLite wrapper for using gecko-dev with Qt
  5. embedlite-components: integration code (mostly JavaScript) that runs in gecko to support the browser
  6. sailfish-components-webview: A Silica component for using the engine in other apps
To begin with I'm just interested in the first two of these: the gecko-dev engine itself. If things go well, we'll need to move on to making changes in the other repositories as well.

Initially I'll actually be working in my own gecko-dev and gecko-dev-mirror forks. Although I'll be committing to the mirror, eventually these will all get converted to patches in the gecko-dev repo.

For the first steps I just pulled in the latest ESR78.1 changes to my local machine and checked everything builds. This takes some time. I'm never exactly sure how long it takes to build because there are some frustrating synchronisation issues that we never managed to fix, which means that the build can get stuck and hang. You can never be certain whether the build is still running or has just halted indefinitely.

As a result, to get through a full build it generally needs to be cancelled and restarted multiple times. Gecko supports incremental builds, so it's not as painful as it would be otherwise, but it still makes it hard to determine exactly how long a full build takes. In my estimation it's between five and eight hours on my development machine (14 cores, 16 threads, 32 GiB RAM).

The build works, so my development environment is set up correctly.

Next up I have to find the correct version of the next Extended Support Release from the upstream mirror. ESR91.9 has been neatly tagged; it looks like the latest ESR91 tag, so I'll start with that.

All I have to do initially is update the submodule to point to this tag. But since I don't want to be working with patches just yet, instead I'm going to add commits to the mirror and update the submodule commit to pull them in.

Committing changes to the mirror might sound like a terrible idea, but it will make things a lot easier during development. Instead of creating patches, which are horrible to work with, I'll instead commit my changes and update the submodule. Once things have settled I can turn all the commits into patches using git format-patch.

I created a new branch called FIREFOX_ESR_91_9_X_RELBRANCH_patches where all of my changes can live, and have created a sailfishos-esr91 gecko-dev branch with the submodule pointing to it.

The existing patches won't apply to this new codebase, so the first thing to do is disable them all in the spec file.

And now I just check that the build... well, check that it starts. Without any of the patches it's not going to get very far before it crashes out, but it needs to get some way so I can check the errors and apply any patches that are needed. I'll potentially have to apply them manually.

An important thing to note at this point is that, when building locally, and in order to avoid some confusion caused to the rust toolchain, we're building using the --with git_workaround flag.
sfdk build -d -p --with git_workaround
What this flag does is rename the .git folder inside the gecko-dev submodule to .git-disabled. When the build is successful we can ignore the consequences of this since it's renamed back again at the end.

However, if the build fails it will leave our local repository in an inconsistent state: the git status of the submodule will be missing. We have to therefore remember to name the .git-disabled folder back to .git if we ever want to do any git operations on the submodule — which we will because we're committing our changes — it's just another thing to bear in mind.
 
The gecko-dev build fails

Well, the build starts. It doesn't get very far, but it does start. That's enough for today; tomorrow I'll try to tackle this first build error I'm hitting.

You can read all the other developer diary entries on my Gecko-dev Diary page.
Comment
16 Aug 2023 : Preamble #
In my previous job working for Jolla one of the projects I worked on was the browser upgrade. The Sailfish Browser uses Mozilla's Gecko Web engine — the same engine used in Firefox — but optimised for embedded devices. The browser is a key component on any consumer-oriented phone. It's likely one of the most used applications, and on a phone that doesn't have the same level of reach for application development as Android or iOS, it can be a critical way of accessing services that might otherwise need an app. On Sailfish OS the browser is really important and it was a genuine privilege to get to work on it.

To be effective the browser needs to be fast at rendering and compatible with the widest possible pool of websites. Basing the Sailfish Browser on the same engine as Firefox helps achieve both of these. Compatibility isn't perfect because of the crazy hoops websites will jump through to try to block anything even slightly non-standard, but it's as close as we can hope for.

But to retain this compatibility we also need to use an up-to-date version of the engine. That means keeping it updated to a recent release.

Beyond this it also needs a decent array of features, things that users have grown to expect such as password management, tab support, privacy features and a whole lot more. Providing these requires deep integration with the gecko engine, but most of them won't be fully implemented by the engine alone.

So maintaining a browser is a lot of work, and requires extended commitment.

During my time at Jolla we upgraded gecko first from ESR 60 to ESR 68, then from ESR 68 to ESR 78. Just to explain briefly, ESR stands for "Extended Support Release". These releases retain support for a longer period of time compared to the standard releases, which means that they get bug fixes but not new features. Adding new features also tends to add bugs and introduce regressions, so the idea is that the ESR releases will acquire fewer bugs and greater stability over time, compared to what you might expect when tracking the latest release.

ESR releases tend to be supported for at least a year. It would be great if the Sailfish Browser could be updated on every gecko release, but unfortunately this just isn't realistic. A good compromise is to aim to keep it up-to-date with at least the most recent stable ESR.

Since ESR 78 Mozilla has released ESR 91, ESR 102 and ESR 115 versions of Gecko. The last of these is still in development, so we'd really want the Sailfish Browser to be at 102 right now. So we're actually two ESR releases behind.

Upgrading gecko for Sailfish OS is a big job. Just checking the git history I can see that the upgrade from ESR 68 to ESR 78 took in the region of nine months work from a team of contributors. Like I said, it's a big job.

While I was at Jolla the browser team was headed up by Raine. Raine has a boatload of experience working with the gecko engine and was a superb team leader. Throughout my time working with him he went out of his way to share his knowledge of the browser and to encourage us to share ours with others. He was motivated to ensure we didn't get too reliant on any one developer. In my view, this is the mark of an excellent software dev lead, which he is. His view was that upgrading more than one ESR at a time was bound to fail: as a task it's just too complex. Nothing I've seen of gecko makes me doubt his assessment.

As such, the next step for the Sailfish Browser should be to update it from the current version to ESR 91. After that we can think about moving it to ESR 102 which is where we'd like it ideally to be right now.

Patches

You might reasonably wonder why building gecko for Sailfish OS is quite such a big job. After all, gecko runs on Linux and Sailfish OS is really just a Linux variant. Shouldn't it just compile straight away?

That's true up to a point, but there are three significant catches with this.

First, gecko has quite a lot of toolkit-specific code, whether that be for Gtk, Windows, Cairo or Android. Historically there was also the EmbedLite API that allowed the renderer to be embedded in a native user interface, including using Qt in the form of QtMozEmbed. The Sailfish Browser is built on this. Sadly Gecko dropped official support for this and Jolla has been the de fecto maintainer of the code since 2015 or thereabouts. Maintaining this involves keeping the Sailfish OS changes in line with upstream gecko changes, which is a big chunk of the work required when updating to a new version for Sailfish OS.

Second, Gecko uses its own build scaffolding written in Python called Mach. Different parts of Gecko use different languages and tooling, mostly C++, Rust and JavaScript using gcc, clang and rustc. If you're a Sailfish OS developer you'll know the standard Sailfish OS development toolkit is a cross-platform SDK based on scratchbox2. Getting the two to work together can be intricate. For example, some Rust components form part of the build process while others get linked into the final executable. The former must be built for the host platform whilst the latter must be built for the target platform. Getting the gecko build process and sb2 to agree on which is which can be troublesome. On top of all that, Gecko has a bunch of build and runtime dependencies which need updating each release as well.

Third, there are a large number of Sailfish-specific integrations that go beyond the rendering engine. For example, integrations that allow searching within pages, saving pages to PDF, playing audio, rendering video, capturing audio and video, storing history, bookmarks and passwords. The browser is more than just the rendering engine and front-end functionality invariably has deep links into the backend gecko engine.

So, to cut a long story short, just getting an updated gecko to compile requires time and effort. Another piece of wisdom that Raine taught me is that the first task of upgrading the engine should always be to get it to compile. Once it's compiling, getting it to actually run on a phone, patching all of the regressions and fixing up all the integrations can follow. But without a compiling build there's no point in spending time on these other parts.

My plan is to follow his advice. I'm therefore going for a three-stage process with this upgrade:
  1. Apply a minimal set of changes and patches to get ESR 91 to build using the Sailfish SDK.
  2. Apply any remaining patches where possible and other changes to get it to run and render.
  3. Handle the Sailfish OS specific integrations.
The first stage will be a laborious step-by-step process that involves attempting a build, hitting a build error, fixing the build error (without testing how it affects the runtime execution). Rinse and repeat until the build fully succeeds. Just the process of running the build can take several hours (between five to eight on my development machine) so it's a slow process, even with incremental builds.

So this could take some time, but my plan is to write about each of the steps as I go through them.

These might not make for the most fascinating posts, but they'll serve as a useful personal record of my progress and maybe they'll be of interest to others too.

So that's the scene set. In the next post I'll start to do some actual real work, test out the build and see where that gets us. I'll aim to write as often as possible and if you're following along I hope you find it interesting.

One last thing: the Sailfish OS community is one of the best. I know there will be Sailfish OS users and developers itching to help and get this project across the line. And if this is going to work, I'll need that help. Once there's a working build that others can try out, it'll become a whole lot easier to cooperate on the code and test things out. If this is something you're interested in contributing to, please watch this space.

To read all of my posts on this topic, please see my Gecko-dev Diary page.
Comment