Víctor Jáquez: GStreamer-VAAPI 1.16 and libva 2.6 in Debian

January 8, 2020, 8:36 am

≫ Next: Guillaume Desmottes: Rust/GStreamer paid internship at Collabora

≪ Previous: Gustavo Orrillo: Espacio de Datos: fulldome installation

Debian has migrated libva 2.6 into testing. This release includes a pull request that changes how the drivers are selected to be loaded and used. As the pull request mentions:

libva will try to load iHD firstly, if it failed. then it will load i965.

Also, Debian testing has imported that iHD driver with two flavors: intel-media-driver and intel-media-driver-non-free. So basically iHD driver is now the main VAAPI driver for Intel platforms, though it only supports the new chips, the old ones still require i965-va-driver.

Sadly, for current GStreamer-VAAPI stable, the iHD driver is not included in its driver white list. And this will pose a problem for users that have installed either of the intel-media-driver packages, because, by default, such driver is ignored and the VAAPI GStreamer elements won’t be registered.

There are three temporal workarounds (mutually excluded) for those users (updated):

Uninstall intel-media-driver* and install (or keep) the old i965-va-driver-shaders/i965-va-driver.
Export, by default in your session, export LIBVA_DRIVER_NAME=i965. Normally this is done adding the variable exportation in $HOME/.profile file. This environment variable will force libva to load the i965 driver.
And finally, export, by default in your sessions, GST_VAAPI_ALL_DRIVERS=1. This is not advised since many applications, such as Epiphany, might fail.

We prefer to not include iHD in the stable white list because most of the work done for that driver has occurred after release 1.16.

In the case of GStreamer-VAAPI master branch (actively in develop) we have merged the iHD in the white list, since the Intel team has been working a lot to make it work. Though, it will be released for GStreamer version 1.18.

↧

Guillaume Desmottes: Rust/GStreamer paid internship at Collabora

January 10, 2020, 2:48 am

≫ Next: Gustavo Orrillo: Special collections’ design process

≪ Previous: Víctor Jáquez: GStreamer-VAAPI 1.16 and libva 2.6 in Debian

Collabora is offering various paid internship positions for 2020. We have a nice range of very cool projects involving kernel work, Panfrost, Monado, etc.

I'll be mentoring a GStreamer project aiming to write a Chromecast sink element in Rust. It would be a great addition to GStreamer and would give the student a chance to learn about our favorite multimedia framework but also about bindings between C GObject code and Rust.

So if you're interested don't hesitate to apply or contact me if you have any question.

↧

Gustavo Orrillo: Special collections’ design process

December 15, 2019, 8:00 am

≫ Next: Andy Wingo: lessons learned from guile, the ancient & spry

≪ Previous: Guillaume Desmottes: Rust/GStreamer paid internship at Collabora

The visualization we developed for the Network of Libraries of the Bank of the Republic of Colombia is a web tool that allows users to create their own search paths through the special documents and collections available in the libraries. This project took around 6 months of work, form initial research and sketching to the final product that is currently in use. It gave us a unique opportunity to apply novel frameworks and technologies for web development such as p5.js to make the rich cultural heritage deposited at the Network of Libraries more easily accessible to a wide range of users, from ocassional visitors to expert researchers. This post shows some of the visual materials and concepts that inspired the tool, as well as design sketches and prototypes.

Early UI sketch

Background

The Network of Libraries of the Bank of the Republic is the depository of more than thirty-five historical archives that constitute a primary source for the reconstruction of the history of Colombia. Researchers and historians can consult most of these archives in the Room of Rare Books and Manuscripts of the Luis Ángel Arango Library; however, some of these archives are available at other cultural centers of the Bank of the Republic around the country.

When we started working on this project, several of the materials were available in digital through a web portal where the information was organized in various thematic groupings and by document types. Analysis of this portal revealed the following features:

Most of the data had already been digitized
A fraction of the contents were available through a “pre-made” timeline visualization implemented with an existing javascript library.
Another part of the contents were available through different interfaces (list, maps, etc)
There was little connection between all the contents, each topic had to be visualized within its own separate page

The major aims of the interactive visualization to be developed were the following:

to gives a more holistic access to the contents of the site
to emphasize relationships between separate themes and types of materials, and facilitate finding contents and understanding the context of these contents

Time-based data

As the special documents and collections had a strong temporal dimension, and timelines were used in the original version of the website, we started exploring timeline-based visualizations of the data. The slideshow below contains some previous projects we considered in our research:

A timeline of history - http://histography.io/

World Digital Library Timelines - https://www.wdl.org/en/timelines/

Interactive Blog Calendar - https://eagereyes.org/blog-calendar

Annual Reports - https://fathom.info/reports

Visualizations using the timeline metaphor.

Two issues with the use timelines to visualize the collections from the Network of Libraries were that the data is sparsely populated, and that some items could cover a very wide time range, for example 50 years or even more. So the problem became how to handle such a large interval properly in a traditional timeline? Because of this issue, we looked into more dynamic approaches. The interactive timelines in the New Cooper Hewitt Experience at the Smithsonian Cooper Hewitt Museum in New York:

and the Timeline of Modern Art at Tate Modern in London:

are great examples where museum’s collections are form a flowing stream where visitors can select and manipulate the items they find interesting.

Seach processes

Another concept that we considered early on was that of serendipitous search, as suggested by the following picture of a library user browsing the shelves:

Browsing the shelves

To us, this idea of serendipitous search resonated with Psychogeography, the term coined in the 1950s by the Situationists and denoting the “exploration of urban environments that emphasizes playfulness and drifting”. Can we transfer this practice into the visualization of cultural materials to create some kind of “Librageography” or “Bibliogeografía”, which prioritizes playful search? Some visual inspiration related to these concepts:

The Naked City by Guy Debord - A psychogeographical map of Paris

Larissa Fassler’s Psychogeographic Cartographies

Information Evolution in Social Networks, Lada Adamic et al.

In particular, the psychogeographic concept of drifting led to the idea of a visual exploration where each user constructs their own map of the data by wandering through the connections between the data elements, defined by the common tags shared by the collections. The following early sketch offer a glimpse of these schemes:

First hand-drawn sketch

While a drifting navigation through the collections could provide an engaging and playful experience for ocassional users, the visualization should also offer tools for a more directed search that advanced users may need when researching the data or looking for specific information. A first mockup of the web viewer incorporates such tools:

First hand-drawn sketch

Refining the design

Once we identified an initial visual metaphor and navigation mechanism, it was the time to start iterating over the early sketches while discussing the progress with the team from the Virtual Library of the Bank of the Republic, who was in charge of the project.

The prototype also supported mobile phones since the beginning, as it was very important that the web visualizer was accessible through both desktop and mobile:

Mobile prototype

In parallel to the design of the main visualization screen, we also starting sketching the intro screen. This is also very important, as the intro screen is the entry point to the visualization, and it may disuade users, specially newcomers, to stay in the page and explore the data:

Sketches of intro screen

We developed a working prototype based on those initial designs and sketches, which allowed us to test the basic user flows and iterate the designs to obtain feedback quickly:

Once we agreed on the overall modes of introduction, presentation and interaction, we started refining the visual appareance of the prototype through color, textures, text, animation, and mutual relationships of all the elements:

Work on the UI was also ongoing at this stage:

After a few rounds additional rounds of design iterations, the color palette and UI improved significantly, and at that stage we were approaching the final version, with only a few minor tweaks left:

Responsive design

With a great majority of users navigating the web on their phones, a big priority for us was to ensure that the viewer worked well on mobile browsers. In the end, we were able to keep the functionality and appereance consistent across desktop, iOS, and Android, while dealing with the differences in screen state and interaction modalities (i.e.: touch vs mouse):

Conclusions

The Virtual Library of the Bank of the Republic released the new Special Documents and Collections portal, including the interactive viewer, in November of 2019. We are very satisfied with the design process described here as well as with the final product. We believe that we created an useful tool for cultural promotion and research in the humanities, and we hope to gain insight on the user engagement with the viewer based on the analytics collected by the portal.

↧

Andy Wingo: lessons learned from guile, the ancient & spry

February 7, 2020, 3:38 am

≫ Next: Andy Wingo: state of the gnunion 2020

≪ Previous: Gustavo Orrillo: Special collections’ design process

Greets, hackfolk!

Like just about every year, last week I took the train up to Brussels for FOSDEM, the messy and wonderful carnival of free software and of those that make it. Mostly I go for the hallway track: to see old friends, catch up, scheme about future plans, and refill my hacker culture reserves.

I usually try to see if I can get a talk or two in, and this year was no exception. First on my mind was the recent release of Guile 3. This was the culmination of a 10-year plan of work and so obviously there are some things to say! But at the same time, I wanted to reflect back a bit and look at the past with a bit of distance.

So in the end, my one talk was two talks. Let's start with the first one. (I'm trying a new thing where I share my talks as blog posts. We'll see how this goes. I know the rendering can be a bit off relative to the slides, but hopefully it's good enough. If you prefer, you can just watch the video instead!)

Celebrating Guile 3

FOSDEM 2020, Brussels

Andy Wingo | wingo@igalia.com

wingolog.org | @andywingo

So yeah let's celebrate! I co-maintain the Guile implementation of Scheme. It's a programming language. Guile 3, in summary, is just Guile, but faster. We added a simple just-in-time compiler as well as a bunch of ahead-of-time optimizations. The result is that it runs faster -- sometimes by a lot!

In the image above you can see Guile 3's performance on a number of microbenchmarks, relative to Guile 2.2, sorted by speedup. The baseline is 1.0x as fast. You can see that besides the first couple microbenchmarks where things are a bit inconclusive (click for full-size image), everything gets faster. Most are at least 2x as fast, and one benchmark is even 32x as fast. (Note the logarithmic scale on the Y axis.)

I only took a look at microbenchmarks at the end of the Guile 3 series; before that, I was mostly going by instinct. It's a relief to find out that in this case, my instincts did align with improvement.

mini-benchmark: eval

(primitive-eval’(let fib ((n 30))
    (if (< n 2)
        n
        (+ (fib (- n 1)) (fib (- n 2))))))

Guile 1.8: primitive-eval written in C

Guile 2.0+: primitive-eval in Scheme

Taking a look at a more medium-sized benchmark, let's compute the 30th fibonacci number, but using the interpreter instead of compiling the procedure. In Guile 2.0 and up, the interpreter (primitive-eval) is implemented in Scheme, so it's a good test of an important small Scheme program.

Before 2.0, though, primitive-eval was actually implemented in C. This had a number of disadvantages, notably that it prevented tail calls between interpreted and compiled code. When we switched to a Scheme implementation of primitive-eval, we knew we would have a performance hit, but we thought that we would gain it back eventually as the compiler got better.

As you can see, it took a while before the compiler and run-time improved to the point that primitive-eval in Scheme reached the speed of its old hand-tuned C implementation, but for Guile 3, we finally got there. Note again the logarithmic scale on the Y axis.

macro-benchmark: guix

guix build libreoffice ghc-pandoc guix \
  –dry-run --derivation

7% faster

guix system build config.scm \
  –dry-run --derivation

10% faster

Finally, taking a real-world benchmark, the Guix package manager is implemented entirely in Scheme. All ten thousand packages are defined in Scheme, the building scripts are in Scheme, the initial RAM disk is in Scheme -- you get the idea. Guile performance in Guix can have an important effect on user experience. As you can see, Guile 3 lowered elapsed time for some operations by around 10 percent or so. Of course there's a lot of I/O going on in addition to computation, so Guile running twice as fast will rarely make Guix run twice as fast (Amdahl's law and all that).

spry /sprī/

adjective: active; lively

So, when I was thinking about words that describe Guile, the word "spry" came to mind.

spry /sprī/

adjective: (especially of an old person) active; lively

But actually when I went to look up the meaning of "spry", Collins Dictionary says that it especially applies to the agèd. At first I was a bit offended, but I knew in my heart that the dictionary was right.

Lessons Learned from Guile, the Ancient & Spry

FOSDEM 2020, Brussels

Andy Wingo | wingo@igalia.com

wingolog.org | @andywingo

That leads me into my second talk.

guile is ancient

2010: Rust

2009: Go

2007: Clojure

1995: Ruby

1995: PHP

1995: JavaScript

1993: Guile (3³ years before 3.0!)

It's common for a new project to be lively, but Guile is definitely not new. People have been born, raised, and earned doctorates in programming languages in the time that Guile has been around.

built from ancient parts

1991: Python

1990: Haskell

1990: SCM

1989: Bash

1988: Tcl

1988: SIOD

Guile didn't appear out of nothing, though. It was hacked up from the pieces of another Scheme implementation called SCM, which itself was initially based on Scheme in One Defun (SIOD), back before the Berlin Wall fell.

written in an ancient language

1987: Perl

1984: C++

1975: Scheme

1972: C

1958: Lisp

1958: Algol

1954: Fortran

1958: Lisp

1930s: λ-calculus (3⁴ years ago!)

But it goes back further! The Scheme language, of which Guile is an implementation, dates from 1975, before I was born; and you can, if you choose, trace the lines back to the lambda calculus, created in mid-30s as a notation for computation. I suppose at this point I should say mid-2030s, to disambiguate.

The point is, Guile is old! Statistically, most software projects from olden times are now dead. How has Guile managed to survive and (sometimes) thrive? Surely there must be some lesson or other that can be learned here.

ancient & spry

Men make their own history, but they do not make it as they please; they do not make it under self-selected circumstances, but under circumstances existing already, given and transmitted from the past.

The tradition of all dead generations weighs like a nightmare on the brains of the living. [...]

Eighteenth Brumaire of Louis Bonaparte, Marx, 1852

I am no philospher of history, but I know that there are some ways of looking at the past that do not help me understand things. One is the arrow of enlightened progress, in which events exist in a causal chain, each producing the next. It doesn't help me understand the atmosphere, tensions, and possibilities inherent at any particular point. I find the "progress" theory of history to be an extreme form of selection bias.

Much more helpful to me is the Hegelian notion of dialectics: that at an given point in time there are various tensions at work. In our field, an example could be memory safety versus systems programming. These tensions create an environment that favors actions that lead towards resolution of the tensions. It doesn't mean that there's only one way to resolve the tensions, and it's not an automatic process -- people still have to do things. But the tendency is to ratchet history forward to a new set of tensions.

The history of a project, to me, is then a process of dialectic tensions and resolutions. If the project survives, as Guile has, then it should teach us something about the way this process works in practice.

ancient & spry

Languages evolve; how to remain minimal?

Dialectic opposites

world and guile
stable and active
...

Lessons learned from inside Hegel’s motor of history

One dialectic is the tension between the world's problems and what tools Guile offers to understand and solve them. In 1993, the web didn't really exist. In 2033, if Guile doesn't run well in a web browser, probably it will be dead. But this process operates very slowly, for an old project; Guile isn't built on CORBA or something ephemeral like that, so we don't have very much data here.

The tension between being a stable base for others to build on, and in being a dynamic project that improves and changes, is a key tension that this talk investigates.

In the specific context of Guile, and for the audience of the FOSDEM minimal languages devroom, we should recognize that for a software project, age and minimalism don't necessarily go together. Software gets features over time and becomes bigger. What does it mean for a minimal language to evolve?

hill-climbing is insufficient

Ex: Guile 1.8; Extend vs Embed

One key lesson that I have learned is that the strategy of making only incremental improvements is a recipe for death, in the long term. The natural result is that you reach what you perceive to be the most optimal state of your project. Any change can only make it worse, so you stop moving.

This is what happened to Guile around version 1.8: we had taken the paradigm of the interpreter as language implementation strategy as far as it could go. There were only around 150 commits to Guile in 2007. We were stuck.

users stay unless pushed away

Inertial factor: interface

Source (API)
Binary (ABI)
Embedding (API)
CLI
...

Ex: Python 3; local-eval; R6RS syntax; set!, set-car!

So how do we make change, in such a circumstance? You could start a new project, but then you wouldn't have any users. It would be nice to change and keep your users. Fortunately, it turns out that users don't really go away; yes, they trickle out if you don't do anything, but unless you change in an incompatible way, they stay with you, out of inertia.

Inertia is good and bad. It does conflict with minimalism as a principle; if you were to design Scheme in 2020, you would not include mutable variables or even mutable pairs. But they are still with us because if we removed them, we'd break too many users.

Users can even make you add back things that you had removed. In Guile 2.0, we removed the capability to evaluate an expression at run-time within the lexical environment of an expression, as we didn't know how to implement this outside an interpreter. It turns out this was so important to users that we had to add local-eval back to Guile, later in the 2.0 series. (Fortunately we were able to do it in a way that layered on lower-level facilities; this approach reconciled me to the solution.)

you can’t keep all users

What users say: don’t change or remove existing behavior

But: sometimes losing users is OK. Hard to know when, though

No change at all == death

Natural result of hill-climbing

Ex: psyntax; BDW-GC mark & finalize; compile-time; Unicode / locales

Unfortunately, the need to change means that sometimes you will lose users. It's either a dead project, or losing users.

In Guile 1.8, for example, the macro expander ran lazily: it would only expand code the first time it ran it. This was good for start-up time, because not all code is evaluated in the course of a simple script. Lazy expansion allowed us to start doing important work sooner. However, this approach caused immense pain to people that wanted "proper" Scheme macros that preserved lexical scoping; the state of the art was to eagerly expand an entire file. So we switched, and at the same time added a notion of compile-time. This compromise kept good start-up time while allowing fancy macros.

But eager expansion was a change. Users that relied on side effects from macro expansion would see them at compile-time instead of run-time. Users of old "defmacros" that could previously splice in live Scheme closures as literals in expanded source could no longer do that. I think it was the right choice but it did lose some users. In fact I just got another bug report related to this 10-year-old change last week.

every interface is a cost

Guile binary ABI: libguile.so; compiled Scheme files

Make compatibility easier: minimize interface

Ex: scm_sym_unquote, GOOPS, Go, Guix

So if you don't want to lose users, don't change any interface. The easiest way to do this is to minimize your interface surface. In Go, for example, they mostly haven't had dynamic-linking problems because that's not a thing they do: all code is statically linked into binaries. Similarly, Guix doesn't define a stable API, because all of its code is maintained in one "monorepo" that can develop in lock-step.

You always have some interfaces, though. For example Guix can't change its command-line interface from one day to the next, for example, because users would complain. But it's been surprising to me the extent to which Guile has interfaces that I didn't consider. Recently for example in the 3.0 release, we unexported some symbols by mistake. Users complained, so we're putting them back in now.

parallel installs for the win

Highly effective pattern for change

libguile-2.0.so
libguile-3.0.so

https://ometer.com/parallel.html

Changed ABI is new ABI; it should have a new name

Ex: make-struct/no-tail, GUILE_PKG([2.2]), libtool

So how does one do incompatible change? If "don't" isn't a sufficient answer, then parallel installs is a good strategy. For example in Guile, users don't have to upgrade to 3.0 until they are ready. Guile 2.2 happily installs in parallel with Guile 3.0.

As another small example, there's a function in Guile called make-struct (old doc link), whose first argument is the number of "tail" slots, followed by initializers for all slots (normal and "tail"). This tail feature is weird and I would like to remove it. Unfortunately I can't just remove the argument, so I had to make a new function, make-struct/no-tail, which exists in parallel with the old version that I can't break.

deprecation facilitates migration

__attribute__ ((__deprecated__))

(issue-deprecation-warning
 "(ice-9 mapping) is deprecated.""  Use srfi-69 or rnrs hash tables instead.")

scm_c_issue_deprecation_warning
  ("Arbiters are deprecated.  ""Use mutexes or atomic variables instead.");

begin-deprecated, SCM_ENABLE_DEPRECATED

Fortunately there is a way to encourage users to migrate from old interfaces to new ones: deprecation. In Guile this applies to all of our interfaces (binary, source, etc). If a feature is marked as deprecated, we cause its use to issue a warning, ideally at compile-time when users responsible for the package can fix it. You can even add __attribute__((__deprecated__)) on C types!

the arch-pattern

Replace, Deprecate, Remove

All change is possible; question is only length of deprecation period

Applies to all interfaces

Guile deprecation period generally one stable series

Ex: scm_t_uint8; make-struct; Foreign objects; uniform vectors

Finally, you end up in a situation where you have replaced the old interface and issued deprecation warnings to help users migrate. The next step is to remove the old interface. If you don't do this, you are failing as a project maintainer -- your project becomes literally unmaintainable as it just grows and grows.

This strategy applies to all changes. The deprecation period may last a while, and it may be that the replacement you built doesn't serve the purpose. There is still a dialog with the users that needs to happen. As an example, I made a replacement for the "SMOB" facility in Guile that allows users to define new types, backed by C interfaces. This new "foreign object" facility might not actually be good enough to replace SMOBs; since I haven't formally deprecatd SMOBs, I don't know yet because users are still using the old thing!

change produces a new stable point

Stability within series: only additions

Corollary: dependencies must be at least as stable as you!

for your definition of stable
social norms help (GNU, semver)

Ex: libtool; unistring; gnulib

In my experience, the old management dictum that "the only constant is change" does not describe software. Guile changes, then it becomes stable for a while. You need an unstable series escape hill-climbing, then once you found your new hill, you start climbing again in the stable series.

Once you reach your stable point, the projects you rely on need to exhibit the same degree of stability that you envision for your project. You can't build a web site that you expect to maintain for 10 years on technology that fundamentally changes every 6 months. But stable dependencies isn't something you can ensure technically; rather it relies on social norms of who makes the software you use.

who can crank the motor of history?

All libraries define languages

Allow user to evolve the language

User functionality: modules (Guix)
User syntax: macros (yay Scheme)

Guile 1.8 perf created tension

incorporate code into Guile
large C interface “for speed”

Compiler removed pressure on C ABI

Empowered users need less from you

A dialectic process does not progress on its own: it requires actions. As a project maintainer, some of my actions are because I want to do them. Others are because users want me to do them. The user-driven actions are generally a burden and as a lazy maintainer, I want to minimize them.

Here I think Guile has to a large degree escaped some of the pressures that weigh on other languages, for example Python. Because Scheme allows users to define language features that exist on par with "built-in" features, users don't need my approval or intervention to add (say) new syntax to the language they work in. Furthermore, their work can still compose with the work of others, even if the others don't buy in to their language extensions.

Still, Guile 1.8 did have a dynamic whereby the relatively poor performance of having to run all code through primitive-eval meant that users were pushed towards writing extensions in C. This in turn pushed Guile to expose all of its guts for access from C, which obviously has led to an overbloated C API and ABI. Happily the work on the Scheme compiler has mostly relieved this pressure, and we may therefore be able to trim the size of the C API and ABI over time.

contributions and risk

From maintenance point of view, all interface is legacy

Guile: Sometimes OK to accept user modules when they are more stable than Guile

In-tree users keep you honest

Ex: SSAX, fibers, SRFI

It can be a good strategy to "sediment" solutions to common use cases into Guile itself. This can improve the minimalism of an entire ecosystem of code. The maintenance burden has to be minimal, however; Guile has sometimes adopted experimental code into its repository, and without active maintenance, it soon becomes stale relative to what users and the module maintainers expect.

I would note an interesting effect: pieces of code that were adopted into Guile become a snapshot of the coding style at that time. It's useful to have some in-tree users because it gives you a better idea about how a project is seen from the outside, from a code perspective.

sticky bits

Memory management is an ongoing thorn

Local maximum: Boehm-Demers-Weiser conservative collector

How to get to precise, generational GC?

Not just Guile; e.g. CPython __del__

There are some points that resist change. The stickiest of these is the representation of heap-allocated Scheme objects in C. Guile currently uses a garbage collector that "automatically" finds all live Scheme values on the C stack and in registers. It was the right choice at the time, given our maintenance budget. But to get the next bump in performance, we need to switch to a generational garbage collector. It's hard to do that without a lot of pain to C users, essentially because the C language is too weak to express the patterns that we would need. I don't know how to proceed.

I would note, though, that memory management is a kind of cross-cutting interface, and that it's not just Guile that's having problems changing; I understand PyPy has had a lot of problems regarding changes on when Python destructors get called due to its switch from reference counting to a proper GC.

future

We are here: stability

And then?

Parallel-installability for source languages: #lang
Sediment idioms from Racket to evolve Guile user base

Remove myself from “holding the crank”

So where are we going? Nowhere, for the moment; or rather, up the hill. We just released Guile 3.0, so let's just appreciate that for the time being.

But as far as next steps in language evolution, I think in the short term they are essentially to further enable change while further sedimenting good practices into Guile. On the change side, we need parallel installability for entire languages. Racket did a great job facilitating this with #lang and we should just adopt that.

As for sedimentation, we should step back and if any common Guile use patterns built by our users should be include core Guile, and widen our gaze to Racket also. It will take some effort both on a technical perspective but also on a social/emotional consensus about how much change is good and how bold versus conservative to be: putting the dialog into dialectic.

dialectic, boogie woogie woogie

https://gnu.org/s/guile

https://wingolog.org/

#guile on freenode

@andywingo

wingo@igalia.com

Happy hacking!

Hey that was the talk! Hope you enjoyed the writeup. Again, video and slides available on the FOSDEM web site. Happy hacking!

↧

Andy Wingo: state of the gnunion 2020

February 9, 2020, 11:44 am

≫ Next: Jean-François Fortin Tam: Revival of Getting Things GNOME: survey results and first status update

≪ Previous: Andy Wingo: lessons learned from guile, the ancient & spry

Greetings, GNU hackers! This blog post rounds up GNU happenings over 2019. My goal is to celebrate the software we produced over the last year and to help us plan a successful 2020.

Over the past few months I have been discussing project health with a group of GNU maintainers and we were wondering how the project was doing. We had impressions, but little in the way of data. To that end I wrote some scripts to collect dates and versions for all releases made by GNU projects, as far back as data is available.

In 2019, I count 243 releases, from 98 projects. Nice! Notably, on ftp.gnu.org we have the first stable releases from three projects:

GNU Guix: GNU Guix is perhaps the most exciting project in GNU these days. It's a package manager! It's a distribution! It's a container construction tool! It's a package-manager-cum-distribution-cum-container-construction-tool! Hearty congratulations to Guix on their first stable release.
GNU Shepherd: The GNU Daemon Shepherd is a modern dependency-based init service, written in Guile Scheme, and used in Guix. When you install Guix as an operating system, it actually stages Scheme programs from the operating system definition into the Shepherd configuration. So cool!
GNU Backgammon: Version 1.06.002 is not GNU Backgammon's first stable release, but it is the earliest version which is available on ftp.gnu.org. Formerly hosted on the now-defunct gnubg.org, GNU Backgammon is a venerable foe, and uses neural networks since before they were cool. Welcome back, GNU Backgammon!

The total release counts above are slightly above what Mike Gerwitz's scripts count in his "GNU Spotlight", posted on the FSF blog. This could be because in addition to files released on ftp.gnu.org, I also manually collected release dates for most packages that upload their software somewhere other than gnu.org. I don't count alpha.gnu.org releases, and there were a handful of packages for which I wasn't successful at retrieving their release dates. But as a first approximation, it's a relatively complete data set.

I put my scripts in git repository if anyone is interested in playing with the data. Some raw CSV files are there as well.

where we at?

Hair toss, check my nails, baby how you GNUing? Hard to tell!

To get us closer to an answer, I calculated the active package count per year. There can be other definitions, but my reading is that an active package is one that has had a stable release within the preceding 3 calendar years. So for 2019, for example, a GNU package is considered active if it had a stable release in 2017, 2018, or 2019. What I got was a graph that looks like this:

What we see is nothing before 1991 -- surely pointing to lacunae in my data set -- then a more or less linear rise in active package count until 2002, some stuttering growth rising to a peak in 2014 at 208 active packages, and from there a steady decline down to 153 active packages in 2019.

Of course, as a metric, active package count isn't precisely the same as project health; GNU ed is indeed the standard editor but it's not GCC. But we need to look for measurements that indirectly indicate project health and this is what I could come up with.

Looking a little deeper, I tabulated the first and last release date for each GNU package, and then grouped them by year. In this graph, the left blue bars indicate the number of packages making their first recorded release, and the right green bars indicate the number of packages making their last release. Obviously a last release in 2019 indicates an active package, so it's to be expected that we have a spike in green bars on the right.

What this graph indicates is that GNU had an uninterrupted growth phase from its beginning until 2006, with more projects being born than dying. Things are mixed until 2012 or so, and since then we see many more projects making their last release and above all, very few packages "being born".

where we going?

I am not sure exactly what steps GNU should take in the future but I hope that this analysis can be a good conversation-starter. I do have some thoughts but will post in a follow-up. Until then, happy hacking in 2020!

↧

Jean-François Fortin Tam: Revival of Getting Things GNOME: survey results and first status update

February 17, 2020, 11:17 am

≫ Next: Jean-François Fortin Tam: The Ultimate Free and Open Source conference explanation video

≪ Previous: Andy Wingo: state of the gnunion 2020

Ever since my previous blogging frenzy where I laid bare the secret to my productivity, formulated my typology of workers, and published a survey to evaluate the revival potential for Getting Things GNOME, I’m sure y’all have been dying to know what were the outcomes of that survey, and how the GTG project is doing.

Well, you can find out in this video where I present my findings and the path forward for the project:

I hope you like it. Much more fun than reading a wall of text, isn’t it?

I’ll have another video (on a different open-source project/community management topic) coming up very soon, and it will be of absolutely epic proportions—it took me months to produce it to the level of quality I desired. If you don’t want to miss it, I suggest you subscribe to my YouTube channel (and/or mailing list).

The post Revival of Getting Things GNOME: survey results and first status update appeared first on The Open Sourcerer.

↧

Jean-François Fortin Tam: The Ultimate Free and Open Source conference explanation video

March 5, 2020, 6:30 am

≫ Next: Víctor Jáquez: Review of the Igalia Multimedia team Activities (2019/H2)

≪ Previous: Jean-François Fortin Tam: Revival of Getting Things GNOME: survey results and first status update

Have you ever wondered what the best community-oriented open source conference events look like? Ever wanted to attend one, but never dared to? Or need something to convince your boss to support you in attending as part of your work?

For many veteran FLOSS contributors who are part of big established projects, it is easy to take things for granted and just go to those events without hesitation; we forget how mysterious and intimidating this can be for casual or new contributors. We don’t typically spend the time to articulate what makes these events great, and why we spend so much effort organizing and attending them.

It also seems quite mysterious to our non-technical friends and family members. They sometimes know that we’re travelling to some mythical “computer conference” event in some faraway land, suspiciously held in a different city every year (as is the case with GNOME’s GUADEC), but it’s hard to explain why we’re mostly going there for a few days to spend time “indoors in some auditorium” instead of sipping margaritas on the beach.

Well, I have the solution for this longstanding communication problem.

After weeks of preparation, a few days of shooting, and over 13 days of full-time editing, I have produced the Ultimate FLOSS conference explanation video:

Click the image above to view the video

It is a dynamic, cinematic, professional-grade short documentary, meant to serve as evergreen material that you can point people to. I hope you will appreciate the level of attention to detail present in this edit!

It is longer than the typical “2-minutes conference highlights” video, but I believe the quality and depth of topics being discussed, combined with my tight script and editing, will make it a pleasure for you to watch from beginning to end. It’s shorter than a TV episode yet still makes for good nighttime entertainment!

In order to make the topic accessible, the cinematic part is preceded by a narrated introduction to establish the context in terms anyone can understand. This is so that you can safely share the video with newcomers, friends, family, new acquaintances you meet for years to come—no matter their knowledge level.
After the educational introduction, it then continues to the “cinematic” documentary part.

“What if I’m already a Lv.70 geek?”

If you’re in a big hurry and you already know all about Free & Open Source software, you can skip to the 5:05 mark, and if you already know about GStreamer and don’t care to know why I made the video “around GStreamer” in the first place, you can jump directly to 8:14… But if you have a few minutes extra, it’s certainly worth watching from the start (you’d be missing some jokes otherwise).

I recommend listening to this with good speakers or headphones. While I am no musician, my editing style is centered around sound, rhythm and (e)motion:

I adapt motion, beats and flow to fit the desired atmosphere and impact. In terms of editing style, I primarily “cut to the music”, but also sometimes rearrange the music itself to fit the motion.
I tweak all the sound levels & frequencies to ensure you can always hear speech effortlessly—whether you are on my studio monitoring speakers or headphones, or on some crappy 1w laptop speakers (which I obviously do not recommend).

While I’m at it, I might as well mention that I’m open for contractual video production or editing work (in addition to being available as a long-term specialized marketer ;)

P.s.: if you’re French, no need to challenge me to a duel after watching the intro! I actually like your transportation system. Especially the fact that you actually have trains. We don’t have that here.

The post The Ultimate Free and Open Source conference explanation video appeared first on The Open Sourcerer.

↧

Víctor Jáquez: Review of the Igalia Multimedia team Activities (2019/H2)

March 16, 2020, 8:20 am

≫ Next: Jean-François Fortin Tam: 2020: the fecal matter is colliding with the rotary oscillator

≪ Previous: Jean-François Fortin Tam: The Ultimate Free and Open Source conference explanation video

This blog post is a review of the various activities the Igalia Multimedia team was involved along the second half of 2019.

Here are the previous 2018/H2 and 2019/H1 reports.

GstWPE

Succinctly, GstWPE is a GStreamer plugin which allows to render web-pages as a video stream where it frames are GL textures.

Phil, its main author, wrote a blog post explaning at detail what is GstWPE and its possible use-cases. He wrote a demo too, which grabs and previews a live stream from a webcam session and blends it with an overlay from wpesrc, which displays HTML content. This composited live stream can be broadcasted through YouTube or Twitch.

These concepts are better explained by Phil himself in the following lighting talk, presented at the last GStreamer Conference in Lyon:

Video Editing

After implementing a deep integration of the GStreamer Editing Services (a.k.a GES) into Pixar’s OpenTimelineIO during the first half of 2019, we decided to implement an important missing feature for the professional video editing industry: nested timelines.

Toward that goal, Thibault worked with the GSoC student Swayamjeet Swain to implement a flexible API to support nested timelines in GES. This means that users of GES can now decouple each scene into different projects when editing long videos. This work is going to be released in the upcoming GStreamer 1.18 version.

Henry Wilkes also implemented the support for nested timeline in OpenTimelineIO making GES integration one of the most advanced one as you can see on that table:

Feature	AAF	RV	ALE
Single Track of Clips		W-O
Multiple Video Tracks		W-O
Audio Tracks & Clips		W-O
Gap/Filler
Markers		N/A
Nesting		W-O
Transitions		W-O
Audio/Video Effects		N/A
Linear Speed Effects	R-O
Fancy Speed Effects
Color Decision List			N/A

Along these lines, Thibault delivered a 15 minutes talk, also in the GStreamer Conference 2019:

After detecting a few regressions and issues in GStreamer, related to frame accuracy, we decided to make sure that we can seek in a perfectly frame accurate way using GStreamer and the GStreamer Editing Services. In order to ensure that, an extensive integration testsuite has been developed, mostly targeting most important container formats and codecs (namely mxf, quicktime, h264, h265, prores, jpeg) and issues have been fixed in different places. On top of that, new APIs are being added to GES to allow expressing times in frame number instead of nanoseconds. This work is still ongoing but should be merged in time for GStreamer 1.18.

GStreamer Validate Flow

GstValidate has been turning into one of the most important GStreamer testing tools to check that elements behave as they are supposed to do in the framework.

Along with our MSE work, we found that other way to specify tests, related with produced buffers and events through specific pads, was needed. Thus, Alicia developed a new plugin for GstValidate: Validate Flow.

Alicia gave an informative 30 minutes talk about GstValidate and the new plugin in the last GStreamer Conference too:

GStreamer VAAPI

Most of the work along the second half of 2019 were maintenance tasks and code reviews.

We worked mainly on memory restrictions per backend driver, and we reviewed a big refactor: internal encoders now use GstObject, instead of the custom GstVaapiObject. Also we reviewed patches for new features such as video rotation and cropping in vaapipostproc.

Servo multimedia

Last year we worked integrating media playing in Servo. We finally delivered hardware accelerated video playback in Linux and Android. We worked also for Windows and Mac ports but they were not finished. As natural, most of the work were in servo/media crate, pushing code and reviewing contributions. The major tasks were to rewrite the media player example and the internal source element looking to handle the download playbin‘s flag properly.

We also added WebGL integration support with <video> elements, thus webpages can use video frames as WebGL textures.

Finally we explored how to isolate the multimedia processing in a dedicated thread or process, but that task remains pending.

WebKit Media Source Extension

We did a lot of downstream and upstream bug fixing and patch review, both in WebKit and GStreamer, for our MSE GStreamer-based backend.

Along this line we improved WebKitMediaSource to use playbin3 but also compatibility with older GStreamer versions was added.

WebKit WebRTC

Most of the work in this area were maintenance and fix regressions uncovered by the layout tests. Besides, the support for the Rasberry Pi was improved by handling encoded streams from v4l2 video sources, with some explorations with Minnowboard on top of that.

Conferences

GStreamer Conference

Igalia was Gold sponsor this last GStreamer Conference held in Lyon, France.

All team attended and five talks were delivered. Only Thibault presented, besides the video editing one which we already referred, another two more: One about GstTranscoder API and the other about the new documentation infrastructure based in Hotdoc:

We also had a productive hackfest, after the conference, where we worked on AV1 Rust decoder, HLS Rust demuxer, hardware decoder flag in playbin, and other stuff.

Linaro Connect

Phil attended the Linaro Connect conference in San Diego, USA. He delivered a talk about WPE/Multimedia which you can enjoy here:

Demuxed

Charlie attended Demuxed, in San Francisco. The conference is heavily focused on streaming and codec engineering and validation. Sadly there are not much interest in GStreamer, as the main focus is on FFmpeg.

RustFest

Phil and I attended the last RustFest in Barcelona. Basically we went to meet with the Rust community and we attended the “WebRTC with GStreamer-rs” workshop presented by Sebastian Dröge.

↧

Jean-François Fortin Tam: 2020: the fecal matter is colliding with the rotary oscillator

March 25, 2020, 8:06 am

≫ Next: Andy Wingo: firefox's low-latency webassembly compiler

≪ Previous: Víctor Jáquez: Review of the Igalia Multimedia team Activities (2019/H2)

Many friends of mine, including a significant portion of GNOME contributors, are in the United States, and I’m personally worried they (or those around them) will face particularly deep trouble this year and beyond. It seems nobody dares talk openly about it, so what the heck, I’m sharing my concern here and getting it out of my chest (then, after worrying about death, I can move on to worrying about taxes). Maybe I’ll be able to sleep a bit better.

As you most probably know, Europe is taking a serious beating and is struggling as we speak… but if you thought the US will fare any better, just wait. Shiitake is about to hit the fan, and the case of the United States of America is particularly concerning because of the many reasons I extensively documented here a couple of days ago. Not only is the US’ preparation for this pandemic very much insufficient and it has no true safety net for its citizens, but it also has very unique societal factors that, compared to all the other countries in the world, put it at risk of suffering extremely deep social disruption and pervasive hardship.

I wish you the best of luck in the fight against the SARS-coronavirus-2, just as I am wishing good luck to the rest of the world. I hope I will be incredibly wrong (so far the trends seem to be confirming my predictions, however) and that some unforeseen radical solutions will turn the tide, but I’m not holding my breath here. The US needs more than band-aid quick-fixes.

Let’s hope that this time, the sheer scale of the problem will bring about real positive change in the system. Not just a bigger economic bubble at the expense of the people and planet. It would be about time.

The post 2020: the fecal matter is colliding with the rotary oscillator appeared first on The Open Sourcerer.

↧

Andy Wingo: firefox's low-latency webassembly compiler

March 25, 2020, 9:29 am

≫ Next: Bastien Nocera: PAM testing using pam_wrapper and dbusmock

≪ Previous: Jean-François Fortin Tam: 2020: the fecal matter is colliding with the rotary oscillator

Good day!

Today I'd like to write a bit about the WebAssembly baseline compiler in Firefox.

background: throughput and latency

WebAssembly, as you know, is a virtual machine that is present in web browsers like Firefox. An important initial goal for WebAssembly was to be a good target for compiling programs written in C or C++. You can visit a web page that includes a program written in C++ and compiled to WebAssembly, and that WebAssembly module will be downloaded onto your computer and run by the web browser.

A good virtual machine for C and C++ has to be fast. The throughput of a program compiled to WebAssembly (the amount of work it can get done per unit time) should be approximately the same as its throughput when compiled to "native" code (x86-64, ARMv7, etc.). WebAssembly meets this goal by defining an instruction set that consists of similar operations to those directly supported by CPUs; WebAssembly implementations use optimizing compilers to translate this portable instruction set into native code.

There is another dimension of fast, though: not just work per unit time, but also time until first work is produced. If you want to go play Doom 3 on the web, you care about frames per second but also time to first frame. Therefore, WebAssembly was designed not just for high throughput but also for low latency. This focus on low-latency compilation expresses itself in two ways: binary size and binary layout.

On the size front, WebAssembly is optimized to encode small files, reducing download time. One way in which this happens is to use a variable-length encoding anywhere an instruction needs to specify an integer. In the usual case where, for example, there are fewer than 128 local variables, this means that a local.get instruction can refer to a local variable using just one byte. Another strategy is that WebAssembly programs target a stack machine, reducing the need for the instruction stream to explicitly load operands or store results. Note that size optimization only goes so far: it's assumed that the bytes of the encoded module will be compressed by gzip or some other algorithm, so sub-byte entropy coding is out of scope.

On the layout side, the WebAssembly binary encoding is sorted by design: definitions come before uses. For example, there is a section of type definitions that occurs early in a WebAssembly module. Any use of a declared type can only come after the definition. In the case of functions which are of course mutually recursive, function type declarations come before the actual definitions. In theory this allows web browsers to take a one-pass, streaming approach to compilation, starting to compile as functions arrive and before download is complete.

implementation strategies

The goals of high throughput and low latency conflict with each other. To get best throughput, a compiler needs to spend time on code motion, register allocation, and instruction selection; to get low latency, that's exactly what a compiler should not do. Web browsers therefore take a two-pronged approach: they have a compiler optimized for throughput, and a compiler optimized for latency. As a WebAssembly file is being downloaded, it is first compiled by the quick-and-dirty low-latency compiler, with the goal of producing machine code as soon as possible. After that "baseline" compiler has run, the "optimizing" compiler works in the background to produce high-throughput code. The optimizing compiler can take more time because it runs on a separate thread. When the optimizing compiler is done, it replaces the baseline code. (The actual heuristics about whether to do baseline + optimizing ("tiering") or just to go straight to the optimizing compiler are a bit hairy, but this is a summary.)

This article is about the WebAssembly baseline compiler in Firefox. It's a surprising bit of code and I learned a few things from it.

design questions

Knowing what you know about the goals and design of WebAssembly, how would you implement a low-latency compiler?

It's a question worth thinking about so I will give you a bit of space in which to do so.

After spending a lot of time in Firefox's WebAssembly baseline compiler, I have extracted the following principles:

The function is the unit of compilation
One pass, and one pass only
Lean into the stack machine
No noodling!

In the remainder of this article we'll look into these individual points. Note, although I have done a good bit of hacking on this compiler, its design and original implementation comes mainly from Mozilla hacker Lars Hansen, who also currently maintains it. All errors of exegesis are mine, of course!

the function is the unit of compilation

As we mentioned, in the binary encoding of a WebAssembly module, all definitions needed by any function come before all function definitions. This naturally leads to a partition between two phases of bytestream parsing: an initial serial phase that collects the set of global type definitions, annotations as to which functions are imported and exported, and so on, and a subsequent phase that compiles individual functions in an essentially independent manner.

The advantage of this approach is that compiling functions is a natural task unit of parallelism. If the user has a machine with 8 virtual cores, the web browser can keep one or two cores for the browser itself and farm out WebAssembly compilation tasks to the rest. The result is that the compiled code is available sooner.

Taking functions to be the unit of compilation also allows for an easy "tier-up" mechanism: after the baseline compiler is done, the optimizing compiler can take more time to produce better code, and when it is done, it can swap out the results on a per-function level. All function calls from the baseline compiler go through a jump table indirection, to allow for tier-up. In SpiderMonkey there is no mechanism currently to tier down; if you need to debug WebAssembly code, you need to refresh the page, causing the wasm code to be compiled in debugging mode. For the record, SpiderMonkey can only tier up at function calls (it doesn't do OSR).

This simple approach does have some down-sides, in that it leaves intraprocedural optimizations on the table (inlining, contification, custom calling conventions, speculative optimizations). This is mitigated in two ways, the most obvious being that LLVM or whatever produced the WebAssembly has ideally already done whatever inlining might be fruitful. The second is that WebAssembly is designed for predictable performance. In JavaScript, an implementation needs to do run-time type feedback and speculative optimizations to get good performance, but the result is that it can be hard to understand why a program is fast or slow. The designers and implementers of WebAssembly in browsers all had first-hand experience with JavaScript virtual machines, and actively wanted to avoid unpredictable performance in WebAssembly. Therefore there is currently a kind of détente among the various browser vendors, that everyone has agreed that they won't do speculative inlining -- yet, anyway. Who knows what will happen in the future, though.

Digressing, the summary here is that the baseline compiler receives an individual function body as input, and generates code just for that function.

one pass, and one pass only

The WebAssembly baseline compiler makes one pass through the bytecode of a function. Nowhere in all of this are we going to build an abstract syntax tree or a graph of basic blocks. Let's follow through how that works.

Firstly, emitFunction simply emits a prologue, then the body, then an epilogue. emitBody is basically a big loop that consumes opcodes from the instruction stream, dispatching to opcode-specific code emitters (e.g. emitAddI32).

The opcode-specific code emitters are also responsible for validating their arguments; for example, emitAddI32 is wrapped in an assertion that there are two i32 values on the stack. This validation logic is shared by a templatized codestream iterator so that it can be re-used by the optimizing compiler, as well as by the publicly-exposed WebAssembly.validate function.

A corollary of this approach is that machine code is emitted in bytestream order; if the WebAssembly instruction stream has an i32.add followed by a i32.sub, then the machine code will have an addl followed by a subl.

WebAssembly has a syntactically limited form of non-local control flow; it's not goto. Instead, instructions are contained in a tree of nested control blocks, and control can only exit nonlocally to a containing control block. There are three kinds of control blocks: jumping to a block or an if will continue at the end of the block, whereas jumping to a loop will continue at its beginning. In either case, as the compiler keeps a stack of nested control blocks, it has the set of valid jump targets and can use the usual assembler logic to patch forward jump addresses when the compiler gets to the block exit.

lean into the stack machine

This is the interesting bit! So, WebAssembly instructions target a stack machine. That is to say, there's an abstract stack onto which evaluating i32.const 32 pushes a value, and if followed by i32.const 10 there would then be i32(32) | i32(10) on the stack (where new elements are added on the right). A subsequent i32.add would pop the two values off, and push on the result, leaving the stack as i32(42). There is also a fixed set of local variables, declared at the beginning of the function.

The easiest thing that a compiler can do, then, when faced with a stack machine, is to emit code for a stack machine: as values are pushed on the abstract stack, emit code that pushes them on the machine stack.

The downside of this approach is that you emit a fair amount of code to do read and write values from the stack. Machine instructions generally take arguments from registers and write results to registers; going to memory is a bit superfluous. We're willing to accept suboptimal code generation for this quick-and-dirty compiler, but isn't there something smarter we can do for ephemeral intermediate values?

Turns out -- yes! The baseline compiler keeps an abstract value stack as it compiles. For example, compiling i32.const 32 pushes nothing on the machine stack: it just adds a ConstI32 node to the value stack. When an instruction needs an operand that turns out to be a ConstI32, it can either encode the operand as an immediate argument or load it into a register.

Say we are evaluating the i32.add discussed above. After the add, where does the result go? For the baseline compiler, the answer is always "in a register" via pushing a new RegisterI32 entry on the value stack. The baseline compiler includes a stupid register allocator that spills the value stack to the machine stack if no register is available, updating value stack entries from e.g. RegisterI32 to MemI32. Note, a ConstI32 never needs to be spilled: its value can always be reloaded as an immediate.

The end result is that the baseline compiler avoids lots of stack store and load code generation, which speeds up the compiler, and happens to make faster code as well.

Note that there is one limitation, currently: control-flow joins can have multiple predecessors and can pass a value (in the current WebAssembly specification), so the allocation of that value needs to be agreed-upon by all predecessors. As in this code:

(func $f (param $arg i32) (result i32)
  (block $b (result i32)
    (i32.const 0)
    (local.get $arg)
    (i32.eqz)
    (br_if $b) ;; return 0 from $b if $arg is zero
    (drop)
    (i32.const 1))) ;; otherwise return 1
;; result of block implicitly returned

When the br_if branches to the block end, where should it put the result value? The baseline compiler effectively punts on this question and just puts it in a well-known register (e.g., $rax on x86-64). Results for block exits are the only place where WebAssembly has "phi" variables, and the baseline compiler allocates all integer phi variables to the same register. A hack, but there we are.

no noodling!

When I started to hack on the baseline compiler, I did a lot of code reading, and eventually came on code like this:

void BaseCompiler::emitAddI32() {
  int32_t c;
  if (popConstI32(&c)) {
    RegI32 r = popI32();
    masm.add32(Imm32(c), r);
    pushI32(r);
  } else {
    RegI32 r, rs;
    pop2xI32(&r, &rs);
    masm.add32(rs, r);
    freeI32(rs);
    pushI32(r);
  }
}

I said to myself, this is silly, why are we only emitting the add-immediate code if the constant is on top of the stack? What if instead the constant was the deeper of the two operands, why do we then load the constant into a register? I asked on the chat channel if it would be OK if I improved codegen here and got a response I was not expecting: no noodling!

The reason is, performance of baseline-compiled code essentially doesn't matter. Obviously let's not pessimize things but the reason there's a baseline compiler is to emit code quickly. If we start to add more code to the baseline compiler, the compiler itself will slow down.

For that reason, changes are only accepted to the baseline compiler if they are necessary for some reason, or if they improve latency as measured using some real-world benchmark (time-to-first-frame on Doom 3, for example).

This to me was a real eye-opener: a compiler optimized not for the quality of the code that it generates, but rather for how fast it can produce the code. I had seen this in action before but this example really brought it home to me.

The focus on compiler throughput rather than compiled-code throughput makes it pretty gnarly to hack on the baseline compiler -- care has to be taken when adding new features not to significantly regress the old. It is much more like hacking on a production JavaScript parser than your traditional SSA-based compiler.

that's a wrap!

So that's the WebAssembly baseline compiler in SpiderMonkey / Firefox. Until the next time, happy hacking!

↧

Bastien Nocera: PAM testing using pam_wrapper and dbusmock

April 1, 2020, 10:17 am

≫ Next: Andy Wingo: multi-value webassembly in firefox: from 1 to n

≪ Previous: Andy Wingo: firefox's low-latency webassembly compiler

On the road to libfprint and fprintd 2.0, we've been fixing some long-standing bugs, including one that required porting our PAM module from dbus-glib to sd-bus, systemd's D-Bus library implementation.

As you can imagine, I have confidence in my ability to write bug-free code at the first attempt, but the foresight to know that this code will be buggy if it's not tested (and to know there's probably a bug in the tests if they run successfully the first time around). So we will have to test that PAM module, thoroughly, before and after the port.

Replacing fprintd

First, to make it easier to run and instrument, we needed to replace fprintd itself. For this, we used dbusmock, which is both a convenience Python library and way to write instrumentable D-Bus services, and wrote a template. There are a number of existing templates for a lot of session and system services, in case you want to test the integration of your code with NetworkManager, low-memory-monitor, or any number of other services.

We then used this to write tests for the command-line utilities, so we can both test our new template and test the command-line utilities themselves.

Replacing gdm

Now that we've got a way to replace fprintd and a physical fingerprint reader, we should write some tests for the (old) PAM module to replace sudo, gdm, or the login authentication services.

Co-workers Andreas Schneier and Jakub Hrozek worked on pam_wrapper, an LD_PRELOAD library to mock the PAM library, and Python helpers to write simple PAM services. This LWN article explains how to test PAM applications, and PAM modules.

After fixing a few bugs in pam_wrapper, and combining with the fprintd dbusmock work above, we could wrap and test the fprintd PAM module like it never was before.

Porting to sd-bus

Finally, porting the PAM module to sd-bus was pretty trivial, a loop of 1) writing tests that work against the old PAM module, 2) porting a section of the code (like the fingerprint reader enumeration, or the timeout support), and 3) testing against the new sd-bus based code. The result was no regressions that we could test for.

Conclusion

Both dbusmock, and pam_wrapper are useful tools in your arsenal to write tests, and given those (fairly) easy to use CIs in GNOME and FreeDesktop.org's GitLabs, it would be a shame not to.

You might also be interested in umockdev, to mock a number of device types, and mocklibc (which combined with dbusmock powers polkit's unattended CI)

↧

Andy Wingo: multi-value webassembly in firefox: from 1 to n

April 3, 2020, 3:56 am

≫ Next: Andy Wingo: multi-value webassembly in firefox: a binary interface

≪ Previous: Bastien Nocera: PAM testing using pam_wrapper and dbusmock

Greetings, hackers! Today I'd like to write about something I worked on recently: implementation of the multi-value future feature of WebAssembly in Firefox, as sponsored by Bloomberg.

In the "minimum viable product" version of WebAssembly published in 2018, there were a few artificial restrictions placed on the language. Functions could only return a single value; if a function would naturally return two values, it would have to return at least one of them by writing to memory. Loops couldn't take parameters; any loop state variables had to be stored to and loaded from indexed local variables at each iteration. Similarly, any block that would naturally return more than one result would also have to do so via locals.

This restruction is lifted with the multi-value proposal. Function types now map from result type to result type, where a result type is a sequence of value types. That is to say, just as functions can take multiple arguments, they can return multiple results. Similarly, with the multi-value proposal, block types are now the same as function types: loops and blocks can take arguments and return any number of results. This change improves the expressiveness of WebAssembly as a compilation target; a C++ program compiled to multi-value WebAssembly can be encoded in fewer bytes than before. Multi-value also establishes a base for other language extensions. For example, the exception handling proposal builds on multi-value to pass multiple values to catch blocks.

So, that's multi-value. You would think that relaxing a restriction would be easy, but you'd be wrong! This task took me 5 months and had a number of interesting gnarly bits. This article is part one of two about interesting aspects of implementing multi-value in Firefox, specifically focussing on blocks. We'll talk about multi-value function calls next week.

multi-value in blocks

In the last article, I presented the basic structure of Firefox's WebAssembly support: there is a baseline compiler optimized for low latency and an optimizing compiler optimized for throughput. (There is also Cranelift, a new experimental compiler that may replace the current implementation of the optimizing compiler; but that doesn't affect the basic structure.)

The optimizing compiler applies traditional compiler techniques: SSA graph construction, where values flow into and out of graphs using the usual defs-dominate-uses relationship. The only control-flow joins are loop entry and (possibly) block exit, so the addition of loop parameters means in multi-value there are some new phi variables in that case, and the expansion of block result count from [0,1] to [0,n] means that you may have more block exit phi variables. But these compilers are built to handle these situations; you just build the SSA and let the optimizing compiler go to town.

The problem comes in the baseline compiler.

from 1 to n

Recall that the baseline compiler is optimized for compiler speed, not compiled speed. If there are only ever going to be 0 or 1 result from a block, for example, the baseline compiler's internal data structures will use something like a Maybe<ValType> to represent that block result.

If you then need to expand this to hold a vector of values, the naïve approach of using a Vector<ValType> would mean heap allocation and indirection, and thus would regress the baseline compiler.

In this case, and in many other similar cases, the solution is to use value tagging to represent 0 or 1 value type directly in a word, and the general case by linking out to an external vector. As block types are function types, they actually appear as function types in the WebAssembly type section, so they are already parsed; the BlockType in that case can just refer out to already-allocated memory.

In fact this value-tagging pattern applies all over the place. (The jit/ links above are for the optimizing compiler, but they relate to function calls; will write about that next week.) I have a bit of pause about value tagging, in that it's gnarly complexity and I didn't measure the speed of alternative implementations, but it was a useful migration strategy: value tagging minimizes performance risk to existing specialized use cases while adding support for new general cases. Gnarly it is, then.

control-flow joins

I didn't mention it in the last article, but there are two important invariants regarding stack discipline in the baseline compiler. Recall that there's a virtual stack, and that some elements of the virtual stack might be present on the machine stack. There are four kinds of virtual stack entry: register, constant, local, and spilled. Locals indicate local variable reads and are mostly like registers in practice; when registers spill to the stack, locals do too. (Why spill to the temporary stack instead of leaving the value in the local variable slot? Because locals are mutable. A local.get captures a local variable value at its point of execution. If future code changes the local variable value, you wouldn't want the captured value to change.)

Digressing, the stack invariants:

Spilled values precede registers and locals on the virtual stack. If u and v are virtual stack entries and u is older than v, then if u is in a register or is a local, then v is not spilled.
Older values precede newer values on the machine stack. Again for u and v, if they are both spilled, then u will be farther from the stack pointer than v.

There are five fundamental stack operations in the baseline compiler; let's examine them to see how the invariants are guaranteed. Recall that before multi-value, targets of non-local exits (e.g. of the br instruction) could only receive 0 or 1 value; if there is a value, it's passed in a well-known register (e.g. %rax or %xmm0). (On 32-bit machines, 64-bit values use a well-known pair of registers.)

push(v): Results of WebAssembly operations never push spilled values, neither onto the virtual nor the machine stack. v is either a register, a constant, or a reference to a local. Thus we guarantee both (1) and (2).
pop() -> v: Doesn't affect older stack entries, so (1) is preserved. If the newest stack entry is spilled, you know that it is closest to the stack pointer, so you can pop it by first loading it to a register and then incrementing the stack pointer; this preserves (2). Therefore if it is later pushed on the stack again, it will not be as a spilled value, preserving (1).
spill(): When spilling the virtual stack to the machine stack, you first traverse stack entries from new to old to see how far you need to spill. Once you get to a virtual stack entry that's already on the stack, you know that everything older has already been spilled, because of (1), so you switch to iterating back towards the new end of the stack, pushing registers and locals onto the machine stack and updating their virtual stack entries to be spilled along the way. This iteration order preserves (2). Note that because known constants never need to be on the machine stack, they can be interspersed with any other value on the virtual stack.
return(height, v): This is the stack operation corresponding to a block exit (local or nonlocal). We drop items from the virtual and machine stack until the stack height is height. In WebAssembly 1.0, if the target continuation takes a value, then the jump passes a value also; in that case, before popping the stack, v is placed in a well-known register appropriate to the value type. Note however that v is not pushed on the virtual stack at the return point. Popping the virtual stack preserves (1), because a stack and its prefix have the same invariants; popping the machine stack also preserves (2).
capture(t): Whereas return operations happen at block exits, capture operations happen at the target of block exits (the continuation). If no value is passed to the continuation, a capture is a no-op. If a value is passed, it's in a register, so we just push that register onto the virtual stack. Both invariants are obviously preserved.

Note that a value passed to a continuation via return() has a brief instant in which it has no name -- it's not on the virtual stack -- but only a location -- it's in a well-known place. capture() then gives that floating value a name.

Relatedly, there is another invariant, that the allocation of old values on block entry is the same as their allocation on block exit, so that all predecessors of the block exit flow all values via the same places. This is preserved by spilling on block entry. It's a big hammer, but effective.

So, given all this, how do we pass multiple values via return()? We don't have unlimited registers, so the %rax strategy isn't going to work.

The answer for the baseline compiler is informed by our lean into the stack machine principle. Multi-value returns are allocated in such a way that a capture() can push them onto the virtual stack. Because spilled values must precede registers, we therefore allocate older results on the stack, and put the last result in a register (or register pair for i64 on 32-bit platforms). Note that it's possible in theory to allocate multiple results to registers; we'll touch on this next week.

Therefore the implementation of return(height, v₁..v_n) is straightforward: we first pop register results, then spill the remaining virtual stack items, then shuffle stack results down towards height. This should result in a memmove of contiguous stack results towards the frame pointer. However because const values aren't present on the machine stack, depending on the stack height difference, it may mean a split between moving some values toward the frame pointer and some towards the stack pointer, then filling in by spilling constants. It's gnarly, but it is what it is. Note that the links to the return and capture implementations above are to the post-multi-value world, so you can see all the details there.

that's it!

In summary, the hard part of multi-value blocks was reworking internal compiler data structures to be able to represent multi-value block types, and then figuring out the low-level stack manipulations in the baseline compiler. The optimizing compiler on the other hand was pretty easy.

When it comes to calls though, that's another story. We'll get to that one next week. Thanks again to Bloomberg for supporting this work; I'm really delighted that Igalia and Bloomberg have been working together for a long time (coming on 10 years now!) to push the web platform forward. A special thanks also to Mozilla's Lars Hansen for his patience reviewing these patches. Until next week, then, stay at home & happy hacking!

↧

Andy Wingo: multi-value webassembly in firefox: a binary interface

April 8, 2020, 2:02 am

≫ Next: Andy Wingo: understanding webassembly code generation throughput

≪ Previous: Andy Wingo: multi-value webassembly in firefox: from 1 to n

Hey hey hey! Hope everyone is staying safe at home in these weird times. Today I have a final dispatch on the implementation of the multi-value feature for WebAssembly in Firefox. Last week I wrote about multi-value in blocks; this week I cover function calls.

on the boundaries between things

In my article on Firefox's baseline compiler, I mentioned that all WebAssembly engines in web browsers treat the function as the unit of compilation. This facilitates streaming, parallel compilation of WebAssembly modules, by farming out compilation of individual functions to worker threads. It also allows for easy tier-up from quick-and-dirty code generated by the low-latency baseline compiler to the faster code produced by the optimizing compiler.

There are some interesting Conway's Law implications of this choice. One is that division of compilation tasks becomes an opportunity for division of human labor; there is a whole team working on the experimental Cranelift compiler that could replace the optimizing tier, and in my hackings on Firefox I have had minimal interaction with them. To my detriment, of course; they are fine people doing interesting things. But the code boundary means that we don't need to communicate as we work on different parts of the same system.

Boundaries are where places touch, and sometimes for fluid crossing we have to consider boundaries as places in their own right. Functions compiled with the baseline compiler, with Ion (the production optimizing compiler), and with Cranelift (the experimental optimizing compiler) are all able to call each other because they actively maintain a common boundary, a binary interface (ABI). (Incidentally the A originally stands for "application", essentially reflecting division of labor between groups of people making different components of a software system; Conway's Law again.) Let's look closer at this boundary-place, with an eye to how it changes with multi-value.

what's in an ABI?

Among other things, an ABI specifies a calling convention: which arguments go in registers, which on the stack, how the stack values are represented, how results are returned to the callers, which registers are preserved over calls, and so on. Intra-WebAssembly calls are a closed world, so we can design a custom ABI if we like; that's what V8 does. Sometimes WebAssembly may call functions from the run-time, though, and so it may be useful to be closer to the C++ ABI on that platform (the "native" ABI); that's what Firefox does. (Incidentally here I think Firefox is probably leaving a bit of performance on the table on Windows by using the inefficient native ABI that only allows four register parameters. I haven't measured though so perhaps it doesn't matter.) Using something closer to the native ABI makes debugging easier as well, as native debugger tools can apply more easily.

One thing that most native ABIs have in common is that they are really only optimized for a single result. This reflects their heritage as artifacts from a world built with C and C++ compilers, where there isn't a concept of a function with more than one result. If multiple results are required, they are represented instead as arguments, typically as pointers to memory somewhere. Consider the AMD64 SysV ABI, used on Unix-derived systems, which carefully specifies how to pass arbitrary numbers of arbitrary-sized data structures to a function (§3.2.3), while only specifying what to do for a single return value. If the return value is too big for registers, the ABI specifies that a pointer to result memory be passed as an argument instead.

So in a multi-result WebAssembly world, what are we to do? How should a function return multiple results to its caller? Let's assume that there are some finite number of general-purpose and floating-point registers devoted to return values, and that if the return values will fit into those registers, then that's where they go. The problem is then to determine which results will go there, and if there are remaining results that don't fit, then we have to put them in memory. The ABI should indicate how to address that memory.

When looking into a design, I considered three possibilities.

first thought: stack results precede stack arguments

When a function needs some of its arguments passed on the stack, it doesn't receive a pointer to those arguments; rather, the arguments are placed at a well-known offset to the stack pointer.

We could do the same thing with stack results, either reserving space deeper on the stack than stack arguments, or closer to the stack pointer. With the advent of tail calls, it would make more sense to place them deeper on the stack. Like this:

The diagram above shows the ordering of stack arguments as implemented by Firefox's WebAssembly compilers: later arguments are deeper (farther from the stack pointer). It's an arbitrary choice that happens to match up with what the native ABIs do, as it was easier to re-use bits of the already-existing optimizing compiler that way. (Native ABIs use this stack argument ordering because of sloppiness in a version of C from before I was born. If you were starting over from scratch, probably you wouldn't do things this way.)

Stack result order does matter to the baseline compiler, though. It's easier if the stack results are placed in the same order in which they would be pushed on the virtual stack, so that when the function completes, the results can just be memmove'd down into place (if needed). The same concern dictates another aspect of our ABI: unlike calls, registers are allocated to the last results rather than the first results. This is to make it easy to preserve stack invariant (1) from the previous article.

At first I thought this was the obvious option, but I ran into problems. It turns out that stack arguments are fundamentally unlike stack results in some important ways.

While a stack argument is logically consumed by a call, a stack result starts life with a call. As such, if you reserve space for stack results just by decrementing the stack pointer before a call, probably you will need to load the results eagerly into registers thereafter or shuffle them into other positions to be able to free the allocated stack space.

Eager shuffling is busy-work that should be avoided if possible. It's hard to avoid in the baseline compiler. For example, a call to a function with 10 arguments will consume 10 values from the temporary stack; any results will be pushed on after removing argument values from the stack. If there any stack results, it's almost impossible to avoid a post-call memmove, to move stack results to where they should be before the 10 argument values were pushed on (and probably spilled). So the baseline compiler case is not optimal.

However, things get gnarlier with the Ion optimizing compiler. Like many other optimizing compilers, Ion is designed to compute the necessary stack frame size ahead of time, and to never move the stack pointer during an activation. The only exception is for pushing on any needed stack arguments for nested calls (which are popped directly after the nested call). So in that case, assuming there are a number of multi-value calls in a stack frame, we'll be shuffling in the optimizing compiler as well. Not great.

Besides the need to shuffle, stack arguments and stack results differ as regards ownership and garbage collection. A callee "owns" the memory for its stack arguments; it is responsible for them. The caller can't assume anything about the contents of that memory after a call, especially if the WebAssembly implementation supports tail calls (a whole 'nother blog post, that). If the values being passed are just bits, that's one thing, but with the reference types proposal, some result values may be managed by the garbage collector. The callee is responsible for making stack arguments visible to the garbage collector; the caller is responsible for the results. The caller will need to emit metadata to allow the garbage collector to see stack result references. For this reason, a stack result actually starts life just before a call, because it can become initialized at any point and thus needs to be traced during the entire callee activation. Not all callers can easily add garbage collection roots for writable stack slots, so the need to place stack results in a fixed position complicates calling multi-value WebAssembly functions in some cases (e.g. from C++).

second thought: pointers to individual stack results

Surely there are more well-trodden solutions to the multiple-result problem. If we encoded a multi-value return in C, how would we do it? Consider a function in C that has three 64-bit integer results. The idiomatic way to encode it would be to have one of the results be the return value of the function, and the two others to be passed "by reference":

int64_t foo(int64_t* a, int64_t* b) {
  *a = 1;
  *b = 2;
  return 3;
}
void call_foo(void) {
  int64 a, b, c;
  c = foo(&a, &b);
}

This program shows us a possibility for encoding WebAssembly's multiple return values: pass an additional argument for each stack result, pointing to the location to which to write the stack result. Like this:

The result pointers are normal arguments, subject to normal argument allocation. In the above example, given that there are already stack arguments, they will probably be passed on the stack, but in many cases the stack result pointers may be passed in registers.

The result locations themselves don't even need to be on the stack, though they certainly will be in intra-WebAssembly calls. However the ability to write to any memory is a useful form of flexibility when e.g. calling into WebAssembly from C++.

The advantage of this approach is that we eliminate post-call shuffles, at least in optimizing compilers. But, having to make an argument for each stack result, each of which might itself become a stack argument, seems a bit offensive. I thought we might be able to do a little better.

third thought: stack result area, passed as pointer

Given that stack results are going to be written to memory, it doesn't really matter where they will be written, from the perspective of the optimizing compiler at least. What if we allocated them all in a block and just passed one pointer to the block? Like this:

Here there's just one additional argument, no matter how many stack results. While we're at it, we can specify that the layout of the stack arguments should be the same as how they would be written to the baseline stack, to make the baseline compiler's job easier.

As I started implementation with the baseline compiler, I chose this third approach, essentially because I was already allocating space for the results in a block in this way by bumping the stack pointer.

When I got to the optimizing compiler, however, it was quite difficult to convince Ion to allocate an area on the stack of the right shape.

Looking back on it now, I am not sure that I made the right choice. The thing is, the IonMonkey compiler started life as an optimizing compiler for JavaScript. It can represent unboxed values, which is how it came to be used as a compiler for asm.js and later WebAssembly, and it does a good job on them. However it has never had to represent aggregate data structures like a C++ class, so it didn't have support for spilling arbitrary-sized data to the stack. It took a while staring at the register allocator to convince it to allocate arbitrary-sized stack regions, and then to allocate component scalar values out of those regions. If I had just asked the register allocator to give me one appropriate-sized stack slot for each scalar, and hacked out the ability to pass separate pointers to the stack slots to WebAssembly calls with stack results, then I would have had an easier time of it, and perhaps stack slot allocation could be more dense because multiple results wouldn't need to be allocated contiguously.

As it is, I did manage to hack it in, and I think in a way that doesn't regress. I added a layer over an argument type vector that adds a synthetic stack results pointer argument, if the function returns stack results; iterating over this type with ABIArgIter will allocate a stack result area pointer, either as a register argument or a stack argument. In the optimizing compiler, I added add a kind of value allocation coresponding to a variable-sized stack area, (using pointer tagging again!), and extended the register allocator to allocate LStackArea, and the component stack results. Interestingly, I had to add a kind of definition that starts life on the stack; previously all Ion results started life in registers and were only spilled if needed.

In the end, a function will capture the incoming stack result area argument, either as a normal SSA value (for Ion) or stored to a stack slot (baseline), and when returning will write stack results to that pointer as appropriate. Passing in a pointer as an argument did make it relatively easy to implement calls from WebAssembly to and from C++, getting the variable-shape result area to be known to the garbage collector for C++-to-WebAssembly calls was simple in the end but took me a while to figure out.

Finally I was a bit exhausted from multi-value work and ready to walk away from the "JS API", the bit that allows multi-value WebAssembly functions to be called from JavaScript (they return an array) or for a JavaScript function to return multiple values to WebAssembly (via an iterable) -- but then when I got to thinking about this blog post I preferred to implement the feature rather than document its lack. Avoidance-of-document-driven development: it's a thing!

towards deployment

As I said in the last article, the multi-value feature is about improved code generation and also making a more capable base for expressing further developments in the WebAssembly language.

As far as code generation goes, things are progressing but it is still early days. Thomas Lively has implemented support in LLVM for emitting return of C++ aggregates via multiple results, which is enabled via the -experimental-multivalue-abi cc1 flag. Thomas has also been implementing multi-value support in the binaryen WebAssembly toolchain component, used by the emscripten C++-to-WebAssembly toolchain. I think it will be a few months though before everything lands in a way that end users can take advantage of.

On the specification side, the multi-value feature is now at phase 4 since January, which basically means things are all done there.

Implementation-wise, V8 has had experimental support since 2017 or so, and the feature was staged last fall, although V8 doesn't yet support multi-value in their baseline compiler. WebKit also landed support last fall.

Unlike V8 and SpiderMonkey, JavaScriptCore (the JS and wasm engine in WebKit) actually implements a WebAssembly interpreter as their solution to the one-pass streaming compilation problem. Then on the compiler side, there are two tiers that both operate on basic block graphs (OMG and BBQ; I just puked a little in my mouth typing that). This strategy makes the compiler implementation quite straightforward. It's also an interesting design point because JavaScriptCore's garbage collector scans the stack conservatively; there's no need for the compiler to do bookkeeping on the GC's behalf, which I'm sure was a relief to the hacker. Anyway, multi-value in WebKit is done too.

The new thing of course is that finally, in Firefox, the feature is now fully implemented (woo) and enabled by default on Nightly builds (woo!). I did that! It took me a while! Perhaps too long? Anyway it's done. Thanks again to Bloomberg for supporting this work; large ups to y'all for helping the web move forward.

See you next time with a more general article rounding up compile-time benchmarks on a variety of WebAssembly implementations. Until then, happy hacking!

↧

Andy Wingo: understanding webassembly code generation throughput

April 14, 2020, 1:59 am

≫ Next: Sebastian Pölsterl: scikit-survival 0.12 Released

≪ Previous: Andy Wingo: multi-value webassembly in firefox: a binary interface

Greets! Today's article looks at browser WebAssembly implementations from a compiler throughput point of view. As I wrote in my article on Firefox's WebAssembly baseline compiler, web browsers have multiple wasm compilers: some that produce code fast, and some that produce fast code. Implementors are willing to pay the cost of having multiple compilers in order to satisfy these conflicting needs. So how well do they do their jobs? Why bother?

In this article, I'm going to take the simple path and just look at code generation throughput on a single chosen WebAssembly module. Think of it as X-ray diffraction to expose aspects of the inner structure of the WebAssembly implementations in SpiderMonkey (Firefox), V8 (Chrome), and JavaScriptCore (Safari).

experimental setup

As a workload, I am going to use a version of the "Zen Garden" demo. This is a 40-megabyte game engine and rendering demo, originally released for other platforms, and compiled to WebAssembly a couple years later. Unfortunately the original URL for the demo was disabled at some point in late 2019, so it no longer has a home on the web. A bit of a weird situation and I am not clear on licensing either. In any case I have a version downloaded, and have hacked out a minimal set of "imports" that the WebAssembly module needs from the host to allow the module to compile and link when run from a JavaScript shell, without requiring WebGL and similar facilities. So the benchmark is just to instantiate a WebAssembly module from the 40-megabyte byte array and see how long it takes. It would be better if I had more test cases (and would be happy to add them to the comparison!) but this is a start.

I start by benchmarking the various WebAssembly implementations, firstly in their standard configuration and then setting special run-time flags to measure the performance of the component compilers. I run these tests on the core-rich machine that I use for browser development (2 Xeon Silver 4114 CPUs for a total of 40 logical cores). The default-configuration numbers are therefore not indicative of performance on a low-end Android phone, but we can use them to extract aspects of the different implementations.

Since I'm interested in compiler throughput, I'm not particularly concerned about how well a compiler will use all 40 cores. Therefore when testing the specific compilers I will set implementation-specific flags to disable parallelism in the compiler and GC: --single-threaded on V8, --no-threads on SpiderMonkey, and --useConcurrentGC=false --useConcurrentJIT=false on JSC. To further restrict any threads that the implementation might decide to spawn, I'll bind these to a single core on my machine using taskset -c 4. Otherwise the machine is in its normal configuration (nothing else significant running, all cores available for scheduling, turbo boost enabled).

I'll express results in nanoseconds per WebAssembly code byte. Of the 40 megabytes or so in the Zen Garden demo, only 23 891 164 bytes are actually function code; the rest is mostly static data (textures and so on). So I'll divide the total time by this code byte count.

I tested V8 at git revision 0961376575206, SpiderMonkey at hg revision 8ec2329bef74, and JavaScriptCore at subversion revision 259633. The benchmarks can be run using just a shell; see the pull request. I timed how long it took to instantiate the Zen Garden demo, ensuring that a basic export was callable. I collected results from 20 separate runs, sleeping a second between them. The bars in the charts below show the median times, with a histogram overlay of all results.

results & analysis

We can see some interesting results in this graph. Note that the Y axis is logarithmic. The "concurrent tiering" results in the graph correspond to the default configurations (no special flags, no taskset, all cores available).

The first interesting conclusions that pop out for me concern JavaScriptCore, which is the only implementation to have a baseline interpreter (run using --useWasmLLInt=true --useBBQJIT=false --useOMGJIT=false). JSC's WebAssembly interpreter is actually structured as a compiler that generates custom WebAssembly-specific bytecode, which is then run by a custom interpreter built using the same infrastructure as JSC's JavaScript interpreter (the LLInt). Directly interpreting WebAssembly might be possible as a low-latency implementation technique, but since you need to validate the WebAssembly anyway and eventually tier up to an optimizing compiler, apparently it made sense to emit fresh bytecode.

The part of JSC that generates baseline interpreter code runs slower than SpiderMonkey's baseline compiler, so one is tempted to wonder why JSC bothers to go the interpreter route; but then we recall that on iOS, we can't generate machine code in some contexts, so the LLInt does appear to address a need.

One interesting feature of the LLInt is that it allows tier-up to the optimizing compiler directly from loops, which neither V8 nor SpiderMonkey support currently. Failure to tier up can be quite confusing for users, so good on JSC hackers for implementing this.

Finally, while baseline interpreter code generation throughput handily beats V8's baseline compiler, it would seem that something in JavaScriptCore is not adequately taking advantage of multiple cores; if one core compiles at 51ns/byte, why do 40 cores only do 41ns/byte? It could be my tests are misconfigured, or it could be that there's a nice speed boost to be found somewhere in JSC.

JavaScriptCore's baseline compiler (run using --useWasmLLInt=false --useBBQJIT=true --useOMGJIT=false) runs much more slowly than SpiderMonkey's or V8's baseline compiler, which I think can be attributed to the fact that it builds a graph of basic blocks instead of doing a one-pass compile. To me these results validate SpiderMonkey's and V8's choices, looking strictly from a latency perspective.

I don't have graphs for code generation throughput of JavaSCriptCore's optimizing compiler (run using --useWasmLLInt=false --useBBQJIT=false --useOMGJIT=true); it turns out that JSC wants one of the lower tiers to be present, and will only tier up from the LLInt or from BBQ. Oh well!

V8 and SpiderMonkey, on the other hand, are much of the same shape. Both implement a streaming baseline compiler and an optimizing compiler; for V8, we get these via --liftoff --no-wasm-tier-up or --no-liftoff, respectively, and for SpiderMonkey it's --wasm-compiler=baseline or --wasm-compiler=ion.

Here we should conclude directly that SpiderMonkey generates code around twice as fast as V8 does, in both tiers. SpiderMonkey can generate machine code faster even than JavaScriptCore can generate bytecode, and optimized machine code faster than JSC can make baseline machine code. It's a very impressive result!

Another conclusion concerns the efficacy of tiering: for both V8 and SpiderMonkey, their baseline compilers run more than 10 times as fast as the optimizing compiler, and the same ratio holds between JavaScriptCore's baseline interpreter and compiler.

Finally, it would seem that the current cross-implementation benchmark for lowest-tier code generation throughput on a desktop machine would then be around 50 ns per WebAssembly code byte for a single core, which corresponds to receiving code over the wire at somewhere around 160 megabits per second (Mbps). If we add in concurrency and manage to farm out compilation tasks well, we can obviously double or triple that bitrate. Optimizing compilers run at least an order of magnitude slower. We can conclude that to the desktop end user, WebAssembly compilation time is indistinguishable from download time for the lowest tier. The optimizing tier is noticeably slower though, running more around 10-15 Mbps per core, so time-to-tier-up is still a concern for faster networks.

Going back to the question posed at the start of the article: yes, tiering shows a clear benefit in terms of WebAssembly compilation latency, letting users interact with web sites sooner. So that's that. Happy hacking and until next time!

↧

Sebastian Pölsterl: scikit-survival 0.12 Released

April 15, 2020, 3:10 am

≫ Next: Christian Schaller: A bold new chapter for Fedora Workstation

≪ Previous: Andy Wingo: understanding webassembly code generation throughput

Version 0.12 of scikit-survival adds support for scikit-learn 0.22 and Python 3.8 and comes with two noticeable improvements:

sklearn.pipeline.Pipeline will now be automatically patched to add support for predict_cumulative_hazard_function and predict_survival_function if the underlying estimator supports it (see first example ).
The regularization strength of the ridge penalty in sksurv.linear_model.CoxPHSurvivalAnalysis can now be set per feature (see second example ).

For a full list of changes in scikit-survival 0.12, please see the release notes.

Pre-built conda packages are available for Linux, macOS, and Windows via

 conda install -c sebp scikit-survival

Alternatively, scikit-survival can be installed from source via pip:

 pip install -U scikit-survival

Using pipelines

You can now create a scikit-learn pipeline and directly call predict_cumulative_hazard_function and predict_survival_function if the underlying estimator supports it, such as RandomSurvivalForest below.

from sklearn.pipeline import make_pipeline
from sksurv.datasets import load_breast_cancer
from sksurv.ensemble import RandomSurvivalForest
from sksurv.preprocessing import OneHotEncoder
X, y = load_breast_cancer()
pipe = make_pipeline(OneHotEncoder(), RandomSurvivalForest())
pipe.fit(X, y)
surv_fn = pipe.predict_survival_function(X, y)

Per-feature regularization strength

If you want to fit Cox’s proportional hazards model to a large set of features, but only shrink the coefficients for a subset of features, previously, you had to use CoxnetSurvivalAnalysis and set the penalty_factor parameter accordingly. This release adds a similar option to CoxPHSurvivalAnalysis, which only uses ridge regression.

For instance, consider the breast cancer data, which comprises 4 established markers (age, tumor size, tumor grade, and estrogen receptor status) and 76 genetic markers. It is sensible to fit a model where the established markers enter unpenalized and only the coefficients of the genetic markers get penalized. We can achieve this by creating an array for the regularization strength $\alpha$ where the entries corresponding to the established markers are zero.

import numpy as np
from sksurv.linear_model import CoxPHSurvivalAnalysis
X, y = load_breast_cancer()
# the last 4 features are: age, er, grade, size
num_genes = X.shape[1] - 4
# add 2, because after one-hot encoding grade becomes three features
alphas = np.ones(X.shape[1] + 2)
# do not penalize established markers
alphas[num_genes:] = 0.0
# fit the model
pipe = make_pipeline(OneHotEncoder(), CoxPHSurvivalAnalysis(alpha=alphas))
pipe.fit(X, y)

↧

Christian Schaller: A bold new chapter for Fedora Workstation

April 24, 2020, 7:38 am

≫ Next: Seungha Yang: Windows DXVA2 (via Direct3D 11) Support in GStreamer 1.17

≪ Previous: Sebastian Pölsterl: scikit-survival 0.12 Released

So you have probably seen the announcement that Lenovo are launching a set of Fedora Workstation based laptops. I am so happy and proud of this effort as it comes as the culmination of our hard effort over the last 6 years to drain the swamp and make Linux a more viable desktop operating system.
I am also so happy and proud that Lenovo was willing to work with us on this effort as they provide us with an incredible opportunity to reach both new and old Linux users around the globe with these systems, being the worlds biggest laptop maker with the widest global reach. Because one important aspect of this is that Lenovo will provide these laptops through all their sales channels in all their markets. This means you can of course order them online through their website, but it also means companies can order them through Lenovos business to business channels and it means that in any country where Lenovo is present you can order them, so this is not a North America only or Europe only, this is truly a global offering.

There are a lot of people who has been involved here in helping to make this happen, but special thanks goes to Egbert Gracias from Lenovo who was critical in making this happen and also a special thanks to Alberto Ruiz who spearheaded this effort from our side.

Our engineering team here at Red Hat has also been hard at work ensuring we can support these models very well be that by bugfixes to kernel drivers or by polishing up things like the Linux fingerprint support. As we go forward we hope to build on this relationship to take linux laptops to the next level and I am also very happy to say that we got Jared Dominguez on on team now to help us develop better work practices and closer relationships with our hardware partners and original device manufacturers.

Also a special thanks to Jakub Steiner for putting together the little sizzle video above, it was supposed to be used at our booth at Red Hat Summit next week, but with that going virtual we repurposed it for this announcement.

↧

Seungha Yang: Windows DXVA2 (via Direct3D 11) Support in GStreamer 1.17

April 26, 2020, 4:23 am

≫ Next: Christian Schaller: Fedora Workstation : Swamp draining for 6 years

≪ Previous: Christian Schaller: A bold new chapter for Fedora Workstation

DXVA2 based hardware accelerated decoding is now supported on Windows, as of GStreamer 1.17.

This is a list of supported codecs for now

H.264 (d3d11h264dec)
HEVC (d3d11h265dec)
VP9 (d3d11vp9dec)
VP8 (d3d11vp8dec)

What should I do to use them?

No special steps or dependencies are required to build this new element indeed.

The above listed new decoder elements are part of the d3d11 plugin in GStreamer. It doesn’t require any special build time dependencies and/or libraries as everything is already provided by the Windows SDK. Once it has been built, the only requirement is whether your hardware (i.e., GPU) is able to support hardware decoding or not.

NOTE: This is a hardware decoding feature, so if the VM does not provide a way to pass-through the GPU, it will not work inside the VM.

When you run gst-inspect-1.0 it will show a list of available decoder elements. This is an example what you might see with gst-inspect-1.0:

[gst-master] PS C:\Work\gst-build> gst-inspect-1.0.exe d3d11
Plugin Details:
  Name                     d3d11
  Description              Direct3D11 plugin
  Filename                 C:\Work\GST-BU~1\build\SUBPRO~1\GST-PL~3\sys\d3d11\gst
d3d11.dll
  Version                  1.17.0.1
  License                  LGPL
  Source module            gst-plugins-bad
  Binary package           GStreamer Bad Plug-ins git
  Origin URL               Unknown package origin

  d3d11vp8dec: Direct3D11 VP8 Intel(R) Iris(R) Plus Graphics Decoder
  d3d11vp9dec: Direct3D11 VP9 Intel(R) Iris(R) Plus Graphics Decoder
  d3d11h265dec: Direct3D11 H.265 Intel(R) Iris(R) Plus Graphics Decoder
  d3d11h264dec: Direct3D11 H.264 Intel(R) Iris(R) Plus Graphics Decoder
  d3d11videosink: Direct3D11 video sink bin
  d3d11videosinkelement: Direct3D11 video sink
  d3d11colorconvert: Direct3D11 Colorspace converter
  d3d11download: Direct3D11 downloader
  d3d11upload: Direct3D11 uploader

The output might be slightly different in your case. For instance, the device name might be something different than “Intel(R) Iris(R) Plus Graphics”. That’s expected :) It will vary based on your hardware vendor and device naming.

Also, if the list doesn’t contain elements for some codecs (for instance, d3d11vp8dec), it’s very likely that your hardware doesn’t support decoding the codec (for example, some Nvidia GPUs don’t support VP8 decoding).

Moreover, if you have multiple GPUs on your device, you will see separate per-GPU decoder elements, with longer names for instance d3d11h264device1dec or so.

Why do I need new D3D11 decoders?

GStreamer already ships with two vendor-specific decoder implementations: one is the Nvidia (aka NVCODEC) plugin and the other is the Intel MSDK plugin. So what’s the benefit of this new implementation?

The main advantage of vendor-specific APIs is supposed to be that they are expected to perform better than generic APIs like DXVA2. In most cases this might be true but sometimes it is not true. The performance and reliability can vary depending on how well it’s integrated in a framework. Moreover, not just the decoder implementation itself, but “how media pipeline was configured in an application” is a very important factor for performance and reliability.

In summary, the strengths of the new d3d11 decoders are:

Zero-copy playback with d3d11videosink
Vendor-independent implementation
UWP support

Zero-copy playback

On Windows, both NVCODEC and MSDK plugins will copy decoded data into a new memory space (cuda, gl, or sysmem depending on the details), which consumes more memory and will make applications slower.

However, when d3d11 decoder elements are configured and at the same time d3d11videosink element is selected for rendering, the decoded data will be passed to d3d11videosink without any copy operation.

The only memcpy-like operation will be color space conversion (YUV to RGB format), but it’s often an unavoidable operation, because YUV is not supported as the render format for most rendererers (Windows DirectComposition seems to support it, but that’s a special case).

Vendor-independent implementation

This is a very useful aspect of this new implementation. Due to the fact that DXVA2 and D3D11 are standard APIs on Windows and provided by the OS, in theory NO vendor-specific consideration is needed (in reality, app-specific workarounds would likely be needed due to buggy vendor-specific driver behavior).

So there would be no reason for the application to write hardware-specific code in the general case. Moreover, these new elements also work on AMD GPUs since we previously only supported Intel and Nvidia on Windows. They should also work with other GPUs supported by Windows 10 such as Qualcomm, but I have not tested that yet.

UWP support

When running on UWP (Universal Windows Platform), most (possibly all?) hardware-specific operations are required to be handled via the native Windows graphics layer such as Direct3D11/12. Due to this, these new d3d11 decoder elements are a requirement for hardware decoding of video on UWP. When I tested this with a UWP application on my laptop, it worked quite well (as I expected)!

Note that UWP is not officially supported by GStreamer yet, but I am expecting it will be possible soon thanks to the efforts of Nirbheek Chauhan who is an active GStreamer maintainer and also maintains Cerbero, the build system of GStreamer. Related to UWP, a very interesting talk from him is available here: https://gstconf.ubicast.tv/videos/gstreamer-windows-uwp-and-firefox-on-the-hololens-2/

We’re excited to see new Windows specific features coming to GStreamer more and more. Stay tuned for more news! :)

↧

Christian Schaller: Fedora Workstation : Swamp draining for 6 years

April 28, 2020, 8:46 am

≫ Next: Jean-François Fortin Tam: Overhauling your Open Source project’s “Developer Experience” and redefining the workflow

≪ Previous: Seungha Yang: Windows DXVA2 (via Direct3D 11) Support in GStreamer 1.17

As Fedora Workstation 32 was released today I ended up looking back at our efforts to drain the swamp over the last 6 years. In April of 2014 I wrote a blog post outlining our vision for the Fedora Workstation effort and what we wanted to achieve with it. I hadn’t looked at that blog post in years, but it was interesting going back to it and realize that while some of the details have changed it is still the vision we are pursuing today; to keep draining the swamp and make Fedora Workstation a top notch operating system for developers and makers in general. Which I guess is one of the hallmarks of a decent vision, that it allows for the details to change without invalidating it.

One of my pet peeves at the time with Linux as a desktop operating system was that so many of the so called efforts to make linux user friendly was essentially duck taping over the problems, creating fragile solutions that often made it harder for us to really move forward. In the yers since we addressed a lot of major swamp issues with our efforts around HiDPI & Bolt (getting ahead of hardware enablement for new monitors and Thunderbolt devices respectively), Flatpaks, GNOME Software and AppStream (making applications discoverable, deployable and maintainable), Wayland (making your desktop secure and future proof), LVFS and firmware handling (making them easily available for Linux users), Finger print reader standard (ensuring your hardware is fully supported) and coming up with ways to improve the lives of developers with improvements to the terminal or Fedora Toolbox, our developer pet container tool.

Working on these and other issues we early realized that a model where hardware gets enabled in a reactive manner, in response to new laptops being sold, was never going to yield a good result for our users. As long as we followed that model people where bound to always hit issues with laptops as they came out and then have to deal with those issues for the first 6-12 Months of its life. This is why I am so excited about our new partnership with Lenovo that we pre-announced on Friday as it is both the culmination of our efforts over the last 6 years, but also the starting point of a new era in terms of how we work with hardware makers. So instead of us spending a ton of time trying to reverse engineers basic drivers we can now rely on our hardware partner and their component vendors providing that and we can instead focus on what I call high level hardware enablement. Meaning that as we see new features coming into laptops and computers we can try to improve the infrastructure in the operating system to be able to take full advantage of said hardware, and we can do so in collaboration with the hardware makers knowing that once we provide the infrastructure they will ensure to provide drivers and similar fitting into that infrastructure. Our work on fingerprint readers and thunderbolt support for instance has been two great early examples of that.

Anyway, you are probably interested to know some of the new things coming in Fedora Workstaton 32, so here are some of my personal highlights:

New lock screen

This is more a cosmetic change, but one that every user will see upon logging into their Fedora system after a new install or upgrade. The new design features a faded version of your desktop background image and it should also feel more smooth as the password dialog now appears on the lock screen page as opposed to before where it sort of replaced it. The dialog now also tries to more discreetly than before inform you if your trying to type in the password while the lock screen is on. A big thanks to Allan Day and the GNOME design team for their work here trying to polish this part of the user interface.

GNOME extension app

GNOME Shell extensions are little tweaks and additional features for the desktop that our user have gotten accustomed to and enjoy greatly. Extensions are also the technology that powers the GNOME Classic session that provides those of our users who want it with a more traditional desktop experience. GNOME Shell extensions have gradually evolved in how we work with them since their inception as something you install through your web browser to now being handled through GNOME Software. With Fedora Workstation 32 we are making the new GNOME Shell extensions management app available as the next step in the evolution of GNOME Shell extensions, making it simple to turn any given extension on of our or quickly see which extensions you have installed.

GNOME Extensions handling app

Fedora Toolbox

Fedora Toolbox is our helper for making working with containers for development and testing as easy it possibly can be. Debarshi Ray and Ondřej Míchal have been hard at work porting the Fedora Toolbox to Go from shell for this release. For those wondering why we choose Go as the language; there was basically two reasons for that. One we felt that the toolbox had gone as far as it could as a shell script, and two that was the language used by all the components we rely on and interact with in the container space, like buildah and podman. We also wanted to make it easy for developers on those projects to contribute by using the same language as they use in their projects.

Fedora Toolbox running on Fedora Workstation 32

Performance improvements

Another area that we always try to give some love is general performance improvement. For example this time around Christian Hergert identified some really bad behavior of GNOME shell when running on a system with very high I/O. At the face of it GNOME Shell didn’t look like it should have been affected, but during some intensive debugging sessions Christian Hergert discovered that I/O was triggered by various API calls to do things like string translation. So he put together a set of patches to resolve the high I/O stalls and can now report that GNOME Shell keeps running smoothly as silk, even under high disk I/O situations.

PipeWire

Wim Taymans keeps making great strides forward with PipeWire, our tool for creating a unified media handler for audio, pro-audio and Video. In Fedora Workstation 32 we will be shipping the 0.3 version which has quite complete Jack support. In fact we are hoping to team up with the Fedora Jam team to finalize the Jack support during the Fedora 32 lifecycle by testing it extensively. We have a lot of Jacks apps already working with PipeWire, including a series of important Jack apps that we have put into Flatpaks in Fedora like Carla. While the support is there in PipeWire in Fedora 32 right now, there are some convenience work we are still needing to do, but we hope to get that pushed out by next week to make replacing Jack with PipeWire becomes very simple to both do and undo for testing purposes.

The PulseAudio support is the last piece that are still in progress. It works for simple music playback, but it is not a drop in replacement for PulseAudio yet, so while we hoped to encourage widespread testing in F32 we will aim at delaying that to F33 in order to polish the PulseAudio support more first. But once ready we will make this available for testing in a simple manner just like the Jack support.

There has also been further work on the video side of PipeWire, adding support for zero copy video capture, this has reduced the overhead of doing things like screen capturing significantly and should be a nice performance/resource usage improvement for everyone.

Firefox on Wayland

Martin Stransky and Jan Horak has been working hard to improve how Firefox runs and works when used as a Wayland native application fixing a truckload of bigger and smaller bugs this cycle. We feel that we crossed the corner now in terms of the Wayland version being just as stable and good as the X11 one. In fact we could move beyond just fixing bugs to actually adding features this time around for instance Martin Stransky worked on WebGL HW acceleration support enabling us to have that enabled by default now for the first time. We also made sure to taking advantage of the Pipewire zero copy support to improve your video conferencing applications running under Firefox which turned out to be even more important than we expected considering Covid-19 has everyone working from home.

Looking forward

We spent a lot of time and energy over the last 6 years to get to where we are now, putting in place a lot of the basic building blocks needed to make Linux a great desktop operating system. And it feels great that just as we kick of the new line of Lenovo laptops running Fedora we are also entering a new phase of development where we can move beyond getting our basic infrastructure in place, but we can really start taking advantage of it to rapidly improve the experience we are providing even more. A good example is the Firefox work mentioned above, where we finally could move on from ‘make it work with Wayland and PipeWire, to ‘lets take advantage of these new pieces to make Firefox on Linux better’. Another example here is that Adam Jackson is currently investigating how we can improve how Fedora Workstation performs for remote usage. This work includes looking at things like VNC and RDP and commercial offerings and figuring out how we can make our stack work better with such tools, on top of the improvements that PipeWire brings for such usecases.

There is some more heavy lifting needed before our next generation OS architecture, Silverblue, is ready to be our default offering, but it is improving leaps and bounds each release and already have a loyal following, personally I am very excited about the fact that we are quickly moving closer the point were we can make it our default and through that offer features like bulletproof OS updates, factory resets and solid version rollbacks.

On the Flatpak side Owen Taylor and Alex Larsson are putting in a lot of final touches on our Red Hat infrastructure. So for RHEL8.2 we will finally be able to build Flatpaks in RHEL infrastructure and provide a runtime and SDK for our RHEL customers to use. But equally exciting is that we will be able to offer these to the community at large, meaning that we can offer a high quality Flatpak Long Term Support runtime and SDK for ISVs that they can use to both target RHEL users, but also Fedora and other Linux distributions with, in a similar vein to how the Red Hat UBI works. We will also be looking at ways to make getting access to these on Fedora very simple for developers, so that developing towards this runtime becomes quick and easy on your Fedora system. Alex and Owen are also working on an incremental updates feature to be shared between Kubernetes containers and OCI Flatpaks, making both technologies better and updates a lot smaller.

We are also looking at a host of other smaller improvements, many of them in collaboration with our friends at Lenovo, like lap detection (so you can be sure the laptop doesn’t burn you), privacy features (like making it harder to read your screen from an angle) and far field microphones. There are also things like Lennarts HomeD idea which we will be looking at as a way to improve the end user experience.

So the future is looking bright and I hope to see many new faces in the Fedora community going forward, be that if you download Fedora Workstation 32 to install on your own system yourself or if you join us through buying a Fedora laptop from Lenovo this summer.

↧

Jean-François Fortin Tam: Overhauling your Open Source project’s “Developer Experience” and redefining the workflow

May 2, 2020, 1:17 pm

≫ Next: Bastien Nocera: Dual-GPU support: Launch on the discrete GPU automatically

≪ Previous: Christian Schaller: Fedora Workstation : Swamp draining for 6 years

This started out as a simple status report following my first report on the revival of the Getting Things GNOME project, but turned out into a full-fledged article that, I believe, would be relevant to many community managers and FLOSS project maintainers out there. Particularly if you have an established open-source project looking for sustainable development but don’t have the luxury of paid developers, it should be worth investing the 7-9 minutes to read this.

As the world came to a standstill and as I finished my tax season accounting (two unrelated things, really), this month I have completed a major overhaul of the “developer experience” for GTG. The objective is to make it easier and more exciting for people to contribute to the project, by having:

a very clear workflow, objectives, and set of rules;
helpful & up-to-date reference documentation (particularly when it comes to building, testing and developing the core application).

This arguably depicts my efforts to clean up the cruft and pick up the missing pieces.

Indeed, from a community management standpoint, the project was suffering from two fundamental problems:

It was completely unclear what is critical or not, and therefore what actually needs to be done to make a release. This would lead any potential contributor to feel overwhelmed and discouraged to work on the project. It is impossible to take action if you don’t know where you stand, don’t know how far you need to go, and if everything is vying for your attention.
The documentation for contributors was mixed up with user documentation, and both were outdated and spread across four—or even five—websites. There were a gazillion things on LaunchPad, GitHub, ReadTheDocs, a defunct website/blog, and on the GTG wiki—which had at least 55 documentation pages, plus 50 pages of past Google Summer of Code projects, totalling somewhere over 105 pages, two thirds of which had broken links. When information was not “just” scattered, it was also often duplicated, conflicting, or so outdated that it was downright misleading. So, yeah.

Not everything is black and white in this world, but when you combine these two ~~polarities~~ problems together, you end up with a “mottled dove”—the Ikaruga.

Why yes, I am totally using a bipolar shoot-em-up bullet hell as the analogy for what the potential contributor’s developer experience must have felt like.

What the project probably looked like from an outsider’s perspective. Actually applies to many open-source projects out there.

Part 1: Fixing the workflow, redefining the objectives and policies

I am addressing the 1st problem mentioned above with:

my surefire cut-the-crap agile bug triaging approach;
a ruthless new roadmap resting on just two milestones.
Two ~~polarities~~ milestones ought to be enough for anybody.

Let’s take a minute to explain my philosophy.

To have a clear sense of direction, as a maintainer or core developer, you need to be able to know what is “critical” and what is better left for new contributors to tackle. This is why I created a dynamic list of issue labels and their descriptions, two of which are extremely important: “low-hanging-fruit” and “patch-or-wont-happen“. See CONTRIBUTING.md and the bug reporting & triage guide for further explanations.

Then, just as you must only assign “critical” (or “necessary”) issues to yourself, you must also be ruthless about the “minimum viable product“. If the release can be functional without a particular issue being solved, then that issue is not to be targetted to the milestone, unless a fix/patch is already being proposed or worked on somehow. That way your developers and maintainers can look only at the milestone as their guiding star and have a very clear sense of progression and of “when” it is done:

“That sounds like a reference to the Rebuild of Evangelion”, you say?
Well of course we’re Eva nerds, what did you expect?

Obviously this is meant for an atomic “release early and often” development model, not the “time-based releases” model (which I don’t think makes much sense for independent projects).

This is what the setting expectations is all about. By clearly documenting the above, I am essentially establishing a “social contract” between users and contributors. This is not about being lazy, it’s about being brutally honest about the resources you have to contend with.

Look at this roadmap. It's the most honest #opensource project roadmap you've ever seen: https://t.co/nJxOYdAw8Y
— Getting Things GNOME! (@GetThingsGNOME) April 29, 2020

Part 2: Separating the documentation for contributors

To solve the 2nd fundamental problem, I spent some time analyzing the existing pages and documentation.

I decided that the wiki would now serve only for “Introducing/marketing the project” to users, acting as a website/landing page for the project. Other than historical documents, anything “documentation” would be relegated either to the official user manual, or to files in the development forge (both GitHub and GitLab automatically render Markdown files as nice HTML, so there is no need to use a wiki for that nowadays). This avoids everything becoming a giant kitchen sink mess, and makes it pleasant to read again.

To make that happen, this is what I’ve done in the past two weeks:

“Burning the Brushwood” (1893), by Eero Järnefelt

Fixed all the broken links;
Migrated any relevant contents to nicely rendered Markdown files into a central place (the main GTG Git repository on GitHub), then split, merged or rewrote a ton of “cornerstone” documentation including the new README, the new CONTRIBUTING file, and most of the stuff you see in docs/contributors/;
Deleted the migrated wiki pages and associated links, archived the rest that remains there for “hysterical raisins“, by moving it to the bottom of the page;
Wrote a new introduction and list of features & benefits for users, at the top of the wiki homepage;
Rewrote remaining “cornerstone” wiki pages (new download/install page, new roadmap page, etc.);
Archeologically recovered the epic lost manifesto page;
Used some more archeology to create the press coverage page;
Ordered the GTG.ReadTheDocs.io website to be destroyed and its remains cremated with the brushwood.

Behold: 37 wiki front page revisions later, the front page now does a decent job at answering the #1 question for people hearing about GTG for the first time: “Why would I use GTG? Why is it magical?” The wiki’s front page used to look like this, it now looks like this. Some might say it is now a very nice shrubbery.

On the other side, in the Git repository, my 23 commits involved 155 files, with 1399 line insertions and 888 line deletions.

The git commit timestamps don’t reflect the spread-out, multi-week nature of this work. Good thing I’m not invoicing GTG for that work, because it would cost more than a Nissan Micra.

Remaining GTG dev docs you can help with

Some documents in the new contributors docs folder are things that I have migrated but not actually reviewed for up-to-dateness or accuracy, such as the DBus API documentation or the plugins documentation. If there are outdated parts, I welcome you to contribute suggestions and ideally patches to address any remaining issues, as those areas are a bit out of my area of focus and expertise (I am many things, but I am not an API architect nor data structures specialist).

The post Overhauling your Open Source project’s “Developer Experience” and redefining the workflow appeared first on The Open Sourcerer.

↧

Bastien Nocera: Dual-GPU support: Launch on the discrete GPU automatically

May 4, 2020, 9:52 am

≫ Next: Christian Schaller: GNOME is not the default for Fedora Workstation

≪ Previous: Jean-François Fortin Tam: Overhauling your Open Source project’s “Developer Experience” and redefining the workflow

*reality TV show deep voice guy*

In 2016, we added a way to launch apps on the discrete GPU.

*swoosh effects*

In 2019, we added a way for that to work with the NVidia drivers.

*explosions*

In 2020, we're adding a way for applications to launch automatically on the discrete GPU.

*fast cuts of loads of applications being launched and quiet*

Introducing the (badly-named-but-if-you-can-come-up-with-a-better-name-youre-ready-for-computers) “PrefersNonDefaultGPU” desktop entry key.

From the specifications website:

If true, the application prefers to be run on a more powerful discrete GPU if available, which we describe as “a GPU other than the default one” in this spec to avoid the need to define what a discrete GPU is and in which cases it might be considered more powerful than the default GPU. This key is only a hint and support might not be present depending on the implementation.

And support for that key is coming to GNOME Shell soon.

TL;DR

Add “PrefersNonDefaultGPU=true” to your application's .desktop file if it can benefit from being run on a more powerful GPU.

We've also added a switcherooctl command to recent versions of switcheroo-control so you can launch your apps on the right GPU from your scripts and tweaks.

↧