#26 - Link Roundup, 29 May 2020

Hi, all:

Some jurisdictions are starting to make plans for people returning to offices. For those of us who can readily work from home, that return may not be for a while; but planning is still worthwhile.

Return will likely look like “split offices” for a while - a few people going in in “shifts” while others work from home. The challenge then is to get the best of both worlds, convenience of WFH and serendipitous interactions, rather than the worst of isolation and cliquing up.

Have you started thinking about how this would work for your team? I’d like to have our on-site team members be on staggered shifts so everyone gets a chance to be in-person with everyone else over time, but I don’t know a good way to do that while still giving people some predictability. I think it’ll also be important for everyone to call into meetings independently from their own computer — physical distancing will probably demand that, but I don’t want there to be an “on-site” experience radically different from those still distributed.

On a sadder note, this week, research computing lost an enthusiastic reporter on HPC topics and a pioneer of some of the earliest research-computing publications: Rich Brueckner. Research computing discussions online would be much poorer today if he hadn’t been such a driving force.

This week’s link roundup follows:

Managing Teams

The New Science of Building Great Teams - Alex “Sandy” Pentland, HBR
“Bursty” Communication Can Help Remote Teams Thrive

Christoph Riedl and Anita Williams Woolley, Behavioural Scientist

These two articles circulated independently this week, expressing related ideas about how communication works in high performing teams.

The first emphasizes how communication works between team members. It’s worth reading, but two key points:

Everyone on the team talks and listens in roughly equal measure, keeping contributions short and sweet.
Members connect directly with one another—not just with the team leader.

The first is also mentioned by Google among many others. Preconditions for that rough equality of contribution are that it’s safe to speak up without having someone jump down your throat and that no one person dominates conversation.

The second is just as important. We don’t want to be bottlenecks for our team! One-on-ones are great ways to spur teammate-to-manager conversation. Stand-ups and staff meetings are great forums for encouraging team members to talk and work together without our intervention.

The second emphasizes another point:

Peer communication is bursty, rather than constant

Bursty is a feature, not a bug, here. Alternating flurries of peer discussion and relative quiet as each works diligently away in silence is a good and healthy sign. It’s a signal that communication is achieving a vital goal - helping the peers get meaningful work done.

Remote brainstorming for regular humans - Bartek Ciszkowski

Whiteboarding and brainstorming are harder to do when the team is distributed. Here are some suggestions for Ciszkowski on how to do distributed brainstorming:

Do it in ~20 minute chunks with 5 minute breaks
Use a simple white boarding tool (Ciszkowski suggests excalidraw which I hadn’t seen before) or even just a screenshared google doc to record responses. That way people can visualize connections between ideas to trigger new ideas.
Periodically restate to your objectives to keep the brainstorming on track.
Make sure everyone is chiming in.

I’ll just add something that’s not unique to distributed brainstorming: ban commenting on the ideas raised (especially critiques!) You are aiming for lots of ideas first. Only after idea-gathering is complete is it time for distillation and evaluation.

How to Search and Find Layoff Lists Online - Jonathan Kidder, Wizard Sourcer

For those with open positions — with layoffs everywhere, are you considering taking a more active approach to recruitment this time ‘round? Many opt-in lists of laid-off workers, especially in tech, are out there. This post pointed me to a list for Canada, and I’m weighing the possibility of contacting promising candidates to let them know about current and upcoming positions.

Product Management and Working with Research Communities

Product for Internal Platforms - Camille Fournier

This is an article written for tech companies about how easy it is to go off the rails developing the enterally-used tech platform for developers. It holds a lot of lessons for research computing (software, systems, or data) though. The traps you can fall into are the same, because you are developing tools for a small, captive audience. It’s too easy to lose track of what a broad range of “customers” need to succeed:

When platform teams build to be building, especially when they have grand visions of complex end goals with few intermediary states, you end up with products that are confusing, overengineered, and far from beloved.

Recommendations:

You still need to be customer focussed
Partner with diverse users as often as possible
When making changes, plan user migration strategies early
Don’t build when you don’t have to

A month long conference is a neat concept - Matt Webb

We’re all learning quickly what does and doesn’t work for online conferences and webinars. Here Webb mentions three things that seems to work well; the first I’ve seen a lot of, but the last two are interesting:

Long conferences (days! weeks! months!) with many short sessions
Feedback from audience - for instance wave quickly for yes, slowly for no
Prerecorded talks, with the speaker then participating with the audience to answer questions in chat

Technical discussions are hard; a few tips - Gaël Varoquaux

The challenges of maintaining community software as seen by a well known neuroscience and machine learning software developer and manager at INRIA. Varoquaux discusses maintainer’s anxiety, contributor’s fatigue, the difficulty of communication. Varoquaux also describes things he’s found that helped:

Hear the other: exchange
Foster multiway discussions
Don’t seek victory
Convey ideas well: pedagogy
Cater for emotions: tone
Give your understanding

Research Software Development

Critique software, but understand the constraints it’s written under - Neil Chue Hong and Simon Hettrick, Research Professional News
Why you can ignore reviews of scientific code by commercial software developers - Phill Bull
An open letter to software engineers criticizing Neil Ferguson’s epidemics simulation code - Konrad Hinsen
BCS says software for scientific modelling needs standards - Mark Say, UK Authority

If you follow research software discussions at all you’re aware of the mess around Neil Ferguson’s epidemic simulation software. The code was a mess and had bugs but was mostly fine. The results were highly politicized. Soon enough people with no experience in research software or epidemiology but strong opinions about what the modelling software should be showing wrote scathing and unfair (and irrelevant) critiques. Some of the posts above are backlash to the backlash. It’s all just a mess.

Rather than rehash the debate, let’s highlight some points that can get drowned out in the loud back-and-forth:

There’s not a lot of evidence that research software is worse from a correctness point of view than non-research software — and correctness bugs are the only ones that really matter. Even when there’s no (e.g.) CI unit testing, there is a very strict kind of testing going on - the answers have to make sense to those in a field. That means well-known results have to be replicated, answers have to line up with intuition and those from other tools. When you’re dealing with the sort of complex systems that require modelling as vs closed form solution, that is a very high bar to clear; accidentally getting the right answer in a number of situations doesn’t happen very often. This is the difference between verification/validation and testing.
Clearly wrong results are generally not worrisome in research software. Subtly wrong results can be catastrophic.
There’s very few values of X for which the blanket statement “Research Software Should be [X]” makes any sense. “Research Software” isn’t a single thing. There is a difference between what is desirable for software used in research depending on its technology readiness level:
- The code is a research output - e.g., methods development, etc. This code honestly shouldn’t be very “software engineered”; a method is being developed.
- The code is a research input - Maybe it started as a research output itself, but now it’s being used in production to produce scientific results by researchers: the original researcher and/or others. Now it has to be more robust (easy to configure, doesn’t crash left and right, more testing) - but the constraints on maintainability, etc. can and should still be quite different from commercial software.
- The code outputs are policy inputs - The code is past the point where it’s informing papers, it’s now a “what if” tool for decision makers. Now the code needs to be not only correct but seen to be correct, because the results will 100% be politicized. The climate science community has much to teach the rest of us about this.
It is perfectly reasonable to have discussions about regulations and standards for code producing policy-shaping results. Consider all the discussion about responsible use of “AI” and machine learning models in the public and even private sphere.

I don’t know what if anything this situation teaches us, honestly. The government should have previously had some kind of trusted modelling ready as part of pandemic preparedness. Ferguson’s code, which his group was using for papers, shouldn’t have been such a mess. Conscripting code only intended for research into the policy-making loop in a time of crisis put everyone in an untenable position.

New users generate more exceptions than existing users (in one dataset - Derek Jones, The Shape Of Code

Not surprising for us in research computing but nice to have it validated with data: new users of software find new ways to trigger software faults. This is one of the reasons why the transitions that research software goes through — from being used by the creator to being used by friendly users, and then again to being used by a wider community — is so challenging and requires so much retooling.

Hypermodern Python - Claudio Jolowicz

As long as you’re porting that Python2 code to Python3, maybe it’s time to revisit package management, testing, CI/CD, and documentation with current python ecosystem approaches.

Research Computing Systems

The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems - Kumar et al.
Research Computing Team Studies Supercomputer Reliability - HPCWire

A really interesting paper from job history of two HPC clusters - and the authors made the data set they used publicly available! Some findings:

Node-sharing doesn’t translate to a higher rate of job failure.
Memory-intensive applications can fail even before the rated memory of the node is reached, which suggests that close monitoring of the memory usage of applications may be necessary.
Careful allocation and scaling up of “remote” resources (such as parallel file systems and network connections to storage systems) is important as a cluster grows in size.
User-history based models can predict likelihood of jobs failing.

Very interesting stuff. Results will be presented at the IEEE Conference on Dependable Systems and Networks.

Deploying Scientific AI Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers - David Brayford and Sofia Vallercorsa

A paper describing the work done at Leibniz Supercomputing Center using containers (LANL’s CharlieCloud for rootless containers) to run a large tensorflow model training within Slurm on their system.

Choosing 2FA authenticator apps can be hard. Ars did it so you don’t have to - Dan Goodin & Mark Gamache, Ars Technica

If you’re considering rolling out 2FA for your systems, this review at Ars Technica may be of interest - they liked Duo, Auth, or LastPass. They really didn’t like Google or Microsoft’s offerings as being too hard or too easy to deal with in the case of a lost device.

Emerging Data & Infrastructure Tools

A war story about COVID, cloud, and cost - Forrest Brazeal

A reminder (again) that provisioning things in the cloud as if you’re building a static cluster is not a guarantee you’re doing things the cheapest or most performant way.

Events: Conferences, Training

ISC 2020 Digital - 22-25 June, Free Registration required

ISC 2020’s virtual agenda is out: it’ll be a mix of prerecorded and live-streamed talks. Videos will be available solely to registered “attendees” for two weeks after the event and then publicly available. The workshops on Thursday June 25th — HPC containers, I/O, LLVM, monitoring — look particularly interesting.

IEEE/IFIP Intl. Conference on Dependable Systems and Networks - June 29-Jul 2

The conference Kumar et al.’s work above will be presented at. Work on dependable systems, software, networks, security & privacy, and fault injection tools. Looks really interesting. Registration not yet open.

RustConf 2020 - 20 Aug, $105

If you or one of your team is interested in Rust, this is an inexpensive one day live-streamed conference which includes at least two pretty research-computing relevant talks: Rust for Computational Biology, and Rust for telescope control.

Random

Wow, the new $75USD Raspberry Pi 4 has 8GB of RAM?

I still keep getting surprised a ~decade into the data science phenomenon how data-intensive research computing is everywhere now. A big US trucking company, DAT, now has a Vice-President responsible for data science (amongst other things).

You can now make your web page look like a default LaTeX style with CSS. (Controversial take: I like what LaTeX can do but I really dislike the default styles and fonts).

A good reminder about why saving numerical data in binary formats is the right way to go: even integer parsing, never mind floating point, is surprisingly hard to do efficiently.

Interesting take on GUIs vs CLIs - the argument is that CLIs are what happen when you prioritize and make explicit (reify) the interactions between the user and the system, as opposed to trying to make those interactions easy/“frictionless”.

A very deep dive into the simplest possible C++ program.

That’s it…

And that’s it for another week.

To those who knew Rich Brueckner, and to everyone who’s lost someone, take care of yourself this weekend. And my best of luck in the coming week with your research computing team and with everything else going on,

Jonathan