Hi, research computing team managers and leaders:
In our team there’s been a lot of passages lately - a paper for our original work is (finally!) coming out, as our new version is (finally!) coming together; we’re gearing up for a new batch of co-ops as our current co-ops are starting to document and getting ready to present their work; a project manager is joining the team for the first time now that the effort has reached a size and scope that it needs one (well, it needed one a year ago, but here we are).
These passages - and especially the influx of new people, new tasks, new scope - are really important for a team’s well being. Stasis isn’t stable; systems, including systems of people, are either growing or stagnating.
In academia sometimes it’s far too easy for groups to become very comfortable with “the way we do things”, and set in their ways. As Boulanger points out in the first article in the roundup, that can quickly lead to problems not being addressed - or even really noticed any more - and eventually people both within the team and “clients” of the team starting to drift away. In fact, I was talking to a colleague this week about one group’s services becoming ossified to the point where consumers of those services started moving to those of a different and newer group - the first group didn’t take feedback or feature requests seriously, and now there’s a real chance it will simply be disbanded (or, maybe worse, left go on indefinitely with less and less actual purpose).
Stagnation isn’t inevitable, but it takes active and continual effort to avoid. In research, and technology, and certainly at the intersection of research and technology, it should be easy - there’s a constant influx of new ideas available. But ideas don’t adopt themselves, and practices don’t adapt themselves. They’re adopted and adapted in units of teams, and one of our important jobs as managers and leads is balancing real needs for degrees of stability and certainty with equally real needs for change and growth. Balancing openness with continuity is hard, but vital.
And now, the roundup:
Why The Status Quo Is So Hard To Change In Engineering Teams - Antoine Boulanger
Boulanger here points out a situation that is especially common in academia, with slow-growing teams where individual team members have long tenure. The issue is that a team gets so used to the way things are they don’t even see it any more, and forget that things don’t have to be this way. There can be a sort of learned helplessness to the procedural, technical, and complexity problems within an organization.
Having new people come in regularly - even short term team members like interns - can be very helpful for this, as long as they are comfortable making comments like “why is X? Isn’t that bad?” and the team takes the points they raise seriously.
Boulanger has recommendations for us managers or leads:
There are a lot of strengths that can come from a long-lived stable team, if you’re careful, but the default outcome is stagnation. The manager, and the team, has to be constantly and actively looking for things to improve and areas in which to grow to prevent the default.
You can be directive without being a jerk - Lara Hogan Being Nice and Effective - Subbu Allamaraj
I think one of the hardest things for new managers - especially those coming from the very hands-off collegial culture of research - is determining the right amount of directiveness appropriate for a given situation. The usual failure modes, in order of the frequency which I see them, is the very common laissez-faire absence of direction and the less common tech-lead-becomes-manager “do this, this, then that, and my way of doing it is exactly like this. In fact, why don’t I just…”
Hogan’s article is a followup to an earlier one, fixing a team going in circles. So here it’s a big topic of setting some direction for an entire team at once. But the approach works for a particular team member, too - being specific about whose job is what, focussing on the important thing (the team’s work) and not about individuals, and firmly but kindly applying direction at the level needed - whether that’s on tasks or goals or somewhere in between.
Allamaraj points out that being effective doesn’t necessarily mean not being nice, and being “nice” isn’t necessarily an end in itself anyway; we want to be kind, and sometimes “nice” gets used as meaning sort of inoffensive. Letting someone trudge aimlessly in circles while just smiling and not saying directive may seem from a distance like ‘nice’ but it’s certainly not kind.
Stop Looking For Mentors - Stay SaaSy
We could all use a bit more mentorship, but searching for A Mentor may make it harder to get the input we need. This article suggests making it easier on yourself:
Instead of looking for a mentor, just find somebody who can answer some questions you have. Then, if you think they can answer some more, ask them again. In reality, a mentor is mostly just somebody that answers questions more than once. That’s it. It’s not cinematic.
Using Amazon Service Workbench for Remote Training - Ann Gledson, Danielle Owen, Anthony Evans, and Peter Crowther, Manchester Research IT blog
So AWS Service Workbench is a free offering I hadn’t heard of before that lets you do what you might have done previously with Cloud Formation or a bunch of home grown scripts - spool up individual environments for researchers or (in this case) for a training course - but with aspinnice self-service UI that lets the IT staff approve requests. (“Free” in the sense of no extra cost - don’t worry, they still charge you for the resources that are being used!)
We’ve all tried having students use their own laptops and requiring them to pre-install packages, and know how challenging that is. The Manchester RIT team used service workbench to support an all-virtual Python course that would normally be done in a computer lab they control. Interestingly, the blog reports the feedback of the course instructor, an RIT team developing service to handle restricted data, a TA for the course, and some feedback from participants.
In this case, the RIT staff liked the control, the instructor and TA liked how they could get started teaching the material right away, and the participants seemed happy.
The downside of this approach of course is that if the students are to continue on using the material on their own, they now still have to go through the install process - but certainly the teaching is easier.
And of course this sort of tooling could be made available for on-prem systems, but in practice it never is; cloud providers have an incentive to make their systems as easy to use for these kinds of use cases as possible, because it means more revenue, while typically fixed on-prem systems generally have different (frankly, the opposite) incentives.
Bring Legacy Code under tests by handling global variables - Nicolas Carlo
When trying to implement component testing legacy code with global variables, Carlo has a simple suggestion - don’t over think it, just pass the global variables in as parameters. It may look ugly, but it’s not new ugliness, it’s just revealing existing ugliness; and that’s the first, necessary, step in defining refactoring plans.
Well-researched advice on software team productivity - Ari-Pekka Koponen, Swarmia
Management is hard, management of something as complex and ambiguous as software development is especially hard, but that doesn’t mean we don’t know anything. There has been a lot of research on what works for making teams work well, and recently particularly in the area of software development. It doesn’t mean there are cookie-cutter solutions for anything, but we do have good guidelines. Koponen walks us through several well-supported (and in some cases ongoing) reports, many of which RCT readers will have already known about
And most importantly, Retrospectives - learning and adapting practices based on what is actually happening on your team - allows you to tune.
Embedded malware in NPM package coa - GitHub advisory, RW Overdijk
Another reminder of how vulerable software supply chains are - coa (command-option-argument, a command line argument parser) used in 200 other packages and a gillion repositories, had malicious releases ˜with malware inserted:
The npm package
coa
had versions published with malicious code. Users of affected versions (2.0.3 and above) should downgrade to 2.0.2 as soon as possible and check their systems for suspicious activity. See this issue for details as they unfold. Any computer that has this package installed or running should be considered fully compromised. All secrets and keys stored on that computer should be rotated immediately from a different computer. The package should be removed, but as full control of the computer may have been given to an outside entity, there is no guarantee that removing the package will remove all malicious software resulting from installing it.
The good news is that it seems to have been spotted quickly if I’m understanding what’s been happening.
Choosing good chunk sizes in Dask - Genevieve Buckley
As with any kind of parallel or distributed computing, choosing granularities over which to calculate is complicated. Too small, and you end up spending too much time on coordination/communication and too little time on computation; too little and you have too little flexibility in scheduling or can even run out of memory. In simulation, it’s usually pretty clear what size over which to run; for data analysis, which is normally a lot less computationally intensive, it’s often less so.
In this article Buckley gives some rough rules of thumb:
and shows how the Dask dashboard can help provide some guidance.
Scaling a read-intensive, low-latency file system to 10M+ IOPs - Randy Seamans, AWS HPC Blog
This is an AWS blog post, but it’s relevant more broadly - it’s a pretty direct use of NVMe-oF, NVMe over fabric.
Here Seamans describes a very high-speed read-nearly-only filesystem, where a gluster file system is replicated onto multiple instances with very high-speed NVMes, and then the NVMe are exposed read-only over NVMe-oF to provide extremely fast read access to files, for use cases like a large number of nodes are doing a read-intensive analysis of a directory full of data.
The yearly backup restore test - Remy van Els
Backups are useless, restores are invaluable. van Elst walks us through his personal annual backup restore test, marked on his calendar, including file integrity checks:
Have you done your backup restore test recently? An untested / unverified backup is the same as no backup, so doing a restore test is a major part in your backup scheme.
Five-P factors for root cause analysis - Lydia Leong
Rather than “root cause analysis” or “five why’s”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine:
Running 20k simulations in 3 days to accelerate early stage drug discovery with AWS Batch - Christian Kniep
Following up on earlier Gromacs benchmarking posts, in this post Kneip describes their final use case - running a large suite of simulations for early stage drug discovery. By choosing their instance types based on the previous work, they could tune turnaround time and cost, and by using spot instances and Batch they could fan out 20k simulations over multiple regions relatively straightforwardly:
For our binding affinity study, we completed 20,000 jobs over the course of three days. By using benchmarks and choosing optimal Spot Instances, we were able to achieve a cost as low as $16 per free energy difference (∆∆G value). As we chose to broaden the set of instances for a shorter time-to-solution, we achieved an average of $40/∆∆G value. With AWS Batch, we were able to create pools of resources in different AWS Regions around the globe and handle orchestration within the region. By the end of this, it was clear that we could achieve both a really fast wall-clock time (and hence time-to-result) as well as a low overall cost.
International Super Computing (ISC22) - 29 May - 2 June, Hamburg, Papers due 29 Nov
SC isn’t even here yet and papers for ISC are coming due. ISC of course covers almost everything in HPC:
EUROSIS Industrial Simulation Conference 2022 (ISC 22) - 1-3 June Dublin, Papers due 21 February
The aim of the conference “is to give a complete overview of this year’s industrial simulation related research and to provide an annual status report on present day industrial simulation research within the European Community and the rest of the world in line with European industrial research projects.” Tracks include:
51st International Conference on Parallel Processing - 29 Aug-1 Sept, Bordeaux, Workshop proposals due 28 Nov, Papers due 14 Apr.
ICPP2022 is interested in “he latest research on all aspects of parallel processing”. Topics of interest include algorithms, applications, architecture, performance, software, and multidisciplinary work.
Oak Ridge National Centre for Computational Sciences Virtual Career Fair - 11 Nov
Four hours of talks, tours, and career tables staffed by 11 teams that are hiring.
I’ve you’ve wanted to start messing around with functional languages, OCaml is a reasonable pragmatic choice that does get used in the wild. Here’s a getting-started-with-the-tooling guide to OCaml.
Or you could use a lisp that fits entirely into 512 bytes.
Use bash functions! They make bash scripts less crummy!
The nice thing about the internet is you can find nice resources about incredibly obscure things. Want a good annotated bibliography and sample code for tree edit distances, the minimal number of edits you can make to transform one tree to another? Great news!
A complete embedded USB stack in Ada.
Learn how X window managers work by writing one.
The anatomy of a terminal emulator.
Lovely explanation of Bezier curves, splines, and smooth surfaces.
Fascinating look at the data infrastructure around python environments that have grown over time in some banks: Bank Python.
Unicode attacks on source code.
Debugging stories are always good! Here’s one deep within the bowels of the Linux TCP stack.
Causing data leaks via maliciously-crafted log messages.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.