Hi!
If you follow HPC twitter at all, you will have seen a heartfelt thread by a well-known research software developer, one who was a key contributor to the Singularity project among others, lamenting the frankly appalling state of developer productivity in HPC - both in what tools exist, and support for them (and other tools for developers) at academic centres. A lot of people chimed into the discussion, including one of the leading developers of the PetSC project, embedded software developers, some key people at big computing centres, all agreeing that there was a problem, but typically zooming in on one or another particular technical or procedural issue and not coming to any conclusion.
I think the issue is a lot bigger than HPC software development - it comes up in too many contexts to be about specific technical issues of running CI/CD pipelines on fixed infrastructure. The only people to identify the correct underlying issue, in my opinion, were people from the private sector, such as Brendan Bouffler from AWS:
Too much reliance on ‘free’ labour - postgrads and postdocs, who, invariably, decide that burning their time being mechanical turks for their ‘superiors’ just sucks, so they come and work for us. And since we pay $, we’re not gonna waste them on things software can do.
The same argument got made by R&D research staff in the private sector. Their time actually has value; as a result, it gets valued.
In academic research computing, partly because of low salaries — especially for the endless stream of trainees — but also because we typically provide research computing systems for free, we tend to put roughly zero value on people’s time. If researchers have to jump through absurd hoops to get or renew their accounts, or have to distort their workflows to fit one-size-fits-all clusters and queueing systems, or postdocs have to spend hours of work by hand every month hand because tools to automate some work would cost $500, well, what do they expect, right?
It’s not that this is an indefensible position to take, but one can’t take this position and act surprised when researchers who can afford to are seriously investigating taking their projects into the commercial cloud even though it costs 2x as much. It turns out that people’s time is worth a lot to them, and is certainly worth some money. If we were to let researchers spend their research computing and data money wherever they pleased, I think we’d find that significantly less than 100% of researchers would use “lowest price possible” as their sole criterion for choosing providers. Certainly core facilities like animal facilities, sequencing centres, and microscopy centres, compete on dimensions other than being the cheapest option available.
To be sure, there are process issues in academia which exacerbates the tendency to see people’s time as valueless - rules about capital vs operating costs, for instance - but those rules aren’t a law of nature. If we were paying people in academia what they pay in tech, we’d suddenly discover that finance departments were willing to cut some slack on small capital expenses if it meant we could be a bit more parsimonious with people’s time.
Until then, one also can’t be too surprised when the most talented and ambitious staff or researchers get poached routinely by vendors and the private sector.
Delegation is a superpower - Caitlin Hudon, Lead Dev
It’s true! But, superpowers sometimes take some practicing to use effectively.
I personally am pretty good at the mechanics of delegating when I think to do it, but too often find myself taking on a responsibility so as not to bother anyone else with it, or because only I can do it (well, yes, if no one else gets a chance to learn how to, I guess I am the only one who can do it).
The ending year is a good time to reflect on things you’ve spent time on to see if they could usefully be given to someone else as a welcome growth opportunity. Hudon encourages us to see a task being delegatable by default:
Invisible Output: Measuring the Behind-the-Scenes Work of a People Manager - Samantha Rae Ayoub, Fellow
One of the problems of being a manager is that the work we do is pretty invisible. Our teams’ accomplishments are pretty clear, but the work we do to support that can be hard to point out. That means it’s hard to point out to our bosses - but even more importantly, to ourselves - the good work we’re doing, and where we need some support.
Ayoub tells us it doesn’t need to be that way; we can set clear goals, and even some quantifiable ones, for ourselves. We managers just need to:
On goal setting, Ayoub urges us to set goals for ourselves, not just for the output of our team, that focus on overall outcome or impact - whether that’s internal (like staff engagement numbers) or external (more collaborations or partnerships) - not just input activities. That will help focus our activities on what matters, not just activities for their own sake.
Ayoub also tells us to make sure we’re including team member metrics - things like retention and rate of promotion - but those are less relevant in our world, where job tenure is generally quite long.
Managing Up - Lessons From Scaling Teams at Credit Karma and Lyft - Matt Greenberg, Valerie Wagoner, Dor Levi, Anne Lewandowski
Managing upwards isn’t that different from managing our own team members; and it’s very similar to managing relationships with peers and external stakeholders like collaborators, situations where we also lack the ability to be directive.
Greenberg, Wagoner, Levi, and Lewandowski suggest focussing on three areas:
They also talk about three common failure modes that can cause problems when working with your boss, in the short and long term
The missing millions: Democratizing computation and data to bridge digital divides and increase access to science for underrepresented communities - A. Blatecky et al
Research computing and data are increasingly important for STEM fields, so if we want STEM - and R&D careers - to be available to all, we need to make sure there are as few barriers as possible to being fluent with computing and digital research infrastructure[*], and to having it accessible.
More selfishly - we readers of this newsletter are all pretty familiar with how hard it is to hire in research computing and data. So we have a pretty strong and direct interest in making sure as many people as possible have access to and build skills and understanding in research computing and data technologies.
We know, though, that research computing and data as a community is less diverse than academia or tech as a whole - and those communities are not stellar examples of diversity, equity, and inclusion.
This is a long and frankly somewhat dispiriting read. It’s not great out there. But there are some pretty solid recommendations in this report - and it’s NSF-funded, so there’s real hope the recommendations would be taken up. Recommendations we on the ground can actually engage in are things like:
By the way, one finding from this report stuck out, as it’s one of the tenets of this newsletter:
It is computing, and software, and data
I can not agree with this enough. Focusing on research software development, or systems, or data management, in isolation is a mistake. In 2021, those lines are so blurred as to be meaningless, and it builds silos where there ought to be none.
[*] sorry, but I just hate the NSF term “cyberinfrastructure”.
An IC’s guide to roadmap planning - David Noël-Romas, Increment
A good introduction to product roadmapping, aimed at a software developer IC tying to figure out how to contribute to product planning. It applies just as well to research computing, whether systems, software, or data products:
42 things I learned from building a production database - Mahesh Balakrishnan
Relatedly - I really do think R&D at Big Tech, or the work of startups, is pretty close to research computing. It’s certainly a lot closer than University IT! They’re working on things that may or may not work, firming up their understanding of the problem at the same time that they’re developing solutions, and reading (and writing) papers as they go.
Here an academic, Balakrishnan, describes what he learned building a production database (Facebook’s equivalent of Chubby - a central state store which is a key piece of infrastructure that other foundational services depend on). They are classic product development lessons, broken down into categories of customers, project management, design, code review, strategy, observability, and research.
Go Does Not Need a Java Style GC - Erik Engheim
Engheim give a good and short overview of why garbage collection is typically much less of an issue in languages like Go or Julia than Java or even C#, even though all four absolutely do rely on garbage collection. This matters a lot, even in web services - modern systems typically being a large number of small processes, GC pauses can ripple through systems causing very large amounts of jitter. In desktop technical computing codes, GC pauses can be very bad for performance.
This is a pretty opinionated article, but Java GC is genuinely a harder problem than GC in other languages, and Engheim outlines why. There’s more here than I can comfortably summarize - but basically Java bet heavily on GC in the early 1990s, allocating basically everything on the heap, and avoided value types, meaning even trivial arrays of simple structs cause very large numbers of small objects:
JVM developers have worked heroically designing runtimes and garbage collectors to mitigate this issue, but it’s fundamental issue not easily walked back (although Project Valhalla is trying to introduce the benefits of value types, which would get things closer to C#). Engheim walks us through generational GCs, bump GCs, compacting GCs and escape analysis, demonstrating the problems each are trying to solve, and the tradeoffs made. Those tradeoffs are different than they were three decades ago - multicore is the default, as an example, so concurrent GC is very much desirable.
Sketching for set comparison in bioinformatics - Camille Marchet
In the last few years, bioinformatics has seen a lot of progress on sketching methods for calculating similarity of large sets of data. In the case of bioinformatics, the large sets of data are strings, and applications is include “alignment-free” mapping reads to references, metagenomic classification, and more; but the problem is more general. If you’re curious about this kind of work - MinHash, etc - in the context of other sketching methods you may have heard of, such as hyperloglog, then Marchet has a nice short well-referenced summary.
What is AMD ROCm? - woachk
In #100 we covered some pretty cool looking AMD GPUs for compute. So how to use these beasts? AMD’s revitalized software development ecosystem for their GPU systems is ROCm. Here we have another opinionated - but I think fair - article, discussing AMD ROCm, or in particular HIP.
(ROCm also includes OpenCL, which is a known enough quantity now that it doesn’t really need “What is OpenCL” articles. I’ll add that OpenCL is certainly being used productively by many teams, but the fact that AMD chose to add HIP to their supported frameworks suggests that OpenCL may have been something less than widely beloved.)
From the article:
HIP is a wholesale clone of the CUDA APIs, including the driver, runtime and libraries’ APIs. That’s not a bad thing, it acknowledges what the industry standard is, making portability easier.
There are differences - the biggest one is that you more or less have to literally s/cuda/hip/, a tool called hipify helps with this. But the convergence is genuinely good, meaning that software developers really only have to learn one mental model for developing GPUs for either NVIDIA or AMD.
The article points to some real limitations of the ROCm/HIP ecosystem to suggest why it’s not taking off. There’s a few; none of them are insurmountable but they do add real friction:
These aren’t insuperable, and in a world where people are using cloud instances or codespaces or the like for development, they may not matter. But an awful lot of research computing software development is still done on local laptops or workstations, where it takes a bit more effort to get into ROCm programming than CUDA programming.
Learning Containers From The Bottom Up - Ivan Velichko
A well sourced learning-about-containers landing page, with a lot of links to dig into things more deeply.
Minisymposia for Humanities and Social Sciences for Platform for Advanced Scientific Computing (PASC22) - Basel, Switzerland, 27-29 Jun, EoIs due 15 Dec
If you’re interested in arranging minisymposia for social sciences and humanities topics in HPC/scientific computing, PASC 2022 has a special call out with expressions of interest due 15 Dec, with full proposals due 15 Jan. PASC is an ACM SIGHPC sponsored conference and covers HPC and scientific computing topics broadly.
The 14th Workshop on General Purpose Processing using GPU - 12 or 13 Feb, Seoul, papers due 23 Dec
An OG GPGPU conference, covering everything from compilers and programming languages for GPUs to GPU reliability, applications, containerization, and serverless for GPUs.
RSE Survey - US-RSE and other RSE organizations - due 14 Jan
RSE organizations have put together the 2021 RSE survey - it would be great if those of us in research computing and data broadly who might not consider ourselves primarily software developers could participate. You can complete the 2021 RSE survey in English, French, German or Spanish.
The 17th International Workshop on Automatic Performance Tuning - 3 June, Lyon, part of IPDPS 2022, papers due 31 Jan
Topics include:
ScaDL 2022: Scalable Deep Learning over Parallel And Distributed Infrastructure - 30 May or 3 June, Lyon, part of IPDPS 2022, Papers due 24 Jan
Topics include:
Want your awk script to run faster? Transcompile it into go.
A free undergraduate cryptography textbook - the Joy of Cryptography.
As with good debugging stories, good “dramatic performance improvement” stories almost always make it into the newsletter. Dramatically speeding up spell checking.
Useful tmux configuration examples, including much better vertical/horizontal split chars ( | and -), window swapping, and tmux plugins. |
Using awk with csv files - yeah yeah, sure, “-F,” , right? Except for all the commas that aren’t delimiters, and… woah, why did none of you ever tell me about csvquote?
Relatedly - latexrun, a latex runner for use within build tools/workflows that handles multiple running, cleanup, etc.
Most of us in this game have had to absorb a lot of background information about open source licenses without realizing it, and the differences between contracts, IP, etc… but if you have a new team member who’s less familiar, this is an unusually good and comprehensive primer, written by actual technologist lawyers.
Nice from-scratch introduction of transformers, a recent (2017) deep learning architecture used for e.g. translation.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.