Hi, everyone:
We have articles below on us and our teams - how to onboard, communicate, and develop our team members - our teams and their place in our institutions, maintaining software, filesystems, and a lot of upcoming training and conference events.
There’s several recent papers, not just blog posts, in this roundup, thanks in no small part to a reader and co-worker’s recommendation to get started with Zotero, so that my “to read” papers are now sitting somewhere more organized than browser tabs or in Notes (thanks, Aman)!
So, onwards:
Improve Your Team’s Remote Communications Now - Paige Meekison
We’re all settled into our new largely-remote routines now, but not so settled that bad habits are ingrained yet; this is a good chance to take a look at communications in our teams and make sure things are going well or see what can be tweaked.
In this article, Meekison talks about push vs pull communication. I’ve written before about how we should be encouraging push, especially remotely - this can be quite asynchronous, which works well. People can write up short design documents describing their decisions, they can report on their work via slack or email, etc. It works well for structured communications (like stands), and now can be a good time to revisit expectations about such work.
Pull is more or less necessarily synchronous, but it has its place too. You don’t want to tamp it down too much - you don’t want people toiling away on problems other people could have helped them with. Meekison suggests a “30 minute rule”, something we’ve seen here before - there should be clarity around how long people should struggle with something before asking coworkers.
Who’s Thinking? - Alex Martynov
I’ve written earlier about the benefit of coaching employees — not just giving them answers to problems (which is a hard reflex to suppress given that we’ve gotten to our positions by being knowledgeable and having answers) but coaching them to find the answers themselves. They’ll probably find better answers (they’ve been wrestling with their problems for longer than you have and understand the nuances better), you’re developing their problem-solving skills, and you spend less time doing other people’s jobs.
This article makes the same argument, but casts it in a very simple light, to maybe help make it clearer whether you’re answering questions or coaching them along:
So the distinction is: are you as a manager thinking or is the other person thinking.
A guide to onboarding remote employees - Ashley Priebe Brown, Zapier
I’ve mentioned before that we’ll be onboarding a new trainee entirely remotely for the first time in May; many of us are likely in a similar situation. So this article seemed particularly timely. Brown’s four recommendations, which she expands on in the article, is:
We have a really long 90-day onboarding checklist for new hires that includes some of this - the frequent checkins, the asking for feedback (especially at the end of the 90 days — it’s their job to make suggestions to update the process and to prepare the checklist for the next new hire) so I think we’re ok on the last two items, but we really need to up our game in the first two - getting everyone a buddy (or a slowly rotating list of buddies), and provide more materials that they can read at their own pace.
I think combining that with the earlier push-pull approach (and giving them an expectation that they have ~30 minutes of wrestling with a problem before asking for help) will put us in a good position, if not for this hire than for our next.
Staff engineer archetypes - Will Larson
A lot of - by no means all - research computing managers tend to be what in the tech industry would look like Staff or Principal developers more than Technical Mangers or Directors — extremely senior technical staff which lead the technical work in conjunction with more administrative leaders (they might be Scientific Directors, or VPRs, or something along those lines) who set strategy and direction.
In this article, Larson describes four archetypes of Staff Developers that he’s seen — what they do and their relationship to their administrative partners — and for each describes their leadership styles, and the sorts of tasks they are particularly drawn to and capable at. It’s an interesting read to think about how those roles apply to our own management style and the strengths we have.
Larson’s four archetypes are:
Operational and Fiscal Management of Core Facilities: A Survey of Chief Research Officers - Carter et al.
A lot of people who work in research are unaware of the fact that there’s a lot of research about research, how it’s done, how it’s funded, what seems to work and what doesn’t. A lot of that research is necessarily qualitative, not quantitative, which initially seems wishy-washy to STEM-trained folks, but those methods can be just as rigorous and are solid practice for getting insight into most systems involving people.
This paper gathers information from 58 “chief research officers” (so Vice Presidents Research, Provosts, Vice Chancellors Research, etc.) at research institutions and asks about their core facilities for research - six common ones were microscopy/imaging centres, animal labs, HPC centres, microanalytic centres, fabrication centres/machine shops, aquatic cores, and agricultural cores. Not there are bioinformatics cores or software development teams, which are probably still too new to be common.
What’s interesting is how common the issues are across these very different kinds of centres - the breakdown of funding (internal/external user fees vs institutional funding vs grants), the operations, and the focus on equipment over people (ugh). Across all types of cores, there was statistically significantly differences between how effective they were when run by professional staff vs by tenure-track faculty (ahem). It’s also worth knowing that in general the CROs would narrowly prefer to contact out these services if possible (except for animal care).
Building a Shared Resource HPC Center Across University Schools and Institutes: A Case Study - MacLachlan et al.
Here the authors describe the history of an HPC centre at George Washington University; it’s interesting to read this in the light of the broader study above. We see some of the same themes; “The budget did not include operating budget line items for staff and operating expenses in the initial budget” and yet “New staff resources was one of the most critical success factors as well as the most difficult to fund long-term.”
The centre was able to leverage XSEDE training and XSEDE campus champions program, which is a big help.
And an interesting tidbit: “The services of an HPC management company, R Systems NA Inc.[9], have been retained to facilitate issues resolution during periods when [the facility] is short-staffed.”
Meet the Research Impact Canvas - Benedikt Fecher and Christian Kobsda
One vital but underlooked piece of managing technical work like research computing efforts is making sure we’re providing as much utility to our organization or research community as possible.
In the business and especially startup world there’s the “business model canvas”, a framework for writing out and thinking through how you would sustainably “provide value” for customers — in this article, Fecher and Kobsda present one for research efforts. What are the things you and your team could do that would have the biggest impacts, given the capabilities you have and the needs your organization and communities have?
This is a first version of the canvas, and I think it could use some tightening up, but this provides a way to think about the efforts of research computing teams as a whole but even individual projects or even papers. Whether you use this framework or one of your own creation, having a systematic way of thinking about what you could be doing and what would be most useful is a valuable way to spend some time.
Maintainers Drive Software Sustainability - Steve Smith, Better Scientific Software
Everyone who uses open-source software knows how important maintainers (preferably plural!) are. Maintainers are effectively the product owners of a software effort, coordinating work, deciding what features to add and what not to add, and interfacing with the research community.
In this article on the Better Scientific Software blog, Smith describes the experiences of the Parflow project and its evolution with and without an active maintainer. Simply having a maintainer isn’t enough, there has to be processes and infrastructure (like CI) in place to support the maintainers work.
Focus refactoring on what matters with Hotspots Analysis - Nicolas Carlo, Understanding Legacy Code
We’ve talked before about using hotspots metrics to investigate problem areas in code. Here Carlo goes one step further. There are two sets of metrics that are easy to generate on a code base - stats on the code as it is, like complexity metrics; and stats on the code’s history from git logs or the equivalent.
Here Carlo suggests doing both, and finding areas of the code that are both complex and change frequently. Those are likely targets for significant refactoring efforts. They’re complex, so that they take time to understand, and yet they’re changing a lot, so people are having to go in there, understand complex logic, and make changes despite the complexity. That takes a lot of peoples time, and is error prone.
On the other hand, simple chunks of code that are changed a lot are fairly harmless, and complex chunks of code which aren’t being changed aren’t causing you any problems, so leave it alone.
GitHub is now free for Teams - GitHub
You’ve probably seen this already if it’s of interest - github organizations, which has for some time allowed for unlimited private repos for academic groups (but you had to renew periodically), now has free private repos for all organizations. You can still pay for other features, like code owners or mandatory reviewers, SSO integration with your credentials, etc — but for most research computing teams, this means that free GitHub is more than enough for hosted code repositories.
High-Performance Python Communication with UCX-Py - Peter Andreas Entschev, Rapids
UCX is a low-level communications library which supports many high-performance networks and allows both messaging and RDMA primitives; it comes in part from the runtimes of some MPI implementations, so that it gives MPI-type performance without MPIs very restrictive API.
Dask, a distributed task and DAG workflow package for Python, has had support for UCX directly for a few months, meaning that for instance scikit-learn could take advantage of high-performance networks. But it looks like RAPIDS has now expanded that into UCX-Py, is a(rapidly evolving general package for UCX support in the Python data science ecosystem, allowing GPU-GPU communication over NVLink. This blogpost describes the work and some resulting benchmarking.
NERSC Rolls Out New Community File System for Next-Gen HPC - Inside HPC
Storage 2020: A Vision for the Future of HPC Storage - Lockwood et al
In 2017, participants from the DOE computing community put together the paper by Lockwood et al. taking a look at storage systems for large-scale simulation science and how they should evolve, by 2020 and 2025. Rather than the usual /home, /scratch, /project, and /archive, which reflect centre and allocation process, the argument is that the different file systems should reflect their purpose — temporary storage for data in activity use, campaign storage for (e.g.) a suite of simulations going into a particular paper that will be analyzed together, community storage for files that will be made available to a broader community, and forever archival storage for valuable or irreplaceable datasets. Temporary and campaign storage will be closely coupled to the compute, and community and forever storage much less so. Excitingly (to me), all of these systems will eventually be object stores, with POSIX middleware exposed for temporary and campaign storage. (This is a needed change that HPC-style research computing has been far too slow to adopt).
Today the Inside HPC article reports NSERC has it’s new /community file system, closely following what is outlined in the 2017 paper. This is an exciting development because it shows that NERSC remains committed to that blueprint. NERSC is a leader in several areas of research computing, and I’m hoping that this model expands to other centres.
NVMe and PCIe SSD Monitoring in Hyperscale Data Centers - Khatri and Chakrabarti
Speaking of filesystems, this recent paper by authors at LinkedIn describe their experience with SSD storage, both failure rates and their attempt to monitor their drives. This came up because of inadequacies in existing monitoring; unlike HDDs where monitoring is extremely mature:
Further, there are some classes of failures for which there are no indicators provided by the manufacturer. These go entirely undetected, until the application is directly impacted.
NVMe drives failed at a much lower rate than earlier PCIe-attached cards (although that might reflect the fact that the PCIe drives were significantly older and may have been custom)
On each host, they gathered a subset of available metrics and logged them to see if they could understand and predict failure modes for their drives. Interestingly, in doing this they found a misconfiguration of some fan policies which led to some systems running significantly hotter than they had believed - 40 vs 33 degrees; they fixed this policy immediately. (You always find something new when you start measuring something new in complex systems!) They also uncovered SSD-specific behaviour, where “garbage collection” of copy-on-write blocks led to significant increases in latencies of some applications.
Enabling Quantum Computing Platform Access for National Science Foundation Researchers - NSF
This is interesting to see — two big cloud players and IBM now have some kind of quantum offering - AWS Braket (full points for the name, by the way), Microsoft Quantum, and IBM Quantum Experience. NSF is announcing that they’re willing to entertain support supplemental funding requests for active awards to incorporate these resources into ongoing research efforts. Really curious to see what comes out of this.
Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries - Maciej et al.
Nebula Graph RC4
Terminus Server
Graph Databases are coming of age, and have a number of use cases in research computing- in the humanities and social sciences where they can directly model people systems, and in fields like biology where they can model “knowledge graphs” of dizzyingly complex interconnections between entities. While graphs can be modelled in relational or document databases, having a database specialized for graphs lets you study, analyze, and change the relations and connections between entities (which document stores don’t help you much with) while allowing you to traverse significant sized neighbourhoods around entities which in a relational database would involve numbers of joins which would bring performance to a crawl.
The revised paper by Maciej et al. is a really nice review of graph databases - their uses (from analytics on one side to streaming then to keeping track of changes on the other, very much reflecting the OLAP vs OLTP divide in traditional relational databases) and their underlying implementation. It’s a pretty all-encompassing review, and I can’t meaningfully summarize it here - if it’s of interest, take a look.
This week there were also two new graph databases which tackle different target cases; there’s the larger Nebula Graph which looks like it aims to be an open-source core with commercial “pro” offerings, with clients in Python, Go, and Java and is more on the OLTP side (keeping track of changes), and the in-memory Terminus server, written in Prolog (!!), aimed at analytics and queries. Both support JSON-LD for expressing links, a JSON-like expression of links which seems like it is slowly replacing RDF for such applications.
Podman for MacOS (sort of) - Rarm Nagalingam, RedHat
So the good news is that a lot of HPC centres which were very hesitant to run containers will now more or less be forced into supporting them as they eventually update their RHEL OSes to 8.0 or 8.1 which supports root-less podman. The bad news is that Podman only runs on linux.
But just as Docker originally didn’t run on Mac and there was a separate “docker-machine” toolchain, podman now has a podman-machine which will run on macs and windows and let you build and run podman containers. Hopefully there will be tighter integration for these other platforms shortly.
Hedge Your Bets: Optimizing Long-term Cloud Costs by Mixing VM Purchasing Options - Ambati et al.
When moving research computing workflows to the commercial cloud, it’s not quite clear what the best strategy is. The “build an old-school HPC cluster, but on AWS” crowd wants to just buy 3 years worth of reserved instances and run it like a bare-metal in-house cluster; the “take advantage of the cloud flexibility” crowd wants to do everything on demand, which is substantially more expensive.
This paper builds a hybrid strategy, mixing reserved and on-demand instances based on some aggregate information about workloads, and then tests how that strategy works by modelling the running of a real-world 4-year set of jobs on it, across all three major cloud providers. They use
a batch trace spanning 4 years from a large shared cluster for a major state University system that includes 14k cores and 60 million job submissions
And find their approach substantially less expensive than either naive extreme:
Our results show that our policies incur a cost within 41% of an optimistic offline optimal approach, are 50% less than solely using on-demand VMs, and 79% less than using reserved VMs.
I’m actually surprised there isn’t more data-driven work like this out there, given the long log histories of many research computing centres.
U.S. Department of Energy’s INCITE Program Seeks Proposals for 2021 - June 19
INCITE, which makes available resources on the DOE’s “leadership-class” machines for a small number of high-impact academic projects, including those led by international researchers, has a new call out. As in the last few years, there is increasingly an emphasis on data-driven rather than solely simulation-based projects.
Udacity - Courses free for one month
Besides the usual developer (C++, as well as web service development), cloud, and AI/ML courses. many of which receive high praise, there are courses on digital marketing which might be useful for those of us who need to up our public (and internal!) communications games for our efforts, and product management courses for keeping what we’re building lined up with what our users actually need.
Failover Conf: A Virtual Event on Reliability - Apr 21, 8am-5pm PT
In research computing our implicit or explicit service level agreements for uptime and availability are generally much lower than in the enterprise — at the hospital research institute where I currently work, there’s rightly a huge difference between a research service being unavailable and a clinical service being unavailable — but that’s not to say they’re nonexistent. If users can’t count on our systems being reliable at the level at which they need, they’ll start finding alternatives. Failover is a conference on reliability in the web-services world, but many of the topics - the people system side of reliability and writing good incident reports/postmortems - are just as relevant to us.
Lead Through Effective Communications: Five Key Ways Leaders Communicate Trust in the Laboratory- Laboratory Manager, Apr 28th, 1pm ET
AWS Summit Online: US, Canada - AWS, May 13, 9am-1:30pm PT
The purely virtual AWS summit is more about enterprise computing than reserach computng, but it still has a number of sessions relevant to our work, even if we’re not using AWS in particular. Some sessions that may be of interest:
Developer First Tech Leadership Conference - June 10-12
Finally, in June there’s a (not free: $250 USD) conference specifically on technical leadership - primarily in the tech industry but many of the issues apply equally to research computing (arguably more so than enterprise computing). You can see if there are sessions that would be of interest; the schedule is available here.
Via the always interesting RSE society slack: increasing the width of integers being printed in an error message caused automation tasks to shut down in two data centres supporting the north american power grid. It turns out NERC publishes these Lessons Learned incident reports and they’re both really interesting and a good model to follow if you don’t have one already.
A really cool setup for single-sign on for SSH, with short-lived SSH certificates. Secure and convenient!
An argument that urban planning, not architecture, is the right model for large scale computational systems.
A short twitter thread with several perspectives on when software becomes “legacy”.
I used to work at a place where I had to print forms out, sign them, and scan them back in and send them to finance/HR/wherever because digital signatures “just weren’t the same.” Now, I could never condone a flagrant flouting of such vital and sacred rules, but just FYI there’s a tool for adding a signature to a PDF and degrading the quality so it looks like it’s been scanned.
A tmux replacement “with a smaller learning curve and more sane defaults.”
ArtStor, a resource of images for educational and scholarly use, has a nice collection of images and artwork suitable for Zoom backgrounds.
Are you tired of matching regex’s with single-purpose command line tools or programming language libraries like a sucker? Don’t you wish you could see if “my-us3r_n4m3” matched “^[a-z0-9-]{3,16}$” by abusing the file system interface, just testing to see if a “file” “/m/y/-/u/s/3/r//n/4/m/3” existed after compiling a regexp to a FAT32 driver like a boss? Well, I’ve got fantastic news.