Roundup - Building trust; Internal engineering conferences; Don’t get stuck on finding a mentor; SlackLog; Single Decision Makers; Incident Response and Postmortems
Hi, all!
The holiday season is starting! I don’t have an essay here for you this week, but I do have lots to share with you in the roundup - so let’s get started!
How to Build Trust - Jacob Kaplan-Moss
Kaplan-Moss has a great article on building trust, say in a new team, even if the team has trust issues from bad experiences in the past.
It’s a great article, I won’t try to summarize it, do go and read it. I greatly appreciate his two fundamental points, that (1) it will take time, and (2) a table-stakes prerequisite for all the following steps is to do our frickin’ job as a manager first. Once that’s done, you can consistently do the following (again, read for the details)
Deploying an internal engineering conference - Dave Mangot
Institution-wide conferences are a great way of sharing expertise, and demonstrating both to our teams and to our institutions that we’re centres of excellence, not temp agencies (#133), or, relatedly, that we’re expert professional service teams not utilities (#127).
Mangot gives a quick overview of how engineering conferences are organized in his organzation. For us, the equivalent would be to include not just our own teams and peer teams, but teams from (say) core facilities, stats help desk, etc. The conference can be broadly technical, or highly topical around a scientific technique or a research field. I’ve seen all work really well, and help to kick-start technical relationships and collaborations between far-flung teams in the same institution.
Activating peers and advisors instead of getting stuck on “finding a mentor” - Ami Vora
This comes up periodically, but it’s worth emphasizing, especially as our attention starts being drawn to the coming year and future plans.
As Vora says, don’t get stuck on “finding a mentor”. Look for advice and mentorship, not A Mentor who will Guide You. It may turn out that someone you get advice or mentorship from on a specific question turns out over time to become one (of hopefully many) long-term mentor, but don’t expect or aim for that. Besides, your mentorship needs will change over time. In the short term, just ask someone for some advice.
Embracing SlackLog - Arjit
In our one-on-ones we probably keep track of the accomplishments of our team members, but we often don’t do that for ourselves. And that’s unnecessarily rough on us, because it’s hard enough keeping track of our impact on our teams (#117 - managerial sensory deprivation) without keeping notes.
Most of us spend a lot of our day in Slack or Teams or some such tool. If that’s the case for you, here the author suggests making a point of keeping track here. There’s almost always a “me” DM channel in these tools where we can dump notes - what’s more, as explained in this article, we can easily link out to conversations or documents elsewhere in the tool. Further, we can pretty easily set reminders to ping ourselves to log our accomplishments or things that went well.
Then we can periodically flip through the channel and pull out key wins for our stakeholders or for our own performance review (if we’re lucky enough to get one).
Navigators - Will Larson
This article by Larson is about assigning technical decision making for an area to a named individual, and when the scope of a decision involves others’ area, having those people confer, escalating to more senior decision makers only when a decision can’t be reached.
Our individual teams are small enough that we often don’t need to explicitly do this. But the opportunity for this comes up a lot when we need cross-team technical work.
Too often, we structure cross-team efforts as if they were were research collaborations, with huge committees meeting together to discuss and decide everything, and the result is that nothing gets done, and nothing improves:
It’s not that [such collaborative decision making processes] can’t be good, I’ve worked with a number of very good ones, but even the best are slowed by consensus, and suffer the downsides of consensus-driven decision making (particularly, optimizing for acceptable decisions rather than best decisions). I’ve also found that consensus-driven groups are slow to improve, even with direct coaching, because it’s generally difficult to figure out who to coach, and accountability can be a bit amorphous.
I think that second point is actually the most important. Any one slowly-made, suboptimal decision is fine - maybe not great, but fine. The real issue is the second: because of lack of accountability, nothing ever really gets better. There’s no amount of coaching or advocacy that can realistically be done to improve things from one deliberation to the next, to speed decision making or make better decisions.
Whenever we’re setting up these sorts of cross-team efforts, especially ad-hoc “short term” efforts (you know how that goes), it’s best to have clear accountability and decision making, and that never comes from a 12+ person advisory body or “work group”. It comes from a single named person being responsible for decisions. That doesn’t mean they can’t be consultative and aim for consensus, but it means it ultimately needs to be their call.
Research Data Management in the Canadian Context: A Guide for Practitioners and Learners - Thompson, Hill, Carlisle-Johnston, Dennie, Fortin
Just passing this along - we’re starting to see research data management practices codified in texts (here, a collection of essays), which is a terrific sign of growing maturity of a practice.
While this is open textbook is written for the Canadian context, it would largely apply in the US as well, and the basic ideas would be relevant beyond North America. It could be useful for sharing with researchers, or to start up a data management practice within a school or department.
Building a culture of incident response - Jess Chang
Gaming a disaster recovery test - Jenny Gray
One of my least popular takes is that, across disciplines of technical research support, our research software development and research data teams tend to be closer to modern best practice than our research computing systems teams.
As an example, research software teams (say) tend to have a routine practice of learning from what’s gone well and what’s gone poorly, if only from routine sprint retrospectives. It’s relatively uncommon for research computing systems teams to have any corresponding practice. But retrospectives are Management 201 (#137); they are to teams what one-on-ones are to individuals. They’re how teams, not just individuals, get feedback to grow their capabilities.
One way to develop this practice in systems teams is to run routine incident responses and postmortems to handle when things go wrong (or nearly wrong).
Chang writes about building an incident response culture, including
None of which needs to be an enormously complicated heavyweight process. It’s just about noticing that something isn’t running to your standards, having a set of steps you go through to have someone be in charge of making sure information is gathered and things are fixed, and then learning from all of the above. Chang also writes about holding blameless postmortems, (e.g., retrospectives) afterwards.
One can go beyond waiting for incidents to happen, of course. Anything that can go wrong eventually will; why wait to see if our teams will be ready? We can simulate them, or even create incidents on our own terms to make sure we’re able to handle them (which could be full-on chaos engineering, or just “disasterpiece theatre” events).
Creating or simulating events sounds dramatic, but it’s no more dramatic than testing every so often to make sure your backups can in fact be cleanly restored; that’s a simple simulation of a data loss event.
Gray walks us through a disaster recovery test (which is a lot simpler to simulate when there’s already an incident process!) where they took down a test system:
We gave engineers advance warning that a test would happen, but only within a 2 week window, and we didn’t tell them what we would break. Instead we created a fake incident alert for the on-call engineer, and then allowed the team to do what it would normally do, using all our standard monitoring tools.
We don’t get to hear details about how companies run these processes, so it’s really nice to see how this worked at Kooth. Gray echos something I’ve heard repeatedly from teams that run exercises like this, where people learn not just about the particular system/failover being simulated, but about everything around it:
One of the biggest learnings for me personally was how worthwhile the exercise was for the engineers. […] Engineers who haven’t seen that large an incident learned a lot about our process, our infrastructure and monitoring tools. That could be invaluable knowledge the next time a serious incident happens.
Yeah, sure, Quantum Computing sounds cool, but I’m not sure I can take it seriously until there’s a library written in a real scientific programming lang… oh, wait! Here we go. quAPL - a quantum computing library in APL (video).
We all think we have excellent reasons for the hoops our users have to jump through, and they yet they will absolutely work around them the second it interferes with doing their work. Here’s a healthcare example: a blog on “You want my password, or a dead patient?”
A Flappy Bird implemented in the MacOS finder by playing tricks with symlinks.
Making a font for the first time.
Yeah, we all have some embarrassing technical debt we work around, but I bet you haven’t had to ban daylight savings time from an entire organzation because of a poor data modelling choice.
This might be interesting - process-compose, a docker-compose for non-containerized applications.
I’ve said in the past that Python’s advantage as a research software language is that one can incrementally add safety or performance or… as one wants, e.g. as it moves up technological readiness levels (#91). Here’s a story of speeding up some python code 170,000x while staying in python.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.