Just 15 minutes + questions, we focus on topics about using and developing nf-core
pipelines. These are recorded and made available at https://nf-co.re
, helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize
Slack channel!
This week, Evan Floden (@evanfloden) will tell us all about the new Nextflow Tower command line client.
Video transcription
The content has been edited to make it reader-friendly
0:01 (host) Hello, everyone. Thank you for joining today’s bytesize talk. I’d like to begin by thanking the Chan Zuckerberg Initiative for supporting all nf-core events. Some information about today’s bytesize talk. The talk will be recorded and is being recorded at the moment. And the video will be uploaded to nf-core’s YouTube playlist and also shared on our Slack platform. After the talk, which is a 15-minute talk, we’ll have a session for questions and answers. Feel free to post your question on the chat box or unmute and directly ask the speaker the question that you have. For today’s talk, we are honored to have Evan Floden, the CEO and co-founder of Seqera Labs, who will be presenting on Nextflow Tower, which is a centralized command post for launching a Nextflow pipeline. Over to you, Evan.
0:56 All right. Thank you very much. Thanks for the intro and thanks for having me here again. Just a bit of a clarification. I’m going to specify that we’re going to talk about Nextflow Tower and the CLI in specific, specifically that we’ve released it towards the end of last year. It’s been a work in progress and something that we’ve been wanting to do for a while now. It tries to answer some key problems that people had. You have Tower itself, which has a graphical user interface and also has an API. And we wanted to solve the problem for people who were used to working with Nextflow from the command line, but also people who wanted to have this infrastructure as code set up.
1:36 Maybe as a brief introduction, I can start on a little bit about the background to it and also the background to Seqera. If you don’t know, we’re the main maintainers of the Nextflow project and the company’s about four years old now, and we’re really solved the problem primarily around the workflows becoming more complicated in terms of number of steps, but also the larger amounts of data and the cloud and cluster configuration side. And this ties in with what we’ve been doing with the CLI, because it allows you to define your whole research environment. Nextflow and particularly nf-core have done a fantastic job on defining the workflow. And if you think about that workflow being typically a repository, maybe you’ve got some profiles, you’ve got some test ways of running those pipelines, but there’s almost a problem outside, which is bigger than just the workflow definition, which is, where does that workflow run? How was that infrastructure created? How can I create all of that in a reproducible way? And how can I start to do this using long running services without having to rely on my laptop being the key provider of all of that? That’s what we’re trying to get to go into a little bit today.
2:43 It really should be a little bit of a show and tell session with some examples of how you can do this from the command line. Please forgive me if there is any typos as we go through this, but hopefully it can give you a flavor of what’s possible and you can start to think, how you can apply it in your situation. I’m going to go into the meat of this first. This is the architecture of Nextflow Tower. Not something that you specifically know tons about, but the point is that it’s a full stack web application and it connects in with these computing platforms. All of our different cloud providers, and obviously the data associated with the repositories.
3:21 How do you interact with that yourself? I mean, there’s typically three ways of doing so. The first is obviously through a web browser. You’re connecting in and going through cloud or through a user interface like model. Then there’s the command line, which I’m going to show you today. And all of that is built around and built on top of this API layer. Tower has published the API. It says open 3.0 standard from that API. You can go into the documentation and do that. And we have a lot of people who are just building purely on top of that API layer. But it was a need to be able to interact with Tower also from the command line. Here’s a quick example of what we’re going to see later on. But the idea is that you can start to launch pipelines, monitor these pipelines, understand a little bit what’s going on and define the infrastructure in this way. I’m going to go through a couple of things. The first is just the very basics of the installation and the setup and then how that works through the repository. And then we’re going to also go through and just see how we can launch some pipelines and how we can set up the infrastructure as code.
4:29 Now, the way that the CLI is written, it’s in parity with the API. And this means that you can do a lot of things and probably many more things than we’ve even considered for most applications. When you see this, I want you to consider other use cases. If you’ve got some, it would be fantastic. We had people reach out to us the other day who were using this for uploading data sets and triggering pipelines, for example. We wrote a blog post on that. It’s a fantastic piece of work. And we see a whole bunch of other use cases outside this direct setup. Please shout out if you want to see how that works. The first thing I’m going to point you to is just Tower itself and the way that it connects in terms of the CLI. You can go to cloud.tower.nf. And when you log in here, you’ve got your credentials that you can log into. Typically people are using GitHub or Google to log in here. And the first thing you have is the community showcase, which is what we’re going to be working in for some of the parts today and really a collection of pipelines. There’s some free compute. You can jump in and start launching it. And you can interact with the CLI as well.
5:39 The point that I want to make here is to set this up, the way the CLI works, it connects to Tower and it connects via your token. You can create your own, as many as you want, access tokens here. You can see that I’ve created a token for the nf-core demo token. And when you create a token, let’s just give a demo example. You can see it just gives you that key there. There’s how it links in there for doing that. I can just quickly remove that one for demonstration purposes. I’ve got one which is linked in, and I’m just going to go and export that into my environment. How does it all work? So, FlinkyNetworks is that we have the Tower CLI, which is a standalone open source piece of software which you can download from GitHub. We have built it in all of the different environments. If you go to tags here for each version, we build this for Linux, for OS X, for Windows as well. And then you can simply download that file and move it to executable and run it here. I’m running a Mac today. I’ve just downloaded the latest Mac version into my terminal, and then you can start to run it here.
6:50
There’s a little bit of background on how this works and how you can set this up, but I’ll leave this for later and we can jump into a demo specifically and see how that works. I’m now going to switch over to my terminal, to my command line, and we can run a little bit from there. Okay. Here’s my command line. The first thing you can see is that we basically, instead of calling it Tower or anything like this, the command that we call it is ./tw
. And you can see all the different commands that you have here. Tower itself can interact with all of the different aspects of the system. We have things like actions for automations, organizing collaborators, compute environments, credentials, data sets, as well. The other nice thing you can do here is to write ./tw info
, and this is like a health check. It’s going to connect with the Tower instance. It’s checking that we can connect to the version that we’re using, showing which user I am when I’m authenticated to do this, and checking some details there. This can be quite useful as well for just setting it up and running that for how you go. The way that you set this up is you would typically export an access token here. In this case here, I can got my access token, which I’ve copied from before. I would then paste it and export that there, and then you’re good to go and you’re connected.
8:15 There’s a couple of other different options that you want to do when you’re considering the setup. The first one is around which workspace you want to work in. I showed you the community workspace before, but if you have your own workspaces set up for your organization, you can simply change that from an environment variable, in this case, the Tower workspace ID, but you can also change it from the command line when you’re running this. You can just say Tower workspace from that. I’m going to say here and just export one that is essentially the community showcase, and anything I do now, you’ll be able to follow along live. If you log into tower, you’ll be able to see all of these actions and pipelines triggered off and things that were created, followed them as we go through. I’m going to export this Tower workspace ID here. Just going to confirm that everything is working and it’s all fine. I’ve uploaded my credentials, connected to the right workspace, and we’re ready to go.
9:11 The first thing you want to consider in the first use case for Tower is the primary use case of what we all do, which is typically run pipelines, how we can trigger those off. And in Tower, we have a special concept of what a pipeline is. It’s essentially almost a reserve name that we’ve created for the system. A pipeline is a combination of the workflow repository. You think of this as you get repository where the source code is hosted, combined with the compute environment, which is really where the execution of that pipeline will take place, plus some parameters, essentially inputs that you want to use, default parameters, but also the parameters that you want to use. Those three things come together for what is called a pipeline.
10:00
And if we go into here and we see inside of the Tower showcase, you should see the exact same thing. If I was to say here, pipelines list
, you’ll see that it lists all of the pipelines that we have inside of this community showcase. And this is the nf-core pipelines, but also we have some other ones we’re adding in here all the time. And these are pre-configured ones, which are good to go. Again, it’s a combination of the repos and compute environments and some parameters. If I want to launch one of these pipelines, I’ve got a couple of different options and come a couple of different ways that we can do this. Well, the first thing is to say, we could say tw pipelines
here, and I want to say launch
. And when I launch this pipeline, I can then choose the name that I want to give it, essentially launch that pipeline, or I could choose an ID. Let’s just choose a default one, and we will choose that we want to launch nf-core chipseq. I say this here, and you can see that I have made a mistake here, pipelines launch. I think we have to give it a name here. The typo. No? Of course. Nf pipelines launch. Let me just copy paste my example. Maybe I’ve made a typo here somewhere. Okay. Like that. Okay. Must have made a typo there somewhere. This is the interaction that was taking place, and we’re submitting now that pipeline was saying, launch that pipeline with tower.
11:34 It’s important to note why this is different from Nextflow run. The primary difference is that when we’re launching this pipeline, we’re directing Tower to launch it. There’s nothing running now on my local machine where this is launched. There’s no Nextflow instance. There’s no head job or anything. Everything has been delegated to Tower, which in itself will submit that compute into the computer environment where that’s working. That’s got a lot of advantages. Firstly, I can shut down my laptop if I need to run out to go to any work at the end of the day. The other thing is that all of the records, all of this, all of the logs, all of the information about the execution of the pipeline has been managed now by Tower and has a full history of that. That’s a very important thing. You don’t want to be reliant on your laptop or where you’re launching the pipeline to manage that information. It also means it’s been collaborated. If you go into the workspace now and click on this URL, you can follow along live. And this allows people to interactively work together. Maybe there’s an issue that you can fix and then you can fix that and launch that again. And we can work in a much more collaborative manner as well.
12:47 That was a basic one where we didn’t actually change any of the parameters. This was a default launch of that. But what if you wanted to start running a pipeline where you’re actually changing things up? Well, you’ve got a couple of different options for doing it. We can define a little bit like the nextflow command line here in terms of the different stuff we can run. Eventually I wanted to say now I want to run the viralrecon pipeline from nf-core. And let’s say that I want to now, instead of just run it by default, I want to use an export profile. And this profile may be some test data that you’ve got. Maybe it’s a profile because of something specific about what you want to find there. And I can just use the exact same profiles here to launch that as well. That’s going to trigger off there. It’s going to be launched into the community workspace and doing as well.
13:35
Another more common use case is around not so much the changing of a profile, but changing of the input data. What you can do here is what we call the params file. You can create a params file in YAML or JSON. Here you can see I’ve defined some input data. I’ve got some different options for those inputs. And here I’ve saved this into this file called params.yaml. And I could then go say, okay, let’s go launch the tw launch nf-core-rnaseq
pipeline. And now I’ve got the choice here of defining some profiles, but I could also add in now the params file itself. Let’s just say a profile for test. And I also want to put in here the params file, exactly like we would with Nextflow. And this point here, I can just specify exactly what the location of that params file is. Obviously, this allows you to predefine a lot of stuff, maybe you’re working off sample sheets that you’ve got predefined there, and they can trigger off that as well. Okay. This is my params file. I think I’ve probably made some typos in this. Much better copy-pasting in the live demos than trying to do this. We can prove it works. Okay. We saw that, and then we can launch that. This is the basic use case of launching those pipelines. You can, of course, monitor them. You can follow them along. Maybe you want to do that from the GUI. Maybe you want to do it from the command line. There’s a whole bunch of different endpoints there.
15:16
I want to switch gears a little bit now and think about how we can define the infrastructure around this. So far, we’ve just launched those pipelines and we’ve been able to monitor them. What about if I wanted to now do this, whether I want to set up pipelines for other people or define my research environment in a more generalized way? I’m just going to quickly change over to a different workspace here. I don’t want to build the stuff in the community workspace. I don’t want to populate it with different things. I’m first going to change over to the workspace here and then I’ll look at the different pipelines that I have inside this workspace. This is a private one that you can’t see, but its principles are exactly the same here. I’m just going to say tw pipelines list
. It’s going to show you all of the pipelines in that space. And you can see that just a whole bunch of stuff that we’ve been populating inside of here. You can see the repository it associates with. I see a lot of the nf-core stuff and then the name of the pipeline.
16:09 What I want to do now is imagine that I wanted to say, take this and I had a test version of this or development version and say I wanted to copy that or I wanted to give that to you. You could capture the whole thing in your environment. Or put in another way, I wanted to define exactly what that pipeline is made up of beyond just the workflow itself. One way that I can do that with Tower is to take the pipelines command here and then export the particular pipeline itself. The way that this works is when I give it a name. Let’s just take the first one from the top. I’ll copy paste this time as opposed to trusting myself too much. When I export this, you’ll see that it’s exported entirely as this JSON file here. This has got a fantastic functionality because it means I can import and export things using this command and really define all of my pipelines as code. These pipelines may have different configurations, different setups for different environments. You can define that now entirely inside of the JSON file, but also you have a pretty nice way to interact with it in this regard. That means I could go create a new pipeline, maybe change one or two things out and all of that infrastructure is shown there as well. This principle of importing and exporting, defining, and having this as a stored location works for all of the resources. I’m showing you pipelines here, but the same thing applies to if I was going to show you, for example, the credentials. I want a list inside Tower of all my credentials that are inside of this workspace. And you can see I’ve got credentials here for Google, for GitHub, for Azure, et cetera. All of that becomes available for me to see.
17:53 Then I can think about how I can link that into actual computer environments of generating this stuff on the fly. If I wanted to, for example, export my compute environment here, which is going to be where the compute takes place. It’s going to be my AWS batch. In one example, let’s say there’s some credentials for that, how that’s set up. And you can just have a look at what one of those looks like. This export here, the credentials, you can see that this is the whole definition of that, that is required. This is running on EUS3, et cetera, all of the working directory for that and how that sets up. There’s obviously a whole bunch of these things that you can run. Just to give you a full view and probably makes a little bit more sense now, all of the commands that we’ve seen here.
18:42 If we do talk on here just by itself, you can start to interact with creating the workspaces themselves as well, participants generating the pipelines, creating data sets. And I just will point out one more thing to give you some inspiration on what’s possible here, is that we recently released a blog post around this, which really took many of these ideas and put them all into play. The concept was that we wanted people to be able to essentially drop a sample sheet or a sample comes off a sequencer. As part of that, it will trigger the execution of a pipeline all the way through. And the Tower CLI is obviously perfect for this. We wanted to set this up on the first case on AWS itself. The way that we did this means we defined some Lambda function that essentially takes a sample sheet which gets generated when data enters to an S3 bucket. Then it uses the Tower CLI to generate a data set, deposit that into Tower, and then to invoke the job with Tower itself. It allows you to integrate with many different services for doing so and then create this whole. This is a a walkthrough if you’re really interested in seeing how this can be done. We provide all the files. And there’s also a Git repository here if you want to go through and follow this up yourself. But I’ll end with that.
20:13 We’re really excited to see what people build with this. We are expanding this as well as we add more functionality to Tower, keeping everything really aligned in that respect. And really excited just to see what people build with it.
(host) Thank you, Evan. That was a really interesting talk on Nextflow Tower. I don’t know if there’s anyone who has any questions to follow up on this one.
20:41 (question) Maybe I can start the questions off. I haven’t used Nextflow Tower before, but are there any more supplementary dependencies you need on your local machine when you’re using Tower?
(answer) When you’re set up with Tower, there’s a couple of ways to do it. One of them is just writing with Tower from your Nextflow command. You can say nextflow run -with_tower
. This is still running it with the head job still on your laptop. To connect externally, you should log into Tower Cloud, and then you can essentially create your computer environment like that. They’re two different ways of working. And to really get the full power of this, you should log in. As I say, it’s running as a service in there. And to provide full clarity around this, this is Tower Cloud, which you can go and log in and use. And our business model is primarily around deploying this in customer’s own environment. Customers have their own version of Tower. This is the public one I’m showing you, which you’re free to use.
21:42 (question) Okay. We have a question from Paul. And then we’ll go to other people who have their hands up. Paul asks, I have access to an HPC support computer at my institution. What is my use case for nf-Tower? I’ve only used nf-Tower in a limited basis and don’t see the need of it. Who are the main users of nf-Tower?
(answer) We can think of a couple of use cases. I can jump in there and maybe demo a little bit. One thing in terms of HPC, I’m showing you batch in AWS batch and different cloud environments here. If we go down to that same workspace we were working before, these are the same computer environments that I showed you. And you can connect in here with all the different platforms. You connect in with your own, in this case, PBS or your own Slurm cluster that you have here. And this organizes the infrastructure side of it. You can connect those bits into there as well for that part. There’s three primary use cases around the use of it. If you are creating pipelines and you want to make them available for anyone who maybe doesn’t even have Nextflow expertise or command line expertise, experimentalists, et cetera, you can create your pipeline and define it in a way which makes it super easy for them to come in. Here you can create your own customized user interfaces. As a user, I just need to select my input data. I would come in and I say, I don’t want to run my RNA-Seq sample sheet against that. I want to go have some options around this and maybe I want to save some particular files. And then I can trigger the execution of that job often. It simplifies the whole launch process for them. As a bioinformatician, you probably want to create those pipelines and make them available with compute to your users to do that. Maybe you want a long-running service. You don’t want to be relying on that. You have a full history of your execution so you can follow those pipelines as they’re going through as well. And you can also automate things as well. From the system admin side, you have the compute environments which can be defined and not really a lot of work around the collaboration side of it. There’s a whole bunch of use cases there.
23:58 (host) Okay. Apparently everyone else was appreciating your talk. I don’t know if anyone else who has a question. Okay. Seems like everyone else is satisfied. Thank you so much, Evan, for the talk. And I’m sure people can catch you on Slack if they have any questions.
(speaker) Absolutely. Thanks so much for the time, everyone. And yeah, reach out if you have any other questions. Always happy to take them.
(host) Sure. Thanks, everyone, for joining today’s bytesize talk. See you next week.
(speaker) Thanks so much, folks.