Announcing ahab v2.0 — Easily Scale and Automate Bioinformatics Pipelines in the Cloud

--

Back in May (as if that was so long ago), Tuple was excited to announce the release of ahab v2.0 at the London Biotech Show and at the Nextflow Summit in Boston. The updates include a new CLI and improvements to make the system easier to use than ever. If you’re unfamiliar with ahab, we’ll tell you all about it!

What is ahab?

ahab is our Kubernetes-based framework built for scaling bioinformatics workloads in the cloud. It allows for automated, reproducible analysis pipelines to be deployed in the cloud with ease.

  • Use your existing pipelines and favorite tools.
  • Comes with an easy-to-use API and task scheduler (+ CLI and Python library).
  • Fully integrates with cloud data lakes.
  • Elastic scalability and robust monitoring.

Kuber-what?

Kubernetes (K8s) is an open-source container orchestration system. Inside K8s, you use Docker containers and spin up compute nodes on as needed.

Example (general) Kubernetes cluster components.

This system is perfect for bioinformatics as it allows us to specify different resource needs by node pool. Thus, different job types may require different amounts of RAM, CPUs, and GPUs, and we can accommodate those individually by pipeline.

Learn more about Kubernetes: https://kubernetes.io/

Cloud providers have platform services that are fully compatible with open-source K8s.

With cloud K8s services, we take advantage of the “limitless” compute capacity of the cloud + their best-in-class networking and security. Plus, if your data is already in the cloud, we can analyze that data right where it lives.

How does it work?

ahab is designed to “bookend” your existing pipelines, scripts, and code (with very minimal modifications).

  1. Grab your code (pipeline, scripts, etc.)
  2. Add ahab CLI commands to the beginning and end of your code. (This gets the pending job information from the ahab API.)
  3. Slightly modify your code to accept the information from the API (such as the paths to your input .fastq files, reference directory, and other parameters).
  4. Build a Docker image 🐳
  5. Push the Docker image to a container registry

BYODocker Containers and Pipelines

With ahab, you’re not locked into using our pipelines (though we have a couple open-source examples you can use). Simply use your favorite tools and any existing logic you’ve already written.

Some example pipeline tools.

Whether your pipelines are robust, fancy Nextflow masterpieces or just a set of shell scripts that are duct-taped together, if it’s in a Docker container, you can likely run it in ahab!

Example Analyses

No Copying Data Around!

Given that you have a cloud data lake already and are using it to house your -omics data, we can mount your data lake as a Volume in the K8s cluster. This makes your data available to the containers running in the cluster as if it’s part of the filesystem.

  • No need to copy data around, saving time and reducing data corruption errors.
  • Get reference data and input files directly from /mnt/datalake/…, for example.
  • Write results directly back to the data lake.

Scale Up. Scale Down. Automatically.

As new data comes in, you shouldn’t have to have your bioinformatician run things manually. With ahab, jobs will get added to the queue and the K8s service will scale up to process the jobs automatically.

Then, when the jobs finish, their statuses get updated in the database and the nodes will spin down. This results in cost savings by not running virtual machines when they’re not needed. Plus, this frees up time for your bioinformatics team to do something more interesting — like analyzing the results!

How do I get my tentacles on ahab? 🦑

Most of the ahab framework is open source as it uses existing cloud platform services, a Python library, and common pipeline frameworks.

ahab CLI/Python library repo: tuplexyz/ahab-lib

We charge a flat consulting fee to deploy the services that run ahab in your environment (like Kubernetes and a database). Plus, this gives us a chance to review your existing data lake infrastructure and help modify any pipelines to work within your existing data infrastructure.

Watch our Nextflow Summit Talk ▶

Want to learn more?

Send us an email and we’ll gladly show you a live demo of how it all works! Plus, we can learn more about the pipelines you’re looking to scale and automate.

✉: contact@tuple.xyz

🦑: https://tuple.xyz/solutions/ahab/

Stay Curious…

--

--

Tuple, The Cloud Genomics Company

Microsoft and Databricks partner consulting company that provides expert Azure cloud solutions for genomics and AI.