2025 Biohackathon Projects

Comprehensive PacBio Iso-Seq Analysis Pipeline

Title:

Comprehensive PacBio Iso-Seq Analysis Pipeline

Challenge Summary:

For long-read RNA sequencing (RNA-seq) data generated by PacBio Iso-Seq, two primary software tools—IsoQuant and TAGET—enable sample-wise comparisons, such as differential gene and isoform analysis. However, to facilitate rapid identification of differentially expressed genes or isoforms, it is essential to evaluate their performance and integrate these tools into a streamlined pipeline.

docs.omicsbox.biobam.com

We propose developing a Conda-based pipeline that simplifies the installation and execution of these tools. This pipeline will enable researchers to efficiently identify differentially expressed genes or isoforms and visualize the results, thereby aiding biologists and clinicians in interpreting complex transcriptomic data.

Number of Team Members Needed:

3-5 team members are needed. We would like to recruit additional members

Useful Tools/Packages/Software:

Python package development, conda familiarity, and data visualization skills using R or matlab.

Submitter:

Zhongshan Cheng, Center for Applied Bioinformatics, St Jude

Optimized HPC Resource Allocation for CryoEM Processing

Title:

Optimized HPC Resource Allocation for CryoEM Processing

Challenge Summary:

This project will address the challenge of optimizing the HPC job submission for CryoEM jobs. We plan to develop an intermediate layer that allows the specification of the cluster resources and automatically proposes the best parameters for specific jobs that typically run for hours or days. We may also develop web tools to enhance the tracking of resource utilization and facilitate job submissions.

Number of Team Members Needed:

This team (4-6 members) is set and has all the skills necessary to complete our project.

Useful Tools/Packages/Software:

Programming languages: Python, Bash
Web and visualization: HTML, JavaScript, Bootstrap, Matplotlib, Highcharts, Plotly, EMhub
Database and files: SQLite, JSON, Redis
HPC systems: Linux, SLURM, LSF, parallel filesystems, scratch
Version control: Git

Submitter:

Jose Miguel de la Rosa Trevin, Principal Scientific Computing Engineer, St. Jude

Automatic Detection of Range Shifter Board Prior to Proton Beam Delivery

Title:

Automatic Detection of Range Shifter Board Prior to Proton Beam Delivery

Challenge Summary:

Some proton treatments call for the use of a range shifter board, a block of material which patients lay on during proton beam delivery. This board has significant dosimetric impact, shifting the range of the proton beam in the patient shallower by about 4 cm, allowing structures nearer to the surface of the skin to be targeted by the radiation beam. Currently, there is no automatic method to detect whether the range shifter board has been put on the treatment couch, leaving room for human error. This project aims to automatically detect whether or not the range shifter has been placed on the treatment couch using the already in-use in-room cameras. From the MOSAIQ database, we will be able to determine whether the plan calls for a range shifter board. Combining these two systems, we will automatically determine if there is a discrepancy between the plan and the patient setup regarding the range shifter board.

Number of Team Members Needed:

This team (3 members) is set and has all the skills necessary to complete our project.

Useful Tools/Packages/Software:

Python, image analysis, radiation oncology

Submitter:

Samuel Pelletier, St. Jude

Integrated Pipeline for mtDNA Analysis

Title:

Integrated Pipeline for mtDNA Analysis

Challenge Summary:

This project aims to integrate and streamline two distinct Nextflow workflows: (1) a Whole Genome Sequencing (WGS) pipeline for mitochondrial DNA (mtDNA) variant calling and annotation, and (2) a MitoEDIT tool, which predicts base edits, identifies associated bystander edits, and assesses their functional impact. The unified workflow will be embedded into a Planet 9 ecosystem that is a web-based platform with cloud execution capabilities. Specifically, the project will focus on:

Developing an end-to-end Nextflow pipeline that integrates MitoEDIT as a module within the WGS workflow.
Building a user-friendly web interface that supports BAM file uploads, allows for both default and customizable input sequences, and enables user-defined variant allele fraction (VAF) thresholds.
Embedding the integrated workflow as modular applications within the Planet 9 ecosystem for cloud-based execution, ensuring seamless user interaction and computational efficiency.

Number of Team Members Needed:

This team would (4-5 members) would like to recruit additonal members.

For developing the Nextflow pipeline, we are looking for bioinformaticians who

are comfortable working with a terminal Linux environment on the cloud
have strong basic bioinformatics skills
are familiar with Nextflow (nf-core), Docker, and Conda
have experience developing bioinformatics pipelines that have multiple inter-dependent steps

For developing the Planet9 app, we are looking for frontend engineers who

are comfortable working with a terminal Linux environment on the cloud
are familiar with TypeScript, ReactJS, and NextJS
are familiar JS schema validators like Zod

We are also looking for backend engineers who

are comfortable working with a terminal Linux environment on the cloud
are familiar with TypeScript, Python and PostgresSQL
are familiar with Terraform and Kubernetes

Useful Tools/Packages/Software:

A standard virtual machine instance on Azure or potentially AWS, TBD closer to the date.

Submitter:

Kelly McCastlain, Scientist, St. Jude

AI-Driven Clinical Data Exploration Platform

Title:

AI-Driven Clinical Data Exploration Platform

Challenge Summary:

I propose developing an AI-assisted interactive data exploration platform that enhances the functionality of the St. Jude Survivor Portal. While the current Portal provides summary statistics, it lacks dynamic visualizations and natural language interaction. This project introduces a local AI engine that combines general biomedical knowledge with specific insights from curated datasets (e.g., CCSS or public cancer datasets). Users will be able to ask questions in natural language, receive relevant visualizations and summaries, and explore data more intuitively. This approach addresses the challenge of accessibility and interpretability of complex biomedical data, offering a more user-friendly, intelligent interface for researchers and clinicians.

Number of Team Members Needed:

This team would like to recruit additional members. I believe the project can be successfully prototyped in 72 hours with a team of 3 to 4 members. I will provide pre-processed, non-PHI clinical data and plan to begin some preliminary model training prior to the Bio-Hackathon. To complement this, I hope to collaborate with at least 2–3 individuals who bring expertise in AI/ML methods and some researchers have broad familiarity with CCSS/SJLIFE datasets. Their contributions would be invaluable in refining the AI-assisted components and ensuring the clinical relevance of the platform’s features.

Useful Tools/Packages/Software:

Required Skills:

Machine Learning / AI Development – experience with natural language processing, retrieval-augmented generation (RAG), or fine-tuning local language models
Python Proficiency – especially in data visualization, web app development (e.g., Streamlit, Flask), and AI libraries (e.g., Hugging Face Transformers, LangChain)
Data Visualization – ability to create interactive plots and dashboards
Biomedical Domain Knowledge – familiarity with CCSS, SJLIFE, or cancer survivorship datasets
Statistics/Biostatistics – understanding of clinical data structure and survival outcomes
Optional: Experience with integrating large language models into user-facing tools

Submitter:

Zhuo Qu, Biostatistician, St. Jude

Molsnap

Title:

Molsnap

Challenge Addressed:

Image-to-SMILES Conversion Tool

Challenge Summary:

The accurate conversion of chemical structure images into SMILES (Simplified Molecular Input Line Entry System) is a critical task in cheminformatics, yet existing tools often face significant limitations in practical setups. Current solutions are frequently plagued by issues such as low accuracy, poor usability, and restricted access due to paywalls, creating barriers for researchers who rely on these tools for chemical data extraction and analysis.

To address these challenges, the Center for Data Driven Discovery proposes a collaborative initiative at the upcoming St. Jude Biohackathon to develop innovative solutions for image-to-SMILES conversion. This initiative aims to create a tool that is accurate, user-friendly, and accessible to the broader scientific community

Number of Team Members Needed:

This team (5-7 members) would like to recruit additional members.

Submitter:

Vyoma Sheth, Senior Computational Researcher, St. Jude

A Browser-Based Interface for Chemically-Aware Database Queries

Title:

A Browser-Based Interface for Chemically-Aware Database Queries

Challenge Addressed:

Chemical Registration Database Search Interface

Challenge Summary:

Many of the details of the project will of course have to be decided by the eventual assembled team, but roughly speaking the app can be divided into two components.

On the server-side, the app will use a pared-down, static version of an existing chemical registration database. The server-side team will need to determined what data should be calculated and cached for the stored compounds, and build a set of scripts to process substructure, similarity, and exact match searches. These scripts will need to use various hashing and fingerprint filters to speed up search, and would make use of open source cheminformatics tools such as RDKit or OpenBabel. The scripts will listen for and serve responses to to requests coming from the client side.

The client-side team will need to build a working interface for initiating, editing, submitting, and viewing the results of various chemical queries. This interface should be simple, intuitive, and responsive. The team can decide at the outset of the hackathon what client-side libraries they want to use to build this interface, as well as coordinate with the server-side team about the formats and parameters of the requests and responses that would be required.

Number of Team Members Needed:

This team (6-8 members) would like to recruit additional members..

Useful Tools/Packages/Software:

Familiarity with chemistry and cheminformatics is an obvious plus, but the project will require a range of backgrounds, including some totally new to cheminformatics. Comfort with server side scripting languages (like Python or PHP) and/or web app languages (HTML, CSS, Javascript, Node.js, React, Elm, etc.) will also come in handy. I don't think a decision about specific languages should be made until the team is chosen.

Submitter:

Nathaniel R. Twarog, Senior Informatics Scientist, St. Jude

Simple, Flexible Pipeline Execution via OnDemand

Title:

Simple, Flexible Pipeline Execution via OnDemand

Challenge Addressed:

Standardized and Simple Pipeline Execution via OnDemand

Challenge Summary:

This team will implement the ability to run arbitrary nextflow (and hopefully WDL) pipelines backed by St. Jude high performance research computing cluster via OnDemand, a browser interface. Users will be able to upload a sample sheet, configure pipeline parameters, and launch their pipeline easily. Given the parameters for most pipelines can be specified in a YAML/JSON config file, we will set up as many broadly useful pipelines as possible for the community.

The ability to run these pipelines through OnDemand would provide broad access to easily runnable pipelines, properly configured for the St. Jude HPC, using consistent reference data. This would streamline data harmonization efforts, decrease the barrier to entry for generalized data processing, and pave the way for additional internal and external pipelines to be run through OnDemand.

It will save St. Jude researchers time, reducing the effort spent on establishment of well-defined best practice pipelines, and thus enabling more specialized and novel development work.

Number of Team Members Needed:

This team (3-5 members) is set and has all the skills necessary to complete our project.

Useful Tools/Packages/Software:

OnDemand, nextflow, familiarity with nf-core pipelines, WDL

Submitter:

Jared Andrews, Senior Bioinformatics Research Scientist, St. Jude

DL-Image-Lab: A Napari Plugin for Deep Learning - Training, Inference, and Fine-Tuning

Title:

DL-Image-Lab: A Napari Plugin for Deep Learning - Training, Inference, and Fine-Tuning

Challenge Summary:

The goal of this project is to develop a Napari plugin that enables scientists to train, run inference, and fine-tune/adapt deep learning models without writing any code. The core idea is to encapsulate powerful deep learning architectures within a simple, intuitive user interface, making advanced image analysis accessible.

The plugin will leverage HPC infrastructure automatically for training tasks, removing the need for manual setup or scripting. In addition to deep learning capabilities, we plan to incorporate traditional machine learning-based fine-tuning methods (e.g., random forest pixel classifiers) to adapt existing models to new imaging setups or domains with ease.

Problem it solves: This would enable scientists to be comfortable using deep learning models into their day to day analysis. The projects intends to lower the coding and set up entry barrier that exists in people from being hesistant to train models. Annotation and training happens on the same UI . So it is easier to make changes and retrain.

Existing solutions: Currently, building a deep learning solution such as a U-Net-based segmentation pipeline typically requires a strong background in Python and coding. While some tools attempt to reduce this complexity, environment set up and training still remains a hurdle. By providing a no-code, UI-driven plugin that performs advanced deep learning operations, this project would be helpful and alao make it accessible for people to set up their own AI pipelines.

Number of Team Members Needed:

This team (4 members) is set and has all the skills necessary to complete our project.

Useful Tools/Packages/Software:

Machine learning , python and software development

Submitter:

Krishnan Venkataraman, Senior Image Data Scientist, St. Jude

Project FINDIT: Facilitating Intra-Departmental Navigation of Data and Information Transfer

Title:

Project FINDIT: Facilitating Intra-Departmental Navigation of Data and Information Transfer

Challenge Summary:

This is a continuation of a project from last year's BioHackathon. Project FINDIT is a data management project for the various studies in the Department of Psychology and Biobehavioral Science (PBS). The goal is to develop a series of "crawlers" (for this project, I will call them "finders"): a piece of code that "crawls" a directory, recursively searching that directory for data, parsing the metadata from each file, and based onthat metadata does a particular form of processing. We plan to develop finders for more basic data management functions (for example, taking a simple inventory of what data we have and where it is stored), and, eventually, we plan to implement more advanced finders intended for automated processing of data (for example, a preprocessing algorithm that can identify EEG data and perform automated artifact correction). At last year's Hackathon, we developed a simple "inventory" finder that provides a basic profile of the data stored in a given directory. For this year's Hackathon, our goal is to develop a "participant-first" finder, which, given a particular subject ID number, will search a parent directory and provide a report showing what data we have available for that participant. Thus, the goal is for someone to be able to determine what data we have available (imaging, clinical, sleep tracking, questionnaires, etc) for a given research participant.

Number of Team Members Needed:

This team (2 members) is set and has all the skills necessary to complete our project.would like to recruit additional members..

Useful Tools/Packages/Software:

Ability to code in R and/or Python, a familiarity with the research being conducted in PBS, and some foundational knowledge regarding best practices in data science

Submitter:

Kyla Gibney, Postdoctoral Research Associate, St. Jude

Shared Resource Form Generator

Title:

Shared Resource Form Generator

Challenge Addressed:

Reusable Form Modules for Shared Resource Management

Challenge Summary:

We propose developing a set of reusable form modules within SRM2 to streamline form creation across shared resources. Currently, each core independently collaborates with IS to build custom forms from scratch, resulting in duplicated efforts, inconsistent design, and increased maintenance. There is no mechanism for sharing common components, which hinders efficiency and usability.

Our solution will address this by creating standardized, modular templates that can be easily customized and reused across cores. This will reduce development time, enhance user experience through consistency, and simplify maintenance. For the broader St. Jude community, this approach will enable faster onboarding, more reliable data collection, and improved access to services, ultimately supporting more efficient and collaborative research workflows.

We will include input from the IS team, Cores who are in need of custom forms, as well as users of forms.

Number of Team Members Needed:

This team (6-8 members) would like to recruit additional members..

Useful Tools/Packages/Software:

This should be a St. Jude specific project and include users familiar with the different aspects of SRM2. It does not rely on heavy coding knowledge for most roles.

Submitter:

Susanna Downing, Bioinformatics Research Scientist, St. Jude

CAR-T Superheroes: A Comic Book App to Explain CAR-T Cell Therapy to Pediatric Patients

Title:

CAR-T Superheroes: A Comic Book App to Explain CAR-T Cell Therapy to Pediatric Patients

Challenge Summary:

Goal:

Chimeric Antigen Receptor (CAR) T cell therapy is a promising treatment for children with relapsed or refractory leukemia and other cancers. However, explaining this complex therapy to young patients and their families can be challenging and can impact treatment acceptance and emotional resilience. This project aims to develop an interactive comic book app that simplifies CAR-T therapy into a fun, engaging superhero narrative, helping children understand their treatment journey.

Background:

When volunteering for the St. Baldrick’s Foundation, I met an extraordinary individual named Carlos Sandi. He is the father of Phenius Sandi, who was diagnosed with Acute Lymphocytic Leukemia (ALL) cancer at an early age and one of the first pediatric CAR-T Cell Therapy patients. Carlos Sandi’s trials and challenges led me to write a profile narrative titled “Childhood Cancer: The Phenius Sandi Success Story.” Many pediatric cancer patients present at preschool age, so it’s challenging for them to have even a basic understanding of what cancer is doing to their bodies. While researching and interviewing for this story, I conceived the idea of merging the science of CAR-T Cell Therapy with a superhero comic book. Mainly to help patients like Phenius Sandi understand their cancer and treatment. I hope to expand upon this science-art comic idea and apply it to many St. Jude programs.

Objectives:

Educate pediatric patients, siblings, and families about CAR-T cell therapy using simple, engaging language and storytelling.
Promote understanding and empowerment by demystifying complex cancer treatments.
Create a digital resource that hospitals and caregivers can use as part of pre-treatment counseling or educational programs.

Key Features

Interactive Comic Book: This colorful, story-driven comic features superhero characters (CAR-T cell agonists) fighting cancer villain cells (antagonists).
Simple Science:
1. Animations and pop-up explanations that introduce core concepts:
2. What is a CAR-T cell?
3. How do they fight cancer?
4. What does the treatment involve?
5. Why is it special for some patients?
Personalization: Option for kids to add their superhero avatar or name. Include personal milestones (e.g., treatment start date, “hero badge” for every milestone reached).
Multi-Language Support: To make it accessible to kids from different backgrounds.
Gamified Learning: Mini-games (e.g., “Help the CAR-T team find the cancer cell hideouts!”). Achievements to motivate kids during their treatment journey.

Team Roles:

Storyboarding Writers: Create the engaging superhero storyline and dialogue.
Graphic Designers/Illustrators: Design characters, scenes, and comic panels.
Front-End Developers: Build an interactive comic app (web/mobile).
Pediatric Oncology Advisors: Ensure accurate, age-appropriate explanations of CAR-T therapy.
UX/UI Designers: Optimize the interface for children and caregivers (use of App builders or similar tools).
Gamification Specialists: Integrate fun mini-games that reinforce learning.

Impact:

Patient-Centric: Supports kids’ emotional well-being by helping them understand their treatment.
Innovative: Bridges the gap between science and compassion, turning complex cell therapies into an accessible story.
Scalable: Can be adapted for other cell therapies or expanded for different age groups.

Deliverables:

Interactive comic book prototype (web or mobile app)
Storyline script and artwork samples
Demo video
Pitch deck for future development or funding

Number of Team Members Needed:

This team (6-8 members) would like to recruit additional members..

Useful Tools/Packages/Software:

App building expertise.

Submitter:

Eric Dixon, Sr. Medical Writer, St. Jude

Open-Source Innovation for the Future of Cryo-ET

Title:

Open-Source Innovation for the Future of Cryo-ET

Challenge Summary:

Cryo-electron tomography (cryo-ET) is a transformative imaging technique that enables high-resolution, three-dimensional visualization of macromolecular structures directly within cells. As this field advances, there is a growing need for modern, flexible computational tools to handle increasingly complex data and enable innovative analysis approaches.

Currently, data analysis infrastructure for cryo-ET is built on large, legacy codebases that are functional but highly complex and monolithic. This architecture makes it difficult for researchers to modify, extend, or integrate new algorithms, limiting the community’s ability to innovate and collaborate effectively. As a result, progress in developing new methods or improving existing pipelines is unnecessarily slow and fragmented.

In this 72h hackathon a team of developers will build a pipeline for structural cryo-ET based on TeamTomo (https://teamtomo.org/) infrastructure..

This pipeline will:

estimate and correct local motion in the raw images
estimate electron-optical aberrations in the raw images
estimate 3D reconstruction geometry
perform accurate 3D reconstructions
infer particle positions by 3D template matching
exporting particle data and metadata for subsequent refinement and further analysis in existing software packages
give Python developers full access to intermediate data/results

As tangible outcomes, we will deliver a configurable program demonstrating subtomogram averaging on in situ apoferritin tilt series data from our work at St. Jude, the data and pipeline will be made publicly available along with a blog post documenting what was achieved during the hackathon.

Having a complete workflow for cryo-ET built on extensible, modular infrastructure will drive future improvements by creating a plug and play environment inside which developers can develop and validate new methodology, thus removing a key bottleneck for innovation in the field.

Number of Team Members Needed:

This team (9-10 members) would like to recruit additional members..

Useful Tools/Packages/Software:

Python programming, especially with experience in scientific computing and data pipelines
PyTorch framework skills for GPU-accelerated computations and machine learning integration
Cryo-ET data knowledge, including tilt series, tomograms, and subtomogram averaging workflows
Image processing and 3D reconstruction techniques, such as motion correction and template matching
Open-source development practices, including version control (Git) and collaborative coding

Submitter:

Abhay Kotecha, Principle Scientist, Senior Director of Technologies, Center of Excellence for Structural Cell Biology, St. Jude Children's Research Hospital

User-friendly interface for probing proteins

Title:

User-friendly interface for probing proteins

Challenge Addressed:

Proteomic Enrichment Analysis Interface

Challenge Summary:

This project aims to develop a user-friendly tool—or adapt existing ones—to systematically query AlphaFold-predicted structural features for a list of proteins of interest. Specifically, we will assess whether certain structural features (e.g., domain types, disorder regions, secondary structure content) are statistically enriched in a protein list compared to a reference set (e.g., the full mouse proteome or a control protein list).

As a test case, we will analyze a dataset derived from aging mouse muscle, where proteins show significant changes in insoluble/soluble ratios. The dataset includes two protein lists: those with increased insolubility with aging ("up") and those with decreased insolubility ("down"). For each protein, we have calculated log2 fold changes and p-values representing differential solubility. We seek to determine whether certain AlphaFold-derived features are enriched in the “up” versus “down” list, or compared to the whole proteome, to gain insights into the structural underpinnings of age-associated protein aggregation.

This tool would enable researchers to explore structural enrichment in custom protein sets and could be adapted to diverse biological questions involving proteostasis, aggregation, or protein function.

Number of Team Members Needed:

This team (4 members) would like to recruit additional members..

Useful Tools/Packages/Software:

gui programming, AlphaFold database API,

Submitter:

Fabio Demontis, Associate Member, St. Jude

Jude-E

Title:

Jude-E

Challenge Addressed:

AI Chat Agent for Family Resource Navigation

Challenge Summary:

Jude-E will be a locally hosted AI chatbot that provides answers and handles a wide range of questions for patient families at St. Jude. This will assist with understanding campus navigation, diagnoses and treatments, document explanations, and logistical support, such as meals and transportation. We will use Gemma 3 running through Ollama, deployed on a Jetson Nano device, with chatbot responses displayed on a connected screen. This setup enables local processing of AI responses, eliminating reliance on cloud services and making it ideal for sensitive healthcare environments, while ensuring fast, real-time access.

Number of Team Members Needed:

This team (4 members) would like to recruit additional members..

Useful Tools/Packages/Software:

Experience working with ChatGPT APIs or any LLMs
Experience working with Raspberry Pi
Frontend (React or any)
Backend (Python or Node)

Submitter:

Sagar Pathak, Senior Software Engineer, St. Jude

Image correlation platform for in situ Cryo-ET

Title:

Image correlation platform for in situ Cryo-ET

Challenge Summary:

In combination with fluorescent microscopy (FLM) and sample thinning (cryo-focused ion beam milling, cryo-FIB), in situ cryo-electron tomography (cryo-ET) allows us to study macromolecular assemblies in their native cellular environment at nanometer to near-atomic resolution. However, a significant bottleneck is the efficient 2D/3D correlation across diverse microscope systems.

Previously, we addressed this challenge with CorRelator, an open-source Java-based software. CorRelator streamlines the process by directly transforming 2D region of interest identified on fluorescent images to live TEM microscope XY stage positions. We implemented 3D correlation strategies to precisely target macromolecules in 3D across FLM, Cryo-FIB, and Cryo-ET. However, users still face hurdles, including external image analysis, manual registration, and the need to switch between several packages. These inefficiencies often lead to correlation errors, reduced throughput, and increased costs.

We propose to develop a comprehensive 2D/3D open-source correlation platform. The key goals for this BioHackathon are: 1) Improve CorRelator’s ability to efficiently process and handle various aspect of digital images, e.g. image format, metadata, memory management; 2) Integrate CorRelator with processing and analysis functions towards a community-driven collaborative bioimaging platform, e.g. Fiji/ImageJ; 3) Develop automated- or semi-automated 2D and 3D registration process where manual pairing can be completely replaced in the end.

Number of Team Members Needed:

This team (2 members) would like to recruit additional members..

Useful Tools/Packages/Software:

Java and/or Python efficiency, computational image processing including handling image formats, metadata, and large data performance (memory management and compression/decompression)

Submitter:

Jae Yang, Scientist, University of Wisconsin-Madison (Will be onsite at St. Jude)

Omics Dataset Validation Prior to Download

Title:

Omics Dataset Validation Prior to Download

Challenge Summary:

I want to make it much easier to evaluate publicly available omics datasets—like single-cell RNA-seq, spatial transcriptomics, and DNA methylation arrays—before we actually download anything. The main source for these datasets will be the Gene Expression Omnibus (GEO). The goal is to avoid spending time and resources on downloading large amounts of data, only to find out that most of it isn’t useful for our research. For example, we want to quickly spot datasets that are missing gene names, lack spatial coordinates, or have their data split across multiple files instead of being neatly organized in a single h5ad object.

To do this, we’re planning to build a tool that can cross-check datasets by looking at their metadata and file formats. This will involve some web scraping to automatically collect and organize all the relevant metadata from GEO. The main output will be comprehensive metadata tables that summarize the key features and file details of each dataset. We’ll also include some visualization options—like bar plots, scatter plots, and filtered tables or SQL queries—to help us explore and filter the datasets more effectively. Once we’ve curated a solid list of promising datasets, we can then use existing tools like GEOquery to download and assess only those that are most likely to meet our needs.

I think the biggest challenge will be building a good web scraping module for GEO. I already have a pretty clear idea of several key variables we can extract and where to find them on GEO DataSets web site, but I have no experience with web scraping. That part will probably be the most difficult, and it’s something I’ll need to learn as we go.

There are some existing solutions out there, like Polly (Elucidata) and GEO2R, which are designed to offer a complete end-to-end analysis pipeline. However, they don’t really address this initial step of getting a clear overview of the data format availability for every queried dataset. Our focus is specifically on making this first stage of dataset evaluation much more transparent and efficient.

Number of Team Members Needed:

This team ('3 to 6 members) would like to recruit additional members..

Useful Tools/Packages/Software:

Web scraping (e.g. Selenium), data visualization (R, Python), familiarity with GEO datasets, API development, and experience with OMICs data formats (e.g., RNA-seq, spatial OMICs, etc)

Submitter:

Maycon Marção, Scientist, St. Jude Children's Research Hospital

Exploring machine learning methods to predict gene dependency from methylation data

Title:

Exploring machine learning methods to predict gene dependency from methylation data

Challenge Summary:

Understanding which genes are dependencies in cancer cells is critical. While CRISPR dependency screens have been applied in many cancer cell lines (via DepMap), several cancer subtypes (as well as Human samples) are not represented. The goal of this project is to use large-scale methylation data to predict gene dependencies. We have curated a dataset of cells that have been profiled via both methylation array and CRISPR dependency screens. The aim is to use this to construct a machine learning model trained on the methylation data to accurately predict the gene dependency. Many models can theoretically accomplish this, so we want to explore several of the latest techniques in a benchmarking approach to find out which method is the most accurate. Ideally, we'd also love to explore the importance of the CpG 'features' contributing to each model.

Number of Team Members Needed:

No membership specification.

Useful Tools/Packages/Software:

Machine learning, coding (primarily R but also some Python for running certain models). Familiarity with concepts related to the project like statistics, methylation, cancer biology would be nice.

Submitter:

Charlie Wright, Scientist, St. Jude Children's Research Hospital

Genesis: Multi-Modal Agentic AI for Cancer Variant Effect Prioritization

Title:

Genesis: Multi-Modal Agentic AI for Cancer Variant Effect Prioritization

Challenge Summary:

Develop an agentic AI system that autonomously integrates genomic data (germline or somatic variants) to carry out variant prioritization for cancer-related variants. Starting with a VCF file, the front-end LLM agent will parse the VCF variants to dynamically select from a variety of MCP servers (such as BioMCP) and biomedical APIs (reactome, gget, etc.), summarize their output, and identify top priority variants given a disease context.

Number of Team Members Needed:

A team of 4-5 members with diverse skills, primarily rooted in python coding or 'vibe coding' to integrate MCP protocols into front end of LLM will be needed. Can work well with both in person or virtual participation.

Useful Tools/Packages/Software:

Experience working with LLM models such as Ollama, Claude, etc.
Python proficiency to work with existing LLM models
Domain knowledge- variant prioritization pipelines for germline and cancer variants

Submitter:

Ninad Oak, Scientist, St. Jude Children's Research Hospital

LabAssist

Title:

LabAssist

Challenge Summary:

LabAssist would be a chatbot trained on lab specific SOPs to assist lab personal with quickly referencing lab procedures. It would help great if this could be incorporated into Teams.

Number of Team Members Needed:

This team ('3 to 4 members) would like to recruit additional members..

Useful Tools/Packages/Software:

MS Teams apps integration, LLM training.

Submitter:

Peter Hall, Lead Researcher, St. Jude Children's Research Hospital

Shared Resource Email Generator

Title:

Shared Resource Email Generator

Challenge Summary:

Our goal is to deliver an email generation app that can be utilized by all Shared Resources at St. Jude. We will be reaching out to individual shared resource labs (over 20 at St. Jude) to inquire how this could be tailored to their respective needs. CAGE and Hartwell are already on-board and we hope to include as many additional shared resources as possible. The app will be built in R (Shiny) and hosted on Posit Connect. The scope will be an internal facing app only (no external utilization). This team is open to anyone with a range of coding experience from none to advanced.

Number of Team Members Needed:

This team (2 to 4 members) would like to recruit additional members..

Useful Tools/Packages/Software:

R, SQL, HTML

Submitter:

Daniel Darnell, Senior Bioinformatics Analyst, St. Jude Children's Research Hospital

WDL Importing and Modules Repository

Title:

WDL Importing and Modules Repository

Challenge Summary:

The goal of this project is to (a) propose changes to the WDL specification that support formal importing of packages and (b) build a repository of reusable modules for WDL (similar to the nf-core modules library at https://nf-co.re/modules) that can be used within externally developed WDL pipelines. If successfully, this will result in a more modular, reusable package ecosystem for WDL, enabling users to write and run their own workflows with composable modules, and it will cover some of the more popular tools in bioinformatics..

Number of Team Members Needed:

This team as 5 members and woul dlike to recruit members.

Useful Tools/Packages/Software:

Bioinformatics, software engineering, workflow development.

Submitter:

Clay McLeod, Director, Product Development & Engineering, St. Jude Children's Research Hospital

St. Jude KIDS25 BioHackathon Projects

Comprehensive PacBio Iso-Seq Analysis Pipeline

Optimized HPC Resource Allocation for CryoEM Processing

Automatic Detection of Range Shifter Board Prior to Proton Beam Delivery

Integrated Pipeline for mtDNA Analysis

AI-Driven Clinical Data Exploration Platform

Molsnap

A Browser-Based Interface for Chemically-Aware Database Queries

Simple, Flexible Pipeline Execution via OnDemand

DL-Image-Lab: A Napari Plugin for Deep Learning - Training, Inference, and Fine-Tuning

Project FINDIT: Facilitating Intra-Departmental Navigation of Data and Information Transfer

Shared Resource Form Generator

CAR-T Superheroes: A Comic Book App to Explain CAR-T Cell Therapy to Pediatric Patients

Open-Source Innovation for the Future of Cryo-ET

User-friendly interface for probing proteins

Jude-E

Image correlation platform for in situ Cryo-ET

Omics Dataset Validation Prior to Download

Exploring machine learning methods to predict gene dependency from methylation data

Genesis: Multi-Modal Agentic AI for Cancer Variant Effect Prioritization

LabAssist

Shared Resource Email Generator

WDL Importing and Modules Repository