Title:
Omics Dataset Validation Prior to Download
Challenge Summary:
I want to make it much easier to evaluate publicly available omics datasets—like single-cell RNA-seq, spatial transcriptomics, and DNA methylation arrays—before we actually download anything. The main source for these datasets will be the Gene Expression Omnibus (GEO). The goal is to avoid spending time and resources on downloading large amounts of data, only to find out that most of it isn’t useful for our research. For example, we want to quickly spot datasets that are missing gene names, lack spatial coordinates, or have their data split across multiple files instead of being neatly organized in a single h5ad object.
To do this, we’re planning to build a tool that can cross-check datasets by looking at their metadata and file formats. This will involve some web scraping to automatically collect and organize all the relevant metadata from GEO. The main output will be comprehensive metadata tables that summarize the key features and file details of each dataset. We’ll also include some visualization options—like bar plots, scatter plots, and filtered tables or SQL queries—to help us explore and filter the datasets more effectively. Once we’ve curated a solid list of promising datasets, we can then use existing tools like GEOquery to download and assess only those that are most likely to meet our needs.
I think the biggest challenge will be building a good web scraping module for GEO. I already have a pretty clear idea of several key variables we can extract and where to find them on GEO DataSets web site, but I have no experience with web scraping. That part will probably be the most difficult, and it’s something I’ll need to learn as we go.
There are some existing solutions out there, like Polly (Elucidata) and GEO2R, which are designed to offer a complete end-to-end analysis pipeline. However, they don’t really address this initial step of getting a clear overview of the data format availability for every queried dataset. Our focus is specifically on making this first stage of dataset evaluation much more transparent and efficient.
Number of Team Members Needed:
This team ('3 to 6 members) would like to recruit additional members..
Useful Tools/Packages/Software:
Web scraping (e.g. Selenium), data visualization (R, Python), familiarity with GEO datasets, API development, and experience with OMICs data formats (e.g., RNA-seq, spatial OMICs, etc)
Submitter:
Maycon Marção, Scientist, St. Jude Children's Research Hospital