Transcriptomics

The transcriptome represents the repertoire of transcripts in an organism as the main product of transcribed RNA. The Human genome comprises 3 billion bases on each of (on average 1014) cells in one body, where each cell may contain up to 300k RNA molecules. Then, the full transcriptome may contain approximately 8.423 RNA bases... in one body! One cell line/condition, translated to bioinformatics data, could imply 30Mb of microarray data or 30Gb of RNA-seq data, and all cells in one body from petabytes to exabytes. Transcriptomics studies have been traditionally carried out using microarray technologies and the emerging next generation sequencing techniques known as RNA-seq. The main problems and topics being addressed in my research in transcriptomics are summarized below.

Single-cell RNA-seq Data Analysis - Cell Type Identification


Single-cell sequencing (scRNA-seq) is an emerging technology used to capture cell-level information and by which individual cells can be analyzed separately. While many computational methods have been devised for analyzing scRNA-seq data, there are many open problems in this research area. One of the main scRNA-seq analytical challenges is finding cell types, highly differentially expressed tissue-specific gene sets, or gene-gene interarions, which currently we are working on. Among these, we focus on the identification of different cell types using manifold learning combined with clustering techniques on scRNA-Seq data. A proposed two-step approach reveals that genes with similar expression patterns are grouped in highly-scored clusters, achieving very high performance in most cases. Efficient nonlinear dimensionality reduction and manifold learning techniques based on modified locally linear embedding significantly improve the clustering step; the addition of independent component analysis enhances visualization in a reduced space.
The method has been tested on a scRNA-seq dataset (available at the GEO database, accession no. GSE148729), extracted from NCBI's Gene Expression Omnibus, which includes 27,072 gene expression profiles of 48,890 human lung cell lines sequenced using Illumina NextSeq 500. In this dataset, different cell lines were contaminated with SARS-CoV-1 and SARS-CoV-2 and sequenced at different time slots.
SARS-CoV-2-Relevant-PathwaysPerforming gene set enrichment analysis to annotate a set of HVGs obtained from each cluster reveals biomarker genes involved in different gene ontology terms. Pathways that are enriched by marker genes are shown in the figure (right). Numbers show the clusters and edges show the links between clusters and pathways. Nodes highlighted in yellow show the SARS-CoV-2 cell-specific pathway. Most of the other green nodes reveal shared and cluster-specific functional pathways in the immune system. Further details can be found in our publication and presentation at 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

Relevant publications:



Single-cell RNA-seq Data Analysis - Prediction of cell-cell interactions


Cell-cell interactions regulate organismal development, homeostasis, and single-cell functions. The disease occurs when cells do not interact properly or decode molecular messages improperly. Thus, identifying and quantifying intercellular signaling pathways has become a common analysis carried out across a variety of fields. We introduce a pipeline to identify cell-cell interactions using graph convolutional networks. Pipeline steps include pre-processing of the data, followed by cell-graph construction, and then identifying cell-cell interactions using graph convolutional networks.
The datasets can be found at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 and the code is publicly available at Github https://github.com/sheenahora/SEGCECO and Code Ocean DOI: 10.24433/CO.7099053.v1.

SEGCECO Pipeline

Relevant publications:


Network Biomarkers

We have proposed a machine learning approach that is used to identify network biomarkers. We have applied our approach to identify such biomarkers in different subtypes of breast cancer, achieving excellent results in prediction of the subtypes, as well as identifying a few biomarkers. We have been able to identify network biomarkers and driver genes for each specific breast cancer subtype. Our results show that the resulting network biomarkers can separate one subtype from the others with very high accuracy. On a more recent work, we have utilized the machine learning approach to identify network biomarkers in breast cancer survivability. Given a breast cancer dataset of patients with different subtypes, we devised a novel network-based approach by integrating protein-protein interaction network (PPI) with gene expression data (1) to identify the network-biomarkers (metagene) of breast cancer survivability and (2) to predict the survivability of breast cancer patients based on subtypes. Our method uses the concept of seed gene for identification of network-biomarkers, ADASYN to solve class imbalance and random forest to predict survivability of patients. We obtained best classification performance with distance three from seed gene protein where the Gmean, f1-measure and accuracy are respectively 0.900, 0.800 and 90.34%, and using a maximum of 34 genes to predict survivability of patients. The dataset can be downloaded from this link. Work in collaboration with A. Ngom.

Relevant publications:

ChIP-Seq and RNA-Seq Data Analysis

ChIP-seq-OMT-RegionsChromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) is a technique that provides quantitative and genome-wide mapping of target protein binding events. In ChIP-Seq, a protein is first cross-linked to DNA and the fragments subsequently sheared. Detecting protein binding sites from massive sequence-based datasets with millions of short reads is a bioinformatics challenge that requires considerable computational resources and specialized, in spite of the availability of efficient tools for ChIP-chip analysis.

One of the problems we are working on is on detecting biologically significant peaks in ChIP-Seq data by using an optimal algorithm for optimal multi-level thresholding (OMT). Most of the existing methods use a set of parameters that may cause variations of the results for different datasets. In our method, both of these issues have been addressed by proposing a new peak finder algorithm based on optimal multi-level thresholding coupled with a model to find the best number of peaks based on clustering techniques for pattern recognition. The algorithm can be extended to find significant enrichment regions in RNA-seq data too. Some applications like detecting alternative splicing sites in RNA-seq data for prostate cancer are currently being studied (see below).

We have tested our algorithm in a variety of datasets that include FoxA1 dataset, which contains experiment and control samples of 24 chromosomes, and four transcription factors with a total of 6 antibodies for Drosophila melanogaster (available at the GEO database, accession no. GSE20369). OMT has been able to find important binding sites like the RAD51C Promoter on the FoxA1 dataset (figure on the right). The enriched region (peak) is clearly detected by OMT. Our recent publications and presentations at BIBM 2012 and ACM-BCB 2012 provide more details for these results.

Relevant publications:

The Role of Alternative Splicing in Prostate Cancer

AlternativeSplicingProstateCancer Prostate cancer is a very complex disease, and diagnosis is becoming progressively more prevalent. Worldwide, prostate cancer is the second most common cancer in men. In Canada, this form of cancer claimed the lives of about 4,000 men in 2012. Prostate cancer is among the four most common types of cancer in Ontario, which collectively account for 54 percent of all cancers in the province.

Studying prostate cancer at the molecular level helps researchers uncover the genetic regulatory mechanisms involved in the tumour biology. One of the important tasks that prostate cancer researchers face is to discover biomarkers that help distinguish between benign and malignant tumours, different subtypes, and progression. This is of great significance, since the lack of reliable biomarkers to distinguish, at early stages, tumours that are not likely to grow from those that are most likely to grow is a major challenge in prostate cancer treatment. As such, men with low-risk prostate cancer are often unnecessarily over-treated.

The advent of next generation sequencing like RNA-seq can read the transcriptome at a remarkable single-nucleotide resolution, generating millions or billions of short sequences or reads which have to be assembled and analyzed. A traditional way to study the transcriptome is to find the role of certain genes (as biomarkers). However, due to alternative splicing mechanisms, each gene has many different ways of expressing itself into different protein products or isoforms. These can be detected on the RNA-seq data produced, but only if found in the blender.

Our current research focuses on studying the transcriptional mechanisms involved in prostate cancer, with particular emphasis on known and de novo cis and trans RNA alternative splicing and their associated noncoding RNAs that differentiate between benign tumors of the prostate and prostate cancer. Using machine learning algorithms on RNA-seq public datasets from the most recent studies in prostate cancer, we aim to (i) identify alternative splicing events associated with prostate cancer (localized and metastatic) and its differentiation between high-risk and low-risk progression, and (ii) understand the functional mechanisms of the tumor biology implied by the identified splicing events, isoforms and noncoding RNAs, and their related processes like transcription factors, signaling pathways and cellular proteins.

The main approach for data collection, analysis, visualization and interpretation is based on current methods for RNA-seq data analysis, machine learning techniques for prediction and unsupervised clustering and feature selection methods for biomarker detection. A simplified view of the model is shown in the figure on the right We are also incorporating data integration approaches used for comparison, integration and validation of biomarkers across different types and datasets. For splice junction detection, alternative splicing analysis and enrichment, we are using mostly TopHat2, PASSion, and our recently developed tool for ChIP-seq and RNA-seq data analysis, OMT (more details above). An approach to compare and integrate data from different sources and cases (pathways, diseases, drugs, organs and tissues) will help validate the new biomarkers found along with existing transcriptomic biomarkers.

We have proposed a new algorithm for finding differential splice junction events on a two-dimensional histogram. We have identified a small subset of differential junctions for which we are currently investigating the corresponding protein isoforms, function and pathways associated with them. We have been awarded a Seeds4Hope grant by the Windsor-Essex County Cancer Centre Foundation. Latest story in the Windsor Star.

Biomarker Discovery in Breast Cancer Subtypes and Survivability

BreastCancerSubtypesTree Recent studies in breast cancer shows that tumour tissues can be grouped into at least 5 different subtypes based upon different genome-wide and pathological studies: normal breast-like, basal, luminal A, luminal B and ERBB2+. These subtypes have been demonstrated to be predictive of overall prognosis and response to specific chemotherapy regimens. Accurately classifying patient populations into these subtypes in a clinical setting helps significantly in guiding diagnosis, treatment and prognosis. To be a highly used clinical tool requires identifying a small subset of genes that reduces the cost of monitoring and screening patients. But a small set of genes may reduce the accuracy of prediction and hence diagnosis and follow-up.

In one of our recent studies, we have proposed a hierarchical model (see figure on the right) in which each patient is assigned to the corresponding subtype based on the set of genes that have been optimally assigned to each particular node in the tree. We have considered high-throughput microarray-data from previous studies of Hu's et al., which contains 137 samples, each pre-classified and associated with one of the five different subtypes. First, a gene selection method is used to find the subset of genes with best ratio of accuracy/gene number. Using these genes, the samples are classified and the subtype with the best prediction accuracy is selected as the target node on that level (e.g. basal for the root). After removing samples corresponding to that class, the process is repeated for the rest of samples and the subset of genes is used to determine the next nodes in the same step by step process.

Following this approach all samples are classified into one of the subtypes. We have been able to reach very high prediction accuracy, resorting only on a very small subset of genes. Our results show that using only 18 genes the model is able to classify cancer subtypes for each patient correctly with accuracy higher than 95%. Moreover, 15 out of the 18 selected genes are related to the cancer subtypes based on previous studies. The work has also been funded by a Seeds4Hope grant by the Windsor-Essex County Cancer Centre Foundation.

Relevant publications:

Prediction of microRNA

MicroRNAs are a class of small non-coding RNAs that play a crucial role in gene regulation by perfectly or imperfectly binding into three prime untranslated regions (30 UTR) in messenger RNAs, and cause repression of translating mRNAs into proteins or their cleavage. Researchers have estimated that about one third of the human genes are regulated by microRNAs. MicroRNAs perform many cellular tasks in cells including controlling cell developmental timing, cell death and stem cell characterization. Many studies have shown that malfunction of microRNAs may have serious effects on cell life and may cause different types of cancer, heart disease and nervous system disorder. Thus, identification of microRNA is an essential process in discovering microRNA functions and their role in cellular processes.

Two types of methods can be used to predict microRNA, experimental and computational approaches. Experimental approaches resort mostly on the most recent sequencing techniques and hence tend to be costly. Various computational approaches have been proposed for prediction of microRNA. One of the main problems encountered in this problem is the imbalance between the positive and negative classes. We have proposed an approach, miLDR-EM, which combines linear dimensionality reduciton (LDR) with explicit mapping onto higher dimensions to classify precursor microRNAs from both pseudo hairpins and other non-coding RNAs. LDR+EM combined with feature selection has been shown to yield excellent performance, better than previous methods, on a dataset of 691 non-redundant human pre-microRNA and 8,494 human pseudo hairpin sequences. By using only three features for normalized/ensemble free energy, miLDR-EM reaches very high prediction accuracy and a geometric mean of 92.20%, which is an excellent performance for a class imbalance problem.

Relevant publications and presentations:

Microarray Data Analysis

We have studied the main aspects of microarray data analysis, with emphasis on DNA microarray image gridding and segmentation, as well as gene selection, biomarker detection and clustering time-time series gene expression data. I have contributed to the field of microarray image and data analysis for more than ten years. The most recent contributions and results are listed below.

DNA Microarray gridding: We have proposed a hill-climbing approach for gridding DNA microarray images, and a new polynomial-time algorithm for optimal multilevel thresholding for solving the same problem with almost perfect accuracy, while being free of parameters. My recent book on Microarray Image and Data Analysis by CRC Press includes some of these contributions. Relevant publications:

Clustering of Microarray Time-series Data: I have pioneered the model for pairwise and multiple profile alignment for time-series gene expression data. In this work, we proposed a universal alignment model used to represent time-series gene expression data, which is independent of the clustering algorithm being used. Relevant publications:

Other contributions: We have proposed pattern recognition models based on clustering and supervised classification for DNA microarray image segmentation. We have also proposed heuristics for oligonucleotide selection, new approaches for gene selection and missing values, and optimizing the parameters of fuzzy k-means for microarray data clustering.

Relevant publications: