About GoLizard

GoLizard is a bioinformatics analysis pipeline designed to pre-annotate unknown or mutated protein sequences. It leverages the Gene Ontology (GO) framework and integrates multiple inference strategies, including domain analysis, sequence similarity, structural homology, and motif detection.

Introduction

The increasing discovery of novel and mutated proteins has created a critical gap in functional annotation. Traditional approaches often rely solely on sequence or structure-based inference, which limits their effectiveness. GoLizard addresses this challenge by combining multiple complementary strategies to enhance GO-based functional interpretation.

Analysis Modes

GoLizard offers two distinct analysis modes:

Functional Analysis: Predicts per-sequence functionality, generating BP (Biological Process), MF (Molecular Function), and CC (Cellular Component) graphs for each protein sequence.
Similarity Analysis: Clusters multiple sequences based on semantic similarity (GO-term similarity) using GOGOsim (Zhao & Wang, 2018) with a weighted BMA algorithm. Produces dendrograms showing sequence relationships and ranked GO terms with evidence scores.

Core Tools and Methodology

InterProScan: Identifies annotated domains from multiple member databases.
BLAST: Identifies sequence similarity to known proteins for GO transfer.
ESMfold: Predicts protein structure via BioLM.ai
FoldSeek: Detects structural homology for function prediction.
ELM (Eukaryotic Linear Motif): Searches for functional motifs in disordered regions.
GOGOsim: Algorithm to compute semantic similarity between proteins (Zhao & Wang, 2018).

Input and Output

Input:

FASTA file containing one or more protein sequences.
Similarity analysis requires multiple sequences for clustering.
Optional: PDB structure files for structural analysis.

Output:

Functional Analysis: BP, MF, CC graphs for each sequence showing predicted GO terms and their relationships.
Similarity Analysis:
- Ranked GO terms with associated evidence scores for each sequence.
- Dendrograms of sequence clusters based on semantic similarity.

Applications

Pre-annotation of unknown protein sequences.
Functional annotation of proteins in non-model organisms.
Mutational impact analysis in proteomics studies.
Characterization of hypothetical or novel proteins.
Predictive insights into pathogen protein function.

Contact

For questions or support, please contact: