The Gulbenkian Training Programme in Bioinformatics

GTPB runs at the Instituto Gulbenkian de Ciência on a yearly basis since 1999.
More than 1000 course attendees in 9 years
Pedro Fernandes, GTPB organizer

Powerful Search and Mining in Biological


PSMBD08 - Course Website

Deadline for applications: September 8th 2008
Course date: September 15th - 19th 2008


Ophir Frieder is the Royden B. Davis, S.J., Chair in Interdisciplinary Studies at Georgetown University. He is also the IITRI Chair Professor of Computer Science and the Director of the Information Retrieval Laboratory at the Illinois Institute of Technology, from which, he is currently on leave. He frequently consults for industry and government and for key intellectual property litigation. His research interests focus on scalable information retrieval systems spanning search and retrieval and communications issues. His systems are deployed in actual commercial and governmental production environments worldwide. He is Fellow of the AAAS, ACM, IEEE, and the 2007 ASIS&T Research Award recipient.
Georgetown University, Washington DC, USA

Jay Urbain completed his PhD in Computer Science at the Illinois Institute of Technology where he was a member of the Information Retrieval Laboratory. Dr. Urbain is currently an Assistant Professor in the Department of Electrical Engineering and Computer Science at the Milwaukee School of Engineering where he has taught since 2003. His research interests include information retrieval, text and data mining, machine learning, and bioinformatics. Dr. Urbain has taught for Learning Tree International, and has consulted for Blue Cross Blue Shield of Northeastern Pennsylvania, Intelligent Medical Objects, Emersion Electric, and American Honda Motor Co. Previously, Jay was VP of Software Development for ThinkMed LLC a health care data mining company, and Sr. Engineering Manager for Patient Monitoring Software at Marquette Medical Systems.
Milwaukee School of Engineering, Milwaukee WI, USA

Course Description

We introduce the foundations of information systems as they relate to Bioinformatics. We assume no prior knowledge, as we initially overview the principal methods in relational database management systems, information retrieval, and data mining. We describe evaluation approaches and support instruction via hands-on laboratory assignments. This course will serve as an introduction for work in the field and will provide the participants with the fundamental understanding to conduct research in this domain.
Target audience: The course was designed to suit a mixed audience of newcomers to Bioinformatics - with or without a biological background, and practitioners that wish to deepen their knowledge in the foundations of information systems, as they apply to the Bioinformatics world. It will also suit power users and software developers. Special attention will be given to Search and Data Mining techniques in the scope of the design, construction and exploration of biological databases. Some of the particular aspects of their implementation in high performace platforms will be covered.

Course Timetable:

PSMBD08Powerful Search and Mining in Biological Databases
Sept 15th Bioinformatics: scope and purpose
Day 1
09:30 - 11:00 Introduction to the course
A brief Molecular Biology primer (for non-Biologists)
Gene sequences -> amino acids -> proteins -> function
Topics in Bioinformatics
- Sequence alignments (find similarity between DNA / protein - amino acid sequences)
- Genome assembly (combine genomic fragments to form whole genome)
- Gene identification & annotation (classify)
- Microarrays & gene expression analysis (use DNA microarray to measure mRNA)
- Protein folding (Compute 3D protein structure from protein sequence and vice versa)
- Phylogeny (infer common ancestry relationships and assign confidence levels)
11:00 - 11:30 Coffee Break
11:30 - 12:30 Bioinformatics: grand challenges
Find genomes of all organisms
Identify and annotate all genes
Compute sequence from 3D structure for all proteins, and vice versa
Identify protein function and gene regulatory networks
Compare DNA / protein sequences for similarity
Compare families of DNA / protein sequences
12:30 - 14:00 Lunch Break
14:00 - 16:00 Bioinformatics hands on lab and algorithms
Understanding the complexity of bioinformatics problems
Introduction to basic bioinformatics algorithms
Pairwise sequence alignment
Multiple sequence alignment
Phylogeny analysis
Protein structure prediction & alignment
16:00 - 16:30 Tea Break
16:30 - 18:00 Hands on lab
Pairwise sequence alignment with dynamic programming, BLAST
Sept 16thDBMS and the relational data model
Day 2
09:30 - 11:00 Introduction to Database Management Systems
Relational data model
11:00 - 11:30 Coffee Break
11:30 - 12:30 Relational operators and SQL
Relational operators
Parallel database principles
12:30 - 14:00 Lunch Break
14:00 - 16:00 Data modeling
Data modeling: 1:1, 1:m, m:m relationships
Hands on lab: ERD data modeling
16:00 - 16:30 Tea Break
16:30 - 18:00 Hands on lab: SQL query lab
Sept 17th Information Retrieval and Online biological databases
Day 3
09:30 - 11:00 Information Retrieval - Part 1
Introduction to IR
Measuring effectiveness: Precision/Recall
IR System architecture
11:00 - 11:30 Coffee Break
11:30 - 12:30 Application development
Query processing
12:30 - 14:00 Lunch Break
14:00 - 16:00 Introduction to NCBI databases
Entrez gene
Hands on Lab 1: Searching online biological databases
16:00 - 16:30 Tea Break
16:30 - 18:00 Application development
NCBI database application development
Hands on Lab 2: Using Java to interface to NCBI databases
Sept 18th Information Retrieval / Introduction to Data Mining
Day 3
09:30 - 11:00 Retrieval Models and utilities
Vector Space
Stemming, synonyms, term proximity
11:00 - 11:30 Coffee Break
11:30 - 12:30 Scalable Information Retrieval Techniques
Parallel information retrieval principals
12:30 - 14:00 Lunch Break
14:00 - 16:00 Hands on IR lab and Introduction to Data Mining
Hands on IR lab
Evaluate different retrieval functions with IR system
Introduction to Data Mining
Data preprocessing
Data warehousing
Machine learning
16:00 - 16:30 Tea Break
16:30 - 18:00 Applications
Gene co-occurrence clustering
Identifying biological pathways
Hands on DNA chip data clustering lab
Sept 19th Data Mining / Biological Literature Search
Day 5
09:30 - 11:00 Machine learning
Principles of machine learning
Machine learning algorithms
Naïve Bayes
Decision trees
Association rules
Probabilistic graphical models
11:00 - 11:30 Coffee Break
11:30 - 12:30 Hands on machine learning lab
12:30 - 14:00 Lunch Break
14:00 - 16:00 Biological Literature Search and Mining
Explosion of biological data and literature from research using high throughput devices
Need to annotate gene expression, protein function, regulatory networks
Gene/protein named entity identification, term normalization, synonymy, polysemy
Integration of structured data and text
Dimensional data model
16:00 - 16:30 Tea Break
14:00 - 16:00 Application development
Multi-evidentiary retrieval models
Document, passage retrieval
Wrap up

Pre-requisites: Elementary computing skills.

GTPB Details

Course Format

Logistic Details

Previous Courses


Instituto Gulbenkian de Ciência, Apartado14, 2781-901 Oeiras, Portugal

GTPB Homepage
IGC Homepage
Last updated: August 25th 2008