11. Genomics and Proteomics – Essentials of Molecular Biology


Genomics And Proteomics

  • Introduction
  • Genomics
    • Classification of genomics
  • Structural Genomics
    • Genome mapping
    • Genome sequencing
    • Genome sequence assembly
    • Genome annotation
    • GenBank
    • Gene ontology
  • Functional Genomics
    • Expressed sequence tags (ESTs)
    • Serial analysis of gene expression (SAGE)
  • Microarray or Gene Chip
    • Applications of microarray technology


  • Proteomics
  • Classification of Proteomics
    • Expression proteomics
    • Structural proteomics
    • Functional proteomics
  • Proteomics Tools
  • Proteomics and Drug Discovery
    • Proteomics and new drug discovery
    • Proteomics and diseases
    • Proteomics and drug designing
  • Summary
  • References

‘Bioinformatics’ is the branch of research science that involves the development of computational tools and databases for better understanding the living organisms. Bioinformatics is limited to sequence, structural and functional analyses of genes, genomes and their products. It differs from a related field ‘computational biology’ that includes all biological areas that involve computation.

‘Genomics’ is the study of the genome of an organism involving the simultaneous analysis of a large number of genes using automated data gathering tools. Genomics includes various fields such as genome mapping, sequencing, functional genomics and comparative genomics. The advent of genomics and the rapid explosion of sequence information have enormously led to the development of bio-informatics. Genomics involves the systematic use of genome information and associates with other data providing solutions in biology, medicine and industry. Genomics has the potential of offering new therapeutics and diagnostics.

‘Proteomics’ is the study of the proteome. ‘Proteome’ refers to the entire set of proteins that are expressed in a cell. Proteomics involves the simultaneous analysis of all the translated proteins in a cell and includes their identification, quantification, localization, modifications, interaction and functions.


Classification of Genomics

Genomics can be divided into structural genomics, functional genomics and comparative genomics.

  • Structural genomics is the initial phase of genome analysis and includes construction of genetic and physical maps of a genome, identification of genes, annotation of gene features and comparison of genome structures. It also involves the determination of three-dimensional (3D) structures of all proteins in a cell (structural proteomics).
  • Functional genomics refers to the analysis of gene expression and gene functions of a genome.
  • Comparative genomics involves the analysis and comparison of genomes of different species.

It is also called the ‘classical genomics’. The first step in understanding the genome structure is through genome mapping.

Genome Mapping

Genome maps describe the locations of genes on a chromosome. Genome maps are of three types, namely genetic linkage maps or genetic maps, physical maps and cytologic maps.

Genetic linkage maps or genetic maps: These identify the relative positions of genetic markers on a chromosome. Genetic markers are the regions of chromosomes whose inheritance pattern can be followed for many eukaryotes; genetic markers represent morphologic phenotypes. Genetic maps also reveal how frequently the markers are inherited together. The closer the two genetic markers are, the more likely is that they are inherited together. In addition, they are not separated by a genetic crossing over event. The distance between two genetic markers is measured in centiMorgan (cM). CentiMorgan or map unit (m.u.) is a unit of recombination frequency for measuring genetic linkage (1 cM is approximately 1,000 kb).

Physical maps: These are maps of identifiable regions on the genomic DNA. The distance between genetic markers is measured directly as kilo bases (Kb) or mega bases (Mb). As the distance in this case is expressed in physical units, it is more accurate and reliable than cM, which is used in genetic maps. These physical maps are constructed using chromosome walking techniques. In ‘chromosome walking’, a number of radio-labelled probes are hybridized to a library of DNA clones. By identifying overlapping clones probed, a relative order of the cloned fragments can be established.

Cytologic maps: These refer to the banding patterns of stained chromosomes. These can be directly observed under a microscope. The observed light and dark bands are the markers in this case, i.e., a genetic marker can be associated with a specific chromosomal region or band. The banding patterns are, however, not constant and they varies according to the chromosomal contraction. The distance between two bands is expressed in units called ‘Dustin units’.

Genome Sequencing

DNA sequencing is carried out using the Sanger method (refer to the section ‘DNA Isolation and Sequencing’ of Chapter 9). The fluorescent traces of the DNA sequences are read by a computer program that assigns bases for each peak in a chromatogram. This process is called ‘base calling’. There are two approaches for whole genome sequencing, namely the ‘shotgun approach’ and the ‘hierarchical approach’.

Shotgun approach

This method randomly sequences clones from both ends of cloned DNA. The various steps involved in the process can be discussed as follows (Figure 11.1).

  • The genomic DNA of the organism to be sequenced is isolated.
  • It is then randomly sheared and restriction digested to yield DNA fragments of about 2 Kb and 10 Kb.
  • The smaller (2 Kb) and larger (10 Kb) fragments are then ligated to plasmid vectors and transformed into bacterial cells and cultured. These two collections of plasmids containing the 2-Kb and 10-Kb DNA fragments are known as plasmid libraries.
  • The plasmid libraries are then sequenced. Clones of DNA fragments from both the ends are sequenced. Every sequence reaction generates about 500 bp sequence data. Thus, millions of sequence data are generated.
  • Overlapping sequence data are identified and the regions of contiguous sequences are assembled.
  • Computer algorithms are used to assemble the millions of sequenced fragments into a continuous stretch or map a complete genome.
  • Gaps too are identified and the predicted coding regions and regulatory regions are identified.


Figure 11.1 Whole genome shotgun sequencing method


Hierarchical shotgun approach

This is also known as clone-by-clone or BAC to BAC sequencing. This method is slow, but the results are more accurate. The various steps involved in the process are (Figure 11.2).

  • DNA is cut into pieces of about 150 Mb and inserted into bacterial artificial chromosome (BAC) vectors, transformed into E. coli where they are replicated and stored. This collection of BAC clones is known as BAC library.
  • The BAC inserts are isolated and mapped to determine the order of each cloned 150-Mb fragment. This is referred to as the Golden Tiling Path.
  • Each BAC fragment in the Golden Path is fragmented randomly into smaller pieces (1.5 Kb) and each piece is cloned into a M13 vector. A M13 library is thus generated.
  • The M13 libraries are sequenced. These sequences are aligned, so that identical sequences are overlapping. These contiguous pieces are then assembled into finished sequence, once each strand has been sequenced about four times to produce 8X coverage of high-quality data.


Figure 11.2 Hierarchical shotgun sequencing


Genome Sequence Assembly

The genome sequencing reaction generates short sequences of about 500 bp. These short fragments are joined to form larger fragments after removing the overlaps. These longer merged sequences are called contigs. These are usually about 5,000–10,000 bp long. Overlapping contigs are then merged to form ‘scaffolds’ (30,000–50,000 bp). These are also called ‘super contigs’. Overlapping scaffolds are then connected to create the map of the genome.

Assembling all shotgun fragments into a full genome is a computationally very challenging step. There are a variety of programs available for processing the raw sequence data. Examples:

Genome Annotation

Before the assembled sequence is deposited into a database, it has to be analysed for useful biological features. The genome annotation provides comments for such features. Annotation in simple terms means the process of identifying the coding regions of genes, their respective locations in a genome and determining the functions of these genes after the genome has been sequenced. Annotation is of two types namely:

  1. Structural annotation11, identifies genes on genome, which is also called gene finding. This can be done by computer analysis using automatic annotation tools. For example, Open reading frame (ORF) finder, http://www.ncbi.nlm.nih.gov/gorf/gorf.html, Glimmer (Gene Locator and Interpolated Markov Model ER) is a system for finding genes in microbial DNA (http://www.cbcb.umd.edu/software/glimmer/)
  2. Functional annotation is the process of determining biological information involved in the regulation of the expression of the sequences.

Gene annotation is a combination of theoretical prediction and experimental verification. Gene structures are first predicted by programmes such as GenScan or FgenesH. These predictions are then verified by tools such as BLAST (Basic Local Alignment Search Tool) searches against a sequence database. The predicted genes are further compared with experimentally determined sequences using pairwise alignment programmes such as GeneWise and Spidey. Once all predictions are checked and the ORF are determined, the functional assignment of the encoded proteins is carried out by homology searching using BLAST searches against a protein database (database is an organized collection of data for one or more purposes, usually in digital form). Functional descriptions are then added by searching protein motif and domain databases; for example, Pfam and Interpro.


GenBank is a DNA sequence database from NCBI (National Center for Biotechnology Information). This is actually a division of National Library of Medicine, National Institute of Health at Bethesda (Maryland). This is an annotated collection of all publicly available DNA sequences.

DNA sequences can be submitted to a database prior to publication in journals, so that an accession number may appear in the paper. The various options for submitting data to GenBank are:

  • Banklt, a WWW-based submission tool for convenient and quick submission of sequence data.
  • Sequin, NCBI’s stand-alone submission software.
  • tb12asn, a command-line program, automates the creation of sequence records for submission to GenBank. It is used primarily for the submission of complete genomes and large batches of sequences.
  • Barcode submission tool, a WWW-based tool for the submission of GenBank sequences and trace data for barcode of life projects.

There are several ways to search data from GenBank:

  • Search GenBank for sequence identifiers and annotations with ENtrez nucleotides, which is divided into three divisions namely core nucleotide (the main collection), dbEST (expressed sequence tags) and dbGSS (genome survey sequences).
  • Search and align GenBank sequences to a query sequence using BLAST. BLAST searches CoreNucleotide, dbEST and dbGSS independently.

The GenBank database is designed to provide access within the scientific community to the most updated and comprehensive DNA sequence information.

Gene Ontology

The description of gene functions uses natural language which is often not so precise. Scientists working on different organisms tend to apply different terms to the same types of genes or proteins. Therefore, the protein functional descriptions must be standardized. This necessitated the development of ‘gene ontology project’, which utilizes standard vocabulary to describe molecular functions, biological processes and cellular components. Thus, the standardization provides consistency in describing protein functions. The standard vocabulary is organized such that a protein function is linked to the cellular function through a hierarchy of descriptions with increasing specificity. The top of the hierarchy provides a picture of the functional class while the lower level in the hierarchy specifies the functional role.


Functional genomics determines the functions of genes on a large scale using new and ‘high-throughput technologies’. The high-throughput analysis involves the simultaneous analysis of all genes of a genome. The high-throughput analysis is also termed as ‘transcriptome analysis’, which is the expression analysis of the full set of RNA molecules produced by a cell under a given set of conditions. Functional genomics is a general approach to assigning biological functions to genes with currently unknown roles in all organisms. It also finds a role in novel drug discovery. Functional genomics is mostly experiment based. Transcriptome analysis facilitates the understanding of metabolic, regulatory and signaling pathways within the cell.

Expressed Sequence Tags (ESTs)

This is one of the high-throughput approaches to genome-wide profiling of gene expression. ESTs are short sequences obtained from complementary DNA (cDNA) clones and they help in the identification of full-length genes. They are about 200–400 nucleotides in length obtained from the 5′ end or 3′ end of cDNA of interest. Libraries of cDNA clones are prepared. To generate EST data, clones in the cDNA library are randomly selected for sequencing from either end of the inserts. The EST data are able to provide a rough estimate of genes that are actively expressed in a genome. This is because the frequencies for particular ESTs reflect the abundance of the corresponding mRNA in a cell and hence gives a picture of the gene expression. By random sequencing of cDNA clones, EST helps to discover new genes. TIGR gene Indices (www.tigr.org/tdb/tgi.shtml) is an EST database and dbEST (http://www.ncbi.nlm.nih.gov/dbEst) is the EST database of GenBank (Figure 11.3).

Serial Analysis of Gene Expression (SAGE)

This is another throughput, sequence-based approach for gene expression. SAGE is more quantitative in determining the mRNA expression in a cell. In this method, short DNA fragments of about 15 bp are excised from cDNA sequences and used as unique markers of gene transcripts. The sequence fragments are called tags. They are subsequently linked together, cloned and sequenced. The transcripts are analysed computationally in a serial manner. Once gene tags are identified, their frequency indicates the level of gene expression. This approach is more efficient than EST analysis, as it uses short nucleotide tag to a gene transcript and allows sequencing of multiple tags in a single clone. SAGE analysis has a better chance of detecting weakly expressed genes (Figure 11.4).


Figure 11.3 EST analysis



Figure 11.4 SAGE-serial analysis of gene expression


The procedure involved can be discused as follows:

  • First, a cDNA strand of each transcript in the cell must be generated.
  • The mRNA of eukaryotes is polyadenylated, i.e., a poly(A) tail is added to the 3′ end of the final transcript.
  • A primer consisting of multiple ‘T’s can be made that will complementary base pair with the poly(A) tail of every mRNAs in a cell.
  • Once the primer has bound to the mRNA, the enzyme reverse transcriptase can make a DNA strand that is complementary to the RNA.
  • This DNA strand will then be converted to a double-stranded DNA molecule.
  • The cDNA that has been created is then cleaved using an ‘anchoring enzyme’.
  • The anchoring enzyme is a restriction endonuclease that recognizes and cuts specific four bp DNA sequences. Since this enzyme requires only four specific nucleotides, it cleaves DNA molecules often, resulting in every cDNA that has been generated being cut at least once.
  • The cut cDNA is then bound to streptavidin beads with the help of its multiple thymidine (Ts) at its 3′ end, thereby it is immobilized.
  • The sample of bound cDNAs is then divided in half and ligated to either linker A or B. These linkers are designed to contain a Type IIS restriction site.
  • Type IIS restriction endonucleases cut at a defined distance up to 20 bp away from their recognition sites. The Type IIS restriction endonuclease, also called the ‘tagging enzyme’, cleaves the cDNA to release it from its bound bead.
  • Blunt ends are then created, so that neither the 3′ nor 5′ end has overhanging single-stranded sequences.
  • Once this is achieved, the cDNA tags bound to linker A and B are ligated to each other to create ditags.
  • These ditags have linker A on one end, linker B on the other and both transcript tags are adjacent to one another in the middle.
  • These ditags are then amplified by PCR, using primers that are complementary to sequence in either linker.
  • Once the ditags have been amplified, they are then cleaved using the anchoring enzyme again.
  • This has two effects: first, it releases the linkers from either end of the ditag, leaving only sequence from the two tags. Second, it creates sticky ends, or 3′ and 5′ ends that have overhanging, single-stranded DNA that can complementarily base pair with single-stranded DNA of another ditag.
  • In this way, all of the ditags generated are linked, or concatenated to produce one long string of tags. This collection of tags is then introduced into a vector to be cloned and sequenced.

DNA microarray is a high-throughput technique that allows for rapid measurement and visualization of differential expression of genes at the whole genome scale. In a single microarray experiment, thousands of genes can be analysed and also permits quantitative gene expression (Figure 11.5).

The major steps involved in this process:

  • Target preparation,
  • Oligonucleotide probe preparation,
  • Hybridization,
  • Slide scanning,
  • Data analysis and
  • Expression profile clustering.


Figure 11.5 DNA microarray


A microarray or a gene chip or a biochips is a slide attached with an array of immobilized high-density DNA oligomers (sometimes cDNAs) representing the entire genome of the species under study. Each oligomer is spotted on the slide. It serves as a probe for binding to a unique cDNA. The entire cDNA labeled with fluorescent dyes are allowed to hybridize with the oligo probes on the chip. The amount of fluorescent or radio labels at each spot reflects the amount of corresponding mRNA in the cell. Thus, expression of genes of the entire genome can be examined also genes involved in the same regulatory or metabolic pathways can be identified.

Target preparation

The cDNAs of the genes are first obtained by extracting total mRNA from tissues and cells. During the cDNA synthesis, fluorescent dyes can be incorporated in the DNA strands.

Oligonucleotide probe preparation

DNA microarrays are generated by fixing oligonucleotide onto a solid support such as glass slide using a robotic device. These oligonucleotides, which are called probes, can vary in length between 25 and 70 bp long, represent thousands of pre-selected genes from an organism. The probes hybridize to labelled cDNA. Probes designed should be highly specific, they should not cross-hybridize and should not form stable internal secondary structures. The oligonucleotide probe should be designed close to the 3′ end. All probes should have approximately same melting temperature. Their GC content should be of 45%–65%. Oligo Wiz (www.cbs.dtu.dk/services/OligoWiz) a Java program and OligoArray (http://berry.engin.umich.edu/oligoarray2/) is a Java client server program that helps to design oligonucleotide probes for microarray construction.


The cDNAs are made to hybridize with the oligo probes attached to the gene chip (glass slide). By differentially labelling the cDNAs with different fluorescent dyes and allowing them to hybridize to the oligo probes on the chips, the gene expression patterns of various genes can be simultaneously measured. The most common type of microarray is a two-colour microarray, which involves the labeling of one set cDNA (test) with one dye (Cy5, red fluorescence) and another set from a reference condition (controls) with the other dye (Cy3, green fluorescence).

Slide scanning

The image of the hybridized array is captured using a laser scanner. The scanner scans every spot on the microarray. Two wavelengths of laser beams are used to excite the red and green fluorescent dyes, which thus produce the red and green fluorescence. A photomultiplier tube detects the fluorescence. Thus, for each spot on the microarray, red and green signals are recorded. The two fluorescent images from the scanner are then overlaid to create a composite image which indicates the relative expression of each gene. The colour intensity is a measure of the gene expression levels. For example, if a gene is expressed at a higher level in the experimental condition (red) than in the control (green), then the spot displays a reddish colour. On the other hand, if the test gene is expressed at a lower level than the control, then the spot appears green. If both the test and control genes are expressed in equal amounts, red and green colours appear equally and thus the spot appears yellow in colour.

Data analysis

‘Image processing’ helps to locate and quatitate the spots. It also discriminates true hybridization signals from background signals contributed by non-specific hybridization, uneven slide surface and the presence of contaminants such as dust on the slide. Computer programs are used to correctly locate the spots and measure the intensities after subtracting the background pixels. The array signals are then converted into numbers and are reported as ratios between the two colours. This ratio is a measure of the gene expression changes in the experimental versus the control conditions.

Microarray scanners are normally provided with software programs to carry out microarray image analysis. There are also a number of free processing software programs available on the internet. For example, ArrayDB (www.genome.nhgri.nih.gov/arraydb/), ScanAlyze (http://rana.ibl.gov/EisenSoftware.html) and TIGR Spotfinder (http://www.tigr.org/softlab).

After image processing, the digitized gene expression is processed further. This processing is referred to as ‘data normalization’. This serves to correct the bias due to variations in microarray data collection rather than intrinsic biological differences. There are various ways to normalize the data. One way is to make an ‘intensity-ratio plot’ where the data is plotted horizontally (Figure 11.6). The log ratios of the Cy5/Cy3 are plotted against the average log intensities. In this way of representation, the data are distributed symmetrically about the horizontal axis. The differentially expressed genes can be visualized more easily.

A Windows program called Arrayplot (www.biologie.ens.fr/fr/geneticqu/puces/publications/arrayplot/index.html) helps in visualization, filtering and normalization of raw microarray data.

Expression profile clustering

Based on the computed distances between genes in an expression profile, genes with similar expression patterns can be grouped. This is referred as clustering. Clustering analysis helps in the identification of co-expressed and co-regulated genes. Genes that are co-regulated usually have related functions. Thus, through gene clustering, the functions of previously uncharacterized genes may be identified. Clustering is of two types namely:


Figure 11.6 Intensite-ratio plot


  1. Hierarchical clustering: It produces a tree-like structure that represents a hierarchy or relatedness of data groups. In the tree leaves, similar gene expression profiles are placed more closely together than dissimilar gene expression profiles. The branches of the tree pattern illustrate the relationship between the related gene groups. http://www.rana.ibl.gov/Eisensoftware.html is a Windows program capable of hierarchical clustering.
  2. Partitioning clustering: Example of this type of clustering is ‘k-means clustering’. In this type, the data is classified through a single partition.

Applications of Microarray Technology

Microarray technology is a powerful tool for gene profiling. It is is widely used in many areas of fundamental and applied biological researches.

  1. In gene expression analysis: Microarray has been widely used for the expression analysis of the genome. Thousands of gene expression patterns can be studied simultaneously.
  2. In mutation analysis: Point mutations/single nucleotide polymorphisms can be easily detected by microarrays by strictly regulating the hybridization conditions. Under these conditions, a single base mismatch leads to unhybridized target–probe. Thus, the abnormalities present in the genome can be detected using a microarray. Genomic DNA (e.g., from tumours and normal tissues) is cleaved using a common restriction endonuclease such as DpnII and BgIII. Adapter oligonucleotides are ligated onto the cleaved products and the DNA is subjected to PCR amplification. This sampling of the genome is a representation of the genetic configuration of an individual source (e.g., tumour) and can be compared to a similar representation from another source (e.g., normal). By quantifying the level of each PCR amplified fragments by using a microarray, deletions, insertions and alteration of the restriction sites can be determined.
  3. In pharmacogenomics: Pharmacogenomics is a science that combines medicine, pharmacology and genomics for developing drug therapies according to the genetic differences in patients, which is responsible for varied responses to a particular therapeutic regimen. By using microarray, the individual genetic profile of a patient can be studied and suitable therapeutic regimens can be planned.
  4. In disease diagnosis: Microarray helps to identify the changes in the gene expression that might contribute to the development of a disease. Thus, the technology plays a role in disease diagnosis.

Proteomics is the study of the expression of genetic information at the protein level (proteome). It also deals with assessment of 3D structure of proteins and their interactions.

Though the RNA/cDNA microarray chips help in the study of expression levels of transcripts, it is necessary to understand that:

  • Not all mRNAs will be translated into the protein.
  • The level of transcription of specific protein-coding RNA may not always correspond to the level of expression and further to the activity of the coded protein due to many factors (mRNA, RNA splicing, post-translational protein modifications, etc.).

This necessitates the development of the branch of bioinformatics called proteomics, which involves the study of the protein complement of a genome (proteome).


Proteomics can be broadly classified into three types namely:

  1. Expression proteomics,
  2. Structural proteomics and
  3. Functional proteomics.

Expression Proteomics

This is the quantitative characterization of protein expression at the whole proteome level. It involves the quantitative measurement of proteins in a cell at a particular metabolic state. Before proceeding for expression analysis, the expressed proteins in a proteome are determined. The proteins are separated, identified and quantified. The comparative profiling of proteins is often performed after separating the proteins by two-dimensional (2D) gel electrophoresis and identifying them by mass spectrometry (MS).

The various steps involved can be outlined as follows:

  • Protein separation on 2D gel electrophoresis,
  • Protease digestion,
  • Mass spectrometry and
  • Peptide identification.

Protein separation on 2D gel electrophoresis

The 2D protein gel electrophoresis is a method of protein separation that enables to distinguish up to 10,000 proteins. First, proteins are separated according to their isoelectric point (the pH at which the net charge of the protein equals zero). The proteins are loaded onto a pH gradient and are made to migrate under the influence of an electric field. The protein migrates through the gradient towards anode or cathode until their isoelectric point is reached beyond which their migration stops.

The proteins are then separated by common poly acrylamide gel electrophoresis (PAGE). The electric current is now applied perpendicular to the original orientation of the electrodes. The proteins now migrate through the gel only according to their size. After the 2D electrophoresis of the gel, it is visualized by suitable staining or labelling methods.

The resulting protein profile can be compared, for example, between experimental and control samples. The differentially expressed proteins are identified, cut out from the gel and subjected to subsequent analysis by MS.

Mass spectrometry

This method enables precise measurement of molecular weight of a broad spectrum of substances. As the studied substance has to be intact in gas phase, MS for protein analysis was enabled by the development of ‘soft’ ionization techniques of MS such as matrix-assisted laser detection of desorption/ionization (MALDI) and electrospray ionization (ESI).

Protein identification is generally performed in two ways:

  1. Protein is digested by trypsin or by other proteolytic enzyme to smaller peptides and their precise molecular weights are measured using MS. The spectrum of the molecular weights is then compared with theoretical spectra that are calculated from protein sequences from available databases (using bioinformatics tools).
  2. Tandem MS enables to choose the peptide which is then fragmented by the collision with inert gas. The fragmentation pattern gives either full of partial information about protein sequence that is subjected to the search in databases.

MS also helps in protein post-translational modification analysis, because it enables to localize given modifications within the protein and also detects the nature of such modification.

Peptide identification

Once the peptide mass finger prints or peptide sequences are determined, bioinformatics programmes can be used to search for the identity of a protein in a database of theoretically digested proteins. For example, ExPASY (www.expasy.ch/tools/) is a proteomics web server with programs for searching peptide information from the protein databases such as SWISS-PROT. Mascot (www.matrixscience.com/search_form_select.html) is another web server that identifies proteins based on peptide mass finger prints and also sequences entries.

Structural Proteomics

This involves the determination of the 3D structure of proteins. Structural proteomics identifies all the proteins within an organelle, determines their locations and characterizes their interactions.

Post-translational modifications play a very important role in proteome analysis. These modifications have a great impact on protein function by altering the size, hydrophobicity and overall conformation of the proteins. Further, the modifications can directly influence the protein-protein interaction and the distribution of proteins to different subcellular locations. Various bioinformatics tools predict sites for post-translational modifications based on specific protein sequences. To minimize false positive results, a statistical process called support vector machine can be used and this increases the specificity of such predictions. AutoMotif (http://automotif.bioinfo.pl/) is a web server for predicting protein sequence motifs.

Some of the other sites for online analysis of proteins structures are:

  • TMpred: Prediction of Trans-membrane Regions and Orientation (http://www.ch.embnet.org/software/TMPRED_form.html).
  • TMHMM: Prediction of transmembrane helices in proteins
  • DAS: Transmembrane Prediction Server (http://www.sbc.su.se/~miklos/DAS/).
  • SPLIT: The Trans-membrane Protein Topology Prediction Server provides clear and colourful output including beta preference and modified hydrophobic moment index. (http://split.pmfst.hr/split/4/).
  • OCTOPUS: Predicts the correct topology for 94 percent of the dataset of 124 sequences with known structures. (http://octopus.cbr.su.se/).
  • SLEP (Surface Localization Extracellular Protein): For predicting the localization of bacterial proteins starting from genome sequences (http://bl210.caspur.it/slep/slep_newJob.php).
  • SignalP: Predicts the presence and location of signal peptide cleavage sites in Gram-positive, Gram-negative and eukaryotic proteins (http://www.cbs.dtu.dk/services/SignalP/).
  • pTARGET is a computational method to predict the subcellular localization of only eukaryotic proteins from animal species that include fungi and metazoans. Predictions are carried out based on the occurrence patterns of protein functional domains and the amino acid compositional differences in proteins from different subcellular locations. This method can predict proteins targeted to nine distinct subcellular locations that include cytoplasm, endoplasmic reticulum, extracellular/secreted, Golgi bodies, lysosomes, mitochondria, nucleus, peroxysomes and plasma membrane (http://bioapps.rit.albany.edu/pTARGET/).

Prediction of disulphide bridges

Disulphide bridge is a unique post-translational modification. Disulphide bonds are very much essential for maintaining the stability of proteins. Prediction of disulphide bonds may help to predict the 3D structure of proteins. For example,

DiANNA: a web server for disulphide connectivity prediction. The web server bc.edu/~clotelab/DiANNA/ outputs the disulphide connectivity prediction given input of a protein sequence.

DBCP: a web server for disulphide bonding connectivity pattern prediction without the prior knowledge of the bonding state of cysteines.

Prediction of protein-protein interactions

Proteins should interact with each other to carry out biochemical functions. Thus, the prediction of protein-protein interactions is an important aspect of proteomics. A number of computational approaches have been developed for the prediction of protein-protein interactions. These methods utilize the structural, genomic and biological contexts of proteins and genes in complete genomes to predict protein interaction networks and functional linkages between proteins. STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources: namely genomic, high-throughput, conserved/co-expression and previous knowledge.


Figure 11.7 Phylogenetic profile


Protein-protein interactions can be predicted by different methods, namely:

  • Gene co-expression,
  • Gene cluster and gene neighbour,
  • Phylogenetic profile (Figure 11.7),
  • Rosetta stone,
  • Bayesian networks,
  • Sequence evolution and
  • Random decision forests.

Rosetta stone

This method is based on gene events. If A and B exist as interacting domains in a fusion protein in one proteome, the gene encoding the protein is a fusion gene. Their homologous gene sequences A’ and B’, which exist separately in another genome, most likely encode proteins interacting to perform a common function. On the other hand, if ancestral genes A and B encode interacting proteins, they may have a tendency to be fused together in other genomes during evolution to enhance their functionality. This method of predicting protein-protein interactions is called the Rosetta stone (Figure 11.8).

Gene cluster and gene neighbour

If certain gene linkage is found to be conserved across divergent genomes, it can be used as a strong indicator of formation of an operon. This method of prediction is better valid for prokaryotes; however, in eukaryotes, gene order is a less potent predictor of protein interactions (Figure 11.9).


Figure 11.8 Rosetta stone


Figure 11.9 Gene cluster


Functional Proteomics

Functional proteomics is an emerging research area that focuses on the identification of biological functions of unknown proteins and also defines the cellular mechanisms at the molecular level. Understanding protein functions as well as unraveling molecular mechanisms within the cell depend on the identification of the interacting protein partners. The association of an unknown protein with partners belonging to a specific protein complex involved in a particular mechanism would be strongly suggestive of its biological function. Such protein–protein interaction studies also details the cellular signaling pathways. Functional proteomics can define prognosis and predict pathologic complete response in patients and hence is more appropriately referred as ‘clinical proteomics’. A variety of MS-based approaches allow the characterization of cellular protein assemblies under near-physiological conditions and subsequent assignment of individual proteins to specific molecular machines, pathways and networks, etc.

Protein microarrays (protein chips)

These are similar to DNA microarrays. A large number of proteins can be analysed. These protein chips contain entire immobilized proteome. Unlike in DNA microarray, these are not used to bind and quantitate complementary molecules but are used for studying protein function.

Protein arrays are solid-phase ligand-binding assay systems using immobilized proteins on surfaces which include glass, membranes, microtiter wells, mass spectrometer plates and beads or other particles. There are rapid and automatable, highly sensitive, economical and gives an abundance of data for a single experiment (Figure 11.10).

Types of protein arrays

Protein arrays are of three types namely:

  1. Large-scale functional chips (target protein arrays): These are constructed by immobilizing a large numbers of purified proteins. This type of protein array is used to assay biochemical functions such as protein–protein, protein–DNA, protein-small molecule interactions and enzyme activity, and to detect antibodies and their specificity.
  2. The analytical capture arrays: These contain affinity reagents, primarily antibodies. They are used to detect and quantitate analytes in plasma/serum or tissue extracts
  3. Lysate (reverseprotein) arrays: In this type of array, the complex samples—such as tissue lysates—are coated on the surface and target proteins are then detected with antibodies overlaid on the coated surface.


Figure 11.10 Manufacturing of protein miccroarays

Protein sources

Sources of proteins for the construction of arrays, include cell-based expression systems for recombinant proteins, proteins purified from natural sources, proteins produced in vitro by cell-free translation systems, and peptides prepared by synthetic methods. Many of these methods can be automated for high-throughput production.

Solid surfaces

Protein arrays are basically mini versions of familiar immunoassay methods such as ELISA and dot blotting. They employ the use of fluorescent readout, robotics and high-throughput detection systems. This enables multiple assays to be carried out in parallel. The commonly used physical supports for the protein arrays include glass slides, silicon, microwells, nitrocellulose membranes, magnetic and microbeads. Micro-drops of protein are delivered onto planar surfaces.

Protein immobilization

A good protein array support surface should have the following features:

  • It should be chemically stable before and after the coupling procedures.
  • It should allow good spot morphology.
  • It should display minimal non-specific binding.
  • It should not contribute a background in detection systems.
  • It must be compatible with different detection systems.
  • The immobilization method used should be reproducible.
  • It is applicable to proteins of different properties (size, hydrophilic and hydrophobic).
  • It is amenable to high throughput and automation.
  • It is compatible with the retention of fully functional protein activity.

Proteins are immobilized both covalently and non-covalently. For example, diffusion into porous surfaces (allows non-covalent binding of unmodified protein within hydrogel structure), passive adsorption to surfaces, covalent binding using tags such as biotin/avidin on the protein bind the protein specifically.

They provide a solid support for assaying enzyme activity, protein-protein interaction, protein DNA/RNA interaction, protein ligand interaction, etc. Antibodies can be fixed on a solid support for assaying thousands of proteins simultaneously.

The protein chips thus created, helps to assay enzymes, protein-protein interaction, protein DNA/RNA interaction, protein ligend interaction etc. Antibodies can be fixed on a solid support for assaying thousands of proteins simultaneously.


  1. Diagnostics: Detection of antigens and antibodies in blood samples; profiling of sera to discover new disease markers; environment and food monitoring. Also finds applications in autoimmunity, allergy and cancer.
  2. Proteomics: Protein expression profiling.
  3. Protein functional analysis: Protein-protein interactions; ligand-binding properties of receptors; enzyme activities.
  4. Antibody characterization: Cross reactivity and specificity, epitope mapping.

A variety of tools are available online to study the varying aspects such as:

  1. Protein identification and characterization

    FindMod: Predict potential protein post-translational modifications and potential single amino acid substitutions in peptides.

  2. DNA Protein

    Translate: Translates a nucleotide sequence to a protein sequence

  3. Post-translational modification prediction

    ChloroP: Prediction of chloroplast transit peptides

    LipoP: Prediction of lipoproteins and signal peptides in Gram-negative bacteria

    MITOPROT: Prediction of mitochondrial targeting sequences

  4. Topology prediction

    NetNES: Leucine-rich nuclear export signals (NES) in eukaryotic proteins

    PSORT: Prediction of protein sub-cellular localization

  5. Primary structure analysis

    ProtParam: Physico-chemical parameters of a protein sequence (amino acid and atomic compositions, isoelectric point, extinction coefficient, etc.)

  6. Secondary structure prediction

    AGADIR: An algorithm to predict the helical content of peptides

    APSSP: Advanced Protein Secondary Structure Prediction Server

  7. Tertiary structure prediction-Homology modelling

    SWISS-MODEL: An automated knowledge-based protein modelling server.

  8. Molecular modelling and visualization tools

    Swiss-PdbViewer: A programme to display, analyse and superimpose protein 3D structures

    SwissDock: Docking of small ligands into protein active sites with EADock DSS

    SwissParam: Topology and parameters for small molecules

  9. Prediction of disordered regions

    DisEMBL: Protein disorder prediction

  10. Alignment analysis

    AMAS: Analyse Multiply Aligned Sequences

  11. Phylogenetic analysis

    BIONJ: Server for NJ phylogenetic analysis

  12. Biological text analysis

    AcroMed: A computer-generated database of biomedical acronyms and the associated long forms extracted from the recent Medline abstracts

  13. Statistical tools

    pROC: A package to visualize, smooth and compare receiver operating characteristic (ROC curves)


Drug discovery is a process that uses a variety of tools from diverse fields such as genomics and proteomics. The structural and functional analysis of expressed proteins in the cells and/or tissues help in the identification of therapeutically applicable proteins for various diseases. Thus, pharmaco-proteomic-based drug development for protein therapies is a currently developing field. Proteomics facilitates the detection and quantification of thousands of proteins from complex biological samples in a single analysis. The comparison of the data between the healthy and diseased state, in the presence and absence of drug, or between responders and non-responders of drug, enables qualitative and quantitative assessments of changes that are possible. Thus, proteomic studies will be critical for developing the most effective diagnostic techniques and disease treatments in the future.

Proteomics and New Drug Discovery

New drugs for the treatment of disease can be identified by proteomic studies. For example, if a protein is implicated in a disease, the 3D structure of that protein provides the information to a computer program that designs drugs to interfere with the action of the protein. A molecule that fits the active site of an enzyme, but cannot be released by the enzyme, will inactivate the enzyme. This is the basis of new drug-discovery tools, which aims to find new drugs to inactivate proteins involved in disease. Pharmacogenetics will use these same techniques to develop personalized drugs.

Virtual ligand screening is a computer technique that attempts to fit millions of small molecules to the 3D structure of a protein. The quality of the fit to various sites in the protein is rated by the computer, which either enhances or disables the function of the protein, depending on its function in the cell. An example of this is the identification of new drugs to target and inactivate the HIV-1 protease. The HIV-1 protease is an enzyme that cleaves a very large HIV protein into smaller and functional proteins. The virus cannot survive without this enzyme; therefore, it is one of the most effective protein targets for killing HIV.

Proteomics and Diseases


Scientists demonstrated the use of proteome analysis in studying insulin-dependent diabetes mellitus. In rat models of the disease, the pancreatic islets were challenged by cytokines, which strongly regulated protein expression (up or down). These islets were grafted into diabetes-prone mice, re-isolated, labelled and analysed by 2D gel electrophoresis. The selected proteins were identified by MALDI-MS. Such types of studies are extended to human pancreatic islet cells. Though the genes involved in humans are different, the protein pathways involved are the same and plays an important role in the genesis of the diabetes.

Infection biology

Proteomics serves as a tool for the study of infection biology. Several immunologically relevant proteins have been detected by comparing virulent strains with attenuated strains, studying secreted proteins, outer-surface proteins and analysing the immuno-proteome. Several vaccine candidates for Mycobacterium tuberculosis and antigens for Helicobacter pylori have been identified by proteomic approaches.

Toxicology and drug-safety assessment

Proteomics has been applied to toxicology and drug-safety assessment. Various studies have been conducted to study toxicity mechanisms and to assess the safety of new drugs by comparison with fingerprints of reference compounds with known toxicity. The molecular fingerprint of a drug is its gene-regulation pattern in response to the perturbances evoked by drug action and is visualized by gene-expression profiling at the mRNA or protein level.

Biomarkers and proteomic studies

Biomarkers are indicators of a biological process. They may be genes, proteins, small molecules or metabolites. Biomarkers are playing vital role in the drug development process. Mass spectral-based proteomic technologies are best suited for the discovery of protein biomarkers in the absence of any prior knowledge of quantitative changes in protein levels. Biomarkers have the potential to speed the process of drug development, as they may provide indications of drug action at earlier stages than clinical endpoint. When successfully applied, biomarker analysis can reduce the length and cost of clinical trials.

Proteomics and Drug Designing

Proteins are functional molecules in cells and are the major targets for drug action. To design a rational drug, we must first find out which proteins can be the drug targets in pathogenesis. Proteomics helps in the identification of protein targets and biochemical pathways involved in disease processes. Thus, proteomics plays an important role in the multi-step drug-development process. The various steps underlying the process include target identification and validation, lead selection, small-molecular screening and optimization and toxicity testing (Figure 11.11). Various sub-disciplines of proteomics such as computational proteomics, chemical proteomics, structural proteomics and topological proteomics offer significant contributions especially in ‘computer-aided drug design’.

Structure-based drug designing

In structure-based drug design, the 3D structure of a drug target interacting with small molecules is used to guide drug discovery. Structural information is obtained with X-ray crystallography or nuclear magnetic resonance spectroscopy (NMR).

Many proteins undergo considerable conformational change; upon ligand binding, it is important to design drugs based on the crystallographic structures of protein-ligand complexes and not based on the unliganded structure. Crystallography has been successfully used in the de novo design of drugs; the various steps involved in drug designing may be summarized as follow:


Figure 11.11 Drug desiging


  1. The protein of interest is cloned, expressed and purified. The protein is then crystallized in the presence of a ligand, which can be a non-hydrolysable substrate or can come from a biochemical or a cell-based screen. Ligands can also be low-affinity compound fragments or scaffolds. The scaffolds are generally a collection of basic chemical building blocks, each with a molecular weight of less than 200 Da. If the screen identifies several promising ligands with unique scaffold, the structures of the drug target should be determined with as many of the ligands as possible.
  2. Once one or more liganded structures have been determined and refined, the ligands are analysed for the structure and the potential sites for drug docking are identified. The ligand is made with greater hydrophobic, hydrogen-bonding and electrostatic complementarity to the molecular target for effective drug docking. Many of these modifications can be proposed from previous knowledge, or can be derived by computer modelling. Numerous commercial and academic computer programs to aid in the analysis and design of new ligands. However, it is important to note that computational methods alone is not sufficient and the binding mode is to be confirmed experimentally;
  3. After the ligands have been designed, they are synthesized chemically. Around 5 to 10 compounds closer to the proposed ligand structure are synthesized to obtain structure–activity relationship (SAR) data. The synthesized compounds are then purified and are tested in a relevant biochemical or cell-based assay to determine whether or not the design was successful.

Drug design software and tools

Various drug designing software and tools are available online. To mention a few:

  • Sanjeevini: It is a complete drug design software.
  • Binding Affinity Prediction of Protein-Ligand Server (BAPPL): It computes the binding free energy of a protein–ligand complex.
  • Drug-DNA Interaction Energy (PreDDICTA): It calculates the Drug–DNA interaction energy.
  • ParDOCK—Automated Server for Rigid Docking: It predicts the binding mode of the ligand in receptor target site.
  • Molecular Volume Calculator: It calculates the volume of a molecule.
  • DNA Sequence to Structure: It generates double helical secondary structure of DNA using conformational parameters taken from experimental fibre-diffraction studies.
  • Hydrogen Addition to Protein: It adds the hydrogen co-ordinates to the X-ray crystal structures of protein.
  • DNA Ligand Docking: It predicts the binding mode of the ligand in the minor groove of DNA.
  • RASPD for Preliminary Screening of Drugs: This tool is useful for preliminary screening of ligand molecules based on physico-chemical properties of the ligand and the active site of the protein. This tool predicts binding energy of drug/target at a preliminary stage.

The National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/) and the European Bioinformatics Institute (http://www.ebi.ac.uk/services/index.html) websites in particular provide access to basic tools and services that are used in laboratories every day. These tools include nucleotide and protein database searching tools, genome maps, structural databases and pattern recognition tools.

  • Computational biology which includes all biological areas that involve computation.
  • Bioinformatics is the branch of research science that involves the development of computational tools and databases for better understanding the living organisms. Bioinformatics is limited to sequence, structural and functional analyses of genes, genomes and their products.
  • Genomics is the study of genome of an organism involving the simultaneous analysis of a large number of genes using automated data gathering tools.
  • Proteomics is the study of the proteome. Proteome refers to the entire set of proteins that are expressed in a cell.
  • ESTs are short sequences obtained from cDNA clones and help in the identification of full-length genes.
  • SAGE is another throughput and sequence-based approach for gene expression and is more quantitative in determining the mRNA expression in a cell.
  • DNA microarray is a high-throughput technique that allows for rapid measurement and visualization of differential expression of genes at the whole genome scale.
  • A microarray or a gene chip or a biochips is a slide attached with an array of immobilized high-density DNA oligomers (sometimes cDNAs) representing the entire genome of the species under study.
  • Protein arrays are solid-phase ligand-binding assay systems using immobilized proteins on surfaces which include glass, membranes, microtiter wells, mass spectrometer plates and beads or other particles that enable the analysis of a large number of proteins at a time.
  • Proteins are functional molecules in cells and are the major targets for drug action. Proteomics as a whole increasingly plays an important role in the multi-step drug-development process.
  • Computational proteomics, chemical proteomics, structural proteomics and topological proteomics offer significant contributions especially in computer-aided drug design.
  1. Define the terms bioinformatics, genomics and protemics.

  2. How is genomics classified?

  3. What is meant by genome sequencing?

  4. Explain the protocol for shot gun sequencing.

  5. How is genome sequencing assembly performed?

  6. Briefly describe the scope and significance of proteomics and genomics.

  7. Describe the importance of Expressed sequence tags.

  8. Explain in detail about SAGE.

  9. Enlist the steps involved in generating a microchip.

  10. Bring out the biotechnological application of microarray technology.

  11. Define structural proteomics.

  12. Describe a few online tools for potein analysis.

  13. Enlist the applications of proteomics.

  14. What is the role of proteomics in curing diseases? Support your answer with suitable examples.

  15. What is the scope and significance of drug designing? Name a few drug designing tools.

  1. Which of the following is a web based programme that helps to detect contaminating bacterial vector sequences?

    1. TIGR
    2. VecScreen
    3. ARACHNE
    4. Phred
  2. Which of the following online tool is used for topology prediction?

    1. AGADIR
    2. SwissDock
    3. PSORT
    4. FindMod
  3. ———predicts the correct topology for 94% of the dataset of 124 sequences with known structures.

    1. SLEP
    2. SignalP
    3. DAS
    4. OCTOPUS
  4. Rosetta stone is a method used for studying———.

    1. protein-protein interactions
    2. protein-DNA interactions
    3. protein folding
    4. prediction of disulphide bridges
  5. Which of the following is not a drug designing software?

    1. RASPD
    2. PreDDICTA
    3. BAPPL
    4. BIONJ
  6. Expressed sequence tags (ESTs) are short sequences obtained from———clones.

    1. mRNA
    2. cDNA
    3. rRNA
    4. tRNA

Ashok Munjal, Vinay Sharma and Ashish Shanker. 2008. A Text Book of Bioinformatics, 1st edition, India: Rastogi Publications.



Jain, Kewal K. 2001 ‘Proteomics: Delivering New Routes to Drug Discovery—Part 2’, Drug Discovery Today, 6(16): 829–832.

Liu, E. T. 2004. ‘Representational Oligonucleotide Microarray Analysis (ROMA) in Pharmacogenomics, The Pharmacogenomics Journal, 4(2): 74–76.

Ludwig, James R., Knierman, Michael D., Hale, John E. and Gelfanova, Valentina. Application of Proteomics for Discovery of Protein Biomarkers, Henry Stewart Publications, 1473–9550.

Ludwig, James R., Knierman, Michael D., Hale, John E. and Gelfanova, Valentina. 2003 ‘Application of Proteomics for Discovery of Protein Biomarkers’, Briefings in Functional Genomics and Proteomics, 2(3): 185–193.

Monti, Maria, Orru, Stefania, Pagnozzi, Daniela and Pucci, Piero. 2005. ‘Functional Proteomics’, Clinica Chimica Acta, 357(2): 140–150.

Song, Chan Ho and Wyse, Michelle. 2004. ‘Painless Gene Expression Profiling: SAGE (Serial Analysis of Gene Expression)’, The Science Creative Quarterly, August.

Shoemaker, Benjamin A. and Panchenko, Anna R. 2007. ‘Deciphering Protein–Protein Interactions. Part II. Computational Methods to Predict Protein and Domain Interaction Partners’, PLOS Computational Biology, 3(4): e43.

Wang, Ying, Chiu, Jen-Fu and He, Qing-Yu. 2005. ‘Proteomics in Computer-Aided Drug Design’, Current Computer-Aided Drug Design, 1: 43–52.

Wang, Ying, Chiu, Jen-Fu and He, Qing-Yu. 2005. Proteomics in Computer-Aided Drug Design. Bentham Science Publishers Ltd., 1573–4099.

Xiong, Jin. 2006. Essential Bioinformatics, Cambridge University Press.