Category Archives: Machine Learning

A new R package for network-based biomarker discovery released

A new R package, netClass, has been release. netClass integrate network information, such as protein-protein interaction network or KEGG, to mRNA classification, but also incorporate miRNA to mRNA with mi-mRNA interaction network for biomarker discovery. This methods we called stSVM and already published in PloS ONE (Cun et al 2013). Apart from stSVM, we also implement the flowing methods in netClass: 

  1. AEP (average gene expression of pathway), Guo et al., BMC Bioinformatics 2005, 6:58.
  2. PAC (pathway activitive classification), Lee E, et  al., PLoS Comput Biol 4(11): e1000217.
  3. hubc (Hub nodes classification), Taylor et al.(2009) Nat. Biotech.: doi: 10.1038/nbt.152
  4. frSVM (filter via top ranked genes), Cun et al. arXiv:1212.3214 ;  Winter etal., PLoS Comput Biol 8(5): e1002511.
  5. stSVM (network smoothed t-statistic) , Cun et al., PloS One,.

NetClass can be download from souceforg ( http://sourceforge.net/projects/netclassr/) or , CRAN (http://cran.r-project.org/web/packages/netClass/ ). For more detail of netClass, you can refer these four papers:

Use prior information to prognostic biomaker or not?

In our recent publication in BMC bioinformatics, we acompared a great deal of feature selection methods to finding prognostic biomakers in 6 breast cancer gene expresion data. No methods show significant performacne in prediction accuracy, feature selection stability and  biogical interprety, which against previeous reseach results: current network-based appraoch did not show much benift in our analysis. Meanwhile, A group from NKI also show the simliar results in PloS One. The R codes for these algorithms in our paper is availiable as request.

Prediction performance in terms of area under ROC curve (AUC)

Continue reading

comutational courese in Plos Computational Biology

Short introduction paper in different ares in computational biology.

Fran Lewitter, Welcome to PLoS Computational Biology “Education”

Kenzie D MacIsaac, Ernest Fraenkel, Practical Strategies for Discovering Regulatory DNA Sequence Motifs, April 2006

Duncan Brown, Kimmen Sjölander, Functional Classification Using Phylogenomic Inference,June 2006

Philip E Bourne, Johanna McEntyre, Biocurators: Contributors to the World of Science,October 2006

FrSVM: A filter ranking feature selection algorithm

We use a simple filter feature selection algorithm, called FrSVM, which selected the top ranked genes in PPI network and then training these top raked genes in L2-SVM. FrSVM integrates protein-protein interaction (ppi) network information into feature/gene selection algorithm for prognostic biomarker discovery.

As L2-SVM could not do feature the the ranking of genes were used as feature selection step.  Central genes always plays an important role biological process, so make using GeneRank to selected  those genes with large differences in their expression.

We applied FrSVM to several cancer datasets and reveals a significantly better prediction performance and higher signature stability. Related manuscript already put to arXiv and  R  code for FrSVM available at:

Codes: https://sites.google.com/site/yupengcun/software/frsvm

Papers: http://arxiv.org/abs/1212.3214

. Any comments and question on the FrSVM are welcomed. The following is how to run the program:


1. 
Geting gene expression profiles (GEP), PPi Network.

##############################################
# Geing GEP
#———————————————————————————-
library(GEOquery)
a = getGEO(“GSExxxxx”, destdir=”/home/YOURPATH/”)
## Normalized the GEP by limma
x= t(normalizeBetweenArrays(exprs(a), method=”quantile”) )
## defien your classes labes, y, as a factor
y= facotr(“Two Class”)

 

##############################################
# mapping probest IDs to Entrez IDs
# take hgu133a paltform as example
#———————————————————————————
library(‘hgu133a.db’)
mapped.probes<-mappedkeys(hgu133aENTREZID)
refseq<-as.list(hgu133aENTREZID[mapped.probes])
times<-sapply(refseq, length)
mapping <- data.frame(probesetID=rep(names(refseq),times=times), graphID=unlist(refseq),row.names=NULL, stringsAsFactors=FALSE)
mapping<- unique(mapping)##############################################
Summarize probests to genes of x by limma
# ad.ppi: Adjacencen matrix of PPI network

#———————————————————————————
Gsub=ad.ppi
mapping <- mapping[mapping[,’probesetID’] %in% colnames(x),]
int <- intersect(rownames(Gsub), mapping[,”graphID”])
xn.m=xn.m[,mapping$probesetID]

index = intersect(mapping[,’probesetID’],colnames(xn.m))
x <- x[,index]
colnames(xn.m) <- map2entrez[index]
ex.sum = t(avereps(t(xn.m), ID=map2entrez[index]))

int= intersect(int, colnames(ex.sum))
ex.sum=ex.sum[,int]         ## GEP which matched to PPI network
Gsub=Gsub[int,int]            ## PPI network which matched to GEP


2.  Run FrSVM program

##################################################
# You need install for flowing packages for run FrSVM.R programs:
#    library(ROCR)
#    library(Matrix)
#    library(kernlab)
#
## If you want to running parallelly, you also need  to load:
#    library(multicore)
#
## Here is an expale for 5 times 10-folds Cross-Validtaion
source(“../FrSVM.R”)
res <- frSVM.cv(x=ex.sum, y=y, folds=10,Gsub=Gsub, repeats=5, parallel = FALSE, cores = 2, DEBUG=TRUE,d=0.5,top.uper=0.95,top.lower=0.9)
## the AUC values for 5*10-folds CV
AUC= res$auc

 

Current approach in finding biomaker by means of mahcine learning

How to find the robust biomarkers in the genomics data are first step to personalized medicine. Here we take a short review on how machine leaning works in find biomarkers and current aproach in this area.  for more interesting technology, please see the following papers.

Biomarker Gene Signature Discovery Integrating Network Knowledge

Bonn-Aachen International Center for IT (B-IT), Dahlmannstr. 2, 53113 Bonn, Germany
Abstract: Discovery of prognostic and diagnostic biomarker gene signatures for diseases, such as cancer, is seen as a major step towards a better personalized medicine. During the last decade various methods, mainly coming from the machine learning or statistical domain, have been proposed for that purpose. However, one important obstacle for making gene signatures a standard tool in clinical diagnosis is the typical low reproducibility of these signatures combined with the difficulty to achieve a clear biological interpretation. For that purpose in the last years there has been a growing interest in approaches that try to integrate information from molecular interaction networks. Here we review the current state of research in this field by giving an overview about so-far proposed approaches.

When Machine learning meets molecular evolution

A recent paper , Schwarz et al. 2010, was using kernel method to reconstruction the phylogenetic tree, which usually done by maximum likelihood estimation. Their using finite-state transducers(FST) to create a alignment-free kernel for evolutionary comparison of molecular sequence, and their call it a rational kernel approach. Their method overcome the gap in alignment sequence. As we known,  the gap can influence the accuracy of phylogenetic tree.

Kernel method had approved to be a powerful tool for classification, and their method do help to classify the twilight-zone in very close sequence(see the following picture). The result in their paper is a new and accurate way of determining evolutionary distances in the twilight zone of sequence alignments that is suitable for large homologies datasets.

The method for phylogenetic/ phylogenomic reconstruction are still challenged problems in evolution biology.  Schwarz et al. ‘s paper only do misclassification, maybe we can see the kernel method for estimating the divergence time, effective population size, recombination rate and mutation rate in nature population.

(A phylogenetic trees of the Chlorophyceae, which reconstructed by FST distance (left) using the full kernel score, F84 distance estimation on a Muscle alignment (top right) and maximum-likelihood tree on the same Muscle alignment (bottom right).)

Continue reading

A Book on Statistical Learning

I just read a book on statistical learning, The Elements of Statistical Learning(2ed). The important of this this book do not need me buck. The authors are so kind, and server they e-print of this online freely and they set up an web for supplementary.

Here is their website: http://www-stat.stanford.edu/~tibs/ElemStatLearn/ . Wish you can find the beauty of statistical learning.

Continue reading