Python: Query reference sequences using rpy2 and BSgenome

Yao Yao on August 18, 2017

Usually this job is done in R, as deescribed in Using R + Bioconductor to Get Flanking Sequence Given Genomic Coordinates. An R sample could be this simple:

# 'BSgenome.Hsapiens.UCSC.hg19' requires 'BSgenome'
# Then 'BSgenome' requires 'Biostrings'
#   `getSeq` is actually a 'Biostrings' function

# S4 method for signature 'BSgenome'
#   getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=FALSE)
# See
one_seq <- getSeq(Hsapiens, "chr1", 10001, 10005, NA, "+", TRUE)
two_seq <- getSeq(Hsapiens, c("chr1", "chr1"), c(10001, 20001), c(10005, 20005), c(NA, NA), c("+", "+"), TRUE)

Suppose you have all related R libraries installed (well, you can install bioconductor with rpy2 if you like). If you want to do this in python for various reasons, it is feasible although things may be a little bit complicated:

import rpy2.rinterface as rinterface
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr

bs_genome = importr('BSgenome.Hsapiens.UCSC.hg19')
# You need to import 'Biostrings' explicitly
# Loading 'BSgenome.Hsapiens.UCSC.hg19' won't load 'Biostrings' automatically
bio_strings = importr('Biostrings')

# Note that:
#   - `getSeq` is actually a 'Biostrings' function
#   - `as.character` is not a legal parameter name in python
#   - `NA` in R, `rinterface.NA_Integer` in `rpy2`
#   - `robjects.XxxVector` accepts tuples, not lists
#   - `getSeq` returns a list even if there is only one element
one_seq = bio_strings.getSeq(bs_genome.Hsapiens, "chr1", 10001, 10005, rinterface.NA_Integer, "+", True) 
two_seq = bio_strings.getSeq(bs_genome.Hsapiens,
                             robjects.StrVector(("chr1", "chr1")),
                             robjects.IntVector((10001, 20001)),
                             robjects.IntVector((10005, 20005)),
                             robjects.IntVector((rinterface.NA_Integer, rinterface.NA_Integer)),
                             robjects.StrVector(("+", "+")),

print(one_seq[0])  # TAACC
print(two_seq[0])  # GGTTA
print(two_seq[1])  # CCTGG

blog comments powered by Disqus