Building Protein List using BeautifulSoup and BioPython

Programming 2006/12/27 20:32
1. BeautifulSoup[각주:1]
 - Parsing an HTML is incredibly easy.
from BeautifulSoup import BeautifulSoup
import urllib2
doc = urllib2.urlopen('http://www.google.com').read()
soup = BeautifulSoup(doc)
 - Navigating through parsed tree is simple.
soup.contents[0].name
# u'html'
soup.head.next.name
# u'title'
soup.head.next.nextSibling.name
# u'body'
 - Finding a tag / tags is powerful.
soup.findAll('div', id='sectionName')
soup.findAll('div', {'class': 'singleprotein'})
soup.find('p', align='center')
soup.find('p', align=re.compile('^b.*')
 - You don't need to worry about modifying the parsed tree.
soup = BeautifulSoup('<a1></a1><a><b>Amazing content<c><d></a><a2></a2>')
soup.a1.nextSibling
# <a><b>Amazing content<c><d></d></c></b></a>
soup.a2.previousSibling
# <a><b>Amazing content<c><d></d></c></b></a>

subtree = soup.a
subtree.extract()

print soup
# <a1></a1><a2></a2>
soup.a1.nextSibling
# <a2></a2>
soup.a2.previousSibling
# <a1></a1>

2. BioPython[각주:2]
 - BioPython is a group of python libraries for biology.
 - Searching biology db and parsing it is a piece of cake.
from Bio import db
from Bio.PDB import PDBParser
db
# db, exporting 'embl', 'embl-dbfetch-cgi', 'embl-ebi-cgi', 'embl-fast', 'embl-xembl-cgi', 'embl-xml', 'fasta', 'fasta-sequence-eutils', 'genbank-nucleotide', 'genbank-protein', 'interpro', 'interpro-ebi-cgi', 'medline', 'medline-eutils', 'nucleotide-genbank-eutils', 'pdb', 'pdb-ebi-cgi', 'pdb-rcsb-cgi', 'prodoc', 'prodoc-expasy-cgi', 'prosite', 'prosite-expasy-cgi', 'protein-genbank-eutils', 'swissprot', 'swissprot-expasy-cgi', 'swissprot-usmirror-cgi'

f = db['pdb']['1kdx']
p = PDBParser()
s = p.get_structure('1kdx', f)
s[0]
# <model id=0>
s[0].child_list
# [<Chain id=A>, <Chain id=B>]
len(s[0].child_list[0].child_list)
# 81 (81 residues)
3. Building Protein List
 - The objective is building a table of membrane-proteins so we can get some insight of what to choose.
 - Protein list from http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html is used.
 - It didn't provide size and the image information, I downloaded pdb to calculate how many residues are there and downloaded images from opm[각주:3] website.

4. Protein List
 - At first, I went through the original protein list and fetch pdb files and image files, then build the whole list.
 - protein_list.py, protein_list.html, protein_list.pdf
 -



5. Inspected List
 - Then, I extract rows, which don't meet my needs.
 - inspect.py, inspected_list.html, inspected_list.pdf
 -



  1. http://www.crummy.com/software/BeautifulSoup/ [본문으로]
  2. http://biopython.org/wiki/Main_Page [본문으로]
  3. http://opm.phar.umich.edu [본문으로]

'Programming' 카테고리의 다른 글

Symfony Adjacent List  (0) 2007/01/03
AJAX Post-It 만들기  (1) 2006/12/29
Building Protein List using BeautifulSoup and BioPython  (0) 2006/12/27
Symfony Form Helper - object_select_tag  (0) 2006/12/24
Adapter Pattern in PHP  (0) 2006/12/20
Symfony v1 beta 2 has been released  (0) 2006/12/20
Trackback 9 : Comment 0
◀ PREV : [1] : ... [63] : [64] : [65] : [66] : [67] : [68] : [69] : [70] : [71] : ... [72] : NEXT ▶