Building Protein List using BeautifulSoup and BioPython
Programming 2006/12/27 20:32- Parsing an HTML is incredibly easy.
from BeautifulSoup import BeautifulSoup- Navigating through parsed tree is simple.
import urllib2
doc = urllib2.urlopen('http://www.google.com').read()
soup = BeautifulSoup(doc)
soup.contents[0].name- Finding a tag / tags is powerful.
# u'html'
soup.head.next.name
# u'title'
soup.head.next.nextSibling.name
# u'body'
soup.findAll('div', id='sectionName')- You don't need to worry about modifying the parsed tree.
soup.findAll('div', {'class': 'singleprotein'})
soup.find('p', align='center')
soup.find('p', align=re.compile('^b.*')
soup = BeautifulSoup('<a1></a1><a><b>Amazing content<c><d></a><a2></a2>')
soup.a1.nextSibling
# <a><b>Amazing content<c><d></d></c></b></a>
soup.a2.previousSibling
# <a><b>Amazing content<c><d></d></c></b></a>
subtree = soup.a
subtree.extract()
print soup
# <a1></a1><a2></a2>
soup.a1.nextSibling
# <a2></a2>
soup.a2.previousSibling
# <a1></a1>
2. BioPython2
- BioPython is a group of python libraries for biology.
- Searching biology db and parsing it is a piece of cake.
from Bio import db3. Building Protein List
from Bio.PDB import PDBParser
db
# db, exporting 'embl', 'embl-dbfetch-cgi', 'embl-ebi-cgi', 'embl-fast', 'embl-xembl-cgi', 'embl-xml', 'fasta', 'fasta-sequence-eutils', 'genbank-nucleotide', 'genbank-protein', 'interpro', 'interpro-ebi-cgi', 'medline', 'medline-eutils', 'nucleotide-genbank-eutils', 'pdb', 'pdb-ebi-cgi', 'pdb-rcsb-cgi', 'prodoc', 'prodoc-expasy-cgi', 'prosite', 'prosite-expasy-cgi', 'protein-genbank-eutils', 'swissprot', 'swissprot-expasy-cgi', 'swissprot-usmirror-cgi'
f = db['pdb']['1kdx']
p = PDBParser()
s = p.get_structure('1kdx', f)
s[0]
# <model id=0>
s[0].child_list
# [<Chain id=A>, <Chain id=B>]
len(s[0].child_list[0].child_list)
# 81 (81 residues)
- The objective is building a table of membrane-proteins so we can get some insight of what to choose.
- Protein list from http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html is used.
- It didn't provide size and the image information, I downloaded pdb to calculate how many residues are there and downloaded images from opm3 website.
4. Protein List
- At first, I went through the original protein list and fetch pdb files and image files, then build the whole list.
- protein_list.py, protein_list.html, protein_list.pdf
-
5. Inspected List
- Then, I extract rows, which don't meet my needs.
- inspect.py, inspected_list.html, inspected_list.pdf
-
'Programming' 카테고리의 다른 글
| Symfony Adjacent List (0) | 2007/01/03 |
|---|---|
| AJAX Post-It 만들기 (1) | 2006/12/29 |
| Building Protein List using BeautifulSoup and BioPython (0) | 2006/12/27 |
| Symfony Form Helper - object_select_tag (0) | 2006/12/24 |
| Adapter Pattern in PHP (0) | 2006/12/20 |
| Symfony v1 beta 2 has been released (0) | 2006/12/20 |

protein_list.py
protein_list.html
protein_list.pdf