Skip to content Skip to sidebar Skip to footer

How To Select Div By Text Content Using Beautiful Soup?

Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc. Imagine everyone takes 3-5 classes. One of them is always Biolog

Solution 1:

(1) To just get the biology grade only, it is almost one liner.

import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology')) 
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores

The output looks like this:

[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']

(2) You locate the tags and maybe for further tasks, you need to find the parent:

import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs

Output looks like this:

[<divclass="score">Biology A+</div>, 
<divclass="score">Biology B</div>, 
<divclass="score">Biology B</div>, 
<divclass="score">Biology B</div>, 
<divclass="score">Biology B</div>]

*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*

More information about how to navigate the tree. And Good luck with your work.

Solution 2:

Another way (using css selector) is:

divs = soup.select('div:contains("Biology")')

EDIT:

BeautifulSoup4 4.7.0+ (SoupSieve) is required

Solution 3:

You can extract them searching for any <div> element that has score as class attribute value, and use a regular expression to extract its biology score:

from bs4 import BeautifulSoup 
import sys
import re

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

for div in soup.find_all('div', attrs={'class': 'score'}):
    t = re.search(r'Biology\s+(\S+)', div.string)
    if t: print(t.group(1))

Run it like:

python3 script.py htmlfile

That yields:

A+
BBBB

Post a Comment for "How To Select Div By Text Content Using Beautiful Soup?"