Exploring a Python Script for Web Scraping & RDF Querying
In today’s data-driven world, the ability to collect, analyze, and interpret data is invaluable. Python, a versatile and powerful programming language, is often the tool of choice for data professionals. In this article, we’ll delve into a Python script that combines data analysis and web scraping to fetch and analyze information related to meetings and attendees. We’ll also take a deeper dive into the technologies and concepts underpinning the script.
The Code
The Python script [1] provided in this article employs several libraries and techniques to achieve its goals. Let’s break down its key components and functionality.
Libraries Used
The script utilizes the following Python libraries:
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import numpy as np
import SPARQLWrapper as sp
import json
import matplotlib.pyplot as plt
Functionality
The primary goal of the script is to fetch data from web pages related to meetings and compare it with data from a database. Here’s a step-by-step breakdown of its functionality.
Fetching Web Pages:
def getHtmlPageFromLink(link):
response = urlopen(link)
soup = bs(response, 'html.parser')
return soup
Data Extraction:
def getTagsFromProperty(html):
Tags0 = html.find_all(property="besluit:heeftAanwezigeBijStart")
# Additional code for Tags1 and Tags2
Tags = list(set(Tags0 + Tags1 + Tags2))
return Tags
def getTextFromInsideTag(tag):
Persoonsnaam = tag.getText()
# Additional code for name parsing and formatting
return Persoonsnaam
def getTextFromList(listOfTags):
mensen = []
AantalAanwezige = len(listOfTags)
if(AantalAanwezige >= 1):
for Tag in listOfTags:
TagText = getTextFromInsideTag(Tag)
mensen.append(TagText)
else:
pass
return mensen
Data Comparison:
def Comparing(NestedLijstVanNamenVanHtml, NestedLijstVanNamenVanDatabase, Everything):
for sitting in list(Everything.keys()):
temp = []
if(len(NestedLijstVanNamenVanHtml[sitting]) != len(NestedLijstVanNamenVanDatabase[sitting])):
temp = [x for x in NestedLijstVanNamenVanHtml[sitting] if x not in NestedLijstVanNamenVanDatabase[sitting]]
# Additional code for finding missing attendees
ontbrekendePersonen = getDictOfMandaFromListName(temp, sitting, Everything)
Everything[sitting][2] = ontbrekendePersonen
return Everything
Querying RDF Data:
def GetAllAanwezigeQuery(zitting):
# SPARQL query to retrieve attendees
sparql.setQuery("""
PREFIX ... # SPARQL query goes here
""")
result = sparql.query().convert()
# Additional code for processing query results
return AlleAanweizgen
Efficiency Analysis and Visualization:
def AnalyzeTiming(x, y):
# Analysis code to measure execution time
# Code for generating line plots using Matplotlib
return 0
Example Output
Let’s take a look at an example of the script’s output. It returns a dictionary structure containing information about meetings, including URLs to notulen (minutes), URLs to associated governing bodies, and a dictionary of attendees with their names and corresponding URLs.
# Example output structure
{
"Meeting URL": ["Notulen URL", "Governing Body URL", {
"Attendee Name": "Attendee URL",
// Additional attendees...
}],
// Additional meeting entries...
}
Use Cases
The script’s functionality can be valuable in various scenarios:
-
Attendance Tracking: It can be used to track and verify the attendance of individuals at meetings or events.
-
Data Analysis: The script can be extended to perform deeper data analysis, such as identifying trends in meeting attendance over time.
-
Integration with RDF Data: By using SPARQL queries, the script can integrate with RDF data sources, making it suitable for working with linked data and semantic web technologies.
Now, let’s take a deeper dive into the technologies used in the script.
Web Scraping
HTML Parsing
Web pages are structured using HTML
(Hypertext Markup Language). The script leverages the BeautifulSoup
library to parse HTML
documents. Parsing involves breaking down the HTML
document into a structured tree-like format, known as the Document Object Model (DOM
).
from bs4 import BeautifulSoup as bs
# Parse an HTML document
soup = bs(html_content, 'html.parser')
Selecting Elements
To interact with specific elements within the DOM
, BeautifulSoup
provides methods for element selection. For instance, we can find all elements with a particular HTML
tag or retrieve elements with specific attributes.
# Find all <a> tags in the HTML
links = soup.find_all('a')
Data Extraction
Once we’ve selected the desired elements, data can be extracted from them. This involves accessing the element’s attributes or text content. Functions like getTextFromInsideTag
and getTextFromList
are used in the script to extract text data from HTML
elements.
# Extract text content from an HTML element
text = element.getText()
Navigating Web Pages
Web scraping often involves navigating through multiple web pages by following links or interacting with forms. The script utilizes the urlopen function from the urllib.request
module to open web pages, fetch their content, and then parse them with BeautifulSoup
.
from urllib.request import urlopen
# Open a web page and fetch its content
response = urlopen(link)
html_content = response.read()
RDF Data Querying
RDF Data Model
RDF
represents data as a graph, with data structured as triples (subjects, predicates, and objects). Triples describe relationships between resources.
SPARQL
Query Language
SPARQL
(SPARQL Protocol and RDF Query Language) is a query language for RDF
data. It allows us to query RDF
graphs to retrieve specific information. SPARQL
queries are designed to match patterns in RDF
data and return results in a structured format.
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object.
}
Query Execution
SPARQL
queries are executed against RDF
data stores or endpoints. In the script, the SPARQLWrapper
library is used to send SPARQL
queries to a specific RDF
endpoint and retrieve results. The results are typically returned in a structured format like JSON
.
import SPARQLWrapper as sp
sparql = sp.SPARQLWrapper("https://example.com/sparql_endpoint")
sparql.setQuery("SELECT ?subject ?predicate ?object WHERE {?subject ?predicate ?object.}")
result = sparql.query().convert()
Authors: Ibrahim El Kaddouri
References
- GitHub: MandaatChecker. Accessed: 2022-08-20.