Web Scrawler With RDF Querying

Exploring a Python Script for Web Scraping & RDF Querying

In today’s data-driven world, the ability to collect, analyze, and interpret data is invaluable. Python, a versatile and powerful programming language, is often the tool of choice for data professionals. In this article, we’ll delve into a Python script that combines data analysis and web scraping to fetch and analyze information related to meetings and attendees. We’ll also take a deeper dive into the technologies and concepts underpinning the script.

The Code

The Python script [1] provided in this article employs several libraries and techniques to achieve its goals. Let’s break down its key components and functionality.

Libraries Used

The script utilizes the following Python libraries:

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import numpy as np
import SPARQLWrapper as sp
import json
import matplotlib.pyplot as plt

Functionality

The primary goal of the script is to fetch data from web pages related to meetings and compare it with data from a database. Here’s a step-by-step breakdown of its functionality.

Fetching Web Pages:


def getHtmlPageFromLink(link):
    response = urlopen(link)
    soup = bs(response, 'html.parser')
    return soup

Data Extraction:


def getTagsFromProperty(html):
    Tags0 = html.find_all(property="besluit:heeftAanwezigeBijStart")
    # Additional code for Tags1 and Tags2
    Tags = list(set(Tags0 + Tags1 + Tags2))
    return Tags

def getTextFromInsideTag(tag):
    Persoonsnaam = tag.getText()
    # Additional code for name parsing and formatting
    return Persoonsnaam

def getTextFromList(listOfTags):
    mensen = []
    AantalAanwezige = len(listOfTags)
    if(AantalAanwezige >= 1):
        for Tag in listOfTags:
            TagText = getTextFromInsideTag(Tag)
            mensen.append(TagText)
    else:
        pass
    return mensen

Data Comparison:


def Comparing(NestedLijstVanNamenVanHtml, NestedLijstVanNamenVanDatabase, Everything):
    for sitting in list(Everything.keys()):
        temp = []
        if(len(NestedLijstVanNamenVanHtml[sitting]) != len(NestedLijstVanNamenVanDatabase[sitting])):
            temp = [x for x in NestedLijstVanNamenVanHtml[sitting] if x not in NestedLijstVanNamenVanDatabase[sitting]]
            # Additional code for finding missing attendees
            ontbrekendePersonen = getDictOfMandaFromListName(temp, sitting, Everything)
            Everything[sitting][2] = ontbrekendePersonen
    return Everything

Querying RDF Data:

def GetAllAanwezigeQuery(zitting):
    # SPARQL query to retrieve attendees
    sparql.setQuery("""
        PREFIX ...  # SPARQL query goes here
    """)
    result = sparql.query().convert()
    # Additional code for processing query results
    return AlleAanweizgen

Efficiency Analysis and Visualization:


def AnalyzeTiming(x, y):
    # Analysis code to measure execution time
    # Code for generating line plots using Matplotlib
    return 0

Example Output

Let’s take a look at an example of the script’s output. It returns a dictionary structure containing information about meetings, including URLs to notulen (minutes), URLs to associated governing bodies, and a dictionary of attendees with their names and corresponding URLs.


# Example output structure
{
    "Meeting URL": ["Notulen URL", "Governing Body URL", {
        "Attendee Name": "Attendee URL",
        // Additional attendees...
    }],
    // Additional meeting entries...
}

Use Cases

The script’s functionality can be valuable in various scenarios:

  • Attendance Tracking: It can be used to track and verify the attendance of individuals at meetings or events.

  • Data Analysis: The script can be extended to perform deeper data analysis, such as identifying trends in meeting attendance over time.

  • Integration with RDF Data: By using SPARQL queries, the script can integrate with RDF data sources, making it suitable for working with linked data and semantic web technologies.

Now, let’s take a deeper dive into the technologies used in the script.

Web Scraping

HTML Parsing

Web pages are structured using HTML (Hypertext Markup Language). The script leverages the BeautifulSoup library to parse HTML documents. Parsing involves breaking down the HTML document into a structured tree-like format, known as the Document Object Model (DOM).

from bs4 import BeautifulSoup as bs

# Parse an HTML document
soup = bs(html_content, 'html.parser')

Selecting Elements

To interact with specific elements within the DOM, BeautifulSoup provides methods for element selection. For instance, we can find all elements with a particular HTML tag or retrieve elements with specific attributes.


# Find all <a> tags in the HTML
links = soup.find_all('a')

Data Extraction

Once we’ve selected the desired elements, data can be extracted from them. This involves accessing the element’s attributes or text content. Functions like getTextFromInsideTag and getTextFromList are used in the script to extract text data from HTML elements.


# Extract text content from an HTML element
text = element.getText()

Web scraping often involves navigating through multiple web pages by following links or interacting with forms. The script utilizes the urlopen function from the urllib.request module to open web pages, fetch their content, and then parse them with BeautifulSoup.


from urllib.request import urlopen

# Open a web page and fetch its content
response = urlopen(link)
html_content = response.read()

RDF Data Querying

RDF Data Model

RDF represents data as a graph, with data structured as triples (subjects, predicates, and objects). Triples describe relationships between resources. SPARQL Query Language

SPARQL (SPARQL Protocol and RDF Query Language) is a query language for RDF data. It allows us to query RDF graphs to retrieve specific information. SPARQL queries are designed to match patterns in RDF data and return results in a structured format.


SELECT ?subject ?predicate ?object
WHERE {
  ?subject ?predicate ?object.
}

Query Execution

SPARQL queries are executed against RDF data stores or endpoints. In the script, the SPARQLWrapper library is used to send SPARQL queries to a specific RDF endpoint and retrieve results. The results are typically returned in a structured format like JSON.


import SPARQLWrapper as sp

sparql = sp.SPARQLWrapper("https://example.com/sparql_endpoint")
sparql.setQuery("SELECT ?subject ?predicate ?object WHERE {?subject ?predicate ?object.}")
result = sparql.query().convert()

Authors: Ibrahim El Kaddouri

References

  1. GitHub: MandaatChecker. Accessed: 2022-08-20.