{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NCRM Text Data Workshop - part 2: Scraping web pages\n", "#### Lewys Brace - l.brace@exeter.ac.uk\n", "\n", "### 1. Collecting textual data\n", "Before you can do any textual data analysis, you first need a data set containing text data to work with. Finding an appropriate data set is often the hardest part of a text data project. There are, however, a number of ways in which you can gain access to data.\n", "\n", "1. Finding a dataset that has already been collected. Without a doubt this is easiest option, but you are limited to what others have already collected (I'll point you to datasets from my own research that you are free to use for your project).\n", "2. Find and use an API. There are hundreds (if not thousands!) of APIs on the web that provide textual data in some way shape or form. These include relatively open social media APIs (such as tweepy for Twitter), partially open APIs such as Google News (https://newsapi.org/s/google-news-api), and paid services such as webhose (https://webhose.io).\n", "3. Find a UoE database with textual data. The obvious candidates are Nexus UK and Proquest News and Newspapers.\n", "4. Scrape your own data! This is the most versatile solution, but also has the steepest learning curve.\n", "\n", "This part of the workshop is going to focus on this last option. There are many different options for web scraping in Python. Here, we will be using the ``requests`` library to communicate (or send requests) to a web server and the ``lxml`` module to parse the returned HTML. As covered in part one, the first thing we need to do here is to do is import the necessary dependencies:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from lxml import html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then use the ``requests`` in order to communicate with a website URL and get the content from it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://socialsciences.exeter.ac.uk/sociology/staff/\"\n", "page = requests.get(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``page`` object is an example of a ``class`` object, which we covered in the last video. These ``class`` objects have a number of ``methods``, which allow us to interact with it. There are numerous arguments that allow you to carry out specific tasks; i.e. we can view the raw HTML content by specifying the ``content`` argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(page.content[0:300])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now want to parse the raw HTML that we collected with the ``requests`` package into a format that will make it easy to navigate. We can do this by building a tree using the ``lxml`` module that we imported above. This essentially simulates a web browser in which you open a URL. Just as your browser requests information from a web server and organizes into a tree, the ``requests`` library communicates with the server and returns the HTML, while the ``lxml`` package organises it into a tree:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree = html.fromstring(page.content)\n", "print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the ``page`` object, the ``tree`` object has a bunch of methods that allow us to search for elements in a web page that we want to extract. Perhaps the most useful method to navigate the page's HTML tree is the XPath.\n", "\n", "### 2. XPaths\n", "\n", "It is worth spending some time familiarising ourselves with XPaths before we carry on building our web scraper. XPaths are a way of navigating our tree, and can be thought of as an address to a specific element(s) on a web page. As an example. let's say that we wanted to extract the text from the \"Our research\" menu option on the Exeter SPA's staff page. We can select this object by copying its Xpath from Chrome:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(tree.xpath('//*[@id=\"main-menu\"]/div[2]/ul/li[4]/a/text()'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As another example, we can use use our web browser to easily get the XPaths to a specific staff member's profile:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(tree.xpath('//*[@id=\"left-col\"]/table[2]/tbody/tr[3]/td[1]/a'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we wanted to get the URL of my staff profile? Well, we can just add the ``href`` attribute from the ``a`` tag." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(tree.xpath('//*[@id=\"left-col\"]/table[2]/tbody/tr[3]/td[1]/a/@href'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we can combine this with the base URL for the staff page in order to get the full URL for my staff profile page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(url + tree.xpath('//*[@id=\"left-col\"]/table[2]/tbody/tr[3]/td[1]/a/@href')[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Other XPath expressions\n", "There are a bunch of other XPath expressions that could prove useful in the future. We'll take a quick look at some of the most commonly used ones.\n", "\n", "#### Contains\n", "The ``contains`` expression is useful to only select elements that \"contain\" certain text, ids, classes. As an example, let's pretend I wanted to look for my colleague Chris:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Look up any tags that include \"Chris\"\n", "chris = tree.xpath('//a[contains(., \"Chris\")]')\n", "print(chris)\n", "print('Number of links with Chris = %s' % len(chris))\n", "#The Chris I need is in location two in the returned list.\n", "print(chris[1].text_content())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### XPath operators\n", "\n", "When using XPaths, you can also make use of a number of standard operators. These operators include:\n", "\n", "* ``=`` Equivalent comparison, can be used for numeric or text values\n", "* ``!=`` Is not equivalent comparison\n", "* ``>``, ``>=`` Greater than, greater than or equal to\n", "* ``<``, ``<=`` Less than, less than or equal to\n", "* ``or`` Boolean or\n", "* ``and`` Boolean and\n", "* ``not`` Boolean not\n", "\n", "For example, say we wanted to select the links that do not have \"Lewys\" in the text:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('The total number of links on the page = %s' % len(tree.xpath('//a')))\n", "\n", "# Find links without \"Lewys\"\n", "no_lewys = tree.xpath('//a[not(contains(., \"Lewys\"))]')\n", "\n", "print('Number of links without a \"Lewys\" = %s' % len(no_lewys))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As another example, say we wanted all of the \"Chris\", except for \"Chris Playford\":" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "no_playford = tree.xpath('//a[contains(., \"Chris\") and not(contains(., \"Playford\"))]')\n", "print(len(no_playford))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Other useful ``lxml`` methods\n", "\n", "In addition to `xpath`, `lxml.html` objects have a number of other useful methods. We already used the `text_content()` method, which simply returns the text of an HTML object. Here are some other methods that could come in handy.\n", "\n", "#### `cssselect()`\n", "\n", "In addition to using Xpaths to select objects, you can also use CSS (Cascading style sheets). As an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import cssselect\n", "nav = tree.cssselect('#main-menu')\n", "print(nav[0].text_content())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### find_class()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plist = tree.find_class('profile_list')\n", "print(plist)\n", "print(plist[0].text_content())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### get_element_by_id()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nav = tree.get_element_by_id('main-menu')\n", "print(nav.text_content())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Scraping our first page(s)\n", "\n", "Okay, so let's start scraping our web pages. The aim of this exercise is to work out how many staff members in the University of Exeter's SPA department are interested in quantiative methods. As such, we need to start by getting the textual data from the SPA staff pages. We'll break the scraping task into two steps: 1) getting the links to each staff member's profile and 2) extracting the profile content.\n", "\n", "By inspecting the page source in our browser, we see that the information that we want (staff member name, a link to their page, and their title) is located in 6 \"profile_list\" tables in the \"main-content\" div. Let's start by extracting all of the \"rows\" from these tables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# XPath to extract all of the rows in all of the\n", "# in \"profile_list\" tables\n", "table = tree.xpath('//*[@class=\"profile_list\"]//tr')\n", "\n", "# How many rows do we have?\n", "print(len(table))\n", "print(table[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you look closely at the \"profile_list\" tables, you'll notice that these tables include a header at the top (e.g., \"Head of Department\") and these headers will now be \"sprinkeled\" through our `tables` object. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "table[1].text_content()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we are going to need to deal with this issue. Also, notice that the information that we actually want is located in a tag and that the header rows don't have these tags." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Look for the tag in a header row\n", "header = table[0].findall('./td')\n", "print(header)\n", "\n", "# And compare this to a non-header row\n", "non_header = table[1].findall('./td')\n", "print(non_header)\n", "\n", "# We can also look at the text in these \"Element td\"\n", "# objects\n", "print(non_header[0].text_content())\n", "print(non_header[1].text_content())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above code illustrates something quite useful: we can use the `findall()` method to look for an XPath within another XPath. Here, we asked `lxml` to \"findall\" of the tags -- i.e., both cells in the table for that row.\n", "\n", "OK, so we are already most of the way there -- the only thing remaining is to extract the profile link for each member of staff. Again, if you look at the source in your browser, you will see that the URL to each staff member's profile is located in the cell holding their name. Let's grab this for Prof. Michael:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(non_header[0].find('./a').attrib['href'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we use find to find only the first \"a\" tag (as opposed to findall which could give more than 1) and ask for the href attribute. Sweet -- now we really have everything that we need. Only thing remaining is to pull everything together and loop over each row in the table, extract the information that we need, and save it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the table rows\n", "table = tree.xpath('//*[@id=\"main-content\"]//tr')\n", "\n", "# Extract name, staff page URL, and position\n", "# of each staff member\n", "staff_profiles = []\n", "\n", "for row in table:\n", " # Get the cells of the HTML table\n", " cells = row.findall('./td')\n", " # Headers in the table do not have a td tag.\n", " # We will ignore these.\n", " if len(cells) != 0:\n", " # Get the URL to the staff member's webpage\n", " page_url = cells[0].find('./a')\n", " if page_url is not None:\n", " page_url = cells[0].find('./a').attrib['href']\n", " staff_profiles.append({'name': cells[0].text_content(),\n", " 'page_url': page_url,\n", " 'position': cells[1].text_content()})\n", "\n", "print(\"Extracted information for %s members of staff.\" % len(staff_profiles))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(staff_profiles[3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our second step,need to extract the content of the profiles. Here, we have our profile URLs and a bit of useful meta data (name and position). The next step is to loop over the `staff_profile` list, visit the profile page, and extract the content. Let's take one step at a time. We start by visiting a webpage in the usual way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "page = requests.get(url + staff_profiles[0]['page_url'])\n", "print(\"We extracted the following page:\\n%s\" % page.url)\n", "# And build the tree in the usual way:\n", "tree = html.fromstring(page.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After looking at the page, the profile content is located in paragraphs (``

``) in the various divs in the ``class = \"tab-content clear-fix\"`` div. For instance, if we wanted to pull the \"overview\" text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract all overview content located in paragraph tags\n", "overview = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"overview\"]//p')]).strip()\n", "\n", "print(overview)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pull other aspects of Mike's page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract all research content located in paragraph tags\n", "research = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"research\"]//p')]).strip()\n", "\n", "print(research)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can just grab everything in at once:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract all content located in paragraph tags\n", "content = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"main-content\"]//p')]).strip()\n", "\n", "print(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll keep things simple and use all available page content moving forward. The last step is to loop over the entire staff_profile list and get the profile content for each person:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract page text\n", "for staff in staff_profiles:\n", " # Visit staff member landing page\n", " page = requests.get(url + staff['page_url'])\n", " \n", " # Parse the xml tree. We will ignore any unicode errors for the\n", " # moment.\n", " tree = html.fromstring(page.content.decode('utf-8', errors='ignore'))\n", " \n", " # Extract all content located in paragraph tags\n", " content = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"main-content\"]//p')])\n", " \n", " # Add text content to dictionary\n", " staff['page_content'] = content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We added add additional field in our staff_profiles list of dictionaries and we can call this information in the usual way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(staff_profiles[3]['page_content'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the page content in hand, we are in a position to offer a preliminary to our research question. Here, we'll keep things really simple. Let's look up the staff that have the word \"statistic\" or \"quant\" in their profile page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quantitative = []\n", "\n", "for row in staff_profiles:\n", " if 'page_content' in row:\n", " if ('quant' in row['page_content']) or ('statistic' in row['page_content']):\n", " quantitative.append(row)\n", "\n", "print('Found %s staff members that reference quantitative work:' % len(quantitative))\n", "\n", "for row in quantitative: \n", " print(row['name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we saave our profiles to an output file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "with open('C:/Users/lb690/OneDrive - University of Exeter/Teaching/NLP_workshop/profiles.json', 'w') as jfile:\n", " json.dump(staff_profiles, jfile, indent=4, separators=(',', ': '), sort_keys=True)\n", " # Add trailing newline for POSIX compatibility\n", " jfile.write('\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }