``) in the various divs in the ``class = \"tab-content clear-fix\"`` div. For instance, if we wanted to pull the \"overview\" text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract all overview content located in paragraph tags\n", "overview = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"overview\"]//p')]).strip()\n", "\n", "print(overview)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also pull other aspects of Mike's page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract all research content located in paragraph tags\n", "research = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"research\"]//p')]).strip()\n", "\n", "print(research)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can just grab everything in at once:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract all content located in paragraph tags\n", "content = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"main-content\"]//p')]).strip()\n", "\n", "print(content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll keep things simple and use all available page content moving forward. The last step is to loop over the entire staff_profile list and get the profile content for each person:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract page text\n", "for staff in staff_profiles:\n", " # Visit staff member landing page\n", " page = requests.get(url + staff['page_url'])\n", " \n", " # Parse the xml tree. We will ignore any unicode errors for the\n", " # moment.\n", " tree = html.fromstring(page.content.decode('utf-8', errors='ignore'))\n", " \n", " # Extract all content located in paragraph tags\n", " content = '\\n\\n'.join([p.text_content().strip() for p in \n", " tree.xpath('//*[@id=\"main-content\"]//p')])\n", " \n", " # Add text content to dictionary\n", " staff['page_content'] = content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We added add additional field in our staff_profiles list of dictionaries and we can call this information in the usual way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(staff_profiles[3]['page_content'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the page content in hand, we are in a position to offer a preliminary to our research question. Here, we'll keep things really simple. Let's look up the staff that have the word \"statistic\" or \"quant\" in their profile page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "quantitative = []\n", "\n", "for row in staff_profiles:\n", " if 'page_content' in row:\n", " if ('quant' in row['page_content']) or ('statistic' in row['page_content']):\n", " quantitative.append(row)\n", "\n", "print('Found %s staff members that reference quantitative work:' % len(quantitative))\n", "\n", "for row in quantitative: \n", " print(row['name'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we saave our profiles to an output file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "with open('C:/Users/lb690/OneDrive - University of Exeter/Teaching/NLP_workshop/profiles.json', 'w') as jfile:\n", " json.dump(staff_profiles, jfile, indent=4, separators=(',', ': '), sort_keys=True)\n", " # Add trailing newline for POSIX compatibility\n", " jfile.write('\\n')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }