Created by: Chris Lindgren, PhD

Web Scraping with Python

I. Setting you up & understanding the legalities & ethics
Creative Commons License: BY-NC-ND

Workshop Series

  1. Overview & Setup
  2. HTML & Xpath
  3. Python Basics
  4. Scraping with Beautiful Soup
  5. Scraping with selenium

Overview & Setup

  1. What is scraping?
  2. Scraping Legalities & Ethics
  3. Requirements
  4. Setup
  5. Debrief

Setting the Terms

  • Spiders: Bots! An agent coded to crawl the web for you. You'll get to code one!
  • Data scraping: Extract already structured data from provided server resource.

Setting the Terms

  • Web scraping: Extract specific structured content (XML/HTML) from specific site/page for a variety of purposes, such as creating structured data.
  • Web crawling: Akin to scraping, but it extracts structured content from multiple sites/pages.

Why learn how to scrape programmatically?

Case: List of emoticons from Wikipedia.

Three Main Methods

  1. Application Programming Interfaces
  2. Modules & Scripts
  3. Software Tools

Method 1. API
(Application Programming Interface)

An API provides people (clients) access to resources from another application or service (server), e.g., a set of data features about TikTok videos, based on a search query.

Endpoints & Access Tokens

  • APIs provide "endpoints" for people/developers to use the provided data.
  • Often requires an "access token"

API Examples

Method 2. Modules & Scripts

  • Modules / Libraries: A set of code with methods and/or functions that fulfill a recurrent set of needs and goals.
  • Scripts: Smaller code files that complete a specific task. Like scraping a page! :-)

Examples of Modules & Scripts

  • Python:
    • General: Beautiful Soup, Scrapy, and Selenium (Selenium is actually a suite of tools that you can be used across multiple languages.)
    • Specific: Pyktok scrapes video, text, and metadata from Tiktok with no authentication.
  • R: rvest

Method 3. Existing Tools

Software tools exist that use APIs to automate the process, such as MassMine.

MassMine: Command line tool designed for researchers to simplify the collection and use of data from online sources such as social media networks

Scraping Legalities & Ethics

Can I / should I do it?

Yes, but it's tricky

Problem #1

IRB has been slow to understand and assess these matters.

Problem #2

Company Terms of Services are increasingly more complex—there's no way NOT to breach their TOS.

Solution

Get acquainted with what computational social scientists are recommending and doing

Be informed about TOS cases

"Post-API Age"

"When companies can restrict or eliminate API access at any time, for any reason, and without any recourse, computational researchers and students need to seriously consider how to proceed. We find ourselves in a situation where heavy investment in teaching and learning platform-specific methods can be rendered useless overnight: This is what I mean by 'the post-API age.'" (Freelon, 2018, p. 665)
Short timeline of events leading to significant API changes across social media platforms
src: Freelon's (2018) talk at "Summer Institute in Computational Social Science"

"We know what you're doing"

Screencap of We know what you are doing site
src: Bindley, "Huffington Post

Used Facebook "Search" API to aggregate embarrassing Facebook posts.

We await Musky volatile decisions

Be informed about legal cases

Internet Researcher Sandvig: Case No. 1:16-cv-1368

Although no adverse action has been taken against any of the plaintiffs, even though some have already engaged in the proposed information-gathering activity, they seek to raise a preenforcement challenge to the constitutionality of a provision of the Computer Fraud and Abuse Act (“CFAA”), 18 U.S.C. § 1030(a)(2)(C), under the theory that the provision runs afoul of the First and Fifth Amendments to the United States Constitution. (p. 8)

Webscraping == "Tresspassing" & "Stealing"

"the government has argued that Web servers are private property, and that anyone who exceeds authorized access is trespassing “on” them" (Sandvig)

Webscraping == "Tresspassing" & "Stealing"

"the CFAA was used to say that because Web servers are private, users are also wasting capacity on these servers, effectively stealing a server's processing cycles that the owner would rather use for other things." (Sandvig)

Sandvig: CFAA frames webscraping as a "cartoon thief with a bag of electrons."

Sandvig: "Are Internet researchers and data journalists “trespassing” and “stealing”? These are the wrong metaphors."

Isn't this TOS-problem a big ole "Nope. We can't do this."?

"... a cursory investigation of the history of web scraping reveals the area to be heavily contested and in many cases untested in legal process ... [J]udges have taken a lenient view where the scraping has been performed against user generated data that is publically viewable." (Perriam et al., 2020, p. 21)

Case: HiQ vs. Linkedin

"HiQ violated Linkedin terms of service by scraping user details from the site, [but] judges ruled in favour of HiQ by supporting HiQ's counter claim that Linkedin's blocking of their scrapers amounted to anti-competitive practices" (p. 21)

Case: Whotargets.me

  • scrapes political ads from user's social news feeds
  • sends ads to a central location for analysis
  • Generates report for user about how political groups have been microtargeting them and compares them against geo/demographic groups

Whotargets.me violates the ToS. Yet, "the service claims 20,000 users in over 80 countries, with their work heavily cited in the press" (Perriam et al., 2020, p. 22).

Surprise!

¯\_(ツ)_/¯ ToS' are designed to protect the companies' commercial interest

Freelon's Recommendations

  1. Learn how to scrape the web, and
  2. Understand the potential consequences of violating platforms' TOS by doing so (p. 665).

1. Use Authorized Methods Whenever Possible (APIs)

2. Understand the Risks of Violating TOS

  • Companies can blacklist you
  • Companies can file lawsuits: Aaron Swartz's 2013 CFAA prosecution - max 50 years and $1 million fine

3. TOS Compliance != Human Subject Compliance

  • TOS Compliance == "respecting the business prerogatives" (Freelon, p. 667)
  • Subjects Compliance == "respecting the dignity and privacy of the platform's users" (Freelon, p. 667)

Re: Human Subject Compliance

  • How does your scraping plan account for the collection and retention of sensitive information from vulnerable populations?
  • How can researchers apply and transform standard researcher subjects compliance practices?
  • How can researchers continue to develop IRB channels to understand and assess exempt vs. non-exempt studies?

Breakout Session - Review a Case Together

Review my latest scraping job from Kaggle.com for a study. [Will share in chat]

  • Am I conducting ethical research?
  • Am I at risk of violation? How? See section 4 of Kaggle's TOS.
  • Other questions or insights?

Requirements

MAC Requirements

(src: docs.python-guide.org)

  1. Apple Developer account
  2. Xcode
  3. Homebrew
  4. Install Python 3 with Homebrew

WINDOWS Requirements

(src: realpython.com)

Follow the below tutorial and be sure to select the "Add to path" option.

  1. Install Python 3 with the "Full Installer"

Code Editor

I recommend installing VS Code.

Setup Breakout Sessions

  • MAC
  • WINDOWS

Apple Dev Account

Sign up for an Apple Developer account with your Apple ID.

Xcode

Download and install Xcode.

Install Homebrew

Open "Terminal" and copy/paste the following into it:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install pyenv with Homebrew

pynev will help you switch between multiple versions of Python on your computer.

brew install pyenv

Configure your Mac's environment

Configure your environmental variables, so pyenv manages your packages

echo 'eval "$(pyenv init -)"' >> ~/.bash_profile

then

source ~/.bash_profile

Check Python 3 versions with pyenv

pyenv versions

Set Python 3 version with pyenv

pyenv global 3.9.8

Check the Python Version

Open "Terminal" and write the following:

python -VPython 2.7

or

python3 -VPython 3.9.8

Want a quick demo?

References

Bruns, A. (2019). After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566. https://doi.org/10.1080/1369118X.2019.1637447
Cagle, L. E. (2019). Surveilling Strangers: The Disciplinary Biopower of Digital Genre Assemblages. Computers and Composition, 52, 67–78. https://doi.org/10.1016/j.compcom.2019.01.006
Freelon, D. (2018). Computational Research in the Post-API Age. Political Communication, 35(4), 665–668. https://doi.org/10.1080/10584609.2018.1477506
Hirschey, J. (2014). Symbiotic Relationships: Pragmatic Acceptance of Data Scraping. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2419167
Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., Nelson, A., Salganik, M. J., Strohmaier, M., Vespignani, A., & Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. https://doi.org/10.1126/science.aaz8170
Mancosu, M., & Vegetti, F. (2020). What You Can Scrape and What Is Right to Scrape: A Proposal for a Tool to Collect Public Facebook Data. Social Media + Society, 6(3), 2056305120940703. https://doi.org/10.1177/2056305120940703
Perriam, J., Birkbak, A., & Freeman, A. (2020). Digital methods in a post-API environment. International Journal of Social Research Methodology, 23(3), 277–290. https://doi.org/10.1080/13645579.2019.1682840
Sandvig, C. W. (n.d.). UNITED STATES DISTRICT COURT FOR THE DISTRICT OF COLUMBIA. 41.
Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating Survey Data and Digital Trace Data: Key Issues in Developing an Emerging Field. Social Science Computer Review, 38(5), 503–516. https://doi.org/10.1177/0894439319843669