Created by: Chris Lindgren, PhD

Web Scraping with Python

I. Setting you up & understanding the legalities & ethics

Workshop Series

Overview & Setup
HTML & Xpath
Python Basics
Scraping with Beautiful Soup
Scraping with selenium

Overview & Setup

What is scraping?
Scraping Legalities & Ethics
Requirements
Setup
Debrief

Setting the Terms

Spiders: Bots! An agent coded to crawl the web for you. You'll get to code one!
Data scraping: Extract already structured data from provided server resource.

Setting the Terms

Web scraping: Extract specific structured content (XML/HTML) from specific site/page for a variety of purposes, such as creating structured data.
Web crawling: Akin to scraping, but it extracts structured content from multiple sites/pages.

Why learn how to scrape programmatically?

Case: List of emoticons from Wikipedia.

Three Main Methods

Application Programming Interfaces
Modules & Scripts
Software Tools

Method 1. API
(Application Programming Interface)

An API provides people (clients) access to resources from another application or service (server), e.g., a set of data features about TikTok videos, based on a search query.

Endpoints & Access Tokens

APIs provide "endpoints" for people/developers to use the provided data.
Often requires an "access token"

API Examples

TikTok's "Video Query"
Twitter's "Filtered Stream"
Kaggle's "Notebooks"

Method 2. Modules & Scripts

Modules / Libraries: A set of code with methods and/or functions that fulfill a recurrent set of needs and goals.
Scripts: Smaller code files that complete a specific task. Like scraping a page! :-)

Examples of Modules & Scripts

Python:
- General: Beautiful Soup, Scrapy, and Selenium (Selenium is actually a suite of tools that you can be used across multiple languages.)
- Specific: Pyktok scrapes video, text, and metadata from Tiktok with no authentication.
R: rvest

Method 3. Existing Tools

Software tools exist that use APIs to automate the process, such as MassMine.

MassMine: Command line tool designed for researchers to simplify the collection and use of data from online sources such as social media networks

Scraping Legalities & Ethics

Can I / should I do it?

Yes, but it's tricky

Problem #1

IRB has been slow to understand and assess these matters.

Problem #2

Company Terms of Services are increasingly more complex—there's no way NOT to breach their TOS.

Solution

Get acquainted with what computational social scientists are recommending and doing

Be informed about TOS cases

"Post-API Age"

"When companies can restrict or eliminate API access at any time, for any reason, and without any recourse, computational researchers and students need to seriously consider how to proceed. We find ourselves in a situation where heavy investment in teaching and learning platform-specific methods can be rendered useless overnight: This is what I mean by 'the post-API age.'" (Freelon, 2018, p. 665)

Short timeline of events leading to significant API changes across social media platforms — src: Freelon's (2018) talk at "Summer Institute in Computational Social Science"

"We know what you're doing"

Screencap of We know what you are doing site — src: Bindley, "Huffington Post

Used Facebook "Search" API to aggregate embarrassing Facebook posts.

We await Musky volatile decisions

Be informed about legal cases

Internet Researcher Sandvig: Case No. 1:16-cv-1368

Although no adverse action has been taken against any of the plaintiffs, even though some have already engaged in the proposed information-gathering activity, they seek to raise a preenforcement challenge to the constitutionality of a provision of the Computer Fraud and Abuse Act (“CFAA”), 18 U.S.C. § 1030(a)(2)(C), under the theory that the provision runs afoul of the First and Fifth Amendments to the United States Constitution. (p. 8)

Webscraping == "Tresspassing" & "Stealing"

"the government has argued that Web servers are private property, and that anyone who exceeds authorized access is trespassing “on” them" (Sandvig)

Webscraping == "Tresspassing" & "Stealing"

"the CFAA was used to say that because Web servers are private, users are also wasting capacity on these servers, effectively stealing a server's processing cycles that the owner would rather use for other things." (Sandvig)

Sandvig: CFAA frames webscraping as a "cartoon thief with a bag of electrons."

Sandvig: "Are Internet researchers and data journalists “trespassing” and “stealing”? These are the wrong metaphors."

Isn't this TOS-problem a big ole "Nope. We can't do this."?

"... a cursory investigation of the history of web scraping reveals the area to be heavily contested and in many cases untested in legal process ... [J]udges have taken a lenient view where the scraping has been performed against user generated data that is publically viewable." (Perriam et al., 2020, p. 21)

Case: HiQ vs. Linkedin

"HiQ violated Linkedin terms of service by scraping user details from the site, [but] judges ruled in favour of HiQ by supporting HiQ's counter claim that Linkedin's blocking of their scrapers amounted to anti-competitive practices" (p. 21)

Case: Whotargets.me

scrapes political ads from user's social news feeds
sends ads to a central location for analysis
Generates report for user about how political groups have been microtargeting them and compares them against geo/demographic groups

Whotargets.me violates the ToS. Yet, "the service claims 20,000 users in over 80 countries, with their work heavily cited in the press" (Perriam et al., 2020, p. 22).

Surprise!

¯\_(ツ)_/¯ ToS' are designed to protect the companies' commercial interest

Freelon's Recommendations

Learn how to scrape the web, and
Understand the potential consequences of violating platforms' TOS by doing so (p. 665).

1. Use Authorized Methods Whenever Possible (APIs)

2. Understand the Risks of Violating TOS

Companies can blacklist you
Companies can file lawsuits: Aaron Swartz's 2013 CFAA prosecution - max 50 years and $1 million fine

3. TOS Compliance != Human Subject Compliance

TOS Compliance == "respecting the business prerogatives" (Freelon, p. 667)
Subjects Compliance == "respecting the dignity and privacy of the platform's users" (Freelon, p. 667)

Re: Human Subject Compliance

How does your scraping plan account for the collection and retention of sensitive information from vulnerable populations?
How can researchers apply and transform standard researcher subjects compliance practices?
How can researchers continue to develop IRB channels to understand and assess exempt vs. non-exempt studies?

Breakout Session - Review a Case Together

Review my latest scraping job from Kaggle.com for a study. [Will share in chat]

Am I conducting ethical research?
Am I at risk of violation? How? See section 4 of Kaggle's TOS.
Other questions or insights?

Requirements

MAC Requirements

(src: docs.python-guide.org)

Apple Developer account
Xcode
Homebrew
Install Python 3 with Homebrew

WINDOWS Requirements

(src: realpython.com)

Follow the below tutorial and be sure to select the "Add to path" option.

Install Python 3 with the "Full Installer"

Code Editor

I recommend installing VS Code.

Setup Breakout Sessions

MAC
WINDOWS

Apple Dev Account

Xcode

Download and install Xcode.

Install Homebrew

Open "Terminal" and copy/paste the following into it:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install `pyenv` with Homebrew

pynev will help you switch between multiple versions of Python on your computer.

brew install pyenv

Configure your Mac's environment

Configure your environmental variables, so pyenv manages your packages

echo 'eval "$(pyenv init -)"' >> ~/.bash_profile

then

source ~/.bash_profile

Check Python 3 versions with pyenv

pyenv versions

Set Python 3 version with pyenv

pyenv global 3.9.8

Check the Python Version

Open "Terminal" and write the following:

python -VPython 2.7

python3 -VPython 3.9.8

Want a quick demo?

References

Bruns, A. (2019). After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566. https://doi.org/10.1080/1369118X.2019.1637447

Cagle, L. E. (2019). Surveilling Strangers: The Disciplinary Biopower of Digital Genre Assemblages. Computers and Composition, 52, 67–78. https://doi.org/10.1016/j.compcom.2019.01.006

Freelon, D. (2018). Computational Research in the Post-API Age. Political Communication, 35(4), 665–668. https://doi.org/10.1080/10584609.2018.1477506

Hirschey, J. (2014). Symbiotic Relationships: Pragmatic Acceptance of Data Scraping. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2419167

Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., Freelon, D., Gonzalez-Bailon, S., King, G., Margetts, H., Nelson, A., Salganik, M. J., Strohmaier, M., Vespignani, A., & Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. https://doi.org/10.1126/science.aaz8170

Mancosu, M., & Vegetti, F. (2020). What You Can Scrape and What Is Right to Scrape: A Proposal for a Tool to Collect Public Facebook Data. Social Media + Society, 6(3), 2056305120940703. https://doi.org/10.1177/2056305120940703

Perriam, J., Birkbak, A., & Freeman, A. (2020). Digital methods in a post-API environment. International Journal of Social Research Methodology, 23(3), 277–290. https://doi.org/10.1080/13645579.2019.1682840

Sandvig, C. W. (n.d.). UNITED STATES DISTRICT COURT FOR THE DISTRICT OF COLUMBIA. 41.

Stier, S., Breuer, J., Siegers, P., & Thorson, K. (2020). Integrating Survey Data and Digital Trace Data: Key Issues in Developing an Emerging Field. Social Science Computer Review, 38(5), 503–516. https://doi.org/10.1177/0894439319843669

Web Scraping with Python

I. Setting you up & understanding the legalities & ethics

Workshop Series

Overview & Setup

Setting the Terms

Setting the Terms

Why learn how to scrape programmatically?

Three Main Methods

Method 1. API (Application Programming Interface)

Endpoints & Access Tokens

API Examples

Method 2. Modules & Scripts

Examples of Modules & Scripts

Method 3. Existing Tools

Scraping Legalities & Ethics

Problem #1

Problem #2

Solution

Be informed about TOS cases

"Post-API Age"

"We know what you're doing"

We await Musky volatile decisions

Be informed about legal cases

Internet Researcher Sandvig: Case No. 1:16-cv-1368

Webscraping == "Tresspassing" & "Stealing"

Webscraping == "Tresspassing" & "Stealing"

Isn't this TOS-problem a big ole "Nope. We can't do this."?

Case: HiQ vs. Linkedin

Case: Whotargets.me

Surprise!

Freelon's Recommendations

1. Use Authorized Methods Whenever Possible (APIs)

2. Understand the Risks of Violating TOS

3. TOS Compliance != Human Subject Compliance

Re: Human Subject Compliance

Breakout Session - Review a Case Together

Requirements

MAC Requirements

WINDOWS Requirements

Code Editor

Setup Breakout Sessions

Apple Dev Account

Xcode

Install Homebrew

Install pyenv with Homebrew

Configure your Mac's environment

Check Python 3 versions with pyenv

Set Python 3 version with pyenv

Check the Python Version

Want a quick demo?

References

Method 1. API
(Application Programming Interface)

Install `pyenv` with Homebrew