Created by: Chris Lindgren, PhD
Beautiful Soup
selenium
Case: List of emoticons from Wikipedia.
An API provides people (clients) access to resources from another application or service (server), e.g., a set of data features about TikTok videos, based on a search query.
Software tools exist that use APIs to automate the process, such as MassMine.
MassMine: Command line tool designed for researchers to simplify the collection and use of data from online sources such as social media networks
Can I / should I do it?
Yes, but it's tricky
IRB has been slow to understand and assess these matters.
Company Terms of Services are increasingly more complex—there's no way NOT to breach their TOS.
Get acquainted with what computational social scientists are recommending and doing
"When companies can restrict or eliminate API access at any time, for any reason, and without any recourse, computational researchers and students need to seriously consider how to proceed. We find ourselves in a situation where heavy investment in teaching and learning platform-specific methods can be rendered useless overnight: This is what I mean by 'the post-API age.'" (Freelon, 2018, p. 665)
Used Facebook "Search" API to aggregate embarrassing Facebook posts.
Although no adverse action has been taken against any of the plaintiffs, even though some have already engaged in the proposed information-gathering activity, they seek to raise a preenforcement challenge to the constitutionality of a provision of the Computer Fraud and Abuse Act (“CFAA”), 18 U.S.C. § 1030(a)(2)(C), under the theory that the provision runs afoul of the First and Fifth Amendments to the United States Constitution. (p. 8)
"the government has argued that Web servers are private property, and that anyone who exceeds authorized access is trespassing “on” them" (Sandvig)
"the CFAA was used to say that because Web servers are private, users are also wasting capacity on these servers, effectively stealing a server's processing cycles that the owner would rather use for other things." (Sandvig)
Sandvig: CFAA frames webscraping as a "cartoon thief with a bag of electrons."
Sandvig: "Are Internet researchers and data journalists “trespassing” and “stealing”? These are the wrong metaphors."
"... a cursory investigation of the history of web scraping reveals the area to be heavily contested and in many cases untested in legal process ... [J]udges have taken a lenient view where the scraping has been performed against user generated data that is publically viewable." (Perriam et al., 2020, p. 21)
"HiQ violated Linkedin terms of service by scraping user details from the site, [but] judges ruled in favour of HiQ by supporting HiQ's counter claim that Linkedin's blocking of their scrapers amounted to anti-competitive practices" (p. 21)
Whotargets.me violates the ToS. Yet, "the service claims 20,000 users in over 80 countries, with their work heavily cited in the press" (Perriam et al., 2020, p. 22).
¯\_(ツ)_/¯ ToS' are designed to protect the companies' commercial interest
Review my latest scraping job from Kaggle.com for a study. [Will share in chat]
(src: docs.python-guide.org)
(src: realpython.com)
Follow the below tutorial and be sure to select the "Add to path" option.
I recommend installing VS Code.
Sign up for an Apple Developer account with your Apple ID.
Download and install Xcode.
Open "Terminal" and copy/paste the following into it:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
pyenv
with Homebrewpynev will help you switch between multiple versions of Python on your computer.
brew install pyenv
Configure your environmental variables, so pyenv manages your packages
echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
then
source ~/.bash_profile
pyenv versions
pyenv global 3.9.8
Open "Terminal" and write the following:
python -V
Python 2.7
or
python3 -V
Python 3.9.8