The goal of LIHKGr is to scrape text data on the LIHKG, the Hong Kong version of Reddit, for analysis. LIHKG has gained popularity in 2016 and become a popular research data source during recent years. LIHKG is currently protected by Google’s reCAPTCHA, this package currently builds on RSelenium
and adopts a semi-manual approach to bypass it.
contains all the required functions. Please install the following packages: RSelenium
, raster
, magrittr
, and purrr
. Follow the following workflow:
For RSelenium
to work, you need to specify the browser. If you are using Chrome, you need to also specify the version. For example,create_lihkg(browser = "chrome", chromever = "83.0.4103.39")
. If a version is not supplied, by default it will run the most recent version. To see Chrome version currently sourced run binman::list_versions("chromedriver")
## Creating a Firefox instance with a random port.
lihkg <- create_lihkg(browser = "firefox", port = sample(10000:60000, 1), verbose = FALSE)
# It can accept a single post id
# Or a vector
# Another way to do it
postids <- c(1610753, 2091171)
To obtain the dataframe:
To save as .RDS:
If you don’t want to save the data as RDS, you can just save the bag as any format you like. It is just a regular data frame / tibble:
Ho, J.C. & Or, N.H.K. (2020). LIHKGr. An application for scraping LIHKG. Source code and releases available at