Dataset about open research data information in Water, Sanitation, and Hygiene • washopenresearch

The goal of washopenresearch is to provide an overview of open research data related to Water Sanitation and Hygiene (WASH). The current version contains two datasets from the following sources:

washdev: Open access journal Journal of Water, Sanitation and Hygiene for Development
uncnewsletter: Research section of the newsletter North Carolina Water News

Installation

You can install the development version of washopenresearch from GitHub with:

# install.packages("devtools")
devtools::install_github("openwashdata/washopenresearch")

Alternatively, you can download the individual datasets as a CSV or XLSX file from the table below.

dataset	CSV	XLSX
washdev	Download CSV	Download XLSX
uncnewsletter	Download CSV	Download XLSX

Data

The package provides access to two datasets washdev and uncnewsletter. Each dataset collects information on scientific articles about (1) article metadata (e.g. title, first author, correspondence author), (2) supplementary material information, (3) data availability statement, and (4) semantic information (e.g. keywords).

library(washopenresearch)

washdev

The dataset washdev contains data on open access articles of the Journal of Water, Sanitation & Hygiene for Development (Vol.1 Issue 1 - Vol.13 Issue 11). It has 924 observations from March 2011 to November 2023.

washdev |> 
  head(3) |> 
  gt::gt() |>
  gt::as_raw_html()

paperid	volume	issue	paper_url	journal	title	published_year	is_supp	supp_file_type	supp_url	num_authors	first_author_name	first_author_affiliation	first_author_affiliation_country	first_author_email	first_author_orcid	correspondence_author_name	correspondence_author_affiliation	correspondence_author_affiliation_country	correspondence_author_email	correspondence_author_orcid	has_das	das	das_type	das_repo_url	keywords	url_source
28742	1	1	https://iwaponline.com/washdev/article/1/1/1/28742/Editorial	Journal of Water, Sanitation & Hygiene for Development	Editorial	2011	FALSE	NA	NA	6	Jamie Bartram	Journal of Water, Sanitation and Hygiene for Development	NA	NA	NA	NA	NA	NA	NA	NA	FALSE	NA	NA	NA	NA	iwaponline.com
28745	1	1	https://iwaponline.com/washdev/article/1/1/3/28745/The-sanitation-ladder-a-need-for-a-revamp	Journal of Water, Sanitation & Hygiene for Development	The sanitation ladder – a need for a revamp?	2011	FALSE	NA	NA	5	E. Kvarnström	Stockholm Environment Institute, Kräftriket 2B, SE-10691 Stockholm, Sweden	Sweden	elisabeth.kvarnstrom@sei.se	NA	E. Kvarnström	Stockholm Environment Institute, Kräftriket 2B, SE-10691 Stockholm, Sweden	Sweden	elisabeth.kvarnstrom@sei.se	NA	FALSE	NA	NA	NA	function-based, sanitation technologies, sustainability, the sanitation ladder	iwaponline.com
28743	1	1	https://iwaponline.com/washdev/article/1/1/13/28743/Vertical-flow-constructed-wetlands-as-an-emerging	Journal of Water, Sanitation & Hygiene for Development	Vertical-flow constructed wetlands as an emerging solution for faecal sludge dewatering in developing countries	2011	FALSE	NA	NA	6	I. M. Kengne	Laboratory of Plant Biotechnology and Environment, Faculty of Science, University Yaoundé I, PO Box 812, Yaoundé, Cameroon	Cameroon	NA	NA	E. Soh Kengne	Laboratory of Plant Biotechnology and Environment, Faculty of Science, University Yaoundé I, PO Box 812, Yaoundé, Cameroon	Cameroon	ives_kengne@yahoo.fr	NA	FALSE	NA	NA	NA	biosolid accumulation, Cyperus papyrus, Echinochloa pyramidalis, faecal sludge dewatering, pollutant removal efficiencies, vertical-flow constructed wetlands	iwaponline.com

For an overview of the variable names, see the following table.

variable_name	variable_type	description
paperid	integer	ID number of the paper on the journal website
volume	integer	Volume number of the journal
issue	integer	Issue number of the journal
paper_url	character	Official website url of the paper
journal	character	Full name of the journal
title	character	Title of the paper
published_year	integer	Year of publication
is_supp	logical	Whether the paper has supplementary materials
num_supp	integer	Number of supplementary material files
supp_file_type	list	File type of the supplementary materials
supp_url	character	Website url of the supplementary materials
num_authors	integer	Number of the authors
first_author_name	character	Name of the first author
first_author_affiliation	character	Academic affiliation of the first author
first_author_affiliation_region	character	Country or region of the first author parsed from first_author_affiliation variable
first_author_email	character	Email of the first author
first_author_orcid	character	ORCID of the first author
correspondence_author_name	character	Name of the correspondence author
correspondence_author_affiliation	character	Academic affiliation of the correspondence author
correspondence_author_affiliation_region	character	Country or region of the correspondence author parsed from correspondence_author_affiliation variable
correspondence_author_email	character	Email of the correspondence author
correspondence_author_orcid	character	ORCID of the correspondence author
has_das	logical	Whether the paper has a data availability statement
das	character	Original data availability statement of the paper. NA if it does not have a data availability statement.
das_type	factor	Type of the data availability statement including “in paper”(data in full paper scope like supplementary material or appendix or main content) “on request”(data available on request to the authors) “available in online repository”(data is shared in a public online repository) “not shareable”(data is not shareable). NA if it does not have a data availability statement.
das_repo_url	list	Website url of the data if the relevant data of the paper is shared on a public repository
keywords	list	List of keywords of the paper
url_source	character	Publisher website of the paper

uncnewsletter

The dataset uncnewsletter contains data on a curated list of articles published at the Research section of the newsletter North Carolina Water News. It has 173 observations from 2020 to 2023.

uncnewsletter |> 
  head(3) |> 
  gt::gt() |>
  gt::as_raw_html()

paperid	issue_url	paper_url	url_source	journal	title	published_year	is_supp	num_supp	supp_file_type	supp_url	num_authors	first_author_name	first_author_affiliation	first_author_affiliation_country	first_author_email	first_author_orcid	correspondence_author_name	correspondence_author_affiliation	correspondence_author_affiliation_country	correspondence_author_email	correspondence_author_orcid	has_das	das	das_type	das_repo_url	citations	keywords
198	http://eepurl.com/hWz3Yf	https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/ep.13800	aiche.onlinelibrary.wiley.com	Environmental Progress & Sustainable Energy	Mitigation of PFAS in U.S. Public Water Systems: Future steps for ensuring safer drinking water	2022	TRUE	1	docx	https://aiche.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fep.13800&file=ep13800-sup-0001-Supinfo.docx	1	Alexis Voulgaropoulos	North Carolina State University	NA	anvoulga@ncsu.edu	0000-0002-5778-354X	NA	NA	NA	NA	NA	FALSE	NA	NA	NA	2	drinkingwater, environmentalpolicy, healthandsafety
89	http://eepurl.com/ieh0rf	https://ajph.aphapublications.org/doi/abs/10.2105/AJPH.2022.307108	ajph.aphapublications.org	American Journal of Public Health	Timing and Trends for Municipal Wastewater, Lab-Confirmed Case, and Syndromic Case Surveillance of COVID-19 in Raleigh, North Carolina	2023	TRUE	1	docx	https://ajph.aphapublications.org/doi/suppl/10.2105/AJPH.2022.307108/suppl_file/kotlarz_suppl-figures_tables.docx	17	Nadine Kotlarz	North Carolina State University	NA	nkotlar@ncsu.ede	NA	NA	NA	NA	NA	NA	FALSE	NA	NA	NA	3	NA
200	http://eepurl.com/hWz3Yf	https://aslopubs.onlinelibrary.wiley.com/doi/abs/10.1002/lom3.10469	aslopubs.onlinelibrary.wiley.com	Limnology and Oceanography: Methods	OpenOBS: Open-source, low-cost optical backscatter sensors for water quality and sediment-transport research	2022	TRUE	1	pdf	https://aslopubs.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Flom3.10469&file=lom310469-sup-0001-Supinfo.pdf	4	Emily F. Eidam	University of North Carolina	NA	efe@unc.edu	0000-0002-1906-8692	NA	NA	NA	NA	NA	TRUE	The code, wiring diagram, hardware bill of materials, and 3D-printed endcap design files are available at https://github.com/tedlanghorst/OpenOBS.	available in online repository	https://github.com/tedlanghorst/OpenOBS	4	NA

For an overview of the variable descriptions, see the following table.

variable_name	variable_type	description
paperid	integer	ID number of the paper on the journal website
issue_url	integer	Volume number of the journal
paper_url	character	Official website url of the paper
url_source	character	Publisher website of the paper
journal	character	Full name of the journal
title	character	Title of the paper
published_year	integer	Year of publication
is_supp	logical	Whether the paper has supplementary materials
num_supp	integer	Number of supplementary material files
supp_file_type	list	File type of the supplementary materials
supp_url	list	Website url of the supplementary materials
num_authors	integer	Number of the authors
first_author_name	character	Name of the first author
first_author_affiliation	character	Academic affiliation of the first author
first_author_affiliation_country	character	Country of the first author directly parsed from first_author_affiliation variable encoded with United Nation names
first_author_email	character	Email of the first author
first_author_orcid	character	ORCID of the first author
correspondence_author_name	character	Name of the correspondence author
correspondence_author_affiliation	character	Academic affiliation of the correspondence author
correspondence_author_affiliation_country	character	Country or region of the correspondence author directly parsed from correspondence_author_affiliation variable encoded with United Nation names
correspondence_author_email	character	Email of the correspondence author
correspondence_author_orcid	character	ORCID of the correspondence author
has_das	logical	Whether the paper has a data availability statement
das	character	Original data availability statement of the paper. NA if it does not have a data availability statement.
das_type	factor	Type of the data availability statement including “in paper”(data in full paper scope like supplementary material or appendix or main content) “on request”(data available on request to the authors) “available in online repository”(data is shared in a public online repository) “not shareable”(data is not shareable). NA if it does not have a data availability statement.
das_repo_url	list	Website url of the data if the relevant data of the paper is shared on a public repository
keywords	list	List of keywords of the paper

Example

washdev

What are the top 10 countries(or regions) the first authors from in the Journal of Water, Sanitation and Hygiene for Development?

library(washopenresearch)

washdev |> 
  filter(!is.na(first_author_affiliation_country)) |>
  group_by(first_author_affiliation_country) |>
  summarise(count=n()) |>
  arrange(desc(count)) |>
  head(10) |>
  ggplot() +
    geom_col(aes(x = reorder(first_author_affiliation_country, count), 
                 y = count)) +
    labs(title = "Top 10 countries of first author",
        subtitle = "in the Journal of Water, Sanitation and Hygiene for Development",
        x = "First Author Country", y = "Count") +
    scale_x_discrete(labels = scales::label_wrap(15))+
    coord_flip() +
    theme_classic()

What are the top choices of keywords in WASH Dev?

Each publication may provide a list of keywords, typically 5-7, to summarize the topics of the article. Here we compile all keywords and calculate their frequency to be used.

keywords_freq <- washdev$keywords |>
    unlist() |>
    str_to_lower() |>
  table() |>
  as.data.frame() |>
  as_tibble() |>
  arrange(desc(Freq))

# Top 20 keywords
ggplot(data = head(keywords_freq, 20)) +
  geom_bar(aes(x = reorder(Var1, Freq), y=Freq), stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Keywords in WASH Dev Journal", x = "Keywords", y = "Count") +
  theme_bw()

uncnewsletter

What are the top 10 source websites of the publications selected by the newsletter?

uncnewsletter |> 
  group_by(url_source) |>
  summarise(count=n()) |>
  arrange(desc(count)) |>
  head(10) |>
  ggplot() +
    geom_col(aes(x = reorder(url_source, count), 
                 y = count)) +
   labs(title = "Top 10 publication websites",
        subtitle = "in the selection of North Carolina Water News",
        x = "Website URL", y = "Count") +
   scale_x_discrete(labels = scales::label_wrap(15))+
   coord_flip() +
   theme_classic()

Method

We describe the raw data collection procedure of each dataset in this section. To reproduce the collection, you need to have python3 installed and install python libraries

pip install requirements.txt

washdev

The collection of washdev is via web scraping using Python. The script can be found in inst/python/washdev_scraping.py. First, each publication link is scraped from iterating the table of contents of all volumes. This step delivers a table containing the variables paper ID, volume number, issue number, publication url, journal title, publication title, and published year. This table will be merged to get the final dataset.

Then, for each publication, we retrieve the needed variables from the publication’s html file using the publication url. The retrieval is rule-based to find the relevant fields (e.g. supplementary materials) and extract the value.

uncnewsletter

The collection of uncnewsletter is a combination of web scraping and manual annotation. We first use the newsletter archive to scrape all publication website links. The code can be found at inst/python/uncnewsletter_scraping.py. Two annotators worked on the manual extraction of the needed variables on these publications. For each publication, an annotator follows the guide to fill in the value on an collaborative spreadsheet. The guide is converted into the data dictionary for this dataset.

License

Data are available as CC-BY.

Citation

Please cite this package using:

citation("washopenresearch")
#> To cite package 'washopenresearch' in publications use:
#> 
#>   Zhong M, Luz L, Schöbitz L (2024). "washopenresearch: Dataset about
#>   open research data information in Water, Sanitation, and Hygiene."
#>   doi:10.5281/zenodo.11185699
#>   <https://doi.org/10.5281/zenodo.11185699>,
#>   <https://github.com/openwashdata/washopenresearch>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Misc{zhong_etall:2024,
#>     title = {washopenresearch: Dataset about open research data information in Water, Sanitation, and Hygiene},
#>     author = {Mian Zhong and Ludwig Luz and Lars Schöbitz},
#>     year = {2024},
#>     doi = {10.5281/zenodo.11185699},
#>     url = {https://github.com/openwashdata/washopenresearch},
#>     abstract = {The goal of washopenresearch is to provide an overview of open research data related to Water Sanitation and Hygiene (WASH). The package provides access to two datasets `washdev` and `uncnewsletter`. Each dataset collects information on scientific articles about (1) article metadata (e.g. title, first author, correspondence author), (2) supplementary material information, (3) data availability statement, and (4) semantic information (e.g. keywords).},
#>     keywords = {open-data,open-research-data,open-science,openwashdata,sanitation,wash},
#>     version = {0.0.1},
#>   }