These functions provide access to the CategoryMembers endpoint of the Action API.
query_category_members()
builds a generator query to return the members of a given category.
build_category_tree()
finds all the pages and subcategories beneath the
passed category, then recursively finds all the pages and subcategories
beneath them, until it can find no more subcategories.
Arguments
- .req
- category
The category to start from.
query_category_members()
accepts either a numeric pageid or the page title.build_category_tree()
accepts a vector of page titles.- namespace
Only return category members from the provided namespace
- type
Alternative to
namespace
: the type of category member to return. Multiple types can be requested using a character vector. Defaults to all.- limit
The number to return each batch. Max 500.
- sort
How to sort the returned category members. 'timestamp' sorts them by the date they were included in the category; 'sortkey' by the category member's unique hexadecimal code
- dir
The direction in which to sort them
- start
If
sort
== 'timestamp', only return category members from after this date. The argument is parsed bylubridate::as_date()
- end
If
sort
== 'timestamp', only return category members included in the category from before this date. The argument is parsed bylubridate::as_date()
- language
The language edition of Wikipedia to query
Value
query_category_members()
: A request object of type
generator/query/action_api/httr2_request
, which can be passed to
next_batch()
or retrieve_all()
. You can specify which properties to
retrieve for each page using query_page_properties()
.
build_category_tree()
: A list containing two dataframes. nodes
lists
all the subcategories and pages found underneath the passed categories.
edges
records the connections between them. The source
column gives the
pageid of the parent category, while the target
column gives the pageid
of any categories, pages or files contained within the source
category.
The timestamp
records the moment when the target
page or subcategory
was included in the source
category. The two dataframes in the list can
be passed to igraph::graph_from_data_frame for network analysis.
Examples
# Get the first 10 pages in 'Category:Physics' on English Wikipedia
physics_members <- wiki_action_request() %>%
query_category_members("Physics") %>%
gracefully(next_batch)
physics_members
#> <complete/query_tbl>
#> ℹ There are more results on the server. Retrieve them with `next_batch()` or `retrieve_all()`
#> ✔ Data complete for all records
#> # A tibble: 10 × 3
#> pageid ns title
#> <int> <int> <chr>
#> 1 6019 0 Computational chemistry
#> 2 22939 0 Physics
#> 3 844186 0 Modern physics
#> 4 1653925 100 Portal:Physics
#> 5 14647723 0 Disclination
#> 6 74609356 0 Force control
#> 7 74985603 0 Edge states
#> 8 75395346 0 Dynamic toroidal dipole
#> 9 75558170 0 Charge based boundary element fast multipole method
#> 10 75821836 0 Isoelectric (electric potential)
# Build the tree of all albums for the Melbourne band Custard
tree <- build_category_tree("Category:Custard_(band)_albums")
#> ⠙ Walking subcategories: 1 done (549/s) | 2ms
#> ⠹ Walking subcategories: 2 done (16/s) | 125ms
tree
#> $nodes
#> # A tibble: 11 × 4
#> pageid ns title type
#> <int> <int> <chr> <chr>
#> 1 41181643 14 Category:Custard_(band)_albums root
#> 2 47888836 0 Come Back, All Is Forgiven page
#> 3 59271122 0 The Common Touch (album) page
#> 4 30333352 0 Loverama page
#> 5 63691299 0 Respect All Lifeforms page
#> 6 43770191 0 Wahooti Fandango page
#> 7 30333401 0 We Have the Technology page
#> 8 43769837 0 Wisenheimer page
#> 9 41148700 14 Category:Custard (band) compilation albums subcat
#> 10 43770688 0 Brisbane 1990–1993 page
#> 11 43770872 0 Goodbye Cruel World (Custard album) page
#>
#> $edges
#> # A tibble: 10 × 3
#> source target timestamp
#> <int> <int> <chr>
#> 1 41181643 47888836 2015-09-21T10:58:43Z
#> 2 41181643 59271122 2019-01-06T17:20:32Z
#> 3 41181643 30333352 2013-11-24T21:09:05Z
#> 4 41181643 63691299 2020-04-18T06:08:40Z
#> 5 41181643 43770191 2014-09-08T08:02:46Z
#> 6 41181643 30333401 2013-11-24T21:09:09Z
#> 7 41181643 43769837 2014-09-08T06:31:49Z
#> 8 41181643 41148700 2013-11-21T14:38:43Z
#> 9 41148700 43770688 2015-05-20T06:12:07Z
#> 10 41148700 43770872 2015-04-26T23:42:41Z
#>
# For network analysis and visualisation, you can pass the category tree
# to igraph
tree_graph <- igraph::graph_from_data_frame(tree$edges, vertices = tree$nodes)
tree_graph
#> IGRAPH f8a879c DN-B 11 10 --
#> + attr: name (v/c), ns (v/n), title (v/c), type (v/c), timestamp (e/c)
#> + edges from f8a879c (vertex names):
#> [1] 41181643->47888836 41181643->59271122 41181643->30333352 41181643->63691299
#> [5] 41181643->43770191 41181643->30333401 41181643->43769837 41181643->41148700
#> [9] 41148700->43770688 41148700->43770872