Skip to contents

These functions provide access to the CategoryMembers endpoint of the Action API.

query_category_members() builds a generator query to return the members of a given category.

build_category_tree() finds all the pages and subcategories beneath the passed category, then recursively finds all the pages and subcategories beneath them, until it can find no more subcategories.

Usage

query_category_members(
  .req,
  category,
  namespace = NULL,
  type = c("file", "page", "subcat"),
  limit = 10,
  sort = c("sortkey", "timestamp"),
  dir = c("ascending", "descending", "newer", "older"),
  start = NULL,
  end = NULL,
  language = "en"
)

build_category_tree(category, language = "en")

Arguments

.req

A query request object

category

The category to start from. query_category_members() accepts either a numeric pageid or the page title. build_category_tree() accepts a vector of page titles.

namespace

Only return category members from the provided namespace

type

Alternative to namespace: the type of category member to return. Multiple types can be requested using a character vector. Defaults to all.

limit

The number to return each batch. Max 500.

sort

How to sort the returned category members. 'timestamp' sorts them by the date they were included in the category; 'sortkey' by the category member's unique hexadecimal code

dir

The direction in which to sort them

start

If sort == 'timestamp', only return category members from after this date. The argument is parsed by lubridate::as_date()

end

If sort == 'timestamp', only return category members included in the category from before this date. The argument is parsed by lubridate::as_date()

language

The language edition of Wikipedia to query

Value

query_category_members(): A request object of type generator/query/action_api/httr2_request, which can be passed to next_batch() or retrieve_all(). You can specify which properties to retrieve for each page using query_page_properties().

build_category_tree(): A list containing two dataframes. nodes lists all the subcategories and pages found underneath the passed categories. edges records the connections between them. The source column gives the pageid of the parent category, while the target column gives the pageid of any categories, pages or files contained within the source category. The timestamp records the moment when the target page or subcategory was included in the source category. The two dataframes in the list can be passed to igraph::graph_from_data_frame for network analysis.

See also

Examples

# Get the first 10 pages in 'Category:Physics' on English Wikipedia
physics_members <- wiki_action_request() %>%
  query_category_members("Physics") %>%
  gracefully(next_batch)
physics_members
#> <complete/query_tbl>
#>  There are more results on the server. Retrieve them with `next_batch()` or `retrieve_all()`
#>  Data complete for all records
#> # A tibble: 10 × 3
#>      pageid    ns title                                              
#>       <int> <int> <chr>                                              
#>  1     6019     0 Computational chemistry                            
#>  2    22939     0 Physics                                            
#>  3   844186     0 Modern physics                                     
#>  4  1653925   100 Portal:Physics                                     
#>  5 14647723     0 Disclination                                       
#>  6 74609356     0 Force control                                      
#>  7 74985603     0 Edge states                                        
#>  8 75395346     0 Dynamic toroidal dipole                            
#>  9 75558170     0 Charge based boundary element fast multipole method
#> 10 75821836     0 Isoelectric (electric potential)                   


# Build the tree of all albums for the Melbourne band Custard
tree <- build_category_tree("Category:Custard_(band)_albums")
#> ⠙ Walking subcategories: 1 done (549/s) | 2ms
#> ⠹ Walking subcategories: 2 done (16/s) | 125ms
tree
#> $nodes
#> # A tibble: 11 × 4
#>      pageid    ns title                                      type  
#>       <int> <int> <chr>                                      <chr> 
#>  1 41181643    14 Category:Custard_(band)_albums             root  
#>  2 47888836     0 Come Back, All Is Forgiven                 page  
#>  3 59271122     0 The Common Touch (album)                   page  
#>  4 30333352     0 Loverama                                   page  
#>  5 63691299     0 Respect All Lifeforms                      page  
#>  6 43770191     0 Wahooti Fandango                           page  
#>  7 30333401     0 We Have the Technology                     page  
#>  8 43769837     0 Wisenheimer                                page  
#>  9 41148700    14 Category:Custard (band) compilation albums subcat
#> 10 43770688     0 Brisbane 1990–1993                         page  
#> 11 43770872     0 Goodbye Cruel World (Custard album)        page  
#> 
#> $edges
#> # A tibble: 10 × 3
#>      source   target timestamp           
#>       <int>    <int> <chr>               
#>  1 41181643 47888836 2015-09-21T10:58:43Z
#>  2 41181643 59271122 2019-01-06T17:20:32Z
#>  3 41181643 30333352 2013-11-24T21:09:05Z
#>  4 41181643 63691299 2020-04-18T06:08:40Z
#>  5 41181643 43770191 2014-09-08T08:02:46Z
#>  6 41181643 30333401 2013-11-24T21:09:09Z
#>  7 41181643 43769837 2014-09-08T06:31:49Z
#>  8 41181643 41148700 2013-11-21T14:38:43Z
#>  9 41148700 43770688 2015-05-20T06:12:07Z
#> 10 41148700 43770872 2015-04-26T23:42:41Z
#> 

# For network analysis and visualisation, you can pass the category tree
# to igraph
tree_graph <- igraph::graph_from_data_frame(tree$edges, vertices = tree$nodes)
tree_graph
#> IGRAPH f8a879c DN-B 11 10 -- 
#> + attr: name (v/c), ns (v/n), title (v/c), type (v/c), timestamp (e/c)
#> + edges from f8a879c (vertex names):
#>  [1] 41181643->47888836 41181643->59271122 41181643->30333352 41181643->63691299
#>  [5] 41181643->43770191 41181643->30333401 41181643->43769837 41181643->41148700
#>  [9] 41148700->43770688 41148700->43770872