Hey, everyone, I’m Andrew Weatherman, creator of toRvik
and lover of college basketball analytics. The goal of
toRvik
is to expand access to reliable, high-quality CBB
statistics. While analogous packages exist to pull data, like Saiem
Gilani’s brilliant hoopR
,
toRvik
requires no paid subscription or set-up and can be
immediately utilized by anyone with just a few lines of code.
toRvik
# You can install using {pacman} with the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
install.packages('pacman')
}::p_load_current_gh("andreweatherman/toRvik", dependencies = TRUE, update = TRUE) pacman
toRvik
toRvik
is a package of scrapers that pull data from Barttorvik, a popular college
basketball analytics website, and return it in tidy format. Barttorvik
splits its data on a number of variables and hosts detailed player and
game statistics, while serving as a reputable, industry-recognized
metric rating system. Generally speaking, all data is avaliable back to
the 2007-08 season. More information about Barttorvik, its data, and its
metric rating system can be found here.
Package functions are syntactically structured to point to their data
source (e.g. by ‘player,’ ‘game,’ etc.) and should be considered
get
functions by nature. As of toRvik
version
1.0.1, the package exports more than 20 functions covering the website
and its data. Some highlights include:
toRvik
requires no set-up and can be instantly executed
in any session. To understand the package, the T-Rank functions, pulling
and splitting Barttorvik’s metric rating system, are an excellent place
to start. Let’s take a glance at the top teams in T-Rank using
toRvik
:
::tic()
tictoc::bart_ratings(year=2022) %>%
toRvik::head(10)
utils#> # A tibble: 10 × 19
#> team conf barthag barthag_rk adj_o adj_o_rk adj_d adj_d_rk adj_t adj_t_rk
#> <chr> <chr> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
#> 1 Gonzaga WCC 0.966 1 120. 4 89.9 9 72.6 5
#> 2 Houston Amer 0.959 2 117. 10 88.5 6 63.7 336
#> 3 Kansas B12 0.958 3 120. 5 91.3 13 69.1 71
#> 4 Texas … B12 0.951 4 111. 41 85.4 1 66.3 223
#> 5 Baylor B12 0.949 5 118. 8 91.3 14 67.6 149
#> 6 Duke ACC 0.944 6 123. 1 96.0 53 67.4 161
#> 7 Tennes… SEC 0.944 7 111. 34 87.1 3 67.4 164
#> 8 Villan… BE 0.935 8 117. 9 93.0 26 62.2 350
#> 9 Arizona P12 0.934 9 118. 7 93.7 35 72.3 9
#> 10 UCLA P12 0.932 10 116. 12 92.2 20 65.4 274
#> # … with 9 more variables: wab <dbl>, nc_elite_sos <dbl>, nc_fut_sos <dbl>,
#> # nc_cur_sos <dbl>, ov_elite_sos <dbl>, ov_fut_sos <dbl>, ov_cur_sos <dbl>,
#> # seed <int>, year <dbl>
::toc()
tictoc#> 3.384 sec elapsed
Here, the bart_ratings
function returned the top ten
teams in T-Rank in the current season. We are also presented with each
team’s adjusted efficiencies, their adjusted tempo, and two forms of
strength of schedule (documented in bart_ratings
). But what
if we want these same measures in home games only? We would use
bart_factors
and input ‘home’ as venue:
::tic()
tictoc::bart_factors(venue='home') %>%
toRvik::head(10)
utils#> # A tibble: 10 × 23
#> team conf barthag rec wins games adj_t adj_o off_efg off_to off_or
#> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Houston Amer 0.978 16–1 16 17 65.8 116. 54.3 16.5 39.1
#> 2 Baylor B12 0.968 15–2 15 17 69.3 118. 55.1 17.6 38.1
#> 3 Gonzaga WCC 0.966 16–0 16 16 73.1 121. 60 15.3 31.8
#> 4 Texas Tech B12 0.965 18–0 18 18 68.2 117. 57 19.4 38.3
#> 5 Auburn SEC 0.961 16–0 16 16 72.5 116. 52.6 16.3 32.5
#> 6 Tennessee SEC 0.959 16–0 16 16 68.9 113. 53 18.3 37.5
#> 7 Villanova BE 0.956 12–1 12 13 62.6 122. 57.2 14.5 29.4
#> 8 UCLA P12 0.952 14–1 14 15 69.4 116. 54.3 13.4 30.7
#> 9 Purdue B10 0.950 16–1 16 17 67.6 125. 58.3 16.8 38.5
#> 10 Texas B12 0.948 16–3 16 19 63.6 110. 51.1 18 33.8
#> # … with 12 more variables: off_ftr <dbl>, adj_d <dbl>, def_efg <dbl>,
#> # def_to <dbl>, def_or <dbl>, def_ftr <dbl>, wab <dbl>, year <dbl>,
#> # venue <chr>, type <chr>, top <dbl>, quad <chr>
::toc()
tictoc#> 2.188 sec elapsed
And now, we have four factor data and metric ratings for home
games only. The bart_factors
function, and the
analogous bart_conf_factors
, takes venue, game type, date
range, and opponent strength as additional splits. Great, but what if we
want to explore rating trends over time? toRvik
gives us
that ability with bart_archive
, a function that pulls
adjusted ratings and projected records from the morning of a desired
date:
::tic()
tictoc::bart_archive('20220113') %>%
toRvik::head(10)
utils#> # A tibble: 10 × 16
#> rk team conf rec adj_o adj_o_rk adj_d adj_d_rk barthag proj_rec
#> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 Gonzaga WCC 13-2 124. 2 93.3 20 0.964 26-3
#> 2 2 Baylor B12 15-1 121. 5 91.3 13 0.962 26-5
#> 3 3 Houston Amer 14-2 120. 6 91 11 0.961 27-4
#> 4 4 Auburn SEC 15-1 115. 20 89.3 8 0.947 27-4
#> 5 5 LSU SEC 15-1 106. 117 82.4 1 0.946 27-4
#> 6 6 Arizona P12 13-1 117. 15 91 12 0.946 27-4
#> 7 7 Villanova BE 12-4 117. 13 91.6 16 0.942 24-6
#> 8 8 Kansas B12 13-2 122. 4 96.5 50 0.938 24-7
#> 9 9 Purdue B10 13-2 125 1 98.9 94 0.937 24-7
#> 10 10 Duke ACC 13-2 117. 11 93.8 23 0.926 26-5
#> # … with 6 more variables: proj_conf_rec <chr>, wab <dbl>, wab_rk <dbl>,
#> # cur_rk <dbl>, change <dbl>, date <date>
::toc()
tictoc#> 0.838 sec elapsed
At this time, bart_archive
only takes a single date, but
if you want to track longer periods, I suggest looking into mapping
packages such as purrr
.
Perhaps the most valuable functions in toRvik
concern
granular analysis. The package gives us the ability to explore advanced
statistics at a game-by-game level for every Division 1 player since the
2007-08 season using bart_player_game
.
Please note: This function returns a large tibble with >100,000 rows for each completed season. If you will be performing analyses on this data, it is recommended to store a fresh tibble as a saftey variable.
::tic()
tictoc::bart_player_game(year=2022, stat='adv') %>%
toRvik::filter(team=='Duke') %>%
dplyr::arrange(desc(net)) %>%
dplyr::head(10)
utils#> # A tibble: 10 × 24
#> date year player exp team opp result min pts usg ortg
#> <date> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 2021-12-14 2022 AJ Griffin Fr Duke Sout… W 22 19 16.7 214.
#> 2 2021-11-12 2022 Wendell Mo… Jr Duke Army W 35 19 22.9 142.
#> 3 2021-11-19 2022 Wendell Mo… Jr Duke Lafa… W 29 23 25.2 159.
#> 4 2022-01-15 2022 Mark Willi… So Duke Nort… W 27 19 25 144.
#> 5 2022-03-18 2022 Mark Willi… So Duke Cal … W 32 15 19.5 156.
#> 6 2021-11-22 2022 Paolo Banc… Fr Duke The … W 31 28 29.3 157.
#> 7 2022-03-24 2022 Paolo Banc… Fr Duke Texa… W 37 22 23.6 146.
#> 8 2022-01-29 2022 AJ Griffin Fr Duke Loui… W 34 22 17.2 163.
#> 9 2022-03-01 2022 Trevor Kee… Fr Duke Pitt… W 34 27 25.9 175.
#> 10 2021-11-19 2022 AJ Griffin Fr Duke Lafa… W 21 18 16.4 188.
#> # … with 13 more variables: or_pct <dbl>, dr_pct <dbl>, ast_pct <dbl>,
#> # to_pct <dbl>, stl_pct <dbl>, blk_pct <dbl>, bpm <dbl>, obpm <dbl>,
#> # dbpm <dbl>, net <dbl>, poss <dbl>, id <dbl>, game_id <chr>
::toc()
tictoc#> 16.242 sec elapsed
Here, bart_player_game
returned the 20 highest
individual net BPMs by a Duke player this season. The function takes
‘box,’ ‘shooting,’ and ‘adv’ as stat inputs, and I welcome you to
explore each one in your own session. But what if we want to investigate
similar performance at a seaosn level? Well,
bart_player_season
gives us that option – also taking
‘box,’ ‘shooting,’ and ‘adv’ as stat inputs.
::tic()
tictoc::bart_player_season(year=2022, stat='shooting') %>%
toRvik::filter(team=='Duke') %>%
dplyr::arrange(desc(mid_a)) %>%
dplyr::head(5)
utils#> # A tibble: 5 × 32
#> player pos exp team conf g mpg ppg p_per usg ortg efg ts
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Paolo… Wing… Fr Duke ACC 39 33.0 17.2 20.9 27.2 111. 52 55.7
#> 2 Wende… Comb… Jr Duke ACC 39 33.9 13.4 15.8 20.3 121. 56.9 60.5
#> 3 AJ Gr… Wing… Fr Duke ACC 39 24.3 10.4 17.1 16.9 127. 61.3 63.0
#> 4 Jerem… Comb… So Duke ACC 39 29 8.62 11.9 17.7 105. 47.7 51.5
#> 5 Trevo… Comb… Fr Duke ACC 36 30.2 11.5 15.2 20.1 110. 49.6 52.0
#> # … with 19 more variables: ftm <dbl>, fta <dbl>, ft_pct <dbl>, two_m <dbl>,
#> # two_a <dbl>, two_pct <dbl>, three_m <dbl>, three_a <dbl>, three_pct <dbl>,
#> # dunk_m <dbl>, dunk_a <dbl>, dunk_pct <dbl>, rim_m <dbl>, rim_a <dbl>,
#> # rim_pct <dbl>, mid_m <dbl>, mid_a <dbl>, mid_pct <dbl>, id <dbl>
::toc()
tictoc#> 1.892 sec elapsed
And now, we have a tibble of season-long shooting data for Duke
players, sorted by number of mid-range attempts. Advanced metric data
can be pulled by team on a per-game basis using
bart_team_schedule
, and total team shooting splis can be
accessed using bart_team_shooting
. Game box data can be
pulled with bart_game_total
.
Lastly for this introductory vignette, we will explore
toRvik
functions for scraping tournament data. Frequent any
time on social media in college basketball circles in March, and you
will undoubtedly hear about ‘team sheets,’ detailed repositories of
strength and quality metrics used by the seeding and selection
committee. With bart_tourney_sheets
, you can pull
‘quick-hit’ team sheets in tidy format with just a single line of
code:
::tic()
tictoc::bart_tourney_sheets(year=2022) %>%
toRvik::head(10)
utils#> # A tibble: 10 × 16
#> team seed net kpi sor res_avg bpi kp sag qual_avg q1a q1
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 Gonza… 1 1 5 7 6 1 1 1 1 5-2 10-3
#> 2 Arizo… 1 2 3 2 2.5 3 2 2 2.3 4-2 6-3
#> 3 Houst… 5 3 13 14 13.5 2 4 5 3.7 0-3 1-4
#> 4 Baylor 1 4 2 4 3 6 5 4 5 4-4 10-5
#> 5 Kentu… 2 5 9 5 7 4 3 6 4.3 3-6 9-7
#> 6 Kansas 1 6 1 1 1 8 6 3 5.7 4-4 12-5
#> 7 Tenne… 3 7 4 3 3.5 5 7 7 6.3 4-7 11-7
#> 8 Villa… 2 8 7 8 7.5 7 11 9 9 5-4 7-6
#> 9 Texas… 3 9 17 12 14.5 13 9 14 12 5-5 8-9
#> 10 UCLA 4 10 11 15 13 9 8 10 9 2-4 5-4
#> # … with 4 more variables: q2 <chr>, q1_2 <chr>, q3 <chr>, q4 <chr>
::toc()
tictoc#> 0.824 sec elapsed
Returned are sheets of top teams sorted by their NCAA NET ranking.
Because this function relies on NET data, it is only available back to
the 2018-19 season. In-season performance is valuable, but what if you
want to investigate just tournament data? Well,
toRvik
gives you two options to do so:
bart_tourney_odds
and bart_tourney_results
.
The former returns metric-adjusted round probabilities by split. Let’s
explore round odds for the 2022 NCAA Tournament:
::tic()
tictoc::bart_tourney_odds(year=2022, odds='pre') %>%
toRvik::arrange(desc(s16)) %>%
dplyr::head(10)
utils#> # A tibble: 10 × 11
#> seed region team conf r64 r32 s16 e8 f4 f2 champ
#> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 West Gonzaga WCC 100 96.6 81.9 69.6 52 38.5 27.5
#> 2 1 Midwest Kansas B12 100 96.3 73.7 48.7 32.5 17.7 8.5
#> 3 1 South Arizona P12 100 94.8 72.7 37.3 21.2 12 5.4
#> 4 1 East Baylor B12 100 94.9 72.5 42.9 25.2 11.1 5.8
#> 5 2 Midwest Auburn SEC 100 91.5 70 48.4 24.8 11.7 4.8
#> 6 2 West Duke ACC 100 94.1 69.8 38.9 15.5 8.2 4
#> 7 3 West Texas Tech B12 100 92.6 68.4 40.9 17.1 9.5 5
#> 8 3 South Tennessee SEC 100 92.3 67.5 41 20.8 11.6 5.2
#> 9 5 Midwest Iowa B10 100 84.3 64.5 32.2 19.3 9.2 3.7
#> 10 2 South Villanova BE 100 90.8 63.6 34.6 16.1 8.4 3.5
::toc()
tictoc#> 0.233 sec elapsed
With the ‘odds’ argument set to ‘pre,’ we returned pre-tournament
odds and sorted by likelihood to reach the second weekend (Sweet 16).
bart_tourney_odds
also takes current odds (‘current’), odds
based on recent performance (‘recent’), and odds based on games against
strong opponents (‘t100’). This data is similarly available starting
with the 2019 tournament. Now, what if we want to explore tournament
results?
::tic()
tictoc::bart_tourney_results(min_year=2011, max_year=2021, type='conf') %>%
toRvik::head(5)
utils#> # A tibble: 5 × 18
#> conf pake pase wins loss w_percent r64 r32 s16 e8 f4 f2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 P12 11.2 11.4 55 38 0.591 38 27 18 8 2 0
#> 2 SEC 10.9 15.5 78 48 0.619 49 33 21 14 7 2
#> 3 MVC 4.1 6.1 19 15 0.559 15 11 4 2 2 0
#> 4 ACC 3.6 -0.3 102 61 0.626 64 44 31 15 5 4
#> 5 Horz 2.6 3 5 10 0.333 10 1 1 1 1 1
#> # … with 6 more variables: champ <dbl>, top2 <dbl>, f4_percent <dbl>,
#> # champ_percent <dbl>, from <dbl>, to <dbl>
::toc()
tictoc#> 0.569 sec elapsed
With bart_tourney_results
, we can return raw and
adjusted outcomes by split. Here, we returned aggregate conference
results from 2011 to 2021, sorted by PAKE – the number of wins attained
above or below KenPom expectation. The function splits by team (‘team’),
conference (‘conf’), coach (‘coach’), and seed (‘seed’) and includes
data starting in 2000.
toRvik
includes several additional functions and
capabilities that I did not describe here; take time to explore them and
those detailed in this introduction. If you have any questions, feel
free to message me on Twitter. If you run into
any bugs, please open an issue on the GitHub.
Happy exploring!