The HYDAT database is a massive hydrologic data resource. The functions in this package are designed to get the most out of the HYDAT database as quickly as possible, however in the process of reformatting the data to be more useful, the package modifies the original tables. This vignette is intended to demonstrate how to access tables within the database for your own custom HYDAT analysis, should additional information be needed. This should not be necessary for the vast majority of users, and is intended only for advanced R users.
Before loading the HYDAT database, the latest version of the database must be downloaded using hydat_download()
. This is a fairly lengthy operation (the download is around 1 GB) and may require several cups of coffee worth of your time.
hydat_download()
The HYDAT database is a SQLite database, which can be accessed in R using the dplyr and dbplyr packages. This package has simplified the connection process, so all you have to do to connect to the database is use hy_src()
.
<- hy_src() src
To list the tables, use src_tbls()
from the dplyr package.
src_tbls(src)
#> [1] "AGENCY_LIST" "ANNUAL_INSTANT_PEAKS" "ANNUAL_STATISTICS" "CONCENTRATION_SYMBOLS"
#> [5] "DATA_SYMBOLS" "DATA_TYPES" "DATUM_LIST" "DLY_FLOWS"
#> [9] "DLY_LEVELS" "MEASUREMENT_CODES" "OPERATION_CODES" "PEAK_CODES"
#> [13] "PRECISION_CODES" "REGIONAL_OFFICE_LIST" "SAMPLE_REMARK_CODES" "SED_DATA_TYPES"
#> [17] "SED_DLY_LOADS" "SED_DLY_SUSCON" "SED_SAMPLES" "SED_SAMPLES_PSD"
#> [21] "SED_VERTICAL_LOCATION" "SED_VERTICAL_SYMBOLS" "STATIONS" "STN_DATA_COLLECTION"
#> [25] "STN_DATA_RANGE" "STN_DATUM_CONVERSION" "STN_DATUM_UNRELATED" "STN_OPERATION_SCHEDULE"
#> [29] "STN_REGULATION" "STN_REMARKS" "STN_REMARK_CODES" "STN_STATUS_CODES"
#> [33] "VERSION"
To inspect any particular table, use the tbl()
function with the src
and the table name.
tbl(src, "STN_OPERATION_SCHEDULE")
#> # Source: table<STN_OPERATION_SCHEDULE> [?? x 5]
#> # Database: sqlite 3.37.2 [C:\work\_dev\GitHub_repos\tidyhydat\inst\test_db\tinyhydat.sqlite3]
#> STATION_NUMBER DATA_TYPE YEAR MONTH_FROM MONTH_TO
#> <chr> <chr> <int> <chr> <chr>
#> 1 01AP003 H 1923 <NA> <NA>
#> 2 01AP003 H 1924 <NA> <NA>
#> 3 01AP003 H 1925 <NA> <NA>
#> 4 01AP003 H 1926 <NA> <NA>
#> 5 01AP003 H 1927 <NA> <NA>
#> 6 01AP003 H 1928 <NA> <NA>
#> 7 01AP003 H 1929 <NA> <NA>
#> 8 01AP003 H 1930 <NA> <NA>
#> 9 01AP003 H 1931 <NA> <NA>
#> 10 01AP003 H 1932 <NA> <NA>
#> # ... with more rows
Working with SQL tables in dplyr is much like working with regular data frames, except no data is actually read from the database until necessary. Because some of these tables are large (particularly those containing the actual data), you will want to filter()
the tables before you collect()
them (the collect()
operation loads them into memory as a data.frame
).
tbl(src, "STN_OPERATION_SCHEDULE") %>%
filter(STATION_NUMBER == "05AA008") %>%
collect()
#> # A tibble: 103 x 5
#> STATION_NUMBER DATA_TYPE YEAR MONTH_FROM MONTH_TO
#> <chr> <chr> <int> <chr> <chr>
#> 1 05AA008 H 2012 JAN DEC
#> 2 05AA008 H 2013 JAN DEC
#> 3 05AA008 H 2014 JAN DEC
#> 4 05AA008 H 2015 JAN DEC
#> 5 05AA008 H 2016 JAN DEC
#> 6 05AA008 H 2017 JAN DEC
#> 7 05AA008 H 2018 JAN DEC
#> 8 05AA008 H 2019 JAN DEC
#> 9 05AA008 H 2020 JAN DEC
#> 10 05AA008 Q 1910 <NA> <NA>
#> # ... with 93 more rows
When you are finished with the database (i.e., the end of the script), it is good practice to close the connection (you may get a loud red warning if you don’t!).
hy_src_disconnect(src)