5.2 Where to find textual datasets?

There are various sources where one can easily find textual corpora. You can look here link.

This session is inspired by data manipulations from the Quanteda Tutorials

We have installed various packages which allow you to obtain textual data. Here are a few examples

5.2.1 janeaustenr

5.2.1.1 Look at books

This library includes 6 books by Jane Austen

austen_books() %>% 
  glimpse()
## Rows: 73,422
## Columns: 2
## $ text <chr> "SENSE AND SENSIBILITY", "", "by Jane Austen", "", "(1811)", "", …
## $ book <fct> Sense & Sensibility, Sense & Sensibility, Sense & Sensibility, Se…

5.2.1.2 Summary

Books are named as: emma, mansfieldpark, northangerabbey, persuasion, prideprejudice and sensesensibility

We use summary() to get number of characters in each

austen_books() %>%
  summary()
##      text                            book      
##  Length:73422       Sense & Sensibility:12624  
##  Class :character   Pride & Prejudice  :13030  
##  Mode  :character   Mansfield Park     :15349  
##                     Emma               :16235  
##                     Northanger Abbey   : 7856  
##                     Persuasion         : 8328

5.2.1.3 Import into Global Environment

Adding data from the first book into the Global Environment

data(sensesensibility)
#data(prideprejudice)
#data(mansfieldpark)
#data(emma)
#data(northangerabbey)
#data(persuasion)

5.2.1.4 Look into data

And we can get the top 60 rows from “sensesensibility”

sensesensibility %>% 
  head(n = 60)
##  [1] "SENSE AND SENSIBILITY"                                                  
##  [2] ""                                                                       
##  [3] "by Jane Austen"                                                         
##  [4] ""                                                                       
##  [5] "(1811)"                                                                 
##  [6] ""                                                                       
##  [7] ""                                                                       
##  [8] ""                                                                       
##  [9] ""                                                                       
## [10] "CHAPTER 1"                                                              
## [11] ""                                                                       
## [12] ""                                                                       
## [13] "The family of Dashwood had long been settled in Sussex.  Their estate"  
## [14] "was large, and their residence was at Norland Park, in the centre of"   
## [15] "their property, where, for many generations, they had lived in so"      
## [16] "respectable a manner as to engage the general good opinion of their"    
## [17] "surrounding acquaintance.  The late owner of this estate was a single"  
## [18] "man, who lived to a very advanced age, and who for many years of his"   
## [19] "life, had a constant companion and housekeeper in his sister.  But her" 
## [20] "death, which happened ten years before his own, produced a great"       
## [21] "alteration in his home; for to supply her loss, he invited and received"
## [22] "into his house the family of his nephew Mr. Henry Dashwood, the legal"  
## [23] "inheritor of the Norland estate, and the person to whom he intended to" 
## [24] "bequeath it.  In the society of his nephew and niece, and their"        
## [25] "children, the old Gentleman's days were comfortably spent.  His"        
## [26] "attachment to them all increased.  The constant attention of Mr. and"   
## [27] "Mrs. Henry Dashwood to his wishes, which proceeded not merely from"     
## [28] "interest, but from goodness of heart, gave him every degree of solid"   
## [29] "comfort which his age could receive; and the cheerfulness of the"       
## [30] "children added a relish to his existence."                              
## [31] ""                                                                       
## [32] "By a former marriage, Mr. Henry Dashwood had one son: by his present"   
## [33] "lady, three daughters.  The son, a steady respectable young man, was"   
## [34] "amply provided for by the fortune of his mother, which had been large," 
## [35] "and half of which devolved on him on his coming of age.  By his own"    
## [36] "marriage, likewise, which happened soon afterwards, he added to his"    
## [37] "wealth.  To him therefore the succession to the Norland estate was not" 
## [38] "so really important as to his sisters; for their fortune, independent"  
## [39] "of what might arise to them from their father's inheriting that"        
## [40] "property, could be but small.  Their mother had nothing, and their"     
## [41] "father only seven thousand pounds in his own disposal; for the"         
## [42] "remaining moiety of his first wife's fortune was also secured to her"   
## [43] "child, and he had only a life-interest in it."                          
## [44] ""                                                                       
## [45] "The old gentleman died: his will was read, and like almost every other" 
## [46] "will, gave as much disappointment as pleasure.  He was neither so"      
## [47] "unjust, nor so ungrateful, as to leave his estate from his nephew;--but"
## [48] "he left it to him on such terms as destroyed half the value of the"     
## [49] "bequest.  Mr. Dashwood had wished for it more for the sake of his wife" 
## [50] "and daughters than for himself or his son;--but to his son, and his"    
## [51] "son's son, a child of four years old, it was secured, in such a way, as"
## [52] "to leave to himself no power of providing for those who were most dear" 
## [53] "to him, and who most needed a provision by any charge on the estate, or"
## [54] "by any sale of its valuable woods.  The whole was tied up for the"      
## [55] "benefit of this child, who, in occasional visits with his father and"   
## [56] "mother at Norland, had so far gained on the affections of his uncle, by"
## [57] "such attractions as are by no means unusual in children of two or three"
## [58] "years old; an imperfect articulation, an earnest desire of having his"  
## [59] "own way, many cunning tricks, and a great deal of noise, as to outweigh"
## [60] "all the value of all the attention which, for years, he had received"

5.2.2 proustr

5.2.2.1 Look at books

This library includes 7 books written in French

proust_books() %>% 
  glimpse()
## Rows: 4,690
## Columns: 4
## $ text   <chr> "Longtemps, je me suis couché de bonne heure. Parfois, à peine …
## $ book   <chr> "Du côté de chez Swann", "Du côté de chez Swann", "Du côté de c…
## $ volume <chr> "Première partie : Combray", "Première partie : Combray", "Prem…
## $ year   <dbl> 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 191…

5.2.2.2 Summary

We use summary() to get number of characters in each

proust_books() %>%
  mutate(book = factor(book)) %>% 
  summary()
##      text                                              book     
##  Length:4690        À l’ombre des jeunes filles en fleurs: 792  
##  Class :character   Albertine disparue                   : 259  
##  Mode  :character   Du côté de chez Swann                :1004  
##                     La Prisonnière                       : 365  
##                     Le Côté de Guermantes                :1610  
##                     Le Temps retrouvé                    : 248  
##                     Sodome et Gomorrhe                   : 412  
##     volume               year     
##  Length:4690        Min.   :1913  
##  Class :character   1st Qu.:1919  
##  Mode  :character   Median :1921  
##                     Mean   :1919  
##                     3rd Qu.:1921  
##                     Max.   :1927  
##                     NA's   :159

5.2.2.3 Import into Global Environment

Adding data from the first book into the Global environment

data(ducotedechezswann)
#data(alombredesjeunesfillesenfleurs)
#data(lecotedeguermantes)
#data(sodomeetgomorrhe)
#data(laprisonniere)
#data(albertinedisparue)
#data(letempretrouve)

5.2.2.4 Look into data

And we can get the top 60 rows from the first one

ducotedechezswann %>%
  head(n = 60)
## # A tibble: 60 × 4
##    text                                                       book  volume  year
##    <chr>                                                      <chr> <chr>  <dbl>
##  1 "Longtemps, je me suis couché de bonne heure. Parfois, à … Du c… Premi…  1913
##  2 "J'appuyais tendrement mes joues contre les belles joues … Du c… Premi…  1913
##  3 "Je me rendormais, et parfois je n'avais plus que de cour… Du c… Premi…  1913
##  4 "Quelquefois, comme Ève naquit d'une côte d'Adam, une fem… Du c… Premi…  1913
##  5 "Un homme qui dort tient en cercle autour de lui le fil d… Du c… Premi…  1913
##  6 "Peut-être l'immobilité des choses autour de nous leur es… Du c… Premi…  1913
##  7 "Puis renaissait le souvenir d'une nouvelle attitude ; le… Du c… Premi…  1913
##  8 "Ces évocations tournoyantes et confuses ne duraient jama… Du c… Premi…  1913
##  9 "Certes, j'étais bien éveillé maintenant : mon corps avai… Du c… Premi…  1913
## 10 "À Combray, tous les jours dès la fin de l'après-midi, lo… Du c… Premi…  1913
## # ℹ 50 more rows

5.2.3 gutenbergr

The gutenbergr package allows for search and download of public domain texts from Project Gutenberg

To use gutenbergr you must know the Gutenberg id of the work you wish to analyse. A text search of the works can be done using the gutenberg_metadata function.

5.2.3.1 Search for available work

gutenberg_metadata
## # A tibble: 79,491 × 8
##    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
##           <int> <chr>    <chr>                <int> <fct>    <chr>              
##  1            1 "The De… Jeffe…                1638 en       Politics/American …
##  2            2 "The Un… Unite…                   1 en       Politics/American …
##  3            3 "John F… Kenne…                1666 en       Browsing: History …
##  4            4 "Lincol… Linco…                   3 en       US Civil War/Brows…
##  5            5 "The Un… Unite…                   1 en       United States/Poli…
##  6            6 "Give M… Henry…                   4 en       American Revolutio…
##  7            7 "The Ma… <NA>                    NA en       Browsing: History …
##  8            8 "Abraha… Linco…                   3 en       US Civil War/Brows…
##  9            9 "Abraha… Linco…                   3 en       US Civil War/Brows…
## 10           10 "The Ki… <NA>                    NA en       Banned Books List …
## # ℹ 79,481 more rows
## # ℹ 2 more variables: rights <fct>, has_text <lgl>

5.2.3.2 Filter available text

We can filter only available text

gutenberg_works(only_text = TRUE)
## # A tibble: 61,693 × 8
##    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
##           <int> <chr>    <chr>                <int> <fct>    <chr>              
##  1            1 "The De… Jeffe…                1638 en       Politics/American …
##  2            2 "The Un… Unite…                   1 en       Politics/American …
##  3            3 "John F… Kenne…                1666 en       Browsing: History …
##  4            4 "Lincol… Linco…                   3 en       US Civil War/Brows…
##  5            5 "The Un… Unite…                   1 en       United States/Poli…
##  6            6 "Give M… Henry…                   4 en       American Revolutio…
##  7            7 "The Ma… <NA>                    NA en       Browsing: History …
##  8            8 "Abraha… Linco…                   3 en       US Civil War/Brows…
##  9            9 "Abraha… Linco…                   3 en       US Civil War/Brows…
## 10           10 "The Ki… <NA>                    NA en       Banned Books List …
## # ℹ 61,683 more rows
## # ℹ 2 more variables: rights <fct>, has_text <lgl>

5.2.3.3 Look at a specific work

Then we can search for a specific work

gutenberg_works(title == "Wuthering Heights")
## # A tibble: 1 × 8
##   gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
##          <int> <chr>     <chr>                <int> <fct>    <chr>              
## 1          768 Wutherin… Bront…                 405 en       Best Books Ever Li…
## # ℹ 2 more variables: rights <fct>, has_text <lgl>

5.2.3.4 Download specific work

Then use the gutenberg_download() function with the ID to download the book

book_768 <- gutenberg_download(768)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.

5.2.3.5 Summary

We use summary() to get number of characters in each

book_768 %>%
  summary()
##   gutenberg_id     text          
##  Min.   :768   Length:12342      
##  1st Qu.:768   Class :character  
##  Median :768   Mode  :character  
##  Mean   :768                     
##  3rd Qu.:768                     
##  Max.   :768

5.2.3.6 Look into data

And we can get the top 60 rows

book_768 %>%
  head(n = 60)
## # A tibble: 60 × 2
##    gutenberg_id text               
##           <int> <chr>              
##  1          768 "Wuthering Heights"
##  2          768 ""                 
##  3          768 "by Emily Brontë"  
##  4          768 ""                 
##  5          768 ""                 
##  6          768 ""                 
##  7          768 ""                 
##  8          768 "CHAPTER I"        
##  9          768 ""                 
## 10          768 ""                 
## # ℹ 50 more rows

5.2.4 textdata

The textdata package allows to find and download textual datasets. See this link for details of datasets.

5.2.4.1 Available datasets

catalogue
##                                                            name
## 1                                                     AFINN-111
## 2                                        v1.0 sentence polarity
## 3                           Loughran-McDonald Sentiment lexicon
## 4                                        Bing Sentiment Lexicon
## 5                          NRC Word-Emotion Association Lexicon
## 6  NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon)
## 7               The NRC Valence, Arousal, and Dominance Lexicon
## 8                                                       AG News
## 9                                                       DBpedia
## 10                                             TREC-6 & TREC-50
## 11                              IMDb Large Movie Review Dataset
## 12                                                     GloVe 6B
## 13                                            GloVe Twitter 27B
## 14                                       GloVe Common Crawl 42B
## 15                                      GloVe Common Crawl 840B
##                                                                   url
## 1  http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
## 2             http://www.cs.cornell.edu/people/pabo/movie-review-data
## 3                     https://sraf.nd.edu/textual-analysis/resources/
## 4            https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
## 5                      http://saifmohammad.com/WebPages/lexicons.html
## 6                   www.saifmohammad.com/WebPages/AffectIntensity.htm
## 7                      https://saifmohammad.com/WebPages/nrc-vad.html
## 8      https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
## 9                                           https://wiki.dbpedia.org/
## 10                         https://cogcomp.seas.upenn.edu/Data/QA/QC/
## 11                      http://ai.stanford.edu/~amaas/data/sentiment/
## 12                           https://nlp.stanford.edu/projects/glove/
## 13                           https://nlp.stanford.edu/projects/glove/
## 14                           https://nlp.stanford.edu/projects/glove/
## 15                           https://nlp.stanford.edu/projects/glove/
##                                                                                                 license
## 1                                                                     Open Database License (ODbL) v1.0
## 2                                                                             Cite the paper when used.
## 3                                  License required for commercial use. Please contact tloughra@nd.edu.
## 4                                             May be used (research, commercial, etc) with attribution.
## 5  License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
## 6  License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
## 7  License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
## 8                                You are encouraged to download this corpus for any non-commercial use.
## 9                                                   Creative Commons Attribution-ShareAlike 3.0 License
## 10                                                           Freely reusable public information licence
## 11                                        No license specified, the work may be protected by copyright.
## 12                                                            Public Domain Dedication and License v1.0
## 13                                                            Public Domain Dedication and License v1.0
## 14                                                            Public Domain Dedication and License v1.0
## 15                                                            Public Domain Dedication and License v1.0
##                                                   size       type download_mech
## 1                                78 KB (cleaned 59 KB)    lexicon         https
## 2                                2 MB (cleaned 1.4 MB)    dataset         https
## 3                              6.7 MB (cleaned 142 KB)    lexicon         https
## 4                              287 KB (cleaned 220 KB)    lexicon          http
## 5                             22.8 MB (cleaned 424 KB)    lexicon          http
## 6                              333 KB (cleaned 212 KB)    lexicon          http
## 7                            150.8 MB (cleaned 792 KB)    lexicon          http
## 8                            64.4 MB (cleaned 33.9 MB)    dataset         https
## 9                          279.5 MB (cleaned 211.1 MB)    dataset         https
## 10                             1.2 MB (cleaned 827 KB)    dataset         https
## 11                            376.4 MB (cleaned 71 MB)    dataset          http
## 12 822.2 MB (158MB, 311MB, 616MB, and 921MB processed) embeddings         https
## 13 1.42 GB (248MB, 476MB, 931MB, and 1.79GB processed) embeddings         https
## 14                          1.75 GB (4.31GB processed) embeddings         https
## 15                          2.03 GB (4.94GB processed) embeddings         https
##                                                                                      description
## 1                                                                                               
## 2                            Dataset with sentences labeled with negative or positive sentiment.
## 3                                                                                               
## 4                                                                                               
## 5                                                                                               
## 6                                                                                               
## 7                                                                                               
## 8                                                                                               
## 9                                                                                               
## 10                                                                                              
## 11                                                                                              
## 12 Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors)
## 13          Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors)
## 14                                  Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors)
## 15                                   Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 citation
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <NA>
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <NA>
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <NA>
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <NA>
## 5  Citation info:\n\nThis dataset was published in Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.'' Computational Intelligence, 29(3): 436-465.\n\narticle{mohammad13,\nauthor = {Mohammad, Saif M. and Turney, Peter D.},\ntitle = {Crowdsourcing a Word-Emotion Association Lexicon},\njournal = {Computational Intelligence},\nvolume = {29},\nnumber = {3},\npages = {436-465},\ndoi = {10.1111/j.1467-8640.2012.00460.x},\nurl = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x},\neprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x},\nyear = {2013}\n}\nIf you use this lexicon, then please cite it.
## 6                                                                                                                                                                                                                                                                 Citation info:\nDetails of the lexicon are in this paper.\nWord Affect Intensities. Saif M. Mohammad. arXiv preprint arXiv, April 2017.\n\ninproceedings{LREC18-AIL,\nauthor = {Mohammad, Saif M.},\ntitle = {Word Affect Intensities},\nbooktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018)},\nyear = {2018},\naddress={Miyazaki, Japan}\n}\n\nIf you use this lexicon, then please cite it.
## 7                                                                                                                                                                                                                                                                                                                           Citation info:\n\ninproceedings{vad-acl2018,\ntitle={Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words},\nauthor={Mohammad, Saif M.},\nbooktitle={Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL)},\nyear={2018},\naddress={Melbourne, Australia}\n}\n\nIf you use this lexicon, then please cite it.
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <NA>
## 9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   <NA>
## 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  <NA>
## 11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  <NA>
## 12                                                                                                                                                                                                                                                                                                                                                      Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## 13                                                                                                                                                                                                                                                                                                                                                      Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## 14                                                                                                                                                                                                                                                                                                                                                      Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## 15                                                                                                                                                                                                                                                                                                                                                      Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
with(catalogue, split(name, type))
## $dataset
## [1] "v1.0 sentence polarity"          "AG News"                        
## [3] "DBpedia"                         "TREC-6 & TREC-50"               
## [5] "IMDb Large Movie Review Dataset"
## 
## $embeddings
## [1] "GloVe 6B"                "GloVe Twitter 27B"      
## [3] "GloVe Common Crawl 42B"  "GloVe Common Crawl 840B"
## 
## $lexicon
## [1] "AFINN-111"                                                   
## [2] "Loughran-McDonald Sentiment lexicon"                         
## [3] "Bing Sentiment Lexicon"                                      
## [4] "NRC Word-Emotion Association Lexicon"                        
## [5] "NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon)"
## [6] "The NRC Valence, Arousal, and Dominance Lexicon"

5.2.4.2 Download datasets

We download the smallest datasets from textdata:

#AFINN-111 sentiment lexicon
lexicon_afinn()
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 2,467 more rows
lexicon_afinn() %>% 
  group_by(factor(word)) %>% 
  summary()
##      word               value             factor(word) 
##  Length:2477        Min.   :-5.0000   abandon   :   1  
##  Class :character   1st Qu.:-2.0000   abandoned :   1  
##  Mode  :character   Median :-2.0000   abandons  :   1  
##                     Mean   :-0.5894   abducted  :   1  
##                     3rd Qu.: 2.0000   abduction :   1  
##                     Max.   : 5.0000   abductions:   1  
##                                       (Other)   :2471

5.2.4.3 Look into data

lexicon_afinn() %>% 
  head(n = 60)
## # A tibble: 60 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ℹ 50 more rows

5.2.5 readtext

The readtext package comes with various datasets. We specify the path to where to find the datasets and upload them

Data_Dir <- system.file("extdata/", package = "readtext")

5.2.5.1 Inaugural Corpus USA

5.2.5.1.1 Importing data
dat_inaug <- read.csv(paste0(Data_Dir, "/csv/inaugCorpus.csv"))
5.2.5.1.2 Checking structure
dat_inaug %>% 
  str()
## 'data.frame':    5 obs. of  4 variables:
##  $ texts    : chr  "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life n"| __truncated__ "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magist"| __truncated__ "When it was first perceived, in early times, that no middle course for America remained between unlimited submi"| __truncated__ "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our countr"| __truncated__ ...
##  $ Year     : int  1789 1793 1797 1801 1805
##  $ President: chr  "Washington" "Washington" "Adams" "Jefferson" ...
##  $ FirstName: chr  "George" "George" "John" "Thomas" ...
5.2.5.1.3 Unnest
dat_inaug %>% 
  unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 5 × 4
##   texts                                                 Year President FirstName
##   <chr>                                                <int> <chr>     <chr>    
## 1 "Fellow-Citizens of the Senate and of the House of …  1789 Washingt… George   
## 2 "Fellow citizens, I am again called upon by the voi…  1793 Washingt… George   
## 3 "When it was first perceived, in early times, that …  1797 Adams     John     
## 4 "Friends and Fellow Citizens:\n\nCalled upon to und…  1801 Jefferson Thomas   
## 5 "Proceeding, fellow citizens, to that qualification…  1805 Jefferson Thomas

5.2.5.2 Universal Declaration of Human Rights

We import multiple files containing the Universal Declaration of Human Rights in 13 languages. There are 13 different textfiles

5.2.5.2.1 Importing data
dat_udhr <- readtext(paste0(Data_Dir, "/txt/UDHR/*"),
                      docvarsfrom = "filenames", 
                      docvarnames = c("document", "language"))
5.2.5.2.2 Checking structure
dat_udhr %>% 
  str()
## Classes 'readtext' and 'data.frame': 13 obs. of  4 variables:
##  $ doc_id  : chr  "UDHR_chinese.txt" "UDHR_czech.txt" "UDHR_danish.txt" "UDHR_english.txt" ...
##  $ text    : chr  "世界人权宣言\n联合国大会一九四八年十二月十日第217A(III)号决议通过并颁布 1948 年 12 月 10 日, 联 合 国 大 会 通"| __truncated__ "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\nÚvod U vědomí toho, že uznání přirozené důstojnosti a rovných a nezcizitelný"| __truncated__ "Den 10. december 1948 vedtog og offentliggjorde FNs tredie generalforsamling Verdenserklæringen om Menneskerett"| __truncated__ "Universal Declaration of Human Rights\nPreamble Whereas recognition of the inherent dignity and of the equal an"| __truncated__ ...
##  $ document: chr  "UDHR" "UDHR" "UDHR" "UDHR" ...
##  $ language: chr  "chinese" "czech" "danish" "english" ...
5.2.5.2.3 Unnest
dat_udhr %>% 
  unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 13 × 4
##    doc_id              text                                    document language
##    <chr>               <chr>                                   <chr>    <chr>   
##  1 UDHR_chinese.txt    "世界人权宣言\n联合国大会一九四八年十二月十日第217A(III)号决议通… UDHR     chinese 
##  2 UDHR_czech.txt      "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\nÚv… UDHR     czech   
##  3 UDHR_danish.txt     "Den 10. december 1948 vedtog og offen… UDHR     danish  
##  4 UDHR_english.txt    "Universal Declaration of Human Rights… UDHR     english 
##  5 UDHR_french.txt     "Déclaration universelle des droits de… UDHR     french  
##  6 UDHR_georgian.txt   "FLFVBFYBC EAKT<FSF CF>JDTKSFJ LTRKFHF… UDHR     georgian
##  7 UDHR_greek.txt      "ΟΙΚΟΥΜΕΝΙΚΗ ΔΙΑΚΗΡΥΞΗ ΓΙΑ ΤΑ ΑΝΘΡΩΠΙΝ… UDHR     greek   
##  8 UDHR_hungarian.txt  "Az Emberi Jogok Egyetemes Nyilatkozat… UDHR     hungari…
##  9 UDHR_icelandic.txt  "Mannréttindayfirlýsing Sameinuðu þjóð… UDHR     iceland…
## 10 UDHR_irish.txt      "DEARBHÚ UILE-CHOITEANN CEARTA AN DUIN… UDHR     irish   
## 11 UDHR_japanese.txt   "『世界人権宣言』\n\n(1948.12.10 第3回国連総会採択)\n\… UDHR     japanese
## 12 UDHR_russian.txt    "Всеобщая декларация прав человека\nПр… UDHR     russian 
## 13 UDHR_vietnamese.txt "7X\\zQ QJ{Q WR\u007fQ WK\u009b JL±L Y… UDHR     vietnam…

5.2.5.3 Twitter data

We the twitter.json data accessed from here. This is a JSON file (.json) downloaded from the Twitter stream API.

5.2.5.3.1 Importing data
dat_twitter <- readtext("data/twitter.json", source = "twitter")
5.2.5.3.2 Checking structure
dat_twitter %>% 
  str()
## Classes 'readtext' and 'data.frame': 7504 obs. of  44 variables:
##  $ doc_id                   : chr  "twitter.json.1" "twitter.json.2" "twitter.json.3" "twitter.json.4" ...
##  $ text                     : chr  "@EFC_Jayy UKIP" "RT @Corbynator2:@jeremycorbyn Reaction from people at the Watford Rally:\n“We believe in Jeremy Corbyn!”\n“We n"| __truncated__ "RT @ryvr: Stephen Hawking, the world’s smartest man, backs Jeremy Corbyn https://t.co/2kl3ayLd44 #TuesdayThoughts" "RT @TheGreenParty: How you cast your vote will shape the future. Every single vote counts. Tomorrow, #VoteGreen"| __truncated__ ...
##  $ retweet_count            : num  0 90 78 244 1896 ...
##  $ favorite_count           : num  0 108 104 218 2217 ...
##  $ favorited                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ truncated                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ id_str                   : chr  "872596537142116352" "872596536869363712" "872596537444093952" "872596538492637185" ...
##  $ in_reply_to_screen_name  : chr  "EFC_Jayy" NA NA NA ...
##  $ source                   : chr  "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
##  $ retweeted                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ created_at               : chr  "Wed Jun 07 23:30:01 +0000 2017" "Wed Jun 07 23:30:01 +0000 2017" "Wed Jun 07 23:30:01 +0000 2017" "Wed Jun 07 23:30:01 +0000 2017" ...
##  $ in_reply_to_status_id_str: chr  "872596176834572288" NA NA NA ...
##  $ in_reply_to_user_id_str  : chr  "4556760676" NA NA NA ...
##  $ lang                     : chr  "en" "en" "en" "en" ...
##  $ listed_count             : num  1 28 2 3 6 2 12 90 1 25 ...
##  $ verified                 : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ location                 : chr  "Japan" "Gondwana" "LDN rt/mention/follow/link ≠ e" "East, England" ...
##  $ user_id_str              : chr  "863929468984995840" "153295243" "273731990" "477177095" ...
##  $ description              : chr  NA "#Black. #Green. #Red. #Aboriginal. #Environmental. #Socialist. #Atheist." "Infovore, atheist, post-Ⓐ, p/t nihilist, lifelong radiophile, aspiring cultural terrorist 🏴  ☮️  🇵🇸  \nImages: @"| __truncated__ "think outside your own perspective and find transcendence bypassing state opulence" ...
##  $ geo_enabled              : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ user_created_at          : chr  "Mon May 15 01:30:11 +0000 2017" "Tue Jun 08 05:05:23 +0000 2010" "Tue Mar 29 01:59:32 +0000 2011" "Sat Jan 28 22:41:07 +0000 2012" ...
##  $ statuses_count           : num  2930 10223 20934 13603 13179 ...
##  $ followers_count          : num  367 845 761 321 386 ...
##  $ favourites_count         : num  1260 9813 14733 5421 5219 ...
##  $ protected                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ user_url                 : chr  NA NA NA "http://www.instagram.com/stuhornett/" ...
##  $ name                     : chr  "ジョージ" "Yara-ma-yha-who" "Openly classist" "Stu" ...
##  $ time_zone                : chr  NA "London" "London" "Casablanca" ...
##  $ user_lang                : chr  "en" "en" "en" "en" ...
##  $ utc_offset               : num  NA 3600 3600 0 3600 NA NA 3600 NA 3600 ...
##  $ friends_count            : num  304 439 2761 767 257 ...
##  $ screen_name              : chr  "CoysJoji" "Unkle_Ken" "OpenlyClassist" "StuHornett" ...
##  $ country_code             : chr  NA NA NA NA ...
##  $ country                  : chr  NA NA NA NA ...
##  $ place_type               : logi  NA NA NA NA NA NA ...
##  $ full_name                : chr  NA NA NA NA ...
##  $ place_name               : chr  NA NA NA NA ...
##  $ place_id                 : chr  NA NA NA NA ...
##  $ place_lat                : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ place_lon                : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ lat                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ lon                      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ expanded_url             : chr  NA NA "http://www.independent.co.uk/news/science/stephen-hawking-jeremy-corbyn-labour-theresa-may-conservatives-endors"| __truncated__ NA ...
##  $ url                      : chr  NA "" "https://t.co/2kl3ayLd44" NA ...
5.2.5.3.3 Unnest
dat_twitter %>% 
  unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 7,504 × 44
##    doc_id          text  retweet_count favorite_count favorited truncated id_str
##    <chr>           <chr>         <dbl>          <dbl> <lgl>     <lgl>     <chr> 
##  1 twitter.json.1  "@EF…             0              0 FALSE     FALSE     87259…
##  2 twitter.json.2  "RT …            90            108 FALSE     FALSE     87259…
##  3 twitter.json.3  "RT …            78            104 FALSE     FALSE     87259…
##  4 twitter.json.4  "RT …           244            218 FALSE     FALSE     87259…
##  5 twitter.json.5  "RT …          1896           2217 FALSE     FALSE     87259…
##  6 twitter.json.6  "RT …            55             52 FALSE     FALSE     87259…
##  7 twitter.json.7  "RT …            65             73 FALSE     FALSE     87259…
##  8 twitter.json.8  "RT …            30              9 FALSE     FALSE     87259…
##  9 twitter.json.9  "RT …          1896           2217 FALSE     FALSE     87259…
## 10 twitter.json.10 "Wha…             0              0 FALSE     TRUE      87259…
## # ℹ 7,494 more rows
## # ℹ 37 more variables: in_reply_to_screen_name <chr>, source <chr>,
## #   retweeted <lgl>, created_at <chr>, in_reply_to_status_id_str <chr>,
## #   in_reply_to_user_id_str <chr>, lang <chr>, listed_count <dbl>,
## #   verified <lgl>, location <chr>, user_id_str <chr>, description <chr>,
## #   geo_enabled <lgl>, user_created_at <chr>, statuses_count <dbl>,
## #   followers_count <dbl>, favourites_count <dbl>, protected <lgl>, …

5.2.5.4 Converting from a PDF file

We can also import data in a PDF format and obtain details from file name.

5.2.5.4.1 Importing data
dat_udhr_PDF <- readtext(paste0(Data_Dir, "/pdf/UDHR/*.pdf"), 
                      docvarsfrom = "filenames", 
                      docvarnames = c("document", "language"),
                      sep = "_")
5.2.5.4.2 Check encoding
Encoding(dat_udhr_PDF$text)
##  [1] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"  
##  [8] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"
5.2.5.4.3 Checking structure
dat_udhr_PDF %>% 
  str()
## Classes 'readtext' and 'data.frame': 11 obs. of  4 variables:
##  $ doc_id  : chr  "UDHR_chinese.pdf" "UDHR_czech.pdf" "UDHR_danish.pdf" "UDHR_english.pdf" ...
##  $ text    : chr  "世界人权宣言\n\n联合国大会一九四八年十二月十日第217A(III)号决议通过并颁布\n1948 年 12 月 10 日, 联 合 国 大 会"| __truncated__ "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\n\nÚvod\n\nU vědomí toho,\nže uznání přirozené důstojnosti a rovných a nezciz"| __truncated__ "Den 10. december 1948 vedtog og offentliggjorde FNs tredie generalforsamling\nVerdenserklæringen om Menneskeret"| __truncated__ "Universal Declaration of Human Rights\n\nPreamble\n\nWhereas recognition of the inherent dignity and of the equ"| __truncated__ ...
##  $ document: chr  "UDHR" "UDHR" "UDHR" "UDHR" ...
##  $ language: chr  "chinese" "czech" "danish" "english" ...
5.2.5.4.4 Unnest
dat_udhr_PDF %>% 
  unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 11 × 4
##    doc_id              text                                    document language
##    <chr>               <chr>                                   <chr>    <chr>   
##  1 UDHR_chinese.pdf    "世界人权宣言\n\n联合国大会一九四八年十二月十日第217A(III)号决… UDHR     chinese 
##  2 UDHR_czech.pdf      "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\n\n… UDHR     czech   
##  3 UDHR_danish.pdf     "Den 10. december 1948 vedtog og offen… UDHR     danish  
##  4 UDHR_english.pdf    "Universal Declaration of Human Rights… UDHR     english 
##  5 UDHR_french.pdf     "Déclaration universelle des droits de… UDHR     french  
##  6 UDHR_greek.pdf      "ΟΙΚΟΥΜΕΝΙΚΗ ΔΙΑΚΗΡΥΞΗ ΓΙΑ ΤΑ ΑΝΘΡΩΠΙΝ… UDHR     greek   
##  7 UDHR_hungarian.pdf  "Az Emberi Jogok Egyetemes Nyilatkozat… UDHR     hungari…
##  8 UDHR_irish.pdf      "DEARBHÚ UILE-CHOITEANN CEARTA AN DUIN… UDHR     irish   
##  9 UDHR_japanese.pdf   "『世界人権宣言』\n\n\n\n(1948.12.10 第3回国連総会採択… UDHR     japanese
## 10 UDHR_russian.pdf    "Всеобщая декларация прав человека\n\n… UDHR     russian 
## 11 UDHR_vietnamese.pdf "                       7X\\zQ⇤QJ{Q⇤WR… UDHR     vietnam…

5.2.5.5 Different encodings

We look into data with different encoding. This is important as the type of data you will generate can be of different encodings.

5.2.5.5.1 Temp path
path_temp <- tempdir()
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = path_temp)
5.2.5.5.2 Importing data

We use regular expressions to search for all files starting with “Indian_” or “UDHR_” and containing any characters, up to the ending with “.text”

filename <- list.files(path_temp, "^(Indian|UDHR_).*\\.txt$")
head(filename)
## [1] "IndianTreaty_English_UTF-16LE.txt"  "IndianTreaty_English_UTF-8-BOM.txt"
## [3] "UDHR_Arabic_ISO-8859-6.txt"         "UDHR_Arabic_UTF-8.txt"             
## [5] "UDHR_Arabic_WINDOWS-1256.txt"       "UDHR_Chinese_GB2312.txt"
5.2.5.5.3 Export encoding

We use various functions to delete .txt at the end of each file name and then we split each string as a function of _ and obtain the third row in each list of item.

filename <- filename %>% 
  str_replace(".txt$", "")
encoding <- purrr::map(str_split(filename, "_"), 3)
head(encoding)
## [[1]]
## [1] "UTF-16LE"
## 
## [[2]]
## [1] "UTF-8-BOM"
## 
## [[3]]
## [1] "ISO-8859-6"
## 
## [[4]]
## [1] "UTF-8"
## 
## [[5]]
## [1] "WINDOWS-1256"
## 
## [[6]]
## [1] "GB2312"

We feed the encoding to readtext() to convert various character encodings into UTF-8.

dat_txt <- readtext(paste0(Data_Dir, "/data_files_encodedtexts.zip"), 
                     encoding = encoding,
                     docvarsfrom = "filenames", 
                     docvarnames = c("document", "language", "input_encoding"))
dat_txt %>% 
  unnest()
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 36 × 5
##    doc_id                             text      document language input_encoding
##    <chr>                              <chr>     <chr>    <chr>    <chr>         
##  1 IndianTreaty_English_UTF-16LE.txt  "WHEREAS… IndianT… English  UTF-16LE      
##  2 IndianTreaty_English_UTF-8-BOM.txt "ARTICLE… IndianT… English  UTF-8-BOM     
##  3 UDHR_Arabic_ISO-8859-6.txt         "الديباج… UDHR     Arabic   ISO-8859-6    
##  4 UDHR_Arabic_UTF-8.txt              "الديباج… UDHR     Arabic   UTF-8         
##  5 UDHR_Arabic_WINDOWS-1256.txt       "الديباج… UDHR     Arabic   WINDOWS-1256  
##  6 UDHR_Chinese_GB2312.txt            "世界人权宣言\… UDHR     Chinese  GB2312        
##  7 UDHR_Chinese_GBK.txt               "世界人权宣言\… UDHR     Chinese  GBK           
##  8 UDHR_Chinese_UTF-8.txt             "世界人权宣言\… UDHR     Chinese  UTF-8         
##  9 UDHR_English_UTF-16BE.txt          "Univers… UDHR     English  UTF-16BE      
## 10 UDHR_English_UTF-16LE.txt          "Univers… UDHR     English  UTF-16LE      
## # ℹ 26 more rows

5.2.6 Webscrapping

We use the rvest package to obtain data from a specific URL. See here for advanced webscrapping. Look at this link as well for a more straightforward way.

5.2.6.1 A single webpage

5.2.6.1.1 Read_html
web_page <- rvest::read_html("https://www.tidyverse.org/packages/")
web_page
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n  <div id="appTidyverseSite" class="shrinkHeader alwaysShrinkHead ...

Because the downloaded file contains a unnecessary information. We process the data to extract only the text from the webpage.

5.2.6.1.2 Extract headline
header_web_page <- web_page %>%
  ## extract paragraphs
  rvest::html_nodes("h1") %>%
  ## extract text
  rvest::html_text() 
head(header_web_page)
## [1] "Tidyverse packages"
5.2.6.1.3 Extract text
web_page_txt <- web_page %>%
  ## extract paragraphs
  rvest::html_nodes("p") %>%
  ## extract text
  rvest::html_text()
head(web_page_txt)
## [1] "Install all the packages in the tidyverse by running install.packages(\"tidyverse\")."                                                                                                                                                              
## [2] "Run library(tidyverse) to load the core tidyverse and make it available\nin your current R session."                                                                                                                                                
## [3] "Learn more about the tidyverse package at https://tidyverse.tidyverse.org."                                                                                                                                                                         
## [4] "The core tidyverse includes the packages that you’re likely to use in everyday data analyses. As of tidyverse 1.3.0, the following packages are included in the core tidyverse:"                                                                    
## [5] "ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. Go to docs..."
## [6] "dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. Go to docs..."

5.2.6.2 Multiple webpages

5.2.6.2.1 Read_html
website <- "https://www.tidyverse.org/packages/" %>% 
  rvest::read_html()
website
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n  <div id="appTidyverseSite" class="shrinkHeader alwaysShrinkHead ...
a_elements <- website %>% 
  rvest::html_elements(css = "div.package > a")
a_elements
## {xml_nodeset (9)}
## [1] <a href="https://ggplot2.tidyverse.org/" target="_blank">\n    <img class ...
## [2] <a href="https://dplyr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [3] <a href="https://tidyr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [4] <a href="https://readr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [5] <a href="https://purrr.tidyverse.org/" target="_blank">\n    <img class=" ...
## [6] <a href="https://tibble.tidyverse.org/" target="_blank">\n    <img class= ...
## [7] <a href="https://stringr.tidyverse.org/" target="_blank">\n    <img class ...
## [8] <a href="https://forcats.tidyverse.org/" target="_blank">\n    <img class ...
## [9] <a href="https://lubridate.tidyverse.org/" target="_blank">\n    <img cla ...
5.2.6.2.2 Extract headline
links <- a_elements %>%
  rvest::html_attr(name = "href")
links
## [1] "https://ggplot2.tidyverse.org/"   "https://dplyr.tidyverse.org/"    
## [3] "https://tidyr.tidyverse.org/"     "https://readr.tidyverse.org/"    
## [5] "https://purrr.tidyverse.org/"     "https://tibble.tidyverse.org/"   
## [7] "https://stringr.tidyverse.org/"   "https://forcats.tidyverse.org/"  
## [9] "https://lubridate.tidyverse.org/"
5.2.6.2.3 Extract subpages
pages <- links %>% 
  map(rvest::read_html)
pages
## [[1]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[2]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[3]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[4]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[5]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[6]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[7]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[8]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...
## 
## [[9]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <a href="#container" class="visually-hidden-focusable">Skip t ...

The structure seems to be similar across all pages

pages %>% 
  map(rvest::html_element, css = "a.navbar-brand") %>% 
  map_chr(rvest::html_text)
## [1] "ggplot2"   "dplyr"     "tidyr"     "readr"     "purrr"     "tibble"   
## [7] "stringr"   "forcats"   "lubridate"

and extracting version number

pages %>% 
  map(rvest::html_element, css = "small.nav-text.text-muted.me-auto") %>% 
  map_chr(rvest::html_text)
## [1] "3.5.2" "1.1.4" "1.3.1" "2.1.5" "1.1.0" "3.3.0" "1.5.1" "1.0.0" "1.9.4"
5.2.6.2.4 Extract text

and we can also add all into a tibble

pages_table <- tibble(
  name = pages %>% 
    map(rvest::html_element, css = "a.navbar-brand") %>% 
    map_chr(rvest::html_text),
  version = pages %>% 
    map(rvest::html_element, css = "small.nav-text.text-muted.me-auto") %>% 
    map_chr(rvest::html_text),
  CRAN = pages %>% 
    map(rvest::html_element, css = "ul.list-unstyled > li:nth-child(1) > a") %>% 
    map_chr(rvest::html_attr, name = "href"),
  Learn = pages %>% 
    map(rvest::html_element, css = "ul.list-unstyled > li:nth-child(4) > a") %>% 
    map_chr(rvest::html_attr, name = "href"), 
  text = pages %>%
    map(rvest::html_element,  css = "body") %>%
    map_chr(rvest::html_text2)
)
pages_table
## # A tibble: 9 × 5
##   name      version CRAN                                          Learn    text 
##   <chr>     <chr>   <chr>                                         <chr>    <chr>
## 1 ggplot2   3.5.2   https://cloud.r-project.org/package=ggplot2   https:/… "Ski…
## 2 dplyr     1.1.4   https://cloud.r-project.org/package=dplyr     http://… "Ski…
## 3 tidyr     1.3.1   https://cloud.r-project.org/package=tidyr     https:/… "Ski…
## 4 readr     2.1.5   https://cloud.r-project.org/package=readr     http://… "Ski…
## 5 purrr     1.1.0   https://cloud.r-project.org/package=purrr     http://… "Ski…
## 6 tibble    3.3.0   https://cloud.r-project.org/package=tibble    https:/… "Ski…
## 7 stringr   1.5.1   https://cloud.r-project.org/package=stringr   http://… "Ski…
## 8 forcats   1.0.0   https://cloud.r-project.org/package=forcats   http://… "Ski…
## 9 lubridate 1.9.4   https://cloud.r-project.org/package=lubridate https:/… "Ski…