5.2 Where to find textual datasets?
There are various sources where one can easily find textual corpora. You can look here link.
This session is inspired by data manipulations from the Quanteda Tutorials
We have installed various packages which allow you to obtain textual data. Here are a few examples
5.2.1 janeaustenr
5.2.1.1 Look at books
This library includes 6 books by Jane Austen
## Rows: 73,422
## Columns: 2
## $ text <chr> "SENSE AND SENSIBILITY", "", "by Jane Austen", "", "(1811)", "", …
## $ book <fct> Sense & Sensibility, Sense & Sensibility, Sense & Sensibility, Se…
5.2.1.2 Summary
Books are named as: emma, mansfieldpark, northangerabbey, persuasion, prideprejudice and sensesensibility
We use summary()
to get number of characters in each
## text book
## Length:73422 Sense & Sensibility:12624
## Class :character Pride & Prejudice :13030
## Mode :character Mansfield Park :15349
## Emma :16235
## Northanger Abbey : 7856
## Persuasion : 8328
5.2.1.4 Look into data
And we can get the top 60 rows from “sensesensibility”
## [1] "SENSE AND SENSIBILITY"
## [2] ""
## [3] "by Jane Austen"
## [4] ""
## [5] "(1811)"
## [6] ""
## [7] ""
## [8] ""
## [9] ""
## [10] "CHAPTER 1"
## [11] ""
## [12] ""
## [13] "The family of Dashwood had long been settled in Sussex. Their estate"
## [14] "was large, and their residence was at Norland Park, in the centre of"
## [15] "their property, where, for many generations, they had lived in so"
## [16] "respectable a manner as to engage the general good opinion of their"
## [17] "surrounding acquaintance. The late owner of this estate was a single"
## [18] "man, who lived to a very advanced age, and who for many years of his"
## [19] "life, had a constant companion and housekeeper in his sister. But her"
## [20] "death, which happened ten years before his own, produced a great"
## [21] "alteration in his home; for to supply her loss, he invited and received"
## [22] "into his house the family of his nephew Mr. Henry Dashwood, the legal"
## [23] "inheritor of the Norland estate, and the person to whom he intended to"
## [24] "bequeath it. In the society of his nephew and niece, and their"
## [25] "children, the old Gentleman's days were comfortably spent. His"
## [26] "attachment to them all increased. The constant attention of Mr. and"
## [27] "Mrs. Henry Dashwood to his wishes, which proceeded not merely from"
## [28] "interest, but from goodness of heart, gave him every degree of solid"
## [29] "comfort which his age could receive; and the cheerfulness of the"
## [30] "children added a relish to his existence."
## [31] ""
## [32] "By a former marriage, Mr. Henry Dashwood had one son: by his present"
## [33] "lady, three daughters. The son, a steady respectable young man, was"
## [34] "amply provided for by the fortune of his mother, which had been large,"
## [35] "and half of which devolved on him on his coming of age. By his own"
## [36] "marriage, likewise, which happened soon afterwards, he added to his"
## [37] "wealth. To him therefore the succession to the Norland estate was not"
## [38] "so really important as to his sisters; for their fortune, independent"
## [39] "of what might arise to them from their father's inheriting that"
## [40] "property, could be but small. Their mother had nothing, and their"
## [41] "father only seven thousand pounds in his own disposal; for the"
## [42] "remaining moiety of his first wife's fortune was also secured to her"
## [43] "child, and he had only a life-interest in it."
## [44] ""
## [45] "The old gentleman died: his will was read, and like almost every other"
## [46] "will, gave as much disappointment as pleasure. He was neither so"
## [47] "unjust, nor so ungrateful, as to leave his estate from his nephew;--but"
## [48] "he left it to him on such terms as destroyed half the value of the"
## [49] "bequest. Mr. Dashwood had wished for it more for the sake of his wife"
## [50] "and daughters than for himself or his son;--but to his son, and his"
## [51] "son's son, a child of four years old, it was secured, in such a way, as"
## [52] "to leave to himself no power of providing for those who were most dear"
## [53] "to him, and who most needed a provision by any charge on the estate, or"
## [54] "by any sale of its valuable woods. The whole was tied up for the"
## [55] "benefit of this child, who, in occasional visits with his father and"
## [56] "mother at Norland, had so far gained on the affections of his uncle, by"
## [57] "such attractions as are by no means unusual in children of two or three"
## [58] "years old; an imperfect articulation, an earnest desire of having his"
## [59] "own way, many cunning tricks, and a great deal of noise, as to outweigh"
## [60] "all the value of all the attention which, for years, he had received"
5.2.2 proustr
5.2.2.1 Look at books
This library includes 7 books written in French
## Rows: 4,690
## Columns: 4
## $ text <chr> "Longtemps, je me suis couché de bonne heure. Parfois, à peine …
## $ book <chr> "Du côté de chez Swann", "Du côté de chez Swann", "Du côté de c…
## $ volume <chr> "Première partie : Combray", "Première partie : Combray", "Prem…
## $ year <dbl> 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 191…
5.2.2.2 Summary
We use summary()
to get number of characters in each
## text book
## Length:4690 À l’ombre des jeunes filles en fleurs: 792
## Class :character Albertine disparue : 259
## Mode :character Du côté de chez Swann :1004
## La Prisonnière : 365
## Le Côté de Guermantes :1610
## Le Temps retrouvé : 248
## Sodome et Gomorrhe : 412
## volume year
## Length:4690 Min. :1913
## Class :character 1st Qu.:1919
## Mode :character Median :1921
## Mean :1919
## 3rd Qu.:1921
## Max. :1927
## NA's :159
5.2.2.4 Look into data
And we can get the top 60 rows from the first one
## # A tibble: 60 × 4
## text book volume year
## <chr> <chr> <chr> <dbl>
## 1 "Longtemps, je me suis couché de bonne heure. Parfois, à … Du c… Premi… 1913
## 2 "J'appuyais tendrement mes joues contre les belles joues … Du c… Premi… 1913
## 3 "Je me rendormais, et parfois je n'avais plus que de cour… Du c… Premi… 1913
## 4 "Quelquefois, comme Ève naquit d'une côte d'Adam, une fem… Du c… Premi… 1913
## 5 "Un homme qui dort tient en cercle autour de lui le fil d… Du c… Premi… 1913
## 6 "Peut-être l'immobilité des choses autour de nous leur es… Du c… Premi… 1913
## 7 "Puis renaissait le souvenir d'une nouvelle attitude ; le… Du c… Premi… 1913
## 8 "Ces évocations tournoyantes et confuses ne duraient jama… Du c… Premi… 1913
## 9 "Certes, j'étais bien éveillé maintenant : mon corps avai… Du c… Premi… 1913
## 10 "À Combray, tous les jours dès la fin de l'après-midi, lo… Du c… Premi… 1913
## # ℹ 50 more rows
5.2.3 gutenbergr
The gutenbergr package allows for search and download of public domain texts from Project Gutenberg
To use gutenbergr you must know the Gutenberg id of the work you wish to analyse. A text search of the works can be done using the gutenberg_metadata
function.
5.2.3.1 Search for available work
## # A tibble: 79,491 × 8
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <fct> <chr>
## 1 1 "The De… Jeffe… 1638 en Politics/American …
## 2 2 "The Un… Unite… 1 en Politics/American …
## 3 3 "John F… Kenne… 1666 en Browsing: History …
## 4 4 "Lincol… Linco… 3 en US Civil War/Brows…
## 5 5 "The Un… Unite… 1 en United States/Poli…
## 6 6 "Give M… Henry… 4 en American Revolutio…
## 7 7 "The Ma… <NA> NA en Browsing: History …
## 8 8 "Abraha… Linco… 3 en US Civil War/Brows…
## 9 9 "Abraha… Linco… 3 en US Civil War/Brows…
## 10 10 "The Ki… <NA> NA en Banned Books List …
## # ℹ 79,481 more rows
## # ℹ 2 more variables: rights <fct>, has_text <lgl>
5.2.3.2 Filter available text
We can filter only available text
## # A tibble: 61,693 × 8
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <fct> <chr>
## 1 1 "The De… Jeffe… 1638 en Politics/American …
## 2 2 "The Un… Unite… 1 en Politics/American …
## 3 3 "John F… Kenne… 1666 en Browsing: History …
## 4 4 "Lincol… Linco… 3 en US Civil War/Brows…
## 5 5 "The Un… Unite… 1 en United States/Poli…
## 6 6 "Give M… Henry… 4 en American Revolutio…
## 7 7 "The Ma… <NA> NA en Browsing: History …
## 8 8 "Abraha… Linco… 3 en US Civil War/Brows…
## 9 9 "Abraha… Linco… 3 en US Civil War/Brows…
## 10 10 "The Ki… <NA> NA en Banned Books List …
## # ℹ 61,683 more rows
## # ℹ 2 more variables: rights <fct>, has_text <lgl>
5.2.3.3 Look at a specific work
Then we can search for a specific work
## # A tibble: 1 × 8
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <fct> <chr>
## 1 768 Wutherin… Bront… 405 en Best Books Ever Li…
## # ℹ 2 more variables: rights <fct>, has_text <lgl>
5.2.3.4 Download specific work
Then use the gutenberg_download()
function with the ID to download the book
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest.
## Using mirror http://aleph.gutenberg.org.
5.2.3.5 Summary
We use summary()
to get number of characters in each
## gutenberg_id text
## Min. :768 Length:12342
## 1st Qu.:768 Class :character
## Median :768 Mode :character
## Mean :768
## 3rd Qu.:768
## Max. :768
5.2.4 textdata
The textdata package allows to find and download textual datasets. See this link for details of datasets.
5.2.4.1 Available datasets
## name
## 1 AFINN-111
## 2 v1.0 sentence polarity
## 3 Loughran-McDonald Sentiment lexicon
## 4 Bing Sentiment Lexicon
## 5 NRC Word-Emotion Association Lexicon
## 6 NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon)
## 7 The NRC Valence, Arousal, and Dominance Lexicon
## 8 AG News
## 9 DBpedia
## 10 TREC-6 & TREC-50
## 11 IMDb Large Movie Review Dataset
## 12 GloVe 6B
## 13 GloVe Twitter 27B
## 14 GloVe Common Crawl 42B
## 15 GloVe Common Crawl 840B
## url
## 1 http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
## 2 http://www.cs.cornell.edu/people/pabo/movie-review-data
## 3 https://sraf.nd.edu/textual-analysis/resources/
## 4 https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
## 5 http://saifmohammad.com/WebPages/lexicons.html
## 6 www.saifmohammad.com/WebPages/AffectIntensity.htm
## 7 https://saifmohammad.com/WebPages/nrc-vad.html
## 8 https://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
## 9 https://wiki.dbpedia.org/
## 10 https://cogcomp.seas.upenn.edu/Data/QA/QC/
## 11 http://ai.stanford.edu/~amaas/data/sentiment/
## 12 https://nlp.stanford.edu/projects/glove/
## 13 https://nlp.stanford.edu/projects/glove/
## 14 https://nlp.stanford.edu/projects/glove/
## 15 https://nlp.stanford.edu/projects/glove/
## license
## 1 Open Database License (ODbL) v1.0
## 2 Cite the paper when used.
## 3 License required for commercial use. Please contact tloughra@nd.edu.
## 4 May be used (research, commercial, etc) with attribution.
## 5 License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
## 6 License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
## 7 License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).
## 8 You are encouraged to download this corpus for any non-commercial use.
## 9 Creative Commons Attribution-ShareAlike 3.0 License
## 10 Freely reusable public information licence
## 11 No license specified, the work may be protected by copyright.
## 12 Public Domain Dedication and License v1.0
## 13 Public Domain Dedication and License v1.0
## 14 Public Domain Dedication and License v1.0
## 15 Public Domain Dedication and License v1.0
## size type download_mech
## 1 78 KB (cleaned 59 KB) lexicon https
## 2 2 MB (cleaned 1.4 MB) dataset https
## 3 6.7 MB (cleaned 142 KB) lexicon https
## 4 287 KB (cleaned 220 KB) lexicon http
## 5 22.8 MB (cleaned 424 KB) lexicon http
## 6 333 KB (cleaned 212 KB) lexicon http
## 7 150.8 MB (cleaned 792 KB) lexicon http
## 8 64.4 MB (cleaned 33.9 MB) dataset https
## 9 279.5 MB (cleaned 211.1 MB) dataset https
## 10 1.2 MB (cleaned 827 KB) dataset https
## 11 376.4 MB (cleaned 71 MB) dataset http
## 12 822.2 MB (158MB, 311MB, 616MB, and 921MB processed) embeddings https
## 13 1.42 GB (248MB, 476MB, 931MB, and 1.79GB processed) embeddings https
## 14 1.75 GB (4.31GB processed) embeddings https
## 15 2.03 GB (4.94GB processed) embeddings https
## description
## 1
## 2 Dataset with sentences labeled with negative or positive sentiment.
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12 Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors)
## 13 Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors)
## 14 Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors)
## 15 Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors)
## citation
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 Citation info:\n\nThis dataset was published in Saif M. Mohammad and Peter Turney. (2013), ``Crowdsourcing a Word-Emotion Association Lexicon.'' Computational Intelligence, 29(3): 436-465.\n\narticle{mohammad13,\nauthor = {Mohammad, Saif M. and Turney, Peter D.},\ntitle = {Crowdsourcing a Word-Emotion Association Lexicon},\njournal = {Computational Intelligence},\nvolume = {29},\nnumber = {3},\npages = {436-465},\ndoi = {10.1111/j.1467-8640.2012.00460.x},\nurl = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x},\neprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x},\nyear = {2013}\n}\nIf you use this lexicon, then please cite it.
## 6 Citation info:\nDetails of the lexicon are in this paper.\nWord Affect Intensities. Saif M. Mohammad. arXiv preprint arXiv, April 2017.\n\ninproceedings{LREC18-AIL,\nauthor = {Mohammad, Saif M.},\ntitle = {Word Affect Intensities},\nbooktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018)},\nyear = {2018},\naddress={Miyazaki, Japan}\n}\n\nIf you use this lexicon, then please cite it.
## 7 Citation info:\n\ninproceedings{vad-acl2018,\ntitle={Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words},\nauthor={Mohammad, Saif M.},\nbooktitle={Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL)},\nyear={2018},\naddress={Melbourne, Australia}\n}\n\nIf you use this lexicon, then please cite it.
## 8 <NA>
## 9 <NA>
## 10 <NA>
## 11 <NA>
## 12 Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## 13 Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## 14 Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## 15 Citation info:\ninproceedings{pennington2014glove,\nauthor = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},\nbooktitle = {Empirical Methods in Natural Language Processing (EMNLP)},\ntitle = {GloVe: Global Vectors for Word Representation},\nyear = {2014},\npages = {1532--1543},\nurl = {http://www.aclweb.org/anthology/D14-1162},\n}
## $dataset
## [1] "v1.0 sentence polarity" "AG News"
## [3] "DBpedia" "TREC-6 & TREC-50"
## [5] "IMDb Large Movie Review Dataset"
##
## $embeddings
## [1] "GloVe 6B" "GloVe Twitter 27B"
## [3] "GloVe Common Crawl 42B" "GloVe Common Crawl 840B"
##
## $lexicon
## [1] "AFINN-111"
## [2] "Loughran-McDonald Sentiment lexicon"
## [3] "Bing Sentiment Lexicon"
## [4] "NRC Word-Emotion Association Lexicon"
## [5] "NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon)"
## [6] "The NRC Valence, Arousal, and Dominance Lexicon"
5.2.4.2 Download datasets
We download the smallest datasets from textdata
:
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
## word value factor(word)
## Length:2477 Min. :-5.0000 abandon : 1
## Class :character 1st Qu.:-2.0000 abandoned : 1
## Mode :character Median :-2.0000 abandons : 1
## Mean :-0.5894 abducted : 1
## 3rd Qu.: 2.0000 abduction : 1
## Max. : 5.0000 abductions: 1
## (Other) :2471
5.2.5 readtext
The readtext package comes with various datasets. We specify the path to where to find the datasets and upload them
5.2.5.1 Inaugural Corpus USA
5.2.5.1.2 Checking structure
## 'data.frame': 5 obs. of 4 variables:
## $ texts : chr "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life n"| __truncated__ "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magist"| __truncated__ "When it was first perceived, in early times, that no middle course for America remained between unlimited submi"| __truncated__ "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our countr"| __truncated__ ...
## $ Year : int 1789 1793 1797 1801 1805
## $ President: chr "Washington" "Washington" "Adams" "Jefferson" ...
## $ FirstName: chr "George" "George" "John" "Thomas" ...
5.2.5.1.3 Unnest
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 5 × 4
## texts Year President FirstName
## <chr> <int> <chr> <chr>
## 1 "Fellow-Citizens of the Senate and of the House of … 1789 Washingt… George
## 2 "Fellow citizens, I am again called upon by the voi… 1793 Washingt… George
## 3 "When it was first perceived, in early times, that … 1797 Adams John
## 4 "Friends and Fellow Citizens:\n\nCalled upon to und… 1801 Jefferson Thomas
## 5 "Proceeding, fellow citizens, to that qualification… 1805 Jefferson Thomas
5.2.5.2 Universal Declaration of Human Rights
We import multiple files containing the Universal Declaration of Human Rights in 13 languages. There are 13 different textfiles
5.2.5.2.2 Checking structure
## Classes 'readtext' and 'data.frame': 13 obs. of 4 variables:
## $ doc_id : chr "UDHR_chinese.txt" "UDHR_czech.txt" "UDHR_danish.txt" "UDHR_english.txt" ...
## $ text : chr "世界人权宣言\n联合国大会一九四八年十二月十日第217A(III)号决议通过并颁布 1948 年 12 月 10 日, 联 合 国 大 会 通"| __truncated__ "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\nÚvod U vědomí toho, že uznání přirozené důstojnosti a rovných a nezcizitelný"| __truncated__ "Den 10. december 1948 vedtog og offentliggjorde FNs tredie generalforsamling Verdenserklæringen om Menneskerett"| __truncated__ "Universal Declaration of Human Rights\nPreamble Whereas recognition of the inherent dignity and of the equal an"| __truncated__ ...
## $ document: chr "UDHR" "UDHR" "UDHR" "UDHR" ...
## $ language: chr "chinese" "czech" "danish" "english" ...
5.2.5.2.3 Unnest
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 13 × 4
## doc_id text document language
## <chr> <chr> <chr> <chr>
## 1 UDHR_chinese.txt "世界人权宣言\n联合国大会一九四八年十二月十日第217A(III)号决议通… UDHR chinese
## 2 UDHR_czech.txt "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\nÚv… UDHR czech
## 3 UDHR_danish.txt "Den 10. december 1948 vedtog og offen… UDHR danish
## 4 UDHR_english.txt "Universal Declaration of Human Rights… UDHR english
## 5 UDHR_french.txt "Déclaration universelle des droits de… UDHR french
## 6 UDHR_georgian.txt "FLFVBFYBC EAKT<FSF CF>JDTKSFJ LTRKFHF… UDHR georgian
## 7 UDHR_greek.txt "ΟΙΚΟΥΜΕΝΙΚΗ ΔΙΑΚΗΡΥΞΗ ΓΙΑ ΤΑ ΑΝΘΡΩΠΙΝ… UDHR greek
## 8 UDHR_hungarian.txt "Az Emberi Jogok Egyetemes Nyilatkozat… UDHR hungari…
## 9 UDHR_icelandic.txt "Mannréttindayfirlýsing Sameinuðu þjóð… UDHR iceland…
## 10 UDHR_irish.txt "DEARBHÚ UILE-CHOITEANN CEARTA AN DUIN… UDHR irish
## 11 UDHR_japanese.txt "『世界人権宣言』\n\n(1948.12.10 第3回国連総会採択)\n\… UDHR japanese
## 12 UDHR_russian.txt "Всеобщая декларация прав человека\nПр… UDHR russian
## 13 UDHR_vietnamese.txt "7X\\zQ QJ{Q WR\u007fQ WK\u009b JL±L Y… UDHR vietnam…
5.2.5.3 Twitter data
We the twitter.json data accessed from here. This is a JSON file (.json) downloaded from the Twitter stream API.
5.2.5.3.2 Checking structure
## Classes 'readtext' and 'data.frame': 7504 obs. of 44 variables:
## $ doc_id : chr "twitter.json.1" "twitter.json.2" "twitter.json.3" "twitter.json.4" ...
## $ text : chr "@EFC_Jayy UKIP" "RT @Corbynator2:@jeremycorbyn Reaction from people at the Watford Rally:\n“We believe in Jeremy Corbyn!”\n“We n"| __truncated__ "RT @ryvr: Stephen Hawking, the world’s smartest man, backs Jeremy Corbyn https://t.co/2kl3ayLd44 #TuesdayThoughts" "RT @TheGreenParty: How you cast your vote will shape the future. Every single vote counts. Tomorrow, #VoteGreen"| __truncated__ ...
## $ retweet_count : num 0 90 78 244 1896 ...
## $ favorite_count : num 0 108 104 218 2217 ...
## $ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ id_str : chr "872596537142116352" "872596536869363712" "872596537444093952" "872596538492637185" ...
## $ in_reply_to_screen_name : chr "EFC_Jayy" NA NA NA ...
## $ source : chr "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>" ...
## $ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ created_at : chr "Wed Jun 07 23:30:01 +0000 2017" "Wed Jun 07 23:30:01 +0000 2017" "Wed Jun 07 23:30:01 +0000 2017" "Wed Jun 07 23:30:01 +0000 2017" ...
## $ in_reply_to_status_id_str: chr "872596176834572288" NA NA NA ...
## $ in_reply_to_user_id_str : chr "4556760676" NA NA NA ...
## $ lang : chr "en" "en" "en" "en" ...
## $ listed_count : num 1 28 2 3 6 2 12 90 1 25 ...
## $ verified : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ location : chr "Japan" "Gondwana" "LDN rt/mention/follow/link ≠ e" "East, England" ...
## $ user_id_str : chr "863929468984995840" "153295243" "273731990" "477177095" ...
## $ description : chr NA "#Black. #Green. #Red. #Aboriginal. #Environmental. #Socialist. #Atheist." "Infovore, atheist, post-Ⓐ, p/t nihilist, lifelong radiophile, aspiring cultural terrorist 🏴 ☮️ 🇵🇸 \nImages: @"| __truncated__ "think outside your own perspective and find transcendence bypassing state opulence" ...
## $ geo_enabled : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ user_created_at : chr "Mon May 15 01:30:11 +0000 2017" "Tue Jun 08 05:05:23 +0000 2010" "Tue Mar 29 01:59:32 +0000 2011" "Sat Jan 28 22:41:07 +0000 2012" ...
## $ statuses_count : num 2930 10223 20934 13603 13179 ...
## $ followers_count : num 367 845 761 321 386 ...
## $ favourites_count : num 1260 9813 14733 5421 5219 ...
## $ protected : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ user_url : chr NA NA NA "http://www.instagram.com/stuhornett/" ...
## $ name : chr "ジョージ" "Yara-ma-yha-who" "Openly classist" "Stu" ...
## $ time_zone : chr NA "London" "London" "Casablanca" ...
## $ user_lang : chr "en" "en" "en" "en" ...
## $ utc_offset : num NA 3600 3600 0 3600 NA NA 3600 NA 3600 ...
## $ friends_count : num 304 439 2761 767 257 ...
## $ screen_name : chr "CoysJoji" "Unkle_Ken" "OpenlyClassist" "StuHornett" ...
## $ country_code : chr NA NA NA NA ...
## $ country : chr NA NA NA NA ...
## $ place_type : logi NA NA NA NA NA NA ...
## $ full_name : chr NA NA NA NA ...
## $ place_name : chr NA NA NA NA ...
## $ place_id : chr NA NA NA NA ...
## $ place_lat : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
## $ place_lon : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
## $ lat : num NA NA NA NA NA NA NA NA NA NA ...
## $ lon : num NA NA NA NA NA NA NA NA NA NA ...
## $ expanded_url : chr NA NA "http://www.independent.co.uk/news/science/stephen-hawking-jeremy-corbyn-labour-theresa-may-conservatives-endors"| __truncated__ NA ...
## $ url : chr NA "" "https://t.co/2kl3ayLd44" NA ...
5.2.5.3.3 Unnest
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 7,504 × 44
## doc_id text retweet_count favorite_count favorited truncated id_str
## <chr> <chr> <dbl> <dbl> <lgl> <lgl> <chr>
## 1 twitter.json.1 "@EF… 0 0 FALSE FALSE 87259…
## 2 twitter.json.2 "RT … 90 108 FALSE FALSE 87259…
## 3 twitter.json.3 "RT … 78 104 FALSE FALSE 87259…
## 4 twitter.json.4 "RT … 244 218 FALSE FALSE 87259…
## 5 twitter.json.5 "RT … 1896 2217 FALSE FALSE 87259…
## 6 twitter.json.6 "RT … 55 52 FALSE FALSE 87259…
## 7 twitter.json.7 "RT … 65 73 FALSE FALSE 87259…
## 8 twitter.json.8 "RT … 30 9 FALSE FALSE 87259…
## 9 twitter.json.9 "RT … 1896 2217 FALSE FALSE 87259…
## 10 twitter.json.10 "Wha… 0 0 FALSE TRUE 87259…
## # ℹ 7,494 more rows
## # ℹ 37 more variables: in_reply_to_screen_name <chr>, source <chr>,
## # retweeted <lgl>, created_at <chr>, in_reply_to_status_id_str <chr>,
## # in_reply_to_user_id_str <chr>, lang <chr>, listed_count <dbl>,
## # verified <lgl>, location <chr>, user_id_str <chr>, description <chr>,
## # geo_enabled <lgl>, user_created_at <chr>, statuses_count <dbl>,
## # followers_count <dbl>, favourites_count <dbl>, protected <lgl>, …
5.2.5.4 Converting from a PDF file
We can also import data in a PDF format and obtain details from file name.
5.2.5.4.2 Check encoding
## [1] "UTF-8" "UTF-8" "UTF-8" "unknown" "UTF-8" "UTF-8" "UTF-8"
## [8] "UTF-8" "UTF-8" "UTF-8" "UTF-8"
5.2.5.4.3 Checking structure
## Classes 'readtext' and 'data.frame': 11 obs. of 4 variables:
## $ doc_id : chr "UDHR_chinese.pdf" "UDHR_czech.pdf" "UDHR_danish.pdf" "UDHR_english.pdf" ...
## $ text : chr "世界人权宣言\n\n联合国大会一九四八年十二月十日第217A(III)号决议通过并颁布\n1948 年 12 月 10 日, 联 合 国 大 会"| __truncated__ "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\n\nÚvod\n\nU vědomí toho,\nže uznání přirozené důstojnosti a rovných a nezciz"| __truncated__ "Den 10. december 1948 vedtog og offentliggjorde FNs tredie generalforsamling\nVerdenserklæringen om Menneskeret"| __truncated__ "Universal Declaration of Human Rights\n\nPreamble\n\nWhereas recognition of the inherent dignity and of the equ"| __truncated__ ...
## $ document: chr "UDHR" "UDHR" "UDHR" "UDHR" ...
## $ language: chr "chinese" "czech" "danish" "english" ...
5.2.5.4.4 Unnest
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 11 × 4
## doc_id text document language
## <chr> <chr> <chr> <chr>
## 1 UDHR_chinese.pdf "世界人权宣言\n\n联合国大会一九四八年十二月十日第217A(III)号决… UDHR chinese
## 2 UDHR_czech.pdf "VŠEOBECNÁ DEKLARACE LIDSKÝCH PRÁV\n\n… UDHR czech
## 3 UDHR_danish.pdf "Den 10. december 1948 vedtog og offen… UDHR danish
## 4 UDHR_english.pdf "Universal Declaration of Human Rights… UDHR english
## 5 UDHR_french.pdf "Déclaration universelle des droits de… UDHR french
## 6 UDHR_greek.pdf "ΟΙΚΟΥΜΕΝΙΚΗ ΔΙΑΚΗΡΥΞΗ ΓΙΑ ΤΑ ΑΝΘΡΩΠΙΝ… UDHR greek
## 7 UDHR_hungarian.pdf "Az Emberi Jogok Egyetemes Nyilatkozat… UDHR hungari…
## 8 UDHR_irish.pdf "DEARBHÚ UILE-CHOITEANN CEARTA AN DUIN… UDHR irish
## 9 UDHR_japanese.pdf "『世界人権宣言』\n\n\n\n(1948.12.10 第3回国連総会採択… UDHR japanese
## 10 UDHR_russian.pdf "Всеобщая декларация прав человека\n\n… UDHR russian
## 11 UDHR_vietnamese.pdf " 7X\\zQ⇤QJ{Q⇤WR… UDHR vietnam…
5.2.5.5 Different encodings
We look into data with different encoding. This is important as the type of data you will generate can be of different encodings.
5.2.5.5.2 Importing data
We use regular expressions to search for all files starting with “Indian_” or “UDHR_” and containing any characters, up to the ending with “.text”
## [1] "IndianTreaty_English_UTF-16LE.txt" "IndianTreaty_English_UTF-8-BOM.txt"
## [3] "UDHR_Arabic_ISO-8859-6.txt" "UDHR_Arabic_UTF-8.txt"
## [5] "UDHR_Arabic_WINDOWS-1256.txt" "UDHR_Chinese_GB2312.txt"
5.2.5.5.3 Export encoding
We use various functions to delete .txt
at the end of each file name and then we split each string as a function of _
and obtain the third row in each list of item.
filename <- filename %>%
str_replace(".txt$", "")
encoding <- purrr::map(str_split(filename, "_"), 3)
head(encoding)
## [[1]]
## [1] "UTF-16LE"
##
## [[2]]
## [1] "UTF-8-BOM"
##
## [[3]]
## [1] "ISO-8859-6"
##
## [[4]]
## [1] "UTF-8"
##
## [[5]]
## [1] "WINDOWS-1256"
##
## [[6]]
## [1] "GB2312"
We feed the encoding to readtext()
to convert various character encodings into UTF-8.
dat_txt <- readtext(paste0(Data_Dir, "/data_files_encodedtexts.zip"),
encoding = encoding,
docvarsfrom = "filenames",
docvarnames = c("document", "language", "input_encoding"))
## Warning: `cols` is now required when using `unnest()`.
## ℹ Please use `cols = c()`.
## # A tibble: 36 × 5
## doc_id text document language input_encoding
## <chr> <chr> <chr> <chr> <chr>
## 1 IndianTreaty_English_UTF-16LE.txt "WHEREAS… IndianT… English UTF-16LE
## 2 IndianTreaty_English_UTF-8-BOM.txt "ARTICLE… IndianT… English UTF-8-BOM
## 3 UDHR_Arabic_ISO-8859-6.txt "الديباج… UDHR Arabic ISO-8859-6
## 4 UDHR_Arabic_UTF-8.txt "الديباج… UDHR Arabic UTF-8
## 5 UDHR_Arabic_WINDOWS-1256.txt "الديباج… UDHR Arabic WINDOWS-1256
## 6 UDHR_Chinese_GB2312.txt "世界人权宣言\… UDHR Chinese GB2312
## 7 UDHR_Chinese_GBK.txt "世界人权宣言\… UDHR Chinese GBK
## 8 UDHR_Chinese_UTF-8.txt "世界人权宣言\… UDHR Chinese UTF-8
## 9 UDHR_English_UTF-16BE.txt "Univers… UDHR English UTF-16BE
## 10 UDHR_English_UTF-16LE.txt "Univers… UDHR English UTF-16LE
## # ℹ 26 more rows
5.2.6 Webscrapping
We use the rvest
package to obtain data from a specific URL. See here for advanced webscrapping. Look at this link as well for a more straightforward way.
5.2.6.1 A single webpage
5.2.6.1.1 Read_html
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <div id="appTidyverseSite" class="shrinkHeader alwaysShrinkHead ...
Because the downloaded file contains a unnecessary information. We process the data to extract only the text from the webpage.
5.2.6.1.2 Extract headline
header_web_page <- web_page %>%
## extract paragraphs
rvest::html_nodes("h1") %>%
## extract text
rvest::html_text()
head(header_web_page)
## [1] "Tidyverse packages"
5.2.6.1.3 Extract text
web_page_txt <- web_page %>%
## extract paragraphs
rvest::html_nodes("p") %>%
## extract text
rvest::html_text()
head(web_page_txt)
## [1] "Install all the packages in the tidyverse by running install.packages(\"tidyverse\")."
## [2] "Run library(tidyverse) to load the core tidyverse and make it available\nin your current R session."
## [3] "Learn more about the tidyverse package at https://tidyverse.tidyverse.org."
## [4] "The core tidyverse includes the packages that you’re likely to use in everyday data analyses. As of tidyverse 1.3.0, the following packages are included in the core tidyverse:"
## [5] "ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. Go to docs..."
## [6] "dplyr provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges. Go to docs..."
5.2.6.2 Multiple webpages
5.2.6.2.1 Read_html
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <div id="appTidyverseSite" class="shrinkHeader alwaysShrinkHead ...
## {xml_nodeset (9)}
## [1] <a href="https://ggplot2.tidyverse.org/" target="_blank">\n <img class ...
## [2] <a href="https://dplyr.tidyverse.org/" target="_blank">\n <img class=" ...
## [3] <a href="https://tidyr.tidyverse.org/" target="_blank">\n <img class=" ...
## [4] <a href="https://readr.tidyverse.org/" target="_blank">\n <img class=" ...
## [5] <a href="https://purrr.tidyverse.org/" target="_blank">\n <img class=" ...
## [6] <a href="https://tibble.tidyverse.org/" target="_blank">\n <img class= ...
## [7] <a href="https://stringr.tidyverse.org/" target="_blank">\n <img class ...
## [8] <a href="https://forcats.tidyverse.org/" target="_blank">\n <img class ...
## [9] <a href="https://lubridate.tidyverse.org/" target="_blank">\n <img cla ...
5.2.6.2.2 Extract headline
## [1] "https://ggplot2.tidyverse.org/" "https://dplyr.tidyverse.org/"
## [3] "https://tidyr.tidyverse.org/" "https://readr.tidyverse.org/"
## [5] "https://purrr.tidyverse.org/" "https://tibble.tidyverse.org/"
## [7] "https://stringr.tidyverse.org/" "https://forcats.tidyverse.org/"
## [9] "https://lubridate.tidyverse.org/"
5.2.6.2.3 Extract subpages
## [[1]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[2]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[3]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[4]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[5]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[6]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[7]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[8]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
##
## [[9]]
## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <a href="#container" class="visually-hidden-focusable">Skip t ...
The structure seems to be similar across all pages
## [1] "ggplot2" "dplyr" "tidyr" "readr" "purrr" "tibble"
## [7] "stringr" "forcats" "lubridate"
and extracting version number
pages %>%
map(rvest::html_element, css = "small.nav-text.text-muted.me-auto") %>%
map_chr(rvest::html_text)
## [1] "3.5.2" "1.1.4" "1.3.1" "2.1.5" "1.1.0" "3.3.0" "1.5.1" "1.0.0" "1.9.4"
5.2.6.2.4 Extract text
and we can also add all into a tibble
pages_table <- tibble(
name = pages %>%
map(rvest::html_element, css = "a.navbar-brand") %>%
map_chr(rvest::html_text),
version = pages %>%
map(rvest::html_element, css = "small.nav-text.text-muted.me-auto") %>%
map_chr(rvest::html_text),
CRAN = pages %>%
map(rvest::html_element, css = "ul.list-unstyled > li:nth-child(1) > a") %>%
map_chr(rvest::html_attr, name = "href"),
Learn = pages %>%
map(rvest::html_element, css = "ul.list-unstyled > li:nth-child(4) > a") %>%
map_chr(rvest::html_attr, name = "href"),
text = pages %>%
map(rvest::html_element, css = "body") %>%
map_chr(rvest::html_text2)
)
pages_table
## # A tibble: 9 × 5
## name version CRAN Learn text
## <chr> <chr> <chr> <chr> <chr>
## 1 ggplot2 3.5.2 https://cloud.r-project.org/package=ggplot2 https:/… "Ski…
## 2 dplyr 1.1.4 https://cloud.r-project.org/package=dplyr http://… "Ski…
## 3 tidyr 1.3.1 https://cloud.r-project.org/package=tidyr https:/… "Ski…
## 4 readr 2.1.5 https://cloud.r-project.org/package=readr http://… "Ski…
## 5 purrr 1.1.0 https://cloud.r-project.org/package=purrr http://… "Ski…
## 6 tibble 3.3.0 https://cloud.r-project.org/package=tibble https:/… "Ski…
## 7 stringr 1.5.1 https://cloud.r-project.org/package=stringr http://… "Ski…
## 8 forcats 1.0.0 https://cloud.r-project.org/package=forcats http://… "Ski…
## 9 lubridate 1.9.4 https://cloud.r-project.org/package=lubridate https:/… "Ski…