Unsupervised Machine Learning II

Session 10 - Exercise

Published

13.01.2025

Link to source file

Fokus der Übung: stm-Topicmodeling mit R kennenlernen

Typische Schritte der Auswertung eines stm-Topicmodels mit Hilfe von tidytext (Silge and Robinson 2016) reproduzieren
Verständnis für die Interpretation von Themenmodellen schärfen.
Einfluss von Metadaten untersuchen und interpretieren.

Background

Todays’s data basis: OpenAlex

Via API bzw. openalexR (Aria et al. 2024) gesammelte “works” der Datenbank OpenAlex mit Bezug zu Literaturriews in den Sozialwissenschaften zwischen 2013 und 2023
Detaillierte Informationen und Ergebnisse zur Suchquery finden Sie hier.

Preparation

Wichtige Information

Bitte stellen Sie sicher, dass Sie das jeweilige R-Studio Projekt zur Übung geöffnet haben. Nur so funktionieren alle Dependencies korrekt.
Um den einwandfreien Ablauf der Übung zu gewährleisten, wird für die Aufgaben auf eine eigenständige Datenerhebung verzichtet und ein Übungsdatensatz zu verfügung gestelt.

Packages

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
    here, qs, # file management
    magrittr, janitor, # data wrangling
    easystats, sjmisc, # data analysis
    gt, gtExtras, # table visualization
    ggpubr, ggwordcloud, # visualization
    # text analysis    
    tidytext, widyr, # based on tidytext
    quanteda, # based on quanteda
    quanteda.textmodels, quanteda.textplots, quanteda.textstats, 
    stm, # structural topic modeling
    openalexR, pushoverr, tictoc, 
    tidyverse # load last to avoid masking issues
  )

Import und Vorverarbeitung der Daten

review_works <- qs::qread(here("data/session-07/openalex-review_works-2013_2023.qs"))

# Create correct data
review_subsample <- review_works %>% 
    # Create additional factor variables
    mutate(
        publication_year_fct = as.factor(publication_year), 
        type_fct = as.factor(type)
        ) %>%
    # Eingrenzung: Sprache und Typ
    filter(language == "en") %>% 
    filter(type == "article") %>%
    # Datentranformation
    unnest(topics, names_sep = "_") %>%
    filter(topics_name == "field") %>% 
    filter(topics_i == "1") %>% 
    # Eingrenzung: Forschungsfeldes
    filter(
    topics_display_name == "Social Sciences"|
    topics_display_name == "Psychology"
    ) %>% 
    mutate(
        field = as.factor(topics_display_name)
    ) %>% 
    # Eingrenzung: Keine Einträge ohne Abstract
    filter(!is.na(ab))

# Create corpus
quanteda_corpus <- review_subsample %>% 
  quanteda::corpus(
    docid_field = "id", 
    text_field = "ab"
  )

# Tokenize
quanteda_token <- quanteda_corpus %>% 
  quanteda::tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE, 
    remove_numbers = TRUE, 
    remove_url = TRUE, 
    split_tags = FALSE # keep hashtags and mentions
  ) %>% 
  quanteda::tokens_tolower() %>% 
  quanteda::tokens_remove(
    pattern = stopwords("en")
    )

# Convert to Document-Feature-Matrix (DFM)
quanteda_dfm <- quanteda_token %>% 
  quanteda::dfm()

# Pruning
quanteda_dfm_trim <- quanteda_dfm %>% 
  dfm_trim( 
    min_docfreq = 10/nrow(review_subsample),
    max_docfreq = 0.99, 
    docfreq_type = "prop")

# Convert for stm topic modeling
quanteda_stm <- quanteda_dfm_trim %>% 
   convert(to = "stm")

🛠️ Praktische Anwendung

Achtung, bitte lesen!

Bevor Sie mit der Arbeit an den folgenden 📋 Exercises beginnen, stellen Sie bitte sicher, dass Sie alle Chunks des Abschnitts Preparation gerendert haben. Das können Sie tun, indem Sie den “Run all chunks above”-Knopf des nächsten Chunks benutzen.
Bei Fragen zum Code lohnt sich ein Blick in den Showcase (.qmd oder .html). Beim Showcase handelt es sich um eine kompakte Darstellung des in der Präsentation verwenden R-Codes. Sie können das Showcase also nutzen, um sich die Code-Bausteine anzusehen, die für die R-Outputs auf den Slides benutzt wurden.

📋 Exercise 1: Visualisierung der Themenprävalenz

1.1. Auswahl des passenden Models

Erstelen Sie einen neuen Datensatz stm_mdl_k40
- basierend auf dem Datensatz stm_serach
  1. Verwenden Sie filter(k == 40), um das Modell mit 40 Themen zu auszuwählen.
  2. Verwenden Sie pull(mdl) %>% .[[1]] um die Spalte und das Element zu extrahieren, die das Modell enthält.
  3. Speichern Sie diese Umwandlung, indem Sie einen neuen Datensatz mit dem Namen stm_mdl_k40 erstellen.
Überprüfen Sie die Transformation indem Sie stm_mdl_k40 in die Konsole eingeben.

Lösung anzeigen

# Pull tpm with 40 topics
stm_mdl_k40 <- stm_search %>% 
  filter(k == 40) %>% 
  pull(mdl) %>% 
  .[[1]]

# Check
stm_mdl_k40

A topic model with 40 topics, 36650 documents and a 14322 word dictionary.

1.2. Identifikation der Top-Terms für jedes Thema

Erstellen Sie einen neuen Datensatz td_beta
- basierend auf dem Datensatz stm_mdl_k40,
- Verwenden Sie tidy(method = "frex"), um die Beta-Matrix zu erstellen.
Erstellen Sie einen neuen Datensatz top_terms
- basierend auf dem Datenastz td_beta,
  1. Verwenden Sie arrange(beta), um die Begriffe nach Beta zu sortieren.
  2. Gruppieren Sie die Begriffe nach topic mit group_by(topic).
  3. Extrahieren Sie die 7 häufigsten Begriffe mit top_n(7, beta).
  4. Sortieren Sie die Begriffe absteigend mit arrange(-beta).
  5. Wählen Sie die Variablen topic und term mit select(topic, term) aus.
  6. Extrahieren Sie die Top-Begriffe pro Thema mit summarise(terms = list(term)).
  7. Transformieren Sie die extrahierten Begriffe pro Thema mit map(terms, paste, collapse = ", ") zu einem String.
  8. “Entpacken” Sie die Begriffe aus der Liste (unnesten) mit unnest(cols = c(terms)).
Überprüfen Sie die Transformation indem Sie top_terms in die Konsole eingeben.

Lösung anzeigen

# Create tidy beta matrix
td_beta <- tidy(stm_mdl_k40, method = "frex")

# Create top terms
top_terms <- td_beta %>%
  arrange(beta) %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  arrange(-beta) %>%
  select(topic, term) %>%
  summarise(terms = list(term)) %>%
  mutate(terms = map(terms, paste, collapse = ", ")) %>% 
  unnest(cols = c(terms))

# Output
top_terms

# A tibble: 40 × 2
   topic terms                                                                  
   <int> <chr>                                                                  
 1     1 care, nursing, healthcare, nurses, professionals, patients, patient    
 2     2 students, school, academic, education, educational, schools, literacy  
 3     3 的, 研究, 和, rs, 在, 了, 性                                           
 4     4 elderly, #x0d, can, review, literature, google, keywords               
 5     5 article, journal, decision, describes, aids, pressure, section         
 6     6 prevalence, countries, among, studies, rates, population, higher       
 7     7 depression, anxiety, psychological, stress, life, symptoms, cancer     
 8     8 people, services, community, service, barriers, participation, support 
 9     9 factors, relationship, positive, studies, associated, behavior, negati…
10    10 b, et, al, r, s, c, d                                                  
# ℹ 30 more rows

1.3 Erstellung der Prävalenz-Tabelle für die Themen

Erstellen Sie einen neuen Datensatz td_gamma
- basierend auf dem Datensatz stm_mdl_k40,
- Verwenden Sie tidy(), um die Gamma-Matrix zu erstellen.
- Verwenden Sie document_names = names(quanteda_stm$documents) um die Dokumentennamen zu speichern
Erstellen Sie einen neuen Datensatz prevalence
- basierend auf dem Datensatz td_gamma,
  1. Gruppieren Sie die Themen nach topic mit group_by(topic).
  2. Berechnen Sie den Durchschnitt der gamma-Werte pro Thema mit summarise(gamma = mean(gamma)).
  3. Sortieren Sie die Themen (absteigend) nach gamma mit arrange(desc(gamma)).
  4. Verknüpfen Sie die Top-Begriffe mit den Themen mit left_join(top_terms, by = "topic").
  5. Überarbeiten Sie die Variable topic mit dem mutate-Befehl:
    1. Erstellen Sie eine neue Variable topic mit paste0("Topic ",sprintf("%02d", topic)).
    2. Ordnen Sie die Themen nach gamma mit reorder(topic, gamma).
Erstellung Sie eine Tabelle als Output
- basierend auf dem Datenastz prevalence,
  1. Verwenden Sie gt() um eine Tabelle zu erstellen.
  2. Formatieren Sie die Spalte gamma mit fmt_number(columns = vars(gamma), decimals = 2) um nur zwei Nachkommastellen anzuzeigen.
  3. Verwenden Sie gtExtras::gt_theme_538() um das Design der Tabelle anzupassen.
✍️ Auf Basis des Outputs von prevalence Notieren Sie, welche Themen Sie als problematisch sehen und warum.

Lösung anzeigen

# Create tidy gamma matrix
td_gamma <- tidy(
  stm_mdl_k40, 
  matrix = "gamma", 
  document_names = names(quanteda_stm$documents)
  )

# Create prevalence
prevalence <- td_gamma %>%
  group_by(topic) %>%
  summarise(gamma = mean(gamma)) %>%
  arrange(desc(gamma)) %>%
  left_join(top_terms, by = "topic") %>%
  mutate(topic = paste0("Topic ",sprintf("%02d", topic)),
         topic = reorder(topic, gamma))

# Output
prevalence %>% 
  gt() %>% 
  fmt_number(
    columns = vars(gamma), 
    decimals = 2) %>% 
  gtExtras::gt_theme_538()

topic	gamma	terms
Topic 16	0.09	research, review, literature, future, findings, systematic, studies
Topic 19	0.07	studies, included, review, quality, evidence, data, systematic
Topic 38	0.05	articles, search, science, databases, review, systematic, criteria
Topic 39	0.04	study, research, literature, analysis, used, results, method
Topic 09	0.04	factors, relationship, positive, studies, associated, behavior, negative
Topic 25	0.04	learning, education, students, teaching, teachers, skills, higher
Topic 29	0.04	cultural, change, policy, political, human, identity, different
Topic 35	0.04	interventions, intervention, effectiveness, n, outcomes, effective, studies
Topic 20	0.03	management, tourism, development, public, paper, economic, marketing
Topic 14	0.03	digital, use, information, technology, online, communication, technologies
Topic 34	0.03	effect, effects, meta-analysis, p, ptsd, ci, significant
Topic 30	0.03	disorders, disorder, suicide, eating, risk, suicidal, psychiatric
Topic 33	0.03	physical, cognitive, activity, studies, body, exercise, weight
Topic 28	0.03	sleep, ci, meta-analysis, risk, studies, pooled, p
Topic 21	0.03	treatment, therapy, trials, patients, music, pain, controlled
Topic 18	0.03	measures, assessment, used, tools, measurement, instruments, measure
Topic 32	0.03	health, mental, stigma, problems, wellbeing, outcomes, review
Topic 06	0.02	prevalence, countries, among, studies, rates, population, higher
Topic 08	0.02	people, services, community, service, barriers, participation, support
Topic 15	0.02	reviews, outcomes, reporting, systematic, outcome, items, preferred
Topic 13	0.02	family, support, resilience, experiences, caregivers, parents, parental
Topic 07	0.02	depression, anxiety, psychological, stress, life, symptoms, cancer
Topic 17	0.02	children, adolescents, language, development, early, child, skills
Topic 01	0.02	care, nursing, healthcare, nurses, professionals, patients, patient
Topic 23	0.02	social, media, older, adults, use, loneliness, people
Topic 37	0.02	training, programs, work, program, professional, skills, workplace
Topic 36	0.02	literature, history, american, black, book, literary, historical
Topic 24	0.02	violence, women, abuse, sexual, ipv, child, trauma
Topic 26	0.02	covid-19, pandemic, vaccine, vaccination, health, acceptance, disease
Topic 27	0.02	use, gender, sexual, substance, alcohol, sex, men
Topic 02	0.02	students, school, academic, education, educational, schools, literacy
Topic 31	0.02	environment, environmental, urban, travel, physical, transport, safety
Topic 40	0.01	authors, interest, information, group, studies, case, term
Topic 12	0.01	crime, review, police, et, al, studies, may
Topic 11	0.01	university, author, papers, college, search, review, share
Topic 04	0.01	elderly, #x0d, can, review, literature, google, keywords
Topic 10	0.01	b, et, al, r, s, c, d
Topic 05	0.00	article, journal, decision, describes, aids, pressure, section
Topic 22	0.00	de, la, y, en, los, el, se
Topic 03	0.00	的, 研究, 和, rs, 在, 了, 性

📋 Exercise 2: Einfluss der Metadaten

2.1. Schätzung der Meta-Effekte

Erstellen Sie einen neuen Datensatz effects:
- Verwenden Sie die Funktion estimateEffect(), um die Effekte zu schätzen.
- Verwenden Sie für das formular-Argument 1:40 ~ publication_year_fct + field, um die Effekte der Veröffentlichungsjahre und Fachbereiche zu schätzen.
- Verwenden Sie stm_mdl_40 als das zu analysierende Modell.
- Verwenden Sie meta = quanteda_stm$meta, um die Metadaten für die Schätzung zu verwenden.

Lösung anzeigen

# Create data
effects <- estimateEffect(
    1:40 ~ publication_year_fct + field,
    stm_mdl_k40,
    meta = quanteda_stm$meta)

2.2. Untersuchung der Effekte

Erstellen Sie einen neuen Datensatz effects_tidy eine bereinigte Tabelle der Effekte:
- Basierend auf dem Datensatz effects
1. Verwenden Sie die tidy() Funktion, um die Effekte in ein aufbereitetes Format zu bringen.
2. Filtern Sie die Daten:
  1. Entfernen Sie Zeilen, bei denen term den Wert (Intercept) hat.
  2. Behalten Sie nur die Zeilen, bei denen term == "fieldSocial Sciences" ist.
3. Entfernen Sie die Spalte term mit select(-term)
Erstellen Sie ein Tabelle zur Überprüfung der Effekte
- Basierend auf dem Datensatz effects_tidy:
1. Verwenden Sie die Funktion gt(), um eine Tabelle zu erstellen.
2. Formatieren Sie alle numerischen Variablen mit fmt_number(columns = -c(topic), decimals = 3), um lediglich drei Dezimalstellen darzustellen.
3. Verwenden Sie data_color(columns = estimate, method = "numeric", palette = "viridis"), um die Schätzwerte farblich zu kennzeichnen.
4. Wenden Sie das Design gtExtras::gt_theme_538() an.
✍️ Notieren Sie, welches Thema am stärksten im Forschungsfeld “Social Science” vertreten ist.

Lösung anzeigen

# Filter effect data
effects_tidy <- effects %>% 
  tidy() %>% 
  filter(
    term != "(Intercept)",
    term == "fieldSocial Sciences") %>% 
    select(-term)


# Explore effects (table outpu)
effects_tidy %>% 
    gt() %>% 
    fmt_number(
      columns = -c(topic),
      decimals = 3
    ) %>% 
    data_color(
       columns = estimate,
    method = "numeric",
    palette = "viridis"
  ) %>% 
  gtExtras::gt_theme_538()

topic	estimate	std.error	statistic	p.value
1	−0.008	0.001	−8.704	0.000
2	0.011	0.001	14.856	0.000
3	0.000	0.000	−1.652	0.099
4	0.005	0.001	6.650	0.000
5	0.003	0.000	11.989	0.000
6	0.005	0.001	5.766	0.000
7	−0.020	0.001	−28.707	0.000
8	0.012	0.001	14.356	0.000
9	−0.015	0.001	−17.875	0.000
10	−0.001	0.000	−1.250	0.211
11	0.006	0.001	9.776	0.000
12	0.008	0.001	10.280	0.000
13	−0.009	0.001	−10.087	0.000
14	0.021	0.001	17.756	0.000
15	0.001	0.001	1.894	0.058
16	0.037	0.001	30.223	0.000
17	−0.011	0.001	−11.977	0.000
18	−0.011	0.001	−13.983	0.000
19	−0.026	0.001	−27.781	0.000
20	0.052	0.001	40.332	0.000
21	−0.041	0.001	−35.699	0.000
22	−0.001	0.000	−1.943	0.052
23	0.013	0.001	18.159	0.000
24	0.002	0.001	1.601	0.109
25	0.036	0.001	25.743	0.000
26	0.008	0.001	6.687	0.000
27	−0.008	0.001	−8.271	0.000
28	−0.028	0.001	−23.281	0.000
29	0.038	0.001	38.214	0.000
30	−0.048	0.001	−36.278	0.000
31	0.007	0.001	7.455	0.000
32	−0.013	0.001	−16.339	0.000
33	−0.034	0.001	−32.499	0.000
34	−0.039	0.001	−32.592	0.000
35	−0.025	0.001	−26.281	0.000
36	0.022	0.001	20.535	0.000
37	0.011	0.001	16.406	0.000
38	−0.004	0.001	−5.334	0.000
39	0.042	0.001	42.204	0.000
40	0.001	0.000	2.560	0.010

Lösung anzeigen

#### Notes:
#

📋 Exercise 3: Einzelthema im Fokus

3.1. Benennung des Themas k = 20

Benennen Sie das Thema k = 20 aus dem Modell stm_mdl_40:
- Verwenden Sie die Funktion labelTopics().
- Geben Sie das Thema 20 als Parameter mit topic = 20 an.
✍️ Notieren Sie die Themennamen. Begründen Sie kurz Ihre Entscheidung.

Lösung anzeigen

# Create topic label
stm_mdl_k40 %>% labelTopics(topic = 20)

Topic 20 Top Words:
     Highest Prob: management, tourism, development, public, paper, economic, marketing 
     FREX: tourism, marketing, consumer, halal, sustainable, disaster, economy 
     Lift: hotel, mega-events, tourism, post-disaster, smes, tourist, b2b 
     Score: tourism, marketing, halal, business, governance, sustainable, disaster

Lösung anzeigen

# Themenname:

3.2. Zusammenführung mit OpenAlex-Daten

Erstellen Sie einen neuen Datensatz gamma_export
- basierend auf dem Datensatz stm_mdl_k40:
  1. Verwenden Sie tidy() um die Gamma-Matrix zu erstellen. Geben Sie matrix = "gamma" und document_names = names(quanteda_stm$documents) als Parameter an.
  2. Gruppieren Sie die Dokumente nach document mit group_by(document).
  3. Wählen Sie die Dokumente mit dem höchsten gamma-Wert mit slice_max(gamma).
  4. Lösen Sie die Gruppierung mit dplyr::ungroup().
  5. Verknüpfen Sie die Daten mit review_subsample mittels left_join(review_subsample, by = c("document" = "id")).
- Benennen Sie die Spalte document in id um mit dplyr::rename(id = document)
- Erstellen Sie eine neue Variable stm_topic mit Hilfe des mutate()-Befehls. Verwenden Sie as.factor(paste("Topic", sprintf("%02d", topic))) um die Themen zu benennen und als Faktor zu speichern.
Überprüfen Sie Transformation mit Hilfe der glimpse()-Funktion, um sicherzustellen, dass die Daten korrekt erstellt wurden.

Lösung anzeigen

# Create gamma export
gamma_export <- stm_mdl_k40 %>% 
  tidytext::tidy(
    matrix = "gamma", 
    document_names = names(quanteda_stm$documents)) %>%
  dplyr::group_by(document) %>% 
  dplyr::slice_max(gamma) %>% 
  dplyr::ungroup() %>% 
  dplyr::left_join(review_subsample, by = c("document" = "id")) %>% 
  dplyr::rename(id = document) %>% 
  dplyr::mutate(
    stm_topic = as.factor(paste("Topic", sprintf("%02d", topic)))
  )

# Check
glimpse(gamma_export)

Rows: 36,650
Columns: 49
$ id                          <chr> "https://openalex.org/W1000529773", "https…
$ topic                       <int> 25, 14, 14, 14, 7, 16, 34, 10, 30, 19, 9, …
$ gamma                       <dbl> 0.5042240, 0.2308276, 0.4425161, 0.3935272…
$ title                       <chr> "A critical evaluation of the teaching of …
$ display_name                <chr> "A critical evaluation of the teaching of …
$ author                      <list> [<data.frame[1 x 12]>], [<data.frame[2 x …
$ ab                          <chr> "A Critical Evaluation of the Teaching of …
$ publication_date            <chr> "2014-01-16", "2015-05-26", "2014-05-22", …
$ relevance_score             <dbl> 4.012791, 27.896568, 32.074608, 23.670475,…
$ so                          <chr> NA, "Proceedings of the annual conference …
$ so_id                       <chr> NA, "https://openalex.org/S4306523984", "h…
$ host_organization           <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Taylor & …
$ issn_l                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1381-1118…
$ url                         <chr> "https://uwispace.sta.uwi.edu/dspace/bitst…
$ pdf_url                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "h…
$ license                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "c…
$ version                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "p…
$ first_page                  <chr> NA, "76", NA, NA, NA, NA, NA, NA, "1", "11…
$ last_page                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, "21", "122…
$ volume                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "20", "78"…
$ issue                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1", NA, "…
$ is_oa                       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ is_oa_anywhere              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ oa_status                   <chr> "closed", "closed", "closed", "closed", "c…
$ oa_url                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, "https://e…
$ any_repository_has_fulltext <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ language                    <chr> "en", "en", "en", "en", "en", "en", "en", …
$ grants                      <list> NA, NA, NA, NA, NA, NA, NA, NA, <"https:/…
$ cited_by_count              <int> 0, 1, 1, 1, 1, 0, 2, 0, 226, 159, 122, 31,…
$ counts_by_year              <list> NA, [<data.frame[1 x 2]>], [<data.frame[1…
$ publication_year            <int> 2014, 2015, 2014, 2013, 2013, 2015, 2015, …
$ cited_by_api_url            <chr> "https://api.openalex.org/works?filter=cit…
$ ids                         <list> <"https://openalex.org/W1000529773", "100…
$ doi                         <chr> NA, "https://doi.org/10.5555/2814058.28141…
$ type                        <chr> "article", "article", "article", "article"…
$ referenced_works            <list> NA, <"https://openalex.org/W1526029332", …
$ related_works               <list> <"https://openalex.org/W958254955", "http…
$ is_paratext                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ is_retracted                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ concepts                    <list> [<data.frame[7 x 5]>], [<data.frame[16 x …
$ topics_i                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ topics_score                <dbl> 0.9566, 0.9812, 0.9894, 0.9999, 0.9752, 0.…
$ topics_name                 <chr> "field", "field", "field", "field", "field…
$ topics_id                   <chr> "https://openalex.org/fields/32", "https:/…
$ topics_display_name         <chr> "Psychology", "Social Sciences", "Psycholo…
$ publication_year_fct        <fct> 2014, 2015, 2014, 2013, 2013, 2015, 2015, …
$ type_fct                    <fct> article, article, article, article, articl…
$ field                       <fct> Psychology, Social Sciences, Psychology, S…
$ stm_topic                   <fct> Topic 25, Topic 14, Topic 14, Topic 14, To…

3.3 Verteilungsparameter von Thema 20

Erstellung eines Outputs zur Überprüfung der Lageparameter
- Basierend auf dem Datensatz gamma_export:
  1. Filtern Sie die Daten nach topic == 20.
  2. Wählen Sie mit Hilfe der select()-Funktion die Variablen gamma, relevance_score und cited_by_count aus.
  3. Verwenden Sie die Funktion datawizard::describe_distribution() um die Verteilungsparameter zu berechnen.
✍️ Identifizieren und notieren Sie folgende Informationen:
- Wie viele Abstracts haben Thema 20 als Hauptthema?
- Wie hoch ist der durschnittliche Relevance Score?
- Wie viele Zitationen haben die Dokumente im Durchschnitt?
- Wie viel Zitate hat das hochzitierteste Dokument?

Lösung anzeigen

# Create distribution parameters
gamma_export %>% 
  filter(topic == 20) %>%
  select(gamma, relevance_score, cited_by_count) %>% 
  datawizard::describe_distribution()

Variable        |  Mean |    SD |   IQR |          Range | Skewness | Kurtosis
------------------------------------------------------------------------------
gamma           |  0.37 |  0.13 |  0.17 |   [0.12, 0.87] |     0.74 |     0.43
relevance_score | 32.55 | 40.35 | 36.74 | [2.01, 402.59] |     3.07 |    14.76
cited_by_count  | 13.54 | 50.51 |  7.00 | [0.00, 948.00] |    10.41 |   143.55

Variable        |    n | n_Missing
----------------------------------
gamma           | 1477 |         0
relevance_score | 1477 |         0
cited_by_count  | 1477 |         0

Lösung anzeigen

#### Notes
# Anzahl der Abstrats von Thema 20
# Durchschnittlicher Relevace Score: 
# Durchschnittliche Zitationen:
# Anzahl der Zitationen des am meisten zitierten Dokuments:

3.4. Top-Dokumente des Themas

Identifizierung der Top-Dokumente
- Basierend auf dem Datensatz gamma_export:
  1. Filtern Sie den Datensatz nach stm_topic == "Topic 20".
  2. Sortieren Sie die Daten absteigend nach gamma mit arrange(-gamma).
  3. Wählen Sie die Variablen title, so, gamma, type, und ab mit select() aus.
  4. Wählen Sie die obersten 5 Zeilen mit slice_head(n = 5).
Erstellung eines Outputs zur Überprüfung der Top-Dokumente
- Basierend auf dem Datensatz top_docs_k20:
  1. Verwenden Sie gt() um eine Tabelle zu erstellen.
  2. Formatieren Sie die Spalte gamma mit fmt_number(columns = vars(gamma), decimals = 2) um nur zwei Nachkommastellen anzuzeigen.
  3. Verwenden Sie gtExtras::gt_theme_538() um das Design der Tabelle anzupassen.
✍️ Basierend auf den den Abstracts und den Titeln der Top-Dokumente:
- Welche Themenbereiche decken die Dokumente ab?
- Würden Sie den im Abschnitt 3.1. gewählten Themennamen beibehalten oder abändern?

Lösung anzeigen

# Identify top documents for topic 20
top_docs_k20 <- gamma_export %>% 
  filter(stm_topic == "Topic 20") %>%
  arrange(-gamma) %>%
  select(title, so, gamma, type, ab) %>%
  slice_head(n = 5) 

# Creae output
top_docs_k20 %>% 
  gt() %>% 
  fmt_number(
    columns = c(gamma), 
    decimals = 2) %>% 
  gtExtras::gt_theme_538()

title	so	gamma	type	ab
Literature Review of Overseas Tourism Destination Brand Research	Journal of Chongqing Technology and Business University	0.87	article	Applying brand theory to the study on tourism destination has been always hot issues for overseas scholars since 1990s.Tourism destination branding management is a significant marketing tool which can bring about effective identification internally,achieving differentiation with external competitors.Systematically reviewing and analyzing recent overseas tourism destination brand literatures,this paper makes conclusion and evaluation of the tourism destination brand construction,branding,brand stakeholders,brand operation,branding performance evaluation to provide reference for domestic tourism destination brand research and management.
Natural Disasters in Colombia and Their Impact on the Food Security of the Affected Population. a Quick Review of the Literature.	Social Science Research Network	0.85	article	Natural disasters in Colombia significantly impact multiple domains of the affected population, including food security. Those events can cause food production and distribution interruptions, leading to scarcity and increased prices. Additionally, they can damage infrastructure and limit the communities' ability to access food. Food assistance during disasters is crucial to ensuring food security. The former is crucial for human survival and development, and entities responsible for risk management and food assistance play a fundamental role in protecting populations affected by natural disasters. Entities responsible for risk management in Colombia, such as the National Unit for Disaster Risk Management (UNGRD) and the Departmental and Municipal Risk Management Councils, coordinate efforts to provide food assistance and other basic needs. The Colombia Food Bank also plays an essential role in responding to food emergencies as a first responder. In this article, we investigate some of the leading natural disasters that Colombia has suffered in recent years and how these events have affected different communities. Likewise, we explore how the response has been made from the risk management framework, highlighting food assistance.
Government Responsibility as the Main Stakeholder in Tourism Development With Collaboration Approach: Literature Review on Heritage Tourism	NA	0.84	article	currently, the economic growth of some countries is a contribution by the vast development of the tourism sector, and one of the potential destinations is heritage motivation.More than fifty-three of previous study founds five elements that associated with Heritage Tourism Development, which are four elements influencing directly and one element is impact after development as an outcome.The study is focusing on stakeholders responsibility which led by government to do the development of heritage tourism.Barriers of policies and low attention to strategic plants and policies are an influencer to the obstacles because of less attention from the leader.Collaboration approach helped government to control the system in heritage tourism process.
Business Strategy in Management Perspective: A Literature Review	Indonesian Journal of Economic & Management Sciences	0.83	article	Business development in the world has entered the era of free markets and broad competition, not only in small areas but also in large areas. Efforts made by a company to win the market are by providing competitive advantages, analyzing competitors, and implementing effective and efficient marketing strategies
A Literature Review on Structural Reform of Agricultural Supply Side	NA	0.82	article	The structural reform of the agricultural supply side is the major deployment of the "No. 1 document" on agriculture, not only for the direction of agricultural industry development, but also for the agricultural industry structure optimization adjustment to play a needle "tonic", also engaged in agricultural economic research experts and scholars Correct future and long-term research direction to play a "heading" role.This paper summarizes the policy of "structural reform of agricultural supply side", which is related to the optimization and upgrading of agricultural industry structure, the cultivation of agricultural enterprises, the integration of agriculture, the development strategy of agricultural brand, the innovation of agricultural technology, "Rural land management system reform", "agricultural development policy" and other research results.

References

Aria, Massimo, Trang Le, Corrado Cuccurullo, Alessandra Belfiore, and June Choe. 2024. “openalexR: An R-Tool for Collecting Bibliometric Data from OpenAlex.” The R Journal 15 (4): 167–80. https://doi.org/10.32614/rj-2023-089.

Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in r.” The Journal of Open Source Software 1 (3): 37. https://doi.org/10.21105/joss.00037.