Title: | Keyword Assisted Topic Models |
---|---|
Description: | Fits keyword assisted topic models (keyATM) using collapsed Gibbs samplers. The keyATM combines the latent dirichlet allocation (LDA) models with a small number of keywords selected by researchers in order to improve the interpretability and topic classification of the LDA. The keyATM can also incorporate covariates and directly model time trends. The keyATM is proposed in Eshima, Imai, and Sasaki (2024) <doi:10.1111/ajps.12779>. |
Authors: | Shusei Eshima [aut, cre] |
Maintainer: | Shusei Eshima <[email protected]> |
License: | GPL-3 |
Version: | 0.5.2 |
Built: | 2025-03-12 05:17:33 UTC |
Source: | https://github.com/keyatm/keyatm |
The implementation of keyATM models.
Maintainer: Shusei Eshima [email protected] (ORCID)
Authors:
Tomoya Sasaki [email protected]
Kosuke Imai [email protected]
Other contributors:
Chung-hong Chan [email protected] (ORCID) [contributor]
Romain François (ORCID) [contributor]
William Lowe [email protected] [contributor]
Seo-young Silvia Kim [email protected] (ORCID) [contributor]
Useful links:
Estimate document-topic distribution by strata (for covariate models)
by_strata_DocTopic(x, by_var, labels, by_values = NULL, ...)
by_strata_DocTopic(x, by_var, labels, by_values = NULL, ...)
x |
the output from the covariate keyATM model (see |
by_var |
character. The name of the variable to use. |
labels |
character. The labels for the values specified in |
by_values |
numeric. Specific values for |
... |
other arguments passed on to the |
strata_topicword object (a list).
Estimate subsetted topic-word distribution
by_strata_TopicWord(x, keyATM_docs, by)
by_strata_TopicWord(x, keyATM_docs, by)
x |
the output from a keyATM model (see |
keyATM_docs |
an object generated by |
by |
a vector whose length is the number of documents. |
strata_topicword object (a list).
Return covariates used in the iteration
covariates_get(x)
covariates_get(x)
x |
the output from the covariate keyATM model (see |
Show covariates information
covariates_info(x)
covariates_info(x)
x |
the output from the covariate keyATM model (see |
Fit keyATM models.
keyATM( docs, model, no_keyword_topics, keywords = list(), model_settings = list(), priors = list(), options = list(), keep = c() )
keyATM( docs, model, no_keyword_topics, keywords = list(), model_settings = list(), priors = list(), options = list(), keep = c() )
docs |
texts read via |
model |
keyATM model: |
no_keyword_topics |
the number of regular topics. |
keywords |
a list of keywords. |
model_settings |
a list of model specific settings (details are in the online documentation). |
priors |
a list of priors of parameters. |
options |
a list of options
|
keep |
a vector of the names of elements you want to keep in output. |
A keyATM_output
object containing:
number of keyword topics
number of no-keyword topics
number of terms (number of unique words)
number of documents
the name of the model
topic proportions for each document (document-topic distribution)
topic specific word generation probabilities (topic-word distribution)
number of tokens assigned to each topic
number of times each word type appears
length of each document in tokens
words in the vocabulary (a vector of unique words)
priors
options
specified keywords
perplexity and log-likelihood
estimated (the probability of using keyword topic word distribution) for the last iteration
values stored during iterations
outputs you specified to store in keep
option
information about the fitting
https://keyatm.github.io/keyATM/articles/pkgdown_files/Options.html
## Not run: library(keyATM) library(quanteda) data(keyATM_data_bills) bills_keywords <- keyATM_data_bills$keywords bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object keyATM_docs <- keyATM_read(bills_dfm) # keyATM Base out <- keyATM(docs = keyATM_docs, model = "base", no_keyword_topics = 5, keywords = bills_keywords) # Visit our website for full examples: https://keyatm.github.io/keyATM/ ## End(Not run)
## Not run: library(keyATM) library(quanteda) data(keyATM_data_bills) bills_keywords <- keyATM_data_bills$keywords bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object keyATM_docs <- keyATM_read(bills_dfm) # keyATM Base out <- keyATM(docs = keyATM_docs, model = "base", no_keyword_topics = 5, keywords = bills_keywords) # Visit our website for full examples: https://keyatm.github.io/keyATM/ ## End(Not run)
Bills data
keyATM_data_bills
keyATM_data_bills
A list with following objects:
A quanteda
dfm object of 140 documents. The text data is a part of the Congressional Bills scraped from CONGRESS.GOV.
An integer vector which takes one if the Republican proposed the bill.
A list of length 4 which contains keywords for four selected topics.
An integer vector indicating the session number of each bill.
An integer vector indicating 40 labels.
An integer vector indicating all labels.
CONGRESS.GOV
Read texts and create a keyATM_docs
object, which is a list of texts.
keyATM_read( texts, encoding = "UTF-8", check = TRUE, keep_docnames = FALSE, split = 0 )
keyATM_read( texts, encoding = "UTF-8", check = TRUE, keep_docnames = FALSE, split = 0 )
texts |
input. keyATM takes a quanteda dfm (dgCMatrix), data.frame, tibble tbl_df, or a vector of file paths. |
encoding |
character. Only used when |
check |
logical. If |
keep_docnames |
logical. If |
split |
numeric. This option works only with a quanteda dfm. It creates a two subset of the dfm by randomly splitting each document (i.e., the total number of documents is the same between two subsets). This option specifies the split proportion. Default is |
a keyATM_docs object. The first element is a list whose elements are split texts. The length of the list equals to the number of documents.
## Not run: # Use quanteda dfm keyATM_docs <- keyATM_read(texts = quanteda_dfm) # Use data.frame or tibble (texts should be stored in a column named `text`) keyATM_docs <- keyATM_read(texts = data_frame_object) keyATM_docs <- keyATM_read(texts = tibble_object) # Use a vector that stores full paths to the text files files <- list.files(doc_folder, pattern = "*.txt", full.names = TRUE) keyATM_docs <- keyATM_read(texts = files) ## End(Not run)
## Not run: # Use quanteda dfm keyATM_docs <- keyATM_read(texts = quanteda_dfm) # Use data.frame or tibble (texts should be stored in a column named `text`) keyATM_docs <- keyATM_read(texts = data_frame_object) keyATM_docs <- keyATM_read(texts = tibble_object) # Use a vector that stores full paths to the text files files <- list.files(doc_folder, pattern = "*.txt", full.names = TRUE) keyATM_docs <- keyATM_read(texts = files) ## End(Not run)
Experimental feature: Fit keyATM base with Collapsed Variational Bayes
keyATMvb( docs, model, no_keyword_topics, keywords = list(), model_settings = list(), vb_options = list(), priors = list(), options = list(), keep = list() )
keyATMvb( docs, model, no_keyword_topics, keywords = list(), model_settings = list(), vb_options = list(), priors = list(), options = list(), keep = list() )
docs |
texts read via |
model |
keyATM model: |
no_keyword_topics |
the number of regular topics |
keywords |
a list of keywords |
model_settings |
a list of model specific settings (details are in the online documentation) |
vb_options |
a list of settings for Variational Bayes
|
priors |
a list of priors of parameters |
options |
a list of options same as |
keep |
a vector of the names of elements you want to keep in output |
A keyATM_output
object
https://keyatm.github.io/keyATM/articles/pkgdown_files/keyATMvb.html
Run multinomial regression with Polya-Gamma augmentation. There is no need to call this function directly. The keyATM Covariate internally uses this.
multiPGreg(Y, X, num_topics, PG_params, iter = 1, store_lambda = 0)
multiPGreg(Y, X, num_topics, PG_params, iter = 1, store_lambda = 0)
Y |
Outcomes. |
X |
Covariates. |
num_topics |
Number of topics. |
PG_params |
Parameters used in this function. |
iter |
The default is |
store_lambda |
The default is |
Show a diagnosis plot of alpha
plot_alpha(x, start = 0, show_topic = NULL, scales = "fixed")
plot_alpha(x, start = 0, show_topic = NULL, scales = "fixed")
x |
the output from a keyATM model (see |
start |
integer. The start of slice iteration. Default is |
show_topic |
a vector to specify topic indexes to show. Default is |
scales |
character. Control the scale of y-axis (the parameter in ggplot2::facet_wrap()): |
keyATM_fig object
Show a diagnosis plot of log-likelihood and perplexity
plot_modelfit(x, start = 1)
plot_modelfit(x, start = 1)
x |
the output from a keyATM model (see |
start |
integer. The starting value of iteration to use in plot. Default is |
keyATM_fig object.
Show a diagnosis plot of pi
plot_pi( x, show_topic = NULL, start = 0, ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median") )
plot_pi( x, show_topic = NULL, start = 0, ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median") )
x |
the output from a keyATM model (see |
show_topic |
an integer or a vector. Indicate topics to visualize. Default is |
start |
integer. The starting value of iteration to use in the plot. Default is |
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
keyATM_fig object.
Plot time trend
plot_timetrend( x, show_topic = NULL, time_index_label = NULL, ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median"), xlab = "Time", scales = "fixed", show_point = TRUE, ... )
plot_timetrend( x, show_topic = NULL, time_index_label = NULL, ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median"), xlab = "Time", scales = "fixed", show_point = TRUE, ... )
x |
the output from the dynamic keyATM model (see |
show_topic |
an integer or a vector. Indicate topics to visualize. Default is |
time_index_label |
a vector. The label for time index. The length should be equal to the number of documents (time index provided to |
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
xlab |
a character. |
scales |
character. Control the scale of y-axis (the parameter in ggplot2::facet_wrap()): |
show_point |
logical. The default is |
... |
additional arguments not used. |
keyATM_fig object.
Show the expected proportion of the corpus belonging to each topic
plot_topicprop( x, n = 3, show_topic = NULL, show_topwords = TRUE, label_topic = NULL, order = c("proportion", "topicid"), xmax = NULL )
plot_topicprop( x, n = 3, show_topic = NULL, show_topwords = TRUE, label_topic = NULL, order = c("proportion", "topicid"), xmax = NULL )
x |
the output from a keyATM model (see |
n |
The number of top words to show. Default is |
show_topic |
an integer or a vector. Indicate topics to visualize. Default is |
show_topwords |
logical. Show topwords. The default is |
label_topic |
a character vector. The name of the topics in the plot. |
order |
The order of topics. |
xmax |
a numeric. Indicate the max value on the x axis |
keyATM_fig object
Plot document-topic distribution by strata (for covariate models)
## S3 method for class 'strata_doctopic' plot( x, show_topic = NULL, var_name = NULL, by = c("topic", "covariate"), ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median"), width = 0.1, show_point = TRUE, ... )
## S3 method for class 'strata_doctopic' plot( x, show_topic = NULL, var_name = NULL, by = c("topic", "covariate"), ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median"), width = 0.1, show_point = TRUE, ... )
x |
a strata_doctopic object (see |
show_topic |
a vector or an integer. Indicate topics to visualize. |
var_name |
the name of the variable in the plot. |
by |
|
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
width |
numeric. Width of the error bars. |
show_point |
logical. Show point estimates. The default is |
... |
additional arguments not used. |
keyATM_fig object.
save_fig()
, by_strata_DocTopic()
Predict topic proportions for the covariate keyATM
## S3 method for class 'keyATM_output' predict( object, newdata, transform = FALSE, burn_in = NULL, parallel = TRUE, posterior_mean = TRUE, ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median"), label = NULL, raw_values = FALSE, ... )
## S3 method for class 'keyATM_output' predict( object, newdata, transform = FALSE, burn_in = NULL, parallel = TRUE, posterior_mean = TRUE, ci = 0.9, method = c("hdi", "eti"), point = c("mean", "median"), label = NULL, raw_values = FALSE, ... )
object |
the keyATM_output object for the covariate model. |
newdata |
New observations which should be predicted. |
transform |
Transorm and standardize the |
burn_in |
integer. Burn-in period. If not specified, it is the half of samples. Default is |
parallel |
logical. If |
posterior_mean |
logical. If |
ci |
value of the credible interval (between 0 and 1) to be estimated. Default is |
method |
method for computing the credible interval. The Highest Density Interval ( |
point |
method for computing the point estimate. |
label |
a character. Add a |
raw_values |
a logical. Returns raw values. The default is |
... |
additional arguments not used. |
This function converts or reads a dictionary object from quanteda to a named list. "Glob"-style wildcard expressions (e.g. politic*) are resolved based on the available terms in your texts.
read_keywords(file = NULL, docs = NULL, dictionary = NULL, split = TRUE, ...)
read_keywords(file = NULL, docs = NULL, dictionary = NULL, split = TRUE, ...)
file |
file identifier for a foreign dictionary, e.g. path to a dictionary in YAML or LIWC format |
docs |
texts read via |
dictionary |
a quanteda dictionary object, ignore if file is not NULL |
split |
boolean, if multi-word terms be seperated, e.g. "air force" splits into "air" and "force". |
... |
additional parameters for |
a named list which can be used as keywords for e.g. keyATM()
## Not run: library(keyATM) library(quanteda) ## using the moral foundation dictionary example from quanteda dictfile <- tempfile() download.file("http://bit.ly/37cV95h", dictfile) data(keyATM_data_bills) bills_dfm <- keyATM_data_bills$doc_dfm keyATM_docs <- keyATM_read(bills_dfm) read_keywords(file = dictfile, docs = keyATM_docs, format = "LIWC") ## End(Not run)
## Not run: library(keyATM) library(quanteda) ## using the moral foundation dictionary example from quanteda dictfile <- tempfile() download.file("http://bit.ly/37cV95h", dictfile) data(keyATM_data_bills) bills_dfm <- keyATM_data_bills$doc_dfm keyATM_docs <- keyATM_read(bills_dfm) read_keywords(file = dictfile, docs = keyATM_docs, format = "LIWC") ## End(Not run)
Save a figure
save_fig(x, filename, ...)
save_fig(x, filename, ...)
x |
the keyATM_fig object. |
filename |
file name to create on disk. |
... |
other arguments passed on to the ggplot2::ggsave() function. |
visualize_keywords()
, plot_alpha()
, plot_modelfit()
, plot_pi()
, plot_timetrend()
, plot_topicprop()
, by_strata_DocTopic()
, values_fig()
Mimno, David et al. 2011. “Optimizing Semantic Coherence in Topic Models.” In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK.: Association for Computational Linguistics, 262–72. https://aclanthology.org/D11-1024.
semantic_coherence(x, docs, n = 10)
semantic_coherence(x, docs, n = 10)
x |
the output from a keyATM model (see |
docs |
texts read via |
n |
integer. The number terms to visualize. Default is |
Equation 1 of Mimno et al. 2011 adopted to keyATM.
A vector of topic coherence metric calculated by each topic.
Show the top documents for each topic
top_docs(x, n = 10)
top_docs(x, n = 10)
x |
the output from a keyATM model (see |
n |
How many documents to show. Default is |
An n x k table of the top n documents for each topic, each number is a document index.
Show the top topics for each document
top_topics(x, n = 2)
top_topics(x, n = 2)
x |
the output from a keyATM model (see |
n |
integer. The number of topics to show. Default is |
An n x k table of the top n topics in each document.
If show_keyword
is TRUE
then words in their keyword topics
are suffixed with a check mark. Words from another keyword topic
are labeled with the name of that category.
top_words(x, n = 10, measure = c("probability", "lift"), show_keyword = TRUE)
top_words(x, n = 10, measure = c("probability", "lift"), show_keyword = TRUE)
x |
the output (see |
n |
integer. The number terms to visualize. Default is |
measure |
character. The way to sort the terms: |
show_keyword |
logical. If |
An n x k table of the top n words in each topic
Get values used to create a figure
values_fig(x)
values_fig(x)
x |
the keyATM_fig object. |
save_fig()
, visualize_keywords()
, plot_alpha()
, plot_modelfit()
, plot_pi()
, plot_timetrend()
, plot_topicprop()
, by_strata_DocTopic()
Visualize the proportion of keywords in the documents.
visualize_keywords(docs, keywords, prune = TRUE, label_size = 3.2)
visualize_keywords(docs, keywords, prune = TRUE, label_size = 3.2)
docs |
a keyATM_docs object, generated by |
keywords |
a list of keywords |
prune |
logical. If |
label_size |
the size of keyword labels in the output plot. Default is |
keyATM_fig object
## Not run: # Prepare a keyATM_docs object keyATM_docs <- keyATM_read(input) # Keywords are in a list keywords <- list(Education = c("education", "child", "student"), Health = c("public", "health", "program")) # Visualize keywords keyATM_viz <- visualize_keywords(keyATM_docs, keywords) # View a figure keyATM_viz # Save a figure save_fig(keyATM_viz, filename) ## End(Not run)
## Not run: # Prepare a keyATM_docs object keyATM_docs <- keyATM_read(input) # Keywords are in a list keywords <- list(Education = c("education", "child", "student"), Health = c("public", "health", "program")) # Visualize keywords keyATM_viz <- visualize_keywords(keyATM_docs, keywords) # View a figure keyATM_viz # Save a figure save_fig(keyATM_viz, filename) ## End(Not run)
Fit weighted LDA models.
weightedLDA( docs, model, number_of_topics, model_settings = list(), priors = list(), options = list(), keep = c() )
weightedLDA( docs, model, number_of_topics, model_settings = list(), priors = list(), options = list(), keep = c() )
docs |
texts read via |
model |
Weighted LDA model: |
number_of_topics |
the number of regular topics. |
model_settings |
a list of model specific settings (details are in the online documentation). |
priors |
a list of priors of parameters. |
options |
a list of options (details are in the documentation of |
keep |
a vector of the names of elements you want to keep in output. |
A keyATM_output
object containing:
number of terms (number of unique words)
number of documents
the name of the model
topic proportions for each document (document-topic distribution)
topic specific word generation probabilities (topic-word distribution)
number of tokens assigned to each topic
number of times each word type appears
length of each document in tokens
words in the vocabulary (a vector of unique words)
priors
options
NULL
for LDA models
perplexity and log-likelihood
estimated pi for the last iteration (NULL
for LDA models)
values stored during iterations
number of topics
outputs you specified to store in keep
option
information about the fitting
https://keyatm.github.io/keyATM/articles/pkgdown_files/Options.html
## Not run: library(keyATM) library(quanteda) data(keyATM_data_bills) bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object keyATM_docs <- keyATM_read(bills_dfm) # Weighted LDA out <- weightedLDA(docs = keyATM_docs, model = "base", number_of_topics = 5) # Visit our website for full examples: https://keyatm.github.io/keyATM/ ## End(Not run)
## Not run: library(keyATM) library(quanteda) data(keyATM_data_bills) bills_dfm <- keyATM_data_bills$doc_dfm # quanteda dfm object keyATM_docs <- keyATM_read(bills_dfm) # Weighted LDA out <- weightedLDA(docs = keyATM_docs, model = "base", number_of_topics = 5) # Visit our website for full examples: https://keyatm.github.io/keyATM/ ## End(Not run)