Package 'keyATM'

Title: Keyword Assisted Topic Models
Description: Fits keyword assisted topic models (keyATM) using collapsed Gibbs samplers. The keyATM combines the latent dirichlet allocation (LDA) models with a small number of keywords selected by researchers in order to improve the interpretability and topic classification of the LDA. The keyATM can also incorporate covariates and directly model time trends. The keyATM is proposed in Eshima, Imai, and Sasaki (2024) <doi:10.1111/ajps.12779>.
Authors: Shusei Eshima [aut, cre] , Tomoya Sasaki [aut], Kosuke Imai [aut], Chung-hong Chan [ctb] , Romain François [ctb] , William Lowe [ctb], Seo-young Silvia Kim [ctb]
Maintainer: Shusei Eshima <[email protected]>
License: GPL-3
Version: 0.5.2
Built: 2025-03-12 05:17:33 UTC
Source: https://github.com/keyatm/keyatm

Help Index


Keyword Assisted Topic Models

Description

The implementation of keyATM models.

Author(s)

Maintainer: Shusei Eshima [email protected] (ORCID)

Authors:

Other contributors:

See Also

Useful links:


Estimate document-topic distribution by strata (for covariate models)

Description

Estimate document-topic distribution by strata (for covariate models)

Usage

by_strata_DocTopic(x, by_var, labels, by_values = NULL, ...)

Arguments

x

the output from the covariate keyATM model (see keyATM()).

by_var

character. The name of the variable to use.

labels

character. The labels for the values specified in by_var (ascending order).

by_values

numeric. Specific values for by_var, ordered from small to large. If it is not specified, all values in by_var will be used.

...

other arguments passed on to the predict.keyATM_output() function.

Value

strata_topicword object (a list).


Estimate subsetted topic-word distribution

Description

Estimate subsetted topic-word distribution

Usage

by_strata_TopicWord(x, keyATM_docs, by)

Arguments

x

the output from a keyATM model (see keyATM()).

keyATM_docs

an object generated by keyATM_read().

by

a vector whose length is the number of documents.

Value

strata_topicword object (a list).


Return covariates used in the iteration

Description

Return covariates used in the iteration

Usage

covariates_get(x)

Arguments

x

the output from the covariate keyATM model (see keyATM())


Show covariates information

Description

Show covariates information

Usage

covariates_info(x)

Arguments

x

the output from the covariate keyATM model (see keyATM()).


keyATM main function

Description

Fit keyATM models.

Usage

keyATM(
  docs,
  model,
  no_keyword_topics,
  keywords = list(),
  model_settings = list(),
  priors = list(),
  options = list(),
  keep = c()
)

Arguments

docs

texts read via keyATM_read().

model

keyATM model: base, covariates, and dynamic.

no_keyword_topics

the number of regular topics.

keywords

a list of keywords.

model_settings

a list of model specific settings (details are in the online documentation).

priors

a list of priors of parameters.

options

a list of options

  • seed: A numeric value for random seed. If it is not provided, the package randomly selects a seed.

  • iterations: An integer. Number of iterations. Default is 1500.

  • verbose: If TRUE, it prints loglikelihood and perplexity. Default is FALSE.

  • llk_per: An integer. If the value is j keyATM stores loglikelihood and perplexity every jj iteration. Default value is 10 per iterations

  • use_weights: If TRUE use weight. Default is TRUE.

  • weights_type: There are four types of weights. Weights based on the information theory (information-theory) and inverse frequency (inv-freq) and normalized versions of them (information-theory-normalized and inv-freq-normalized). Default is information-theory.

  • prune: If TRUE rume keywords that do not appear in the corpus. Default is TRUE.

  • store_theta: If TRUE or 1, it stores θ\theta (document-topic distribution) for the iteration specified by thinning. Default is FALSE (same as 0).

  • store_pi: If TRUE or 1, it stores π\pi (the probability of using keyword topic word distribution) for the iteration specified by thinning. Default is FALSE (same as 0).

  • thinning: An integer. If the value is j keyATM stores following parameters every j iteration. The default is 5.

    • theta: For all models. If store_theta is TRUE document-level topic assignment is stored (sufficient statistics to calculate document-topic distributions theta).

    • alpha: For the base and dynamic models. In the base model alpha is shared across all documents whereas each state has different alpha in the dynamic model.

    • lambda: coefficients in the covariate model.

    • R: For the dynamic model. The state each document belongs to.

    • P: For the dynamic model. The state transition probability.

  • parallel_init: Parallelize processes to speed up initialization. Default is FALSE. Please plan() before use this feature.

  • resume: The resume argument is used to save and load the intermediate results of the keyATM fitting process, allowing you to resume the fitting from a previous state. The default value is NULL (do not resume).

keep

a vector of the names of elements you want to keep in output.

Value

A keyATM_output object containing:

keyword_k

number of keyword topics

no_keyword_topics

number of no-keyword topics

V

number of terms (number of unique words)

N

number of documents

model

the name of the model

theta

topic proportions for each document (document-topic distribution)

phi

topic specific word generation probabilities (topic-word distribution)

topic_counts

number of tokens assigned to each topic

word_counts

number of times each word type appears

doc_lens

length of each document in tokens

vocab

words in the vocabulary (a vector of unique words)

priors

priors

options

options

keywords_raw

specified keywords

model_fit

perplexity and log-likelihood

pi

estimated π\pi (the probability of using keyword topic word distribution) for the last iteration

values_iter

values stored during iterations

kept_values

outputs you specified to store in keep option

information

information about the fitting

See Also

https://keyatm.github.io/keyATM/articles/pkgdown_files/Options.html

Examples

## Not run: 
  library(keyATM)
  library(quanteda)
  data(keyATM_data_bills)
  bills_keywords <- keyATM_data_bills$keywords
  bills_dfm <- keyATM_data_bills$doc_dfm  # quanteda dfm object
  keyATM_docs <- keyATM_read(bills_dfm)

  # keyATM Base
  out <- keyATM(docs = keyATM_docs, model = "base",
                no_keyword_topics = 5, keywords = bills_keywords)

  # Visit our website for full examples: https://keyatm.github.io/keyATM/

## End(Not run)

Bills data

Description

Bills data

Usage

keyATM_data_bills

Format

A list with following objects:

doc_dfm

A quanteda dfm object of 140 documents. The text data is a part of the Congressional Bills scraped from CONGRESS.GOV.

cov

An integer vector which takes one if the Republican proposed the bill.

keywords

A list of length 4 which contains keywords for four selected topics.

time_index

An integer vector indicating the session number of each bill.

labels

An integer vector indicating 40 labels.

labels_all

An integer vector indicating all labels.

Source

CONGRESS.GOV


Read texts

Description

Read texts and create a keyATM_docs object, which is a list of texts.

Usage

keyATM_read(
  texts,
  encoding = "UTF-8",
  check = TRUE,
  keep_docnames = FALSE,
  split = 0
)

Arguments

texts

input. keyATM takes a quanteda dfm (dgCMatrix), data.frame, tibble tbl_df, or a vector of file paths.

encoding

character. Only used when texts is a vector of file paths. Default is UTF-8.

check

logical. If TRUE, check whether there is anything wrong with the structure of texts. Default is TRUE.

keep_docnames

logical. If TRUE, it keeps the document names in a quanteda dfm. Default is FALSE.

split

numeric. This option works only with a quanteda dfm. It creates a two subset of the dfm by randomly splitting each document (i.e., the total number of documents is the same between two subsets). This option specifies the split proportion. Default is 0.

Value

a keyATM_docs object. The first element is a list whose elements are split texts. The length of the list equals to the number of documents.

Examples

## Not run: 
 # Use quanteda dfm
 keyATM_docs <- keyATM_read(texts = quanteda_dfm)

 # Use data.frame or tibble (texts should be stored in a column named `text`)
 keyATM_docs <- keyATM_read(texts = data_frame_object)
 keyATM_docs <- keyATM_read(texts = tibble_object)

 # Use a vector that stores full paths to the text files
 files <- list.files(doc_folder, pattern = "*.txt", full.names = TRUE)
 keyATM_docs <- keyATM_read(texts = files)


## End(Not run)

keyATM with Collapsed Variational Bayes

Description

Experimental feature: Fit keyATM base with Collapsed Variational Bayes

Usage

keyATMvb(
  docs,
  model,
  no_keyword_topics,
  keywords = list(),
  model_settings = list(),
  vb_options = list(),
  priors = list(),
  options = list(),
  keep = list()
)

Arguments

docs

texts read via keyATM_read()

model

keyATM model: base, covariates, and dynamic

no_keyword_topics

the number of regular topics

keywords

a list of keywords

model_settings

a list of model specific settings (details are in the online documentation)

vb_options

a list of settings for Variational Bayes

  • convtol: the default is 1e-4

  • init: mcmc (default) or random

priors

a list of priors of parameters

options

a list of options same as keyATM(). Options are used when initialization method is mcmc.

keep

a vector of the names of elements you want to keep in output

Value

A keyATM_output object

See Also

https://keyatm.github.io/keyATM/articles/pkgdown_files/keyATMvb.html


Run multinomial regression with Polya-Gamma augmentation

Description

Run multinomial regression with Polya-Gamma augmentation. There is no need to call this function directly. The keyATM Covariate internally uses this.

Usage

multiPGreg(Y, X, num_topics, PG_params, iter = 1, store_lambda = 0)

Arguments

Y

Outcomes.

X

Covariates.

num_topics

Number of topics.

PG_params

Parameters used in this function.

iter

The default is 1.

store_lambda

The default is 0.


Show a diagnosis plot of alpha

Description

Show a diagnosis plot of alpha

Usage

plot_alpha(x, start = 0, show_topic = NULL, scales = "fixed")

Arguments

x

the output from a keyATM model (see keyATM()).

start

integer. The start of slice iteration. Default is 0.

show_topic

a vector to specify topic indexes to show. Default is NULL.

scales

character. Control the scale of y-axis (the parameter in ggplot2::facet_wrap()): free adjusts y-axis for parameters. Default is fixed.

Value

keyATM_fig object

See Also

save_fig()


Show a diagnosis plot of log-likelihood and perplexity

Description

Show a diagnosis plot of log-likelihood and perplexity

Usage

plot_modelfit(x, start = 1)

Arguments

x

the output from a keyATM model (see keyATM()).

start

integer. The starting value of iteration to use in plot. Default is 1.

Value

keyATM_fig object.

See Also

save_fig()


Show a diagnosis plot of pi

Description

Show a diagnosis plot of pi

Usage

plot_pi(
  x,
  show_topic = NULL,
  start = 0,
  ci = 0.9,
  method = c("hdi", "eti"),
  point = c("mean", "median")
)

Arguments

x

the output from a keyATM model (see keyATM()).

show_topic

an integer or a vector. Indicate topics to visualize. Default is NULL.

start

integer. The starting value of iteration to use in the plot. Default is 0.

ci

value of the credible interval (between 0 and 1) to be estimated. Default is 0.9 (90%). This is an option when calculating credible intervals (you need to set store_pi = TRUE in keyATM()).

method

method for computing the credible interval. The Highest Density Interval (hdi, default) or Equal-tailed Interval (eti). This is an option when calculating credible intervals (you need to set store_pi = TRUE in keyATM()).

point

method for computing the point estimate. mean (default) or median. This is an option when calculating credible intervals (you need to set store_pi = TRUE in keyATM()).

Value

keyATM_fig object.

See Also

save_fig()


Plot time trend

Description

Plot time trend

Usage

plot_timetrend(
  x,
  show_topic = NULL,
  time_index_label = NULL,
  ci = 0.9,
  method = c("hdi", "eti"),
  point = c("mean", "median"),
  xlab = "Time",
  scales = "fixed",
  show_point = TRUE,
  ...
)

Arguments

x

the output from the dynamic keyATM model (see keyATM()).

show_topic

an integer or a vector. Indicate topics to visualize. Default is NULL.

time_index_label

a vector. The label for time index. The length should be equal to the number of documents (time index provided to keyATM()).

ci

value of the credible interval (between 0 and 1) to be estimated. Default is 0.9 (90%). This is an option when calculating credible intervals (you need to set store_theta = TRUE in keyATM()).

method

method for computing the credible interval. The Highest Density Interval (hdi, default) or Equal-tailed Interval (eti). This is an option when calculating credible intervals (you need to set store_theta = TRUE in keyATM()).

point

method for computing the point estimate. mean (default) or median. This is an option when calculating credible intervals (you need to set store_theta = TRUE in keyATM()).

xlab

a character.

scales

character. Control the scale of y-axis (the parameter in ggplot2::facet_wrap()): free adjusts y-axis for parameters. Default is fixed.

show_point

logical. The default is TRUE. This is an option when calculating credible intervals.

...

additional arguments not used.

Value

keyATM_fig object.

See Also

save_fig()


Show the expected proportion of the corpus belonging to each topic

Description

Show the expected proportion of the corpus belonging to each topic

Usage

plot_topicprop(
  x,
  n = 3,
  show_topic = NULL,
  show_topwords = TRUE,
  label_topic = NULL,
  order = c("proportion", "topicid"),
  xmax = NULL
)

Arguments

x

the output from a keyATM model (see keyATM()).

n

The number of top words to show. Default is 3.

show_topic

an integer or a vector. Indicate topics to visualize. Default is NULL.

show_topwords

logical. Show topwords. The default is TRUE.

label_topic

a character vector. The name of the topics in the plot.

order

The order of topics.

xmax

a numeric. Indicate the max value on the x axis

Value

keyATM_fig object

See Also

save_fig()


Plot document-topic distribution by strata (for covariate models)

Description

Plot document-topic distribution by strata (for covariate models)

Usage

## S3 method for class 'strata_doctopic'
plot(
  x,
  show_topic = NULL,
  var_name = NULL,
  by = c("topic", "covariate"),
  ci = 0.9,
  method = c("hdi", "eti"),
  point = c("mean", "median"),
  width = 0.1,
  show_point = TRUE,
  ...
)

Arguments

x

a strata_doctopic object (see by_strata_DocTopic()).

show_topic

a vector or an integer. Indicate topics to visualize.

var_name

the name of the variable in the plot.

by

topic or covariate. Default is by topic.

ci

value of the credible interval (between 0 and 1) to be estimated. Default is 0.9 (90%).

method

method for computing the credible interval. The Highest Density Interval (hdi, default) or Equal-tailed Interval (eti).

point

method for computing the point estimate. mean (default) or median.

width

numeric. Width of the error bars.

show_point

logical. Show point estimates. The default is TRUE.

...

additional arguments not used.

Value

keyATM_fig object.

See Also

save_fig(), by_strata_DocTopic()


Predict topic proportions for the covariate keyATM

Description

Predict topic proportions for the covariate keyATM

Usage

## S3 method for class 'keyATM_output'
predict(
  object,
  newdata,
  transform = FALSE,
  burn_in = NULL,
  parallel = TRUE,
  posterior_mean = TRUE,
  ci = 0.9,
  method = c("hdi", "eti"),
  point = c("mean", "median"),
  label = NULL,
  raw_values = FALSE,
  ...
)

Arguments

object

the keyATM_output object for the covariate model.

newdata

New observations which should be predicted.

transform

Transorm and standardize the newdata with the same formula and option as model_settings used in keyATM().

burn_in

integer. Burn-in period. If not specified, it is the half of samples. Default is NULL.

parallel

logical. If TRUE, parallelization for speeding up. Default is TRUE. Please plan() before use this function.

posterior_mean

logical. If TRUE, the quantity of interest to estimate is the posterior mean. Default is TRUE.

ci

value of the credible interval (between 0 and 1) to be estimated. Default is 0.9 (90%).

method

method for computing the credible interval. The Highest Density Interval (hdi, default) or Equal-tailed Interval (eti).

point

method for computing the point estimate. mean (default) or median.

label

a character. Add a label column to the output. The default is NULL (do not add it).

raw_values

a logical. Returns raw values. The default is FALSE.

...

additional arguments not used.


Convert a quanteda dictionary to keywords

Description

This function converts or reads a dictionary object from quanteda to a named list. "Glob"-style wildcard expressions (e.g. politic*) are resolved based on the available terms in your texts.

Usage

read_keywords(file = NULL, docs = NULL, dictionary = NULL, split = TRUE, ...)

Arguments

file

file identifier for a foreign dictionary, e.g. path to a dictionary in YAML or LIWC format

docs

texts read via keyATM_read()

dictionary

a quanteda dictionary object, ignore if file is not NULL

split

boolean, if multi-word terms be seperated, e.g. "air force" splits into "air" and "force".

...

additional parameters for quanteda::dictionary()

Value

a named list which can be used as keywords for e.g. keyATM()

See Also

dictionary

Examples

## Not run: 
  library(keyATM)
  library(quanteda)
  ## using the moral foundation dictionary example from quanteda
  dictfile <- tempfile()
  download.file("http://bit.ly/37cV95h", dictfile)
  data(keyATM_data_bills)
  bills_dfm <- keyATM_data_bills$doc_dfm
  keyATM_docs <- keyATM_read(bills_dfm)
  read_keywords(file = dictfile, docs = keyATM_docs, format = "LIWC")

## End(Not run)

Save a figure

Description

Save a figure

Usage

save_fig(x, filename, ...)

Arguments

x

the keyATM_fig object.

filename

file name to create on disk.

...

other arguments passed on to the ggplot2::ggsave() function.

See Also

visualize_keywords(), plot_alpha(), plot_modelfit(), plot_pi(), plot_timetrend(), plot_topicprop(), by_strata_DocTopic(), values_fig()


Semantic Coherence: Mimno et al. (2011)

Description

Mimno, David et al. 2011. “Optimizing Semantic Coherence in Topic Models.” In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK.: Association for Computational Linguistics, 262–72. https://aclanthology.org/D11-1024.

Usage

semantic_coherence(x, docs, n = 10)

Arguments

x

the output from a keyATM model (see keyATM()).

docs

texts read via keyATM_read().

n

integer. The number terms to visualize. Default is 10.

Details

Equation 1 of Mimno et al. 2011 adopted to keyATM.

Value

A vector of topic coherence metric calculated by each topic.


Show the top documents for each topic

Description

Show the top documents for each topic

Usage

top_docs(x, n = 10)

Arguments

x

the output from a keyATM model (see keyATM()).

n

How many documents to show. Default is 10.

Value

An n x k table of the top n documents for each topic, each number is a document index.


Show the top topics for each document

Description

Show the top topics for each document

Usage

top_topics(x, n = 2)

Arguments

x

the output from a keyATM model (see keyATM()).

n

integer. The number of topics to show. Default is 2.

Value

An n x k table of the top n topics in each document.


Show the top words for each topic

Description

If show_keyword is TRUE then words in their keyword topics are suffixed with a check mark. Words from another keyword topic are labeled with the name of that category.

Usage

top_words(x, n = 10, measure = c("probability", "lift"), show_keyword = TRUE)

Arguments

x

the output (see keyATM() and by_strata_TopicWord()).

n

integer. The number terms to visualize. Default is 10.

measure

character. The way to sort the terms: probability (default) or lift.

show_keyword

logical. If TRUE, mark keywords. Default is TRUE.

Value

An n x k table of the top n words in each topic


Get values used to create a figure

Description

Get values used to create a figure

Usage

values_fig(x)

Arguments

x

the keyATM_fig object.

See Also

save_fig(), visualize_keywords(), plot_alpha(), plot_modelfit(), plot_pi(), plot_timetrend(), plot_topicprop(), by_strata_DocTopic()


Visualize keywords

Description

Visualize the proportion of keywords in the documents.

Usage

visualize_keywords(docs, keywords, prune = TRUE, label_size = 3.2)

Arguments

docs

a keyATM_docs object, generated by keyATM_read() function

keywords

a list of keywords

prune

logical. If TRUE, prune keywords that do not appear in docs. Default is TRUE.

label_size

the size of keyword labels in the output plot. Default is 3.2.

Value

keyATM_fig object

See Also

save_fig()

Examples

## Not run: 
 # Prepare a keyATM_docs object
 keyATM_docs <- keyATM_read(input)

 # Keywords are in a list
 keywords <- list(Education = c("education", "child", "student"),
                  Health    = c("public", "health", "program"))

 # Visualize keywords
 keyATM_viz <- visualize_keywords(keyATM_docs, keywords)

 # View a figure
 keyATM_viz

 # Save a figure
 save_fig(keyATM_viz, filename)

## End(Not run)

Weighted LDA main function

Description

Fit weighted LDA models.

Usage

weightedLDA(
  docs,
  model,
  number_of_topics,
  model_settings = list(),
  priors = list(),
  options = list(),
  keep = c()
)

Arguments

docs

texts read via keyATM_read().

model

Weighted LDA model: base, covariates, and dynamic.

number_of_topics

the number of regular topics.

model_settings

a list of model specific settings (details are in the online documentation).

priors

a list of priors of parameters.

options

a list of options (details are in the documentation of keyATM()).

keep

a vector of the names of elements you want to keep in output.

Value

A keyATM_output object containing:

V

number of terms (number of unique words)

N

number of documents

model

the name of the model

theta

topic proportions for each document (document-topic distribution)

phi

topic specific word generation probabilities (topic-word distribution)

topic_counts

number of tokens assigned to each topic

word_counts

number of times each word type appears

doc_lens

length of each document in tokens

vocab

words in the vocabulary (a vector of unique words)

priors

priors

options

options

keywords_raw

NULL for LDA models

model_fit

perplexity and log-likelihood

pi

estimated pi for the last iteration (NULL for LDA models)

values_iter

values stored during iterations

number_of_topics

number of topics

kept_values

outputs you specified to store in keep option

information

information about the fitting

See Also

https://keyatm.github.io/keyATM/articles/pkgdown_files/Options.html

Examples

## Not run: 
  library(keyATM)
  library(quanteda)
  data(keyATM_data_bills)
  bills_dfm <- keyATM_data_bills$doc_dfm  # quanteda dfm object
  keyATM_docs <- keyATM_read(bills_dfm)

  # Weighted LDA
  out <- weightedLDA(docs = keyATM_docs, model = "base",
                     number_of_topics = 5)

  # Visit our website for full examples: https://keyatm.github.io/keyATM/

## End(Not run)