| Title: | Cell Type Annotation Using Large Language Models |
|---|---|
| Description: | Automated cell type annotation for single-cell RNA sequencing data using consensus predictions from multiple large language models. Integrates with Seurat objects and provides uncertainty quantification for annotations. Supports various LLM providers including OpenAI, Anthropic, and Google. For details see Yang et al. (2025) <doi:10.1101/2025.04.10.647852>. |
| Authors: | Chen Yang [aut, cre, cph] |
| Maintainer: | Chen Yang <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.0.5 |
| Built: | 2026-06-06 21:06:42 UTC |
| Source: | https://github.com/cafferychen777/mllmcelltype |
A comprehensive function for automated cell type annotation using multiple Large Language Models (LLMs). This function supports both Seurat's differential gene expression results and custom gene lists as input. It implements a sophisticated annotation pipeline that leverages state-of-the-art LLMs to identify cell types based on marker gene expression patterns.
A data frame from Seurat's FindAllMarkers() function containing differential gene expression results (must have columns: 'cluster', 'gene', and 'avg_log2FC'). The function will select the top genes based on avg_log2FC for each cluster.
A list where each element has a 'genes' field containing marker genes for a cluster. This can be in one of these formats:
Named with cluster IDs: list("0" = list(genes = c(...)), "1" = list(genes = c(...)))
Named with cell type names: list(t_cells = list(genes = c(...)), b_cells = list(genes = c(...)))
Unnamed list: list(list(genes = c(...)), list(genes = c(...)))
Cluster IDs are preserved as-is. The function does not modify or re-index cluster IDs. 'mouse brain'). This helps provide context for more accurate annotations.
OpenAI: 'gpt-5.5', 'gpt-5.4', 'gpt-5.4-mini'
Anthropic: 'claude-opus-4-7', 'claude-opus-4-6', 'claude-sonnet-4-6', 'claude-haiku-4-5-20251001'
DeepSeek: 'deepseek-v4-flash', 'deepseek-v4-pro'
Google: 'gemini-3.1-pro-preview', 'gemini-3-flash-preview', 'gemini-3.1-flash-lite'
Alibaba: 'qwen3.6-max-preview', 'qwen3.6-plus', 'qwen3.6-flash'
Stepfun: 'step-3.5-flash', 'step-3.5-flash-2603', 'step-3'
Zhipu/Z.AI: 'glm-5.1', 'glm-5-turbo', 'glm-5'
MiniMax: 'MiniMax-M2.7', 'MiniMax-M2.7-highspeed', 'MiniMax-M2.5'
X.AI: 'grok-4.3', 'grok-4.3-latest', 'grok-latest'
OpenRouter: Provides access to models from multiple providers through a single API. Format: 'provider/model-name'
OpenAI models: 'openai/gpt-5.5', 'openai/gpt-5.4-mini'
Anthropic models: 'anthropic/claude-opus-4.7', 'anthropic/claude-sonnet-4.6'
Google models: 'google/gemini-3.1-pro-preview', 'google/gemini-3-flash-preview'
X.AI models: 'x-ai/grok-4.3'
Stepfun models: 'stepfun/step-3.5-flash' Each provider requires a specific API key format and authentication method:
OpenAI: "sk-..." (obtain from OpenAI platform)
Anthropic: "sk-ant-..." (obtain from Anthropic console)
Google: A Google API key for Gemini models (obtain from Google AI)
DeepSeek: API key from DeepSeek platform
Qwen: API key from Alibaba Cloud
Stepfun: API key from Stepfun AI
Zhipu: API key from Zhipu AI
MiniMax: API key from MiniMax
X.AI: API key for Grok models
OpenRouter: "sk-or-..." (obtain from OpenRouter) OpenRouter provides access to multiple models through a single API key
The API key can be provided directly or stored in environment variables:
# Direct API key
result <- annotate_cell_types(input, tissue_name, model="gpt-5.5",
api_key="sk-...")
# Using environment variables
Sys.setenv(OPENAI_API_KEY="sk-...")
Sys.setenv(ANTHROPIC_API_KEY="sk-ant-...")
Sys.setenv(OPENROUTER_API_KEY="sk-or-...")
# Then use with environment variables
result <- annotate_cell_types(input, tissue_name, model="claude-sonnet-4-6",
api_key=Sys.getenv("ANTHROPIC_API_KEY"))
If NA, returns the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API. when input is from Seurat's FindAllMarkers(). Default: 10
A single character string: Applied to all providers (e.g., "https://api.proxy.com/v1")
A named list: Provider-specific URLs (e.g., list(openai = "https://openai-proxy.com/v1", anthropic = "https://anthropic-proxy.com/v1")). This is useful for:
Users accessing international APIs through proxies
Enterprise users with internal API gateways
Development/testing with local or alternative endpoints If NULL (default), uses official API endpoints for each provider.
annotate_cell_types( input, tissue_name, model = "gpt-5.5", api_key = NA, top_gene_count = 10, debug = FALSE, base_urls = NULL )annotate_cell_types( input, tissue_name, model = "gpt-5.5", api_key = NA, top_gene_count = 10, debug = FALSE, base_urls = NULL )
input |
Either a data frame from Seurat's FindAllMarkers() containing columns 'cluster', 'gene', and 'avg_log2FC', or a list with 'genes' field for each cluster |
tissue_name |
Required tissue context (e.g., 'human PBMC', 'mouse brain') for more accurate annotations |
model |
Model name to use. Default: 'gpt-5.5'. See details for supported models |
api_key |
API key for the selected model provider as a non-empty character scalar.
If |
top_gene_count |
Number of top genes to use per cluster when input is from Seurat. Default: 10 |
debug |
Logical indicating whether to enable debug output. Default: FALSE |
base_urls |
Optional base URLs for API endpoints. Can be a string or named list for custom endpoints |
When api_key is provided, the provider response split by newline as
a character vector. When api_key is NA, the generated prompt string.
# Example 1: Using custom gene lists, returning prompt only (no API call) annotate_cell_types( input = list( t_cells = list(genes = c('CD3D', 'CD3E', 'CD3G', 'CD28')), b_cells = list(genes = c('CD19', 'CD79A', 'CD79B', 'MS4A1')), monocytes = list(genes = c('CD14', 'CD68', 'CSF1R', 'FCGR3A')) ), tissue_name = 'human PBMC', model = 'gpt-5.5', api_key = NA # Returns prompt only without making API call ) # Example 2: Using with Seurat pipeline and OpenAI model ## Not run: library(Seurat) # Load example data data("pbmc_small") # Find marker genes all.markers <- FindAllMarkers( object = pbmc_small, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25 ) # Set API key in environment variable (recommended approach) Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # Get cell type annotations using OpenAI model openai_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'gpt-5.5', api_key = Sys.getenv("OPENAI_API_KEY"), top_gene_count = 15 ) # Example 3: Using Anthropic Claude model Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") claude_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'claude-opus-4-7', api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 15 ) # Example 4: Using OpenRouter to access multiple models Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # Access OpenAI models through OpenRouter openrouter_gpt4_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'openai/gpt-5.5', # Note the provider/model format api_key = Sys.getenv("OPENROUTER_API_KEY"), top_gene_count = 15 ) # Access Anthropic models through OpenRouter openrouter_claude_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'anthropic/claude-opus-4.6', # Note the provider/model format api_key = Sys.getenv("OPENROUTER_API_KEY"), top_gene_count = 15 ) # Example 5: Using with mouse brain data mouse_annotations <- annotate_cell_types( input = mouse_markers, # Your mouse marker genes tissue_name = 'mouse brain', # Specify correct tissue for context model = 'gpt-5.5', api_key = Sys.getenv("OPENAI_API_KEY"), top_gene_count = 20, # Use more genes for complex tissues debug = TRUE # Enable debug output ) ## End(Not run)# Example 1: Using custom gene lists, returning prompt only (no API call) annotate_cell_types( input = list( t_cells = list(genes = c('CD3D', 'CD3E', 'CD3G', 'CD28')), b_cells = list(genes = c('CD19', 'CD79A', 'CD79B', 'MS4A1')), monocytes = list(genes = c('CD14', 'CD68', 'CSF1R', 'FCGR3A')) ), tissue_name = 'human PBMC', model = 'gpt-5.5', api_key = NA # Returns prompt only without making API call ) # Example 2: Using with Seurat pipeline and OpenAI model ## Not run: library(Seurat) # Load example data data("pbmc_small") # Find marker genes all.markers <- FindAllMarkers( object = pbmc_small, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25 ) # Set API key in environment variable (recommended approach) Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # Get cell type annotations using OpenAI model openai_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'gpt-5.5', api_key = Sys.getenv("OPENAI_API_KEY"), top_gene_count = 15 ) # Example 3: Using Anthropic Claude model Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") claude_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'claude-opus-4-7', api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 15 ) # Example 4: Using OpenRouter to access multiple models Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # Access OpenAI models through OpenRouter openrouter_gpt4_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'openai/gpt-5.5', # Note the provider/model format api_key = Sys.getenv("OPENROUTER_API_KEY"), top_gene_count = 15 ) # Access Anthropic models through OpenRouter openrouter_claude_annotations <- annotate_cell_types( input = all.markers, tissue_name = 'human PBMC', model = 'anthropic/claude-opus-4.6', # Note the provider/model format api_key = Sys.getenv("OPENROUTER_API_KEY"), top_gene_count = 15 ) # Example 5: Using with mouse brain data mouse_annotations <- annotate_cell_types( input = mouse_markers, # Your mouse marker genes tissue_name = 'mouse brain', # Specify correct tissue for context model = 'gpt-5.5', api_key = Sys.getenv("OPENAI_API_KEY"), top_gene_count = 20, # Use more genes for complex tissues debug = TRUE # Enable debug output ) ## End(Not run)
Anthropic API Processor
Anthropic API Processor
Concrete implementation of BaseAPIProcessor for Anthropic models. Handles Anthropic-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> AnthropicProcessor
new()
Initialize Anthropic processor
AnthropicProcessor$new(base_url = NULL)
get_default_api_url()
Get default Anthropic API URL
AnthropicProcessor$get_default_api_url()
make_api_call()
Make API call to Anthropic
AnthropicProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from Anthropic API response
AnthropicProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
AnthropicProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Base API Processor Class
Base API Processor Class
Abstract base class for API processors that provides common functionality including unified logging, error handling, input processing, and response validation. This eliminates code duplication across all provider-specific processors.
provider_nameName of the API provider
loggerUnified logger instance
base_urlCustom base URL for API endpoints
new()
Initialize the base API processor
BaseAPIProcessor$new(provider_name, base_url = NULL)
process_request()
Main entry point for processing API requests
BaseAPIProcessor$process_request(prompt, model, api_key)
get_api_url()
Get the API URL to use for requests
BaseAPIProcessor$get_api_url()
get_default_api_url()
Abstract method to be implemented by subclasses for getting default API URL
BaseAPIProcessor$get_default_api_url()
make_api_call()
Abstract method to be implemented by subclasses for making the actual API call
BaseAPIProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Abstract method to be implemented by subclasses for extracting content from response Make API call and extract response content
BaseAPIProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
BaseAPIProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Manages caching of consensus analysis results
cache_dirDirectory to store cache files. Options:
NULL (default): Uses system cache directory
"local": Uses .mllmcelltype_cache in current directory
"temp": Uses temporary directory
Custom path: Any other string is used as directory path
cache_versionCurrent cache version
new()
Initialize cache manager
NULL (default): Uses system cache directory via tools::R_user_dir()
"local": Uses .mllmcelltype_cache in current directory
"temp": Uses temporary directory (cleared on R restart)
Custom path: Any other string is used as directory path
CacheManager$new(cache_dir = NULL)
get_cache_dir()
Get actual cache directory path
CacheManager$get_cache_dir()
generate_key()
Generate cache key from input parameters (improved version)
CacheManager$generate_key( input, models, cluster_id, tissue_name = "", top_gene_count = 10 )
save_to_cache()
Save results to cache
CacheManager$save_to_cache(key, data)
load_from_cache()
Load results from cache
CacheManager$load_from_cache(key)
has_cache()
Check if results exist in cache
CacheManager$has_cache(key)
get_cache_stats()
Get cache statistics
CacheManager$get_cache_stats()
clear_cache()
Clear all cache
CacheManager$clear_cache(confirm = FALSE)
validate_cache()
Validate cache content Extract genes from input in a standardized way Create stable hash from genes list Create stable hash from models list Create stable hash from tissue_name and top_gene_count Create stable hash from cluster ID
CacheManager$validate_cache(key)
clone()
The objects of this class are cloneable with this method.
CacheManager$clone(deep = FALSE)
deepWhether to make a deep clone.
This function runs the same input through multiple models and compares their predictions. It provides both individual predictions and a consensus analysis.
compare_model_predictions( input, tissue_name, models = c("claude-opus-4-7", "gpt-5.5", "gemini-3.1-pro-preview", "deepseek-v4-flash", "qwen3.6-plus", "grok-4.3"), api_keys, top_gene_count = 10, consensus_threshold = 0.5, base_urls = NULL )compare_model_predictions( input, tissue_name, models = c("claude-opus-4-7", "gpt-5.5", "gemini-3.1-pro-preview", "deepseek-v4-flash", "qwen3.6-plus", "grok-4.3"), api_keys, top_gene_count = 10, consensus_threshold = 0.5, base_urls = NULL )
input |
Either a data frame from Seurat's FindAllMarkers() containing columns 'cluster', 'gene', and 'avg_log2FC', or a list with 'genes' field for each cluster |
tissue_name |
Tissue context (e.g., 'human PBMC', 'mouse brain') for more accurate annotations |
models |
Vector of model names to use for comparison. Default includes top models from each provider |
api_keys |
Named list of API keys for the models, with provider or model names as keys.
Every model in |
top_gene_count |
Number of top genes to use per cluster when input is from Seurat. Default: 10 |
consensus_threshold |
Minimum agreement threshold for consensus (0-1). Default: 0.5. Consensus is only evaluated when at least two non-missing model predictions are available for a cluster. |
base_urls |
Optional base URLs for API endpoints. Can be a string or named list for provider-specific custom endpoints. |
List containing individual model predictions and consensus analysis
If a cluster has fewer than two valid predictions after alignment/padding,
its consensus-related outputs are NA.
This function uses create_standardization_prompt from prompt_templates.R Supported models:
OpenAI: 'gpt-5.5', 'gpt-5.4', 'gpt-5.4-mini'
Anthropic: 'claude-opus-4-7', 'claude-opus-4-6', 'claude-sonnet-4-6', 'claude-haiku-4-5-20251001'
DeepSeek: 'deepseek-v4-flash', 'deepseek-v4-pro'
Google: 'gemini-3.1-pro-preview', 'gemini-3-flash-preview', 'gemini-3.1-flash-lite'
Alibaba: 'qwen3.6-max-preview', 'qwen3.6-plus', 'qwen3.6-flash'
Stepfun: 'step-3.5-flash', 'step-3.5-flash-2603', 'step-3'
Zhipu/Z.AI: 'glm-5.1', 'glm-5-turbo', 'glm-5'
MiniMax: 'MiniMax-M2.7', 'MiniMax-M2.7-highspeed', 'MiniMax-M2.5'
X.AI: 'grok-4.3', 'grok-4.3-latest', 'grok-latest'
OpenRouter: Provides access to models from multiple providers through a single API. Format: 'provider/model-name'
OpenAI models: 'openai/gpt-5.5', 'openai/gpt-5.4-mini'
Anthropic models: 'anthropic/claude-opus-4.7', 'anthropic/claude-sonnet-4.6'
Google models: 'google/gemini-3.1-pro-preview', 'google/gemini-3-flash-preview'
X.AI models: 'x-ai/grok-4.3'
Stepfun models: 'stepfun/step-3.5-flash'
With provider names as keys: list("openai" = "sk-...", "anthropic" = "sk-ant-...", "openrouter" = "sk-or-...")
With model names as keys: list("gpt-5.5" = "sk-...", "claude-sonnet-4-6" = "sk-ant-...")
The system first tries to find the API key using the provider name. If not found, it then tries using the model name. Example:
api_keys <- list(
"openai" = Sys.getenv("OPENAI_API_KEY"),
"anthropic" = Sys.getenv("ANTHROPIC_API_KEY"),
"openrouter" = Sys.getenv("OPENROUTER_API_KEY"),
"claude-opus-4-7" = "your-claude-opus-key"
)
## Not run: # Compare predictions using different models api_keys <- list( "claude-sonnet-4-6" = "your-anthropic-key", "deepseek-v4-pro" = "your-deepseek-key", "gemini-3.1-pro-preview" = "your-gemini-key", "qwen3.6-plus" = "your-qwen-key" ) results <- compare_model_predictions( input = list(gs1=c('CD4','CD3D'), gs2='CD14'), tissue_name = 'PBMC', api_keys = api_keys ) ## End(Not run)## Not run: # Compare predictions using different models api_keys <- list( "claude-sonnet-4-6" = "your-anthropic-key", "deepseek-v4-pro" = "your-deepseek-key", "gemini-3.1-pro-preview" = "your-gemini-key", "qwen3.6-plus" = "your-qwen-key" ) results <- compare_model_predictions( input = list(gs1=c('CD4','CD3D'), gs2='CD14'), tissue_name = 'PBMC', api_keys = api_keys ) ## End(Not run)
Set global logger configuration
configure_logger(level = "INFO", console_output = FALSE, json_format = TRUE)configure_logger(level = "INFO", console_output = FALSE, json_format = TRUE)
level |
Logging level: "DEBUG", "INFO", "WARN", or "ERROR". Default: "INFO" |
console_output |
Whether to enable console output. Default: FALSE |
json_format |
Whether to use JSON format for log messages. Default: TRUE |
Invisible logger object
Create prompt for cell type annotation
create_annotation_prompt(input, tissue_name, top_gene_count = 10)create_annotation_prompt(input, tissue_name, top_gene_count = 10)
input |
Either a data frame from Seurat's FindAllMarkers() or a list for each cluster
where each element is either a character vector of genes or a list containing a |
tissue_name |
Tissue context for the annotation (e.g., 'human PBMC', 'mouse brain') |
top_gene_count |
Number of top genes to use per cluster when input is from Seurat. Default: 10 |
A list with prompt (formatted prompt text), expected_count
(number of clusters), and gene_lists (cluster ID to marker genes mapping).
DeepSeek API Processor
DeepSeek API Processor
Concrete implementation of BaseAPIProcessor for DeepSeek models. Handles DeepSeek-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> DeepSeekProcessor
new()
Initialize DeepSeek processor
DeepSeekProcessor$new(base_url = NULL)
get_default_api_url()
Get default DeepSeek API URL
DeepSeekProcessor$get_default_api_url()
make_api_call()
Make API call to DeepSeek
DeepSeekProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from DeepSeek API response
DeepSeekProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
DeepSeekProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Gemini API Processor
Gemini API Processor
Concrete implementation of BaseAPIProcessor for Gemini models. Handles Gemini-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> GeminiProcessor
new()
Initialize Gemini processor
GeminiProcessor$new(base_url = NULL)
get_default_api_url()
Get default Gemini API URL template
GeminiProcessor$get_default_api_url()
get_api_url_for_model()
Get API URL for specific model
GeminiProcessor$get_api_url_for_model(model)
make_api_call()
Make API call to Gemini
GeminiProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from Gemini API response
GeminiProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
GeminiProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
This file contains utility functions for managing API keys and related operations. Get API key for a specific model
get_api_key(model, api_keys)get_api_key(model, api_keys)
model |
Model name to get API key for |
api_keys |
Named list of API keys with provider or model names as keys |
This function retrieves the appropriate API key for a given model by first checking the provider name and then the model name in the provided API keys list.
API key string for the specified model
Get the global logger instance
get_logger()get_logger()
This function determines the appropriate provider (e.g., OpenAI, Anthropic, Google, OpenRouter) based on the model name. Uses prefix-based matching for efficient and maintainable provider detection. New models following existing naming conventions are automatically supported.
get_provider(model)get_provider(model)
model |
Character string specifying the model name (e.g., "gpt-5.5", "claude-opus-4-7"). |
Supported providers and model prefixes:
OpenAI: gpt-, o1, o3*, o4*, chatgpt-, codex- (e.g., 'gpt-5.5', 'gpt-5.4-mini')
Anthropic: claude-* (e.g., 'claude-opus-4-7', 'claude-sonnet-4-6')
DeepSeek: deepseek-* (e.g., 'deepseek-v4-flash', 'deepseek-v4-pro')
Google: gemini-* (e.g., 'gemini-3.1-pro-preview', 'gemini-3-flash-preview')
Qwen: qwen*, qwq-* (e.g., 'qwen3.6-plus', 'qwen3.6-flash')
Stepfun: step-* (e.g., 'step-3.5-flash', 'step-3')
Zhipu: glm-, chatglm (e.g., 'glm-5.1', 'glm-5-turbo')
MiniMax: minimax-* (e.g., 'MiniMax-M2.7', 'MiniMax-M2.5')
Grok: grok-* (e.g., 'grok-4.3', 'grok-4.3-latest')
OpenRouter: Any model with '/' in the name (e.g., 'openai/gpt-5.5', 'anthropic/claude-opus-4.7')
Character string of the provider name (e.g., "openai", "anthropic").
Grok API Processor
Grok API Processor
Concrete implementation of BaseAPIProcessor for Grok models. Handles Grok-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> GrokProcessor
new()
Initialize Grok processor
GrokProcessor$new(base_url = NULL)
get_default_api_url()
Get default Grok API URL
GrokProcessor$get_default_api_url()
make_api_call()
Make API call to Grok
GrokProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from Grok API response
GrokProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
GrokProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
This function implements an interactive voting and discussion mechanism where multiple LLMs collaborate to reach a consensus on cell type annotations, particularly focusing on clusters with low agreement. The process includes:
Initial voting by all LLMs
Identification of controversial clusters
Detailed discussion for controversial clusters
Final summary by a designated LLM (default: Claude)
interactive_consensus_annotation( input, tissue_name, models = c("claude-opus-4-7", "gpt-5.5", "gemini-3.1-pro-preview", "deepseek-v4-flash", "grok-4.3"), api_keys, top_gene_count = 10, controversy_threshold = 0.7, entropy_threshold = 1, max_discussion_rounds = 3, consensus_check_model = NULL, log_dir = "logs", cache_dir = NULL, use_cache = TRUE, base_urls = NULL, clusters_to_analyze = NULL, force_rerun = FALSE )interactive_consensus_annotation( input, tissue_name, models = c("claude-opus-4-7", "gpt-5.5", "gemini-3.1-pro-preview", "deepseek-v4-flash", "grok-4.3"), api_keys, top_gene_count = 10, controversy_threshold = 0.7, entropy_threshold = 1, max_discussion_rounds = 3, consensus_check_model = NULL, log_dir = "logs", cache_dir = NULL, use_cache = TRUE, base_urls = NULL, clusters_to_analyze = NULL, force_rerun = FALSE )
input |
Either a data frame from Seurat's FindAllMarkers() function containing
differential gene expression results (must have columns: 'cluster', 'gene',
and 'avg_log2FC'), or a list where each element is either a character vector
of genes or a list containing a |
tissue_name |
Character string specifying the tissue type for context-aware cell type annotation (e.g., 'human PBMC', 'mouse brain'). Required. |
models |
Character vector of model names to use for consensus annotation. Minimum 2 models required. Supports models from OpenAI, Anthropic, DeepSeek, Google, Alibaba, Stepfun, Zhipu, MiniMax, X.AI, and OpenRouter. |
api_keys |
Named, non-empty list of API keys. Can use provider names as keys (e.g., "openai", "anthropic") or model names as keys (e.g., "gpt-5"). |
top_gene_count |
Integer specifying the number of top marker genes to use for annotation per cluster (default: 10). |
controversy_threshold |
Numeric value between 0 and 1 for consensus proportion threshold. Clusters below this threshold are considered controversial (default: 0.7). |
entropy_threshold |
Numeric value for entropy threshold. Higher entropy indicates more disagreement among models (default: 1.0). |
max_discussion_rounds |
Integer specifying maximum number of discussion rounds for controversial clusters (default: 3). |
consensus_check_model |
Character string specifying which model to use for consensus checking. If NULL, uses the first model that succeeds during initial annotation. |
log_dir |
Character scalar specifying directory for log files (default: "logs"). This function reinitializes the session logger with this directory at the start of each call. |
cache_dir |
Character string or NULL. Cache directory for storing results. NULL uses system cache, "local" uses current directory, "temp" uses temporary directory, or specify custom path. |
use_cache |
Logical indicating whether to use caching (default: TRUE). |
base_urls |
Named list or character string specifying custom API base URLs. Useful for proxies or alternative endpoints. If NULL, uses official endpoints. |
clusters_to_analyze |
Character or numeric vector specifying which clusters to analyze. If NULL (default), all clusters are analyzed. |
force_rerun |
Logical indicating whether to force rerun of all specified clusters, ignoring cache. Only affects controversial cluster discussions (default: FALSE). |
A list containing:
initial_results: Initial voting results, consensus checks, and controversial cluster IDs
final_annotations: Final annotations keyed by cluster ID
controversial_clusters: Clusters identified as controversial
discussion_logs: Detailed discussion logs for controversial clusters
session_id: Logger session identifier
voting_results: Backward-compatible alias of initial_results
discussion_results: Backward-compatible alias of discussion_logs
final_consensus: Backward-compatible alias of final_annotations
Get list of registered custom models
list_custom_models()list_custom_models()
Get list of registered custom providers
list_custom_providers()list_custom_providers()
Convenience functions for logging
log_debug(message, context = NULL) log_info(message, context = NULL) log_warn(message, context = NULL) log_error(message, context = NULL)log_debug(message, context = NULL) log_info(message, context = NULL) log_warn(message, context = NULL) log_error(message, context = NULL)
message |
Log message string |
context |
Optional context information (list or character) |
Invisible NULL
Minimax API Processor
Minimax API Processor
Concrete implementation of BaseAPIProcessor for Minimax models. Handles Minimax-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> MinimaxProcessor
new()
Initialize Minimax processor
MinimaxProcessor$new(base_url = NULL)
get_default_api_url()
Get default MiniMax OpenAI-compatible chat completions API URL
MinimaxProcessor$get_default_api_url()
make_api_call()
Make API call to Minimax
MinimaxProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from Minimax API response
MinimaxProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
MinimaxProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Display the cache directory location
mllmcelltype_cache_dir(cache_dir = NULL)mllmcelltype_cache_dir(cache_dir = NULL)
cache_dir |
Cache directory specification. NULL uses system default, "local" uses current dir, "temp" uses temp dir, or custom path |
Invisible cache directory path
## Not run: mllmcelltype_cache_dir() mllmcelltype_cache_dir("local") ## End(Not run)## Not run: mllmcelltype_cache_dir() mllmcelltype_cache_dir("local") ## End(Not run)
Clear the mLLMCelltype cache
mllmcelltype_clear_cache(cache_dir = NULL)mllmcelltype_clear_cache(cache_dir = NULL)
cache_dir |
Cache directory specification. NULL uses system default, "local" uses current dir, "temp" uses temp dir, or custom path |
Invisible NULL
## Not run: mllmcelltype_clear_cache() mllmcelltype_clear_cache("local") ## End(Not run)## Not run: mllmcelltype_clear_cache() mllmcelltype_clear_cache("local") ## End(Not run)
OpenAI API Processor
OpenAI API Processor
Concrete implementation of BaseAPIProcessor for OpenAI models. Handles OpenAI-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> OpenAIProcessor
new()
Initialize OpenAI processor
OpenAIProcessor$new(base_url = NULL)
get_default_api_url()
Get default OpenAI API URL
OpenAIProcessor$get_default_api_url()
make_api_call()
Make API call to OpenAI
OpenAIProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from OpenAI API response
OpenAIProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
OpenAIProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
OpenRouter API Processor
OpenRouter API Processor
Concrete implementation of BaseAPIProcessor for OpenRouter models. Handles OpenRouter-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> OpenRouterProcessor
new()
Initialize OpenRouter processor
OpenRouterProcessor$new(base_url = NULL)
get_default_api_url()
Get default OpenRouter API URL
OpenRouterProcessor$get_default_api_url()
make_api_call()
Make API call to OpenRouter
OpenRouterProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from OpenRouter API response
OpenRouterProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
OpenRouterProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Qwen API Processor
Qwen API Processor
Concrete implementation of BaseAPIProcessor for Qwen models. Handles Qwen-specific API calls, authentication, and response parsing.
Qwen has OpenAI-compatible chat completions endpoints:
International (US): https://dashscope-us.aliyuncs.com/compatible-mode/v1/chat/completions
Domestic (China): https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions
Legacy international: https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions The processor automatically tries international first, then domestic, then legacy international.
mLLMCelltype::BaseAPIProcessor -> QwenProcessor
new()
Test if an endpoint is accessible
Initialize Qwen processor
QwenProcessor$new(base_url = NULL)
get_default_api_url()
Get default Qwen OpenAI-compatible chat completions API URL
QwenProcessor$get_default_api_url()
get_working_api_url()
Get working Qwen API URL with automatic endpoint detection
QwenProcessor$get_working_api_url(api_key)
make_api_call()
Make API call to Qwen
QwenProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from Qwen API response
QwenProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
QwenProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Register a custom model for a provider
register_custom_model(model_name, provider_name, model_config = list())register_custom_model(model_name, provider_name, model_config = list())
model_name |
Unique name for the custom model |
provider_name |
Name of the provider this model belongs to |
model_config |
List of configuration parameters for the model (e.g., temperature, max_tokens) |
Invisible TRUE on success
## Not run: register_custom_model( model_name = "my_model", provider_name = "my_provider", model_config = list( temperature = 0.7, max_tokens = 2000 ) ) ## End(Not run)## Not run: register_custom_model( model_name = "my_model", provider_name = "my_provider", model_config = list( temperature = 0.7, max_tokens = 2000 ) ) ## End(Not run)
Register a custom LLM provider
register_custom_provider(provider_name, process_fn, description = NULL)register_custom_provider(provider_name, process_fn, description = NULL)
provider_name |
Unique name for the custom provider |
process_fn |
Function that processes LLM requests. Must accept parameters: prompt, model, api_key; may optionally accept model_config |
description |
Optional description of the provider |
Invisible NULL
## Not run: register_custom_provider( provider_name = "my_provider", process_fn = function(prompt, model, api_key) { # Custom implementation response <- httr::POST( url = "your_api_endpoint", body = list(prompt = prompt), encode = "json" ) return(httr::content(response)$choices[[1]]$text) } ) ## End(Not run)## Not run: register_custom_provider( provider_name = "my_provider", process_fn = function(prompt, model, api_key) { # Custom implementation response <- httr::POST( url = "your_api_endpoint", body = list(prompt = prompt), encode = "json" ) return(httr::content(response)$choices[[1]]$text) } ) ## End(Not run)
StepFun API Processor
StepFun API Processor
Concrete implementation of BaseAPIProcessor for StepFun models. Handles StepFun-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> StepFunProcessor
new()
Initialize StepFun processor
StepFunProcessor$new(base_url = NULL)
get_default_api_url()
Get default StepFun API URL
StepFunProcessor$get_default_api_url()
make_api_call()
Make API call to StepFun
StepFunProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from StepFun API response
StepFunProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
StepFunProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.
Unified Logger for mLLMCelltype Package
Unified Logger for mLLMCelltype Package
This logger provides centralized, multi-level logging with structured output, log rotation, and performance monitoring capabilities.
log_dirDirectory for storing log files
log_levelCurrent logging level
session_idUnique identifier for the current session
max_log_sizeMaximum log file size in MB (default: 10MB)
max_log_filesMaximum number of log files to keep (default: 5)
enable_consoleWhether to output to console (default: FALSE)
enable_jsonWhether to use JSON format (default: TRUE)
performance_statsPerformance monitoring statistics
new()
Initialize the unified logger
UnifiedLogger$new( base_dir = "logs", level = "INFO", max_size = 10, max_files = 5, console_output = FALSE, json_format = TRUE )
debug()
Log a debug message
UnifiedLogger$debug(message, context = NULL)
info()
Log an info message
UnifiedLogger$info(message, context = NULL)
warn()
Log a warning message
UnifiedLogger$warn(message, context = NULL)
error()
Log an error message
UnifiedLogger$error(message, context = NULL)
log_api_call()
Log API call performance
UnifiedLogger$log_api_call( provider, model, duration, success = TRUE, tokens = NULL )
log_api_request_response()
Log complete API request and response for debugging and audit
UnifiedLogger$log_api_request_response( provider, model, prompt_content, response_content, request_metadata = NULL, response_metadata = NULL )
log_cache_operation()
Log cache operations
UnifiedLogger$log_cache_operation(operation, key, size = NULL)
log_cluster_progress()
Log cluster annotation progress
UnifiedLogger$log_cluster_progress(cluster_id, stage, progress = NULL)
log_discussion()
Log detailed cluster discussion with complete model conversations
UnifiedLogger$log_discussion(cluster_id, event_type, data = NULL)
log_model_response()
Log model response with concise summary in main log and full text in file
UnifiedLogger$log_model_response( provider, model, response, stage = "annotation", cluster_id = NULL )
get_performance_summary()
Get performance summary
UnifiedLogger$get_performance_summary()
cleanup_logs()
Clean up old log files
UnifiedLogger$cleanup_logs(force = FALSE)
set_level()
Set logging level
UnifiedLogger$set_level(level)
clone()
The objects of this class are cloneable with this method.
UnifiedLogger$clone(deep = FALSE)
deepWhether to make a deep clone.
Zhipu API Processor
Zhipu API Processor
Concrete implementation of BaseAPIProcessor for Zhipu models. Handles Zhipu-specific API calls, authentication, and response parsing.
mLLMCelltype::BaseAPIProcessor -> ZhipuProcessor
new()
Initialize Zhipu processor
ZhipuProcessor$new(base_url = NULL)
get_default_api_url()
Get default Zhipu API URL
ZhipuProcessor$get_default_api_url()
make_api_call()
Make API call to Zhipu
ZhipuProcessor$make_api_call(chunk_content, model, api_key)
extract_response_content()
Extract response content from Zhipu API response
ZhipuProcessor$extract_response_content(response, model)
clone()
The objects of this class are cloneable with this method.
ZhipuProcessor$clone(deep = FALSE)
deepWhether to make a deep clone.