Research Software
José Javier
Research Software CV Photography
Software
Coding Packages, Projects, and Methodological Expertise

Overview

I have over seven years of experience in Python, R, and Stata, specializing in statistical modeling, causal inference, data visualization, machine learning, and natural language processing. My work focuses on developing reproducible, user-friendly tools that enable robust quantitative analysis across diverse research domains.

My programming expertise spans the full data science workflow, from data ingestion and transformation to statistical modeling and interactive result delivery. I have independently and collaboratively developed and maintained software packages, and implemented modular, end-to-end workflows for statistical inference.

I have deep expertise in causal inference methods, including propensity score matching (e.g., MDM, genetic matching, CEM), regression discontinuity designs (standard and in time), and modern synthetic control methods such as Generalized Synthetic Control (GSC), Synthetic Difference-in-Differences (SDID), and event study DID. These approaches support robust policy evaluation and impact analysis in observational settings.

I also specialize in natural language processing and text analysis, developing tools for keyword extraction, sentiment and emotion analysis, multilingual processing, and topic modeling using both lexicon-based methods and transformer-based architectures (e.g., BERTopic, RoBERTa, XLM-R).

Complementing my statistical and NLP expertise is my background in web application development using Streamlit and Shiny, through which I build interactive, browser-based tools for applied researchers and analysts. These tools are designed to eliminate the coding barrier, making complex methods usable through intuitive interfaces. Overall, my work sits at the intersection of methodological rigor, data accessibility, and applied insight, integrating computational tools with causal reasoning to support impactful, transparent, and reproducible research

TextViz Studio

TextViz Studio is a browser-based, no-code platform I developed to make advanced statistical and text analysis techniques accessible to researchers without programming backgrounds. Built entirely in Python using Streamlit, the application encapsulates a range of quantitative and natural language processing tools through an intuitive user interface that abstracts away the complexity of code.

The platform includes modular tools for statistical data exploration, regression modeling, keyword and N-gram analysis, multilingual sentiment and emotion detection, and transformer-based topic modeling. These tools are designed to support common workflows in the social sciences, policy evaluation, and qualitative research. Users can perform everything from basic descriptive analysis to causal modeling and topic discovery with just a few clicks, and export visualizations, summaries, and formatted tables for reporting or publication.

TextViz Studio reflects my broader goal of democratizing data science: building transparent, customizable tools that preserve methodological rigor while enabling broader access. It supports both exploratory and confirmatory research designs, integrates modern machine learning backends (e.g., BERTopic, GPT-based labeling, RoBERTa), and allows batch processing for large-scale document analysis.

Packages

AppendIDs (R Package)

The AppendIDs R package helps researchers working with cross-sectional and cross-sectional time-series (CSTS) country data to resolve inconsistencies across differing country-coding schemes. Designed for scholars in international relations and comparative politics, the package simplifies the merging of datasets that rely on divergent systems such as Gleditsch-Ward, Correlates of War (COW), and International Financial Statistics (IFS) codes.

Its core function automatically appends standardized Gleditsch-Ward identifiers to country-year and dyadic datasets, flags duplicate entries, and supports integration with tools like the countrycode package. This enables seamless dataset alignment across varying naming conventions and coding systems, including for countries with historical name or status changes (e.g., pre-1990 West Germany vs. post-reunification Germany).

AppendIDs is a key tool behind the International Political Economy (IPE) Data Resource, which unifies over 90 commonly used datasets in the IPE literature. It promotes replicability and efficient data integration for large-scale comparative research.

topicmodelCV (R Package)

The topicmodelCV R package provides a streamlined way to determine the optimal number of topics (k) for Latent Dirichlet Allocation (LDA) models using 10-fold cross-validation. Designed for applied researchers working with text data, the package evaluates model quality across multiple metrics—including perplexity, held-out likelihood, and semantic coherence.

By supplying a pre-processed document-term matrix (DTM) to the function, users can run multiple LDA models using the topicmodels package and compare model performance across values of k. The package outputs a tidy results table for inspection and includes built-in visualization functions to support quick and interpretable model selection.

topicmodelCV simplifies a critical step in unsupervised topic modeling, making cross-validated LDA evaluation more accessible, reproducible, and interpretable for researchers and data scientists.