top of page

Democratization of Analysis: An Exercise in Accessing and Visualizing New York Health Data

Writer: Step Two Policy ProjectStep Two Policy Project

Updated: 5 hours ago

Health Data and Information Series # 4

 

PDF version:

 

Key Takeaways

  • There have been significant efforts in New York to make State-level healthcare data available to the public, especially through reporting in the health.data.ny.gov environment, which offers important information describing the data, robust visualization tools, and options for direct export and API connection.

  • Many topics of policy relevance are not available on health.data.ny.gov and are instead posted directly to agency-based webpages and managed with less standardization and clarity about relevant definitions, time frames, sources, update frequency, and other characteristics.

  • We thought it would be helpful to illustrate the types of resources that could be developed from the health delivery system data that is already publicly accessible but is not organized or presented in ways conducive to generating insights on public policy.

  • We selected a few topics as case studies: Medicaid managed long term care, Medicaid managed care, nursing homes, and the Essential Plan.

  • Putting together relevant datasets for ingestion, analysis, and visualization highlighted challenges related to: definitions and labels; formats and structure; and timeliness.

  • There are many opportunities to facilitate the democratization of health data and information:

    • Host more data on Health Data NY 

    • Publish data on additional topics of policy importance 

    • Produce dashboards and publish spreadsheet-friendly reports to facilitate interaction with the data 

    • Use more consistent structure and formatting conventions, standardized terminology, and shared demographic reference tables across health-related datasets

    • Provide return on reporting investment for organizations

    • Partner with stakeholders to streamline surveying and reporting efforts

  • We have included a detailed explanation of the methods we undertook for this project in the Appendix.


Introduction

The Step Two Policy Project has been focusing on the transparency of health data and information since our first paper in September of 2023. Our thesis is that more transparency would facilitate what we call the “democratization of analysis.” The premise of the “democratization of analysis” is that giving more people tools to understand what is happening with healthcare in New York State will generate insights which, in some instances, may not have been the focus of State staff and policymakers themselves.


Sometimes, it is hard to fully understand what an idea would look like if it were operationalized. Hospital global budgets, to pick one example, sound intuitively simple but in reality, are quite complex. By contrast, the proposal for greater transparency of health data and information in New York is relatively easy to imagine because it has a direct analog.


The Step Two Policy Project has proposed that New York create a health data organization in New York that would perform functions similar to the Massachusetts Center for Health Information and Analysis (CHIA), which would provide much more transparency about the healthcare delivery system in New York than is currently available to the public. We developed a crosswalk between the health data and information that is publicly available through CHIA in Massachusetts and compared it to New York.


The crosswalk examined whether the same data and information reported publicly by CHIA is captured in New York and whether that data is publicly accessible. We discovered that most of the data available through CHIA is already captured in New York, and much of that data is publicly available. However, the publicly available data is often difficult to use because it is siloed, presented in formats that are difficult to work with, and has ambiguities that make it challenging to be sure what is included in the analyses, i.e., exactly what information the data represents.  



“We are in an era of data-driven decision making and New York is working with an anachronistic data and information infrastructure that constrains both the value we can glean and the breadth of our ability to innovate. In order to improve healthcare delivery for New York’s communities and to address the long-standing inequities in access and outcomes, we need to modernize our data and information infrastructure to produce accurate, timely, and actionable information. We can do this by developing a comprehensive health and health-related data strategy with centralized leadership and the authority to engage partner State agencies to reach across the data silos.”


Recent developments in New York suggest that the state may be on the threshold of expanding the type of health data and information that is available to both policymakers and the public. At least two health data initiatives of significance are described in Gov.  Hochul’s 2025 State of the State Book. The first statement addresses healthcare quality, which we know is also a focus of the Future of Healthcare Commission, as follows:


“New York implemented Quality Assurance Reporting Requirements (QARR) in 1994 to measure and report on healthcare quality. While Medicaid managed care plans meet or exceed national benchmarks for many key adult measures, the current measures look at the population as a whole without insight into how quality may differ across different segments of the population. As a result, the State lacks data needed to identify and address health inequities in New York State's Medicaid managed care population. To address ongoing health disparities in New York, Governor Hochul will direct managed care plans to analyze their populations through a health equity lens to determine the largest disparities in quality and outcomes. The plans will be required to develop quality measures, stratified by key demographic factors, and implement strategies to address any gaps, including developing value-based payments to improve health equity.” 


We were also encouraged to see the announcement of the United Hospital Fund to serve as the lead entity to coordinate the Medicaid Health Equity Regional Organization (HERO), which will help to “better integrate Medicaid services” under the current 1115 waiver.


“As part of New York’s federal Medicaid waiver approved last year, the State is investing $125 million in building new health planning and data infrastructure through a new Health Equity Regional Organization (HERO). Investments in the HERO will lay the groundwork for a new statewide data infrastructure that can be used to support the design and development of new policies, interventions, and targeted investments to improve outcomes and reduce health disparities. A key goal of this infrastructure will be to enhance the State's capacity for program evaluation. By leveraging partnerships with academic centers and stakeholders, the State will develop and evaluate metrics of success for existing and future programs, including massive new investments in health-related social needs services."


There is a degree of uncertainty surrounding the HERO, from the federal perspective, but also from the State, in terms of its ultimate scope. Will the HERO go beyond collecting and analyzing Medicaid data, to other payers? To what extent will it interact with the State’s all payer database. It remains unclear at this early stage the extent to which the HERO will be primarily focused on community-specific health-related social needs (HRSN) assessments and examining the impact of service connections and delivery, as opposed to a much broader range of data and information about individuals, populations, healthcare spending, and the healthcare delivery system.


It is important to analyze the healthcare delivery system, including the utilization and quality of clinical care, from the perspective of all types of payers, not just government payers. In this regard, we are supportive of the recent regulatory changes to the Statewide Health Information Network for New York (SHIN-NY) that, combined with continued progress on the State’s all payer database, will help to bridge clinical care and public health to enable better population heath management.


We regularly utilize the interactive datasets that are available on Health Data NY and study the reports produced by the Medicaid program, the NY State of Health, and the Division of the Budget. But even with these resources, which have varying degrees of manipulability, consistency, and standardization, the reality is that much of New York’s health and health-related data and information that would be useful in policy analysis and development continues to be opaque. We doubt this opacity is by design. Rather, it is more a function of a non-strategic evolution of State data systems, and that transparency and sharing of data and information has not been a priority.


We thought it would be helpful to illustrate the types of resources that could be developed from the health delivery system data that is currently publicly accessible but is not usually organized or presented in ways that facilitate insights into public policy. To implement this project, we worked with Isaac Michaels, an epidemiologist.[1] Although the data sources we were working with are relatively straightforward, it was important to have someone assisting us with our process who is familiar with working with large datasets and is skilled in methods to ingest, clean, analyze, and visualize data.


We started by selecting a few datasets as case studies. Putting together these datasets for the purposes of developing visualizations and interactive tools highlighted some of the challenges that arise when preparing data to be presented to the public. Definitions were not always straightforward, and tying enrollment data to spending frequently requires clarifications. One of the benefits of public transparency is that it forces people to present information in a way that resolves ambiguities.


Another benefit of putting together dashboards, visualizations, and other resources is that it enables secondary analysis by directly giving the public the tools to observe historical trends and make corresponding projections.


Orientation to the Data

Below is a bullet list of the data categories that we focused on, which can be found on the Data page of the Step Two Policy Project’s website.


  • In New York State, the Department of Health (DOH) reports on its website on a monthly basis the enrollment in various types of Managed Long Term Care (MLTC) plans – i.e., partial capitation MLTC, Medicaid Advantage Plans, and PACE. These reports are posted in PDF and Excel formats, though the spreadsheet formatting requires significant cleaning to be used in analysis.

  • This enrollment data is also included on Health Data NY, where its format is more conducive to outside analysis, but the posting of data there lags the posting on the DOH website by approximately six months.[2]

  • Both sources of enrollment data roll up New York City counties to “New York City” but report all other NYS counties separately, which makes truly statewide, county-level analysis impossible.

  • Total long term care spending – either managed or fee-for-service – is not reported with discrete categories in annual budget documents. However, the quarterly Medicaid Global Cap report published by DOH includes a spending category called “Managed Long Term Care,” which we used for this analysis. It is not clear exactly what is included in the “Managed Long Term Care” spending row, but the global cap reports include Medicaid Advantage Plus (MAP), PACE, and partial capitation plans in their total enrollment counts for “Managed Long Term Care.”

  • Despite the ambiguity of what “managed long term care” spending includes, for the sake of our analysis, we assume the spending number includes MAP and PACE, in addition to partial capitation plans, and we include MAP and PACE enrollment, since both are publicly available in the same source as for partial capitation.

  • Financial data on MLTC plans is well documented in annual Medicaid Managed Care Operational Reports (MMCORs), which are only accessible by filing a Freedom of Information request or by paying a third party. [3]

  • The lack of accessible, detailed public data on MLTC plans means critical information on financial allocations and utilization patterns is unavailable. Making this information more accessible would help hold plans accountable for their management of public funds and would help analysts better understand the forces behind significant budget growth in this sector every year.

  • Sources: Medicaid Program Enrollment by Month: Beginning 2009, Health Data NY and Global Spending Cap Updates (quarterly), NYS DOH

    • Enrollee Months by Plan Type, Total Enrollment, State Spending, and Per Capita Average Annual Spending, by Fiscal Year (table)

    • Enrollment and Spending, by Fiscal Year (graph)

    • Member Months by Plan Type, Plan Name, Year (table)

    • Number of Enrollees by Plan Name, Plan Type, Month (graphs)

    • Excel Workbook with Interactive Projection tool, previewed below

  • Managed Care covers the majority of the approximately seven million New Yorkers enrolled in Medicaid: variably cited at 74-80%.

  • As in the case of MLTC, DOH reports enrollment in MMC plans on its website on a monthly basis. It is also included on Health Data NY, but there it lags the DOH website by approximately six months. Both roll up NYC counties to “New York City” but report other NYS counties separately.

  • MMC spending is not reported discretely in annual budget documents. However, the quarterly Medicaid Global Cap report published by DOH includes a spending category of “Mainstream Managed Care,” though it does not define exactly what that comprises. Global cap reports include HARP and HIV/SNPs in total enrollment counts for Mainstream Managed Care, but that is inconsistent with definitions of mainstream found elsewhere. These two programs are not Mainstream plans, but alternatives to them, so it is not intuitive to combine them unless one is describing Medicaid managed care writ large (i.e., not mainstream specifically).

  • Given the ambiguity of what “mainstream managed care” includes, for the sake of our analysis, we assume the spending number includes HARPs and HIV/SNPs, but we do not include HARP and HIV/SNP enrollment, in case spending excludes them.

  • The limited availability of financial information on MLTC plans, described above, is also true of MMC plans.

  • Sources: Medicaid Program Enrollment by Month: Beginning 2009, Health Data NY and Global Spending Cap Updates (quarterly), NYS DOH

  • Enrollee Months by Plan Type, Total Enrollment, State Spending, and Per Capita Average Annual Spending, by Fiscal Year (table)

  • Enrollment and Spending, by Fiscal Year (graph

  • Excel Workbook with Interactive Projection tool, previewed below

  • This topic includes census and spending on a variety of long term and specialty bed types: nursing home, pediatric, behavioral intervention, ventilator, scattered ventilator, traumatic brain injury, and neurodegenerative disease).

  • Publicly available nursing home data does not include the proportion of beds attributed to any particular payer type, Medicaid or otherwise. The name of the dataset, “Nursing Home Weekly Bed Census,” is a bit of a misnomer because none of the columns directly report census (i.e., how many individuals are in beds at a given facility). Instead, the dataset reports “total capacity” and “available capacity,” from which the user can calculate the difference. The data dictionary notes that “beds” account for “beds that the facility is approved to operate” and “beds available” actually includes “non-operational” beds. Furthermore, there is no differentiation between certified beds and staffed beds. Given the current ubiquity of workforce shortages, it is likely that the bed availability reported this way is generally an overcount.

  • There is limited insight into nursing home spending within managed long term care, though there are also proposals to shift long stay nursing home mainstream spending to fee-for-service. Fee-for-service nursing home spending is available in quarterly Global Cap Reports, which only include Medicaid. Spending information can, technically, be gleaned from plans’ Medicaid Managed Care Operational Reports (MMCORs), but these are private. Efforts have been made to make the data more transparent (see: New York Legal Assistance Group: MLTC Data Transparency), but they have not been maintained. And even the private data is lagged such that the latest available is from calendar year 2023. Therefore, the most feasible way to access this information now is to request it directly from the State Department of Health, with Mainstream and MLTC separated.

  • Spending data in our tables and graphs is based only on fee-for-service spending, as reported in quarterly Global Spending Cap Updates, because public data on nursing home spending is very limited. These tables and graphs include “Gross Spending,” which is estimated by assuming a 50% federal match to reported State spending.

  • Sources: Nursing Home Weekly Bed Census: Beginning 2009, Health Data NY and Global Spending Cap Updates (quarterly), NYS DOH

    • Nursing Home Census and Spending, by Fiscal Year (table)

    • Nursing Home Census and Spending, by Fiscal Year (graph, previewed below)

    • Nursing Home, Annual Average Census (table)

    • Nursing Home Beds, Monthly Average Census (graph)

  • The Essential Plan, New York’s Basic Health Program under the Affordable Care Act, offers health insurance for lower-income residents who do not qualify for Medicaid. It was launched in April 2015. At that time, eligibility was capped at 200% FPL. Effective April 1, 2024 (FY24 and beyond), the eligibility cap increased to 250% FPL, capturing a wider swath of New Yorkers.

  • As of this writing there are still extensive subsidies available to Essential Plan enrollees, but we calculated per capita spending using the overall program enrollment (regardless of plan level) and spending reported at the State level.

  • Public spending data about the Essential Plan is limited to what is available in annual budget documents, which only report disbursements.

  • Sources: Essential Plan enrollment data, compiled manually from New York State of Health monthly reporting dating back to September 2015, and spending data, aggregated manually from NYS budgets’ Financial Plans dating back to FY16.

    • Enrollment and Spending, by Fiscal Year (table)

    • Enrollment and Spending, by Fiscal Year (graph, previewed below)

Observations

There have been significant efforts in New York to make State-level healthcare data available to the public, especially through reporting in the health.data.ny.gov environment, which offers helpful information about the data, robust visualization tools, and options for direct export and API connection. That data, too, is sometimes vulnerable to some of the challenges we’ll describe below. Although, in our experience, the “dataset owners” who maintain these resources are responsive to outreach and willing to answer questions and share data in alternative formats on an ad hoc basis.


Still, data on many other topics of policy relevance is not available on health.data.ny.gov and are instead posted directly to agency-based webpages and managed with less standardization and clarity about relevant definitions, time frames, sources, update frequency, and other characteristics.


Overall, there are a few themes among the challenges we have observed across New York’s publicly available data sources:

Definitions and Labels

Some datasets include inconsistencies and ambiguities in definitions and labels, which can make data interpretation and combining previous reports difficult. Vague or inconsistent labeling makes it difficult to determine exactly what is being measured. For example, when using the monthly managed care enrollment reports posted on the DOH website, we had to verify whether the “Integrated Benefits for Duals in HARP” within the HARP sheet was a subset of “Integrated Benefits for Duals in MMC” within the NYSOH sheet, or whether they were actually independent. Such ambiguity could result in double-counting or other errors.


Inconsistencies also arise from varying naming conventions, not only within datasets but between agencies, making cross-agency comparisons challenging. While seemingly minor, the common pain point of referring to Saint Lawrence County as “Saint Lawrence” or “St. Lawrence” makes a real difference when comparing data across two sources that use different conventions. Some datasets report the counties of all five New York City boroughs as one county called “New York City,” which makes it impossible to disaggregate and observe trends within the city. Additionally, there may be differences in population numbers or other demographic statistics that are used in State agency analyses. One solution would be for all State entities to utilize a standard reference table for county and state-level statistics, which could be updated at regular intervals, but that would provide standardization and consistency in analyses across New York’s agencies.


Finally, while it is natural for definitions and reporting structures to evolve alongside programs, such changes should be well-documented and reflected in dataset metadata, column headings, and other supporting materials. Without proper documentation, data users may struggle to understand the rationale for shifts in reporting and accurately track trends over time. Health Data NY consistently includes information about the specific dataset, an overview, and a data dictionary, but this is not usually the case with data posted elsewhere. Standardized documentation and better transparency in reporting would improve the usability and consistency of state datasets and allow researchers to better contextualize their findings.


Formats and Structure

As we discussed in our first Policy Brief, Democratization of Health Data, Information, and Policy Analysis, “Data refinement is the process of transforming raw and unstructured data into clean and structured formats.” The effort of data refinement can be significant but is essential for optimizing the utility of public data.


Uploading regular reporting in formats that are conducive to analysis and interactive searches is aligned with a strategy of democratizing analysis. In New York State, however, there are many files made available only in PDF format. While one can convert these to Excel using Adobe Pro, which requires a paid subscription, that process typically requires effort, that is sometimes considerable, to “clean” the data in the Excel file. The conversion often results in original headers, footers, page numbers, and blocks of white space carrying over to Excel, resulting in irregularly merged columns, logos or other images floating near where they originally occurred in the PDF, and empty rows that need to be cleaned. Quarterly Global Cap Reports, for example, are only available as PDFs; the tables are difficult to extract cleanly and have changed structure over time; and the reports contain significant blocks of narrative, which do not make sense to retain in a spreadsheet.


Essential Plan Enrollment Reports, while posted monthly and not difficult to locate, are posted to the Essential Plan website as PDFs. The format of the tables, when converted to Excel, retains subtotal rows after each county, which are not conducive to efficient analysis. These PDFs are certainly useful to other stakeholders, and they contain relevant policy detail that helps contextualize the data. Still, it would be easier to extract the relevant data for use in direct analysis if the spreadsheet versions could be posted alongside the PDF versions.[5]


In the Introduction, we mentioned the Massachusetts CHIA. Most of the data and information posted on the CHIA website is publicly available at multiple levels of refinement, from PDF reports to Excel workbooks to interactive Tableau dashboards.[6] These resources are also accompanied by detailed technical appendices that define terms and describe methods. This is a convention that, once it becomes standard, would ultimately save the State and public stakeholders time and effort.


Timeliness

Some datasets are significantly lagged, meaning the information they produce is outdated by the time the data is technically available. For example, annual Institutional Cost Reports are a valuable source of hospital financial data but are three years lagged (i.e., 2021 Hospital Cost Report Data was added to health.data.ny.gov in the fall of 2024), posted once the reports have been audited by a Certified Public Accounting Firm. The greater the delay, the less actionable the data for policymaking.


In addition to data lags, irregular reporting cadence (especially with significant gaps) is problematic because it complicates the potential for interpreting the impact of programs and initiatives. Essential Plan enrollment data, for example, is available for every month from September 2021 - November 2024, but prior to that is only available in February and October, or January and September, going back to 2015. Because eligibility criteria and other such circumstances often evolve while programs are being implemented, large gaps in the data make it difficult to know more precisely when the impact of these changes occurs.


On the financial side, Quarterly Global Spending Cap Reports have not always followed a consistent production schedule. Originally, they were produced as monthly reports. Even the Mid-Year Budget Update often does not revise previous estimates, limiting its usefulness for real-time fiscal planning. Additionally, spending forecasts are frequently designed to remain unchanged until nine months into the fiscal year, delaying the availability of updated projections. As a result, beyond the Global Spending Cap Report, there is often little timely spending data available to support informed decision-making.    


Again, there are opportunities to contact “dataset owners” or agency-specific staff to clarify confusion about various datasets, but direct remediation of specific data issues is not a complete or sustainable solution. Staff only have so much bandwidth to make these adjustments ad hoc, and data could be prepared in a more strategic way to be more useful to research and policy communities and the broader public.


Opportunities

Although, the DOH public health data webpages have undergone a significant, positive transformation in recent years, the dashboards on the NYS Health Connector, envisioned as the public’s access to the delivery system data in the all payer database, remain relatively limited in scope and are not using timely data.[7]


Host more data on Health Data NY

As we have mentioned, Health Data NY pages consistently include valuable metadata, including a description of the specific dataset, the frequency of posting, ownership, and a data dictionary—not usually the case with data posted elsewhere. The existing infrastructure and established conventions of Health Data NY provide the most usability and consistency of health-related datasets and should be used for additional topics.


Publish data on additional topics of policy importance

There are many topics of public policy interest and priority that stakeholders have limited information on, despite that information existing within relevant agencies. For example:

  • Public Health Data Resources: Federal actions have led to the removal of key public health data from websites, impacting the availability of comprehensive and timely information on topics like avian flu and other infectious diseases​. States must step in to fill expanding voids in epidemiological and other public health information sharing. The Department of Health’s new Global Health Update Report is one example of the contribution states will have to make in the absence of more robust federal resources.

  • CDPAP: There is currently no public disclosure of CDPAP enrollment other than in Institutional Cost Reports (ICRs), which are only truly accessible for inside stakeholders and State staff who have the technical ability to roll up ICRs, which are invariably lagged. This information advantage is reproduced in many instances, not just in CDPAP. Reporting on spending on CDPAP, an area of massive growth in the State budget, is also buried within the quarterly Global Cap Reports and reported as “Personal Care,” which would not be an apparent synonym for the unaccustomed reader.

  • Medicaid: Ideally, more comprehensive and detailed information on Medicaid services and spending should be publicly available, given the significant State investment in the program. For example, during DSRIP, DOH, in cooperation with Salient Management Company, provided Medicaid service delivery data by region that included claims, managed care encounters, and members served for a variety of clinical settings, which offered insight into Medicaid utilization.[8] This type of information should be available on an ongoing basis.

    • One important example, despite the availability of monthly enrollment data on various plan types in Medicaid Managed Care, is a significant gap in information regarding fee-for-service (FFS) Medicaid. Currently, the only available data on FFS enrollment is a total enrollee count provided in quarterly global cap reports. This lack of detail overlooks key aspects such as how many FFS enrollees use Medicaid as their primary insurance versus those who have other primary insurance, such as Medicare or commercial plans, with Medicaid serving as their secondary coverage. Gaining insight into the distribution of primary and secondary Medicaid coverage among FFS enrollees, as well as understanding the expenditures associated with individuals using Medicaid as secondary coverage, would be valuable for policy development.

  • Emergency Medical Services: State EMS utilization and outcomes data, except for the FDNY EMS data in New York City,[9] is not publicly available. All EMS agencies are required to submit their electronic patient care reports (ePCRs) and other operations-related data to the State that adheres to New York’s requirements and is compatible with the federal EMS data platform called NEMSIS.  Some EMS agencies, if their vendor supports it and it is financially viable, transmit their ePCRs to the regional Qualified Entities (QEs) that make up the SHIN-NY. This clinical information is then available to healthcare providers who can see blood glucose readings, for example, from a patient’s diabetes-related emergency that required a 911 call. Adding EMS treatment encounters to the SHIN-NY admission/discharge/transfer (ADT) notifications to healthcare providers would assist in situations where a patient is treated in place or transported to a destination other than an emergency department. Ultimately, connecting EMS data to Statewide Planning and Research Cooperative Systems (SPARCS) data could provide information on patient outcomes and support policies related to expanded community-based interventions, alternate EMS transport destinations, or other emergent EMS-related innovative practices.


Produce dashboards and publish spreadsheet-friendly reports to facilitate interaction with the data

Higher-level, illustrative dashboards help highlight key areas of interest in healthcare and could be expanded, especially within the budgetary context.

  • Essential Plan: While enrollment reports are posted monthly in PDF format, CVS or Excel format should also be posted. 

  • Global Cap Reports: These reports include both large blocks of texts and many tables, so conversion from PDF to Excel is inefficient. Since the data is financially oriented, it seems natural that the format should support additional analysis. The Global Cap Reports should have accompanying CSV or Excel workbooks for the tables they include.

  • Commercial Health Insurance Spending: The commercial market represents a large proportion of New Yorkers and is a well-documented source of cost growth in states nationwide. Public data on claims (received, paid, denied) by insurers in New York is available, but on a quarterly basis and with each plan listed individually. Creating a dashboard-style interface for the claims reports would offer a dynamic way to analyze the data across different parameters and time frames and would facilitate comparison.

  • Commercial Insurance Enrollment: Data on enrollment by commercial insurer is difficult to access. Each year, the NYS Department of Financial Services makes available payers’ Rate Applications and consolidates changes into tables that include the number of members by plan in the Individual, Small Group, Large Group, Medicare Supplement, and Long Term Care markets. Presenting this already-public data in a more interactive format would be valuable for observing trends over time and making comparisons across plans.


For this and other examples where there are no posted CSV or Excel files and that is not due to the need for privacy of information, the State should make these available and structure spreadsheets in formats that adhere to best practices for data formatting, also known as “tidy data” principles. Otherwise, users must spend time cleaning the data— removing blank rows, reorganizing columns, and correcting inconsistently merged columns, in the case of working with converted PDFs.


Dashboards with dynamic underlying data would undoubtedly require upfront investment but would also save significant time and effort in the long term. In the absence of a robust, public-facing all payer database, i.e., a greatly expanded and updated NYS Health Connector, decentralized dashboards with spreadsheet-friendly reports, or accompanying CSV or Excel files, will facilitate a wider interaction with the State’s health data.


Provide return on reporting investment for organizations

There is much information on provider performance that the State requires to be collected regularly, but these processes are incomplete unless their result is information others can use. Too often, the very organizations that report into State systems cannot actually learn from the data they submit over time. The State may only share it back with them in the aggregate or under corrective circumstances. For example:

  • Federally Qualified Health Centers (FQHCs): FQHCs submit Institutional Cost Reports (ICRs) to the State, but that information is not available publicly on the health.data.ny.gov website, which does include the hospital and nursing home ICRs.

  • School-based health centers (SBHCs) must report on a wide range of measures, but the outputs of that reporting are not reported publicly or even shared back with the SBHCs themselves.

  • Health Homes: Health Home care management programs report regular data to the Health Home Care Management Assessment Reporting Tool (HH-CMART), but reporting is not public. Medicaid Analytics Performance Portal (MAPP) dashboards are designed to visualize program oversight and performance management, but these are only available to individuals with specific Health Commerce System (HCS) and MAPP access.


Partner with stakeholders to streamline surveying and reporting efforts

There is a rich network of researchers focused on health systems, coverage, and public health in New York. The State could more intentionally and regularly collaborate with these partners to consolidate efforts around surveying and other data collection processes, as well as reporting. This would help the State take advantage of third-party resources that have already been allocated to these areas and would reduce duplicative and redundant efforts.


 

Appendix: Project Methods

By Isaac Michaels


Overview of Analytical Approach

The objective of this project was to describe trends in New York State’s Medicaid program—specifically, enrollment trends for Medicaid Managed Care (MMC) and Medicaid Managed Long-Term Care (MLTC) and weekly census counts for nursing homes—as well as to evaluate spending patterns. Here we detail our recommended, automated approach to data analysis and explain the underlying methods, tools, and workflows that make it reproducible and efficient. We advocate for this approach because it minimizes manual intervention, improves consistency, and allows routine updates with minimal effort. The sections that follow describe the data sources, the tools and workflows used, the methods for data acquisition and preprocessing, the analytical techniques employed, and the limitations and generalizability of our method.


Data Sources

This project used three data sources. First, the Medicaid Program Enrollment dataset was obtained from Health Data NY—a statewide portal offering free, open access to a wide range of health-related data—and it provides monthly enrollment figures for MMC and MLTC. Second, the Nursing Home Weekly Bed Census dataset, also from Health Data NY, supplies weekly facility-level census counts, offering insight into occupancy trends in nursing homes. Third, the Medicaid Global Spending Cap Reports, published by the New York State Department of Health, provide spending data in PDF format. These sources were selected for their public accessibility and their capacity to present a comprehensive picture of both program utilization and spending. It is worth noting that while the Nursing Home Census data are available in a ready-to-use tabular format—which simplifies analysis and manipulation—the spending data require extensive extraction and preprocessing from PDFs, making their analysis more challenging.


Tools, Workflow, and Reproducibility Considerations

We employed several tools in this project. R is an open-source programming language for statistical computing, which we used for data manipulation, analysis, and visualization. RMarkdown—an extension of R—integrates code, narrative text, and data outputs into a single document. The Tidyverse, a collection of open-source R packages, facilitated data cleaning and transformation, and the pdftools package (also open source) was used for extracting data from PDFs. GitHub, a free but proprietary platform, was used for version control and to host the HTML pages that appear on the Step Two Policy Project website. Dropbox, a proprietary cloud storage service, provided secure storage for both raw and processed data. Conducting the analysis in a coding environment automated repetitive tasks, reduced manual errors, and allows data updates to be performed on a regular schedule.

 

Data Acquisition and Preprocessing

Automated data acquisition was essential to avoid manual downloads and processing. We directly downloaded the Medicaid Program Enrollment and Nursing Home Census datasets from Health Data NY using their public URLs. For the spending data, we employed the pdftools package in R to extract information from the Medicaid Global Spending Cap Reports. A significant challenge in this process was the continuous, subtle change in the report and table formats. To overcome this, our code dynamically identifies the page containing the spending actuals table using keyword searches (e.g., "actual," "variance," "TOTAL").


To streamline these tasks further, we developed custom R functions. For example, our function try_downloading_report() automates the retrieval of PDF reports. It uses the GET function from the httr package with a custom User-Agent header to download each PDF from dynamically constructed URLs. This function checks the HTTP status code to ensure a successful download before saving the report to a Dropbox folder; it also avoids redundant downloads by first checking the Dropbox folder, to see if the respective report has already been saved. In addition, our workflow iterates through fiscal years, quarters, and months to construct the proper URLs and file paths for each report. Once downloaded, the PDF text is extracted, and our custom logic—based on regular expressions and keyword detection—identifies the correct page from which to extract the spending table. The code then determines whether the numeric values are reported in thousands or millions and assigns the appropriate multiplier. Finally, the extracted tables from all reports are combined into a single dataset and saved as a CSV file. Preprocessing also involves feature engineering, such as converting calendar years to fiscal years to align with reporting periods, and ensuring numeric fields are standardized. We used the Tidyverse—a collection of R packages that simplifies data manipulation and visualization—to standardize variable names, merge datasets, and address missing values consistently.


Another critical component of our data processing pipeline involves the integration of Medicaid enrollment and spending data. For example, our code reads the Medicaid Program Enrollment data directly from Health Data NY using the readr package function, read_csv(), and cleans the column names with the janitor package function, clean_names(). It then filters the dataset to retain records corresponding to MMC, creates new date variables by combining the eligibility year and month into proper Date objects, and harmonizes plan types by grouping similar values together. In parallel, the spending data are processed by stripping out currency symbols and commas, and converting the resulting strings into numeric values before applying the appropriate multiplier. The code further filters the spending data to include only year-end reports (identified by "q4" or "Mar" in the report name) and creates a fiscal year variable. Finally, the enrollment data are aggregated—calculating total enrollment months, month-to-month changes, and mean monthly change percentages—and then joined with the corresponding spending metrics. This comprehensive pipeline, which leverages Tidyverse functions such as filter(), mutate(), left_join(), and summarise(), automates the integration of heterogeneous data sources and ensures that every transformation is applied consistently, thereby reducing manual intervention and potential errors.

 

Analytical Methods

We focused on quantifying trends in enrollment for MMC and MLTC and on examining census counts for nursing homes, along with associated spending patterns. Time aggregation techniques—methods that combine data from finer time scales (e.g., weeks) into coarser scales (e.g., months or years)—were applied to simplify trend analysis. For instance, total enrollment months were computed by summing values, whereas mean census counts were derived by averaging. Furthermore, we engineered features to capture both absolute changes (the difference between successive periods) and relative changes (percentage differences), enabling us to assess growth rates comprehensively. All computations were performed in R, ensuring that every analytical step is documented and can be replicated as new data become available.


Application to MMC and MLTC Analyses

For MMC and MLTC, the analysis focused on evaluating enrollment trends and per capita spending. We explored various enrollment- and per capita spending-growth rate scenarios and developed linear projection models to estimate annual spending over the next five years. To facilitate user interaction, we generated downloadable Excel workbooks that contain raw enrollment and spending figures along with interactive tools. These workbooks allow users to adjust growth rate scenarios and view the resulting projections in both tabular and graphical formats. This interactive approach is designed to support policymakers and stakeholders by providing clear, accessible projections that inform evidence-based decision-making.


Limitations

The reliance on spending data from PDF reports makes the analysis vulnerable to encountering data processing errors, as the extraction process must continuously adapt to subtle changes in report formatting. Additionally, delays in updating public datasets may mean that the analysis does not always capture the most current trends. Integrating data from heterogeneous sources also presents challenges that require ongoing refinement of the extraction and processing methods.


Generalizability and Benefits of Automation

The high degree of automation implemented in this project is a key strength. Automating tasks such as dataset downloads, preprocessing, and feature engineering not only reduces the time and effort required but also minimizes human error and ensures consistency across analyses. As a result, the analysis can be updated routinely. Furthermore, removing manual steps from the analysis enhances the project’s sustainability by ensuring that the ongoing effort required to keep the analysis up to date and accessible is minimal. Beyond its applicability for analyzing New York State’s Medicaid and nursing home data, this approach is generalizable for analyzing data on other subjects and in other settings. By combining automation with reproducible coding practices and open-source tools, our project demonstrates a robust framework for generating reliable, updatable insights that support informed policy decision-making. We encourage other researchers and policymakers to adopt similar methodologies to enhance transparency, efficiency, and the overall quality of health data analysis.


 
Endnotes:

[2] The site includes the following disclaimer: “There is approximately a three-month lag for eligibility information, meaning that the most complete and current data will always be at least three months old. This allows for the eligibility data to be complete at the time of the release.”

[3] New York Legal Assistance Group. NYS MLTC Data Transparency Project.

[4] It is difficult to glean detailed information on the 20-25% of Medicaid enrollees who are not included in MMC and are covered under fee-for-service arrangements. 

[5] Ideally, such spreadsheets should have no interruptions between rows, including for subtotal information. The raw data should be presented in a compact layout with clear column headers and no blank rows.

[7] The Volume and Estimated Cost of Hospital Services dashboard source is 2015-17 data; the Emergency Department Visits in New York State dashboard source is 2019 data; the Health Plan Quality Ratings dashboard source is 2022 survey data.

Comentarios


bottom of page