A
Analytext

Unlock Insights from Financial Texts

A public repository of parsed textual data from SEC filings, including major sections and financial statement notes from 10-K and 10-Q reports, designed to accelerate business research.

Based on the paper: Codesso, M., Hoitash, R., & Hoitash, U. (2025). Textual Financial Data Repository and Python Code for Machine Learning, Al, and Textual Analyses.

A Better Way to Work with Financial Data

We solve the most common roadblocks in textual analysis so you can focus on research, not data wrangling.

Save Time and Effort

Stop reinventing the wheel. Our data is pre-parsed, eliminating the need for you to write and maintain complex extraction scripts.

Ensure Consistency

Using a standardized data source enhances the comparability and replicability of research across different studies and teams.

Comprehensive Data

Access key sections (MD&A, Risk Factors) and all financial notes from 10-Ks and 10-Qs, covering all firms from 2008 onwards.

Pre-calculated Metrics

Jumpstart your analysis with ready-to-use metrics like readability scores, sentiment, word counts, and more, without any programming.

Novel Word Lists

Utilize new dictionaries for COVID-19 and Human Capital, developed using a unique, event-based methodology for higher relevance.

Open-Source Python Code

Access the full codebase to understand our methodology, verify results, or extend the tools for your own custom research needs.

SEC Filings Sections Data

This dataset provides the raw textual data extracted from the major sections of annual (10-K) and quarterly (10-Q) SEC filings from 2008 onwards. We also provide pre-calculated metrics for each section.

Business Description (Item 1)

Communicates the nature of the business, its products, services, markets, and recent events. Recently updated to include human capital disclosures.

Risk Factors (Item 1A)

Informs investors about material risks that can affect future performance, including financial, operational, and market-related risks.

MD&A (Item 7 / Item 2)

Management's Discussion and Analysis of financial condition and results of operations. A key source for understanding management's perspective.

Financial Statement Notes Data

This dataset contains the text from financial statement notes, extracted using XBRL data from 2009 onwards. Footnotes provide essential, detailed explanations for the numbers in primary financial statements. We map notes to 32 distinct accounting topics.

Key Features

  • Granular Focus: Unlike broad sections, notes focus on specific topics (e.g., Tax, Fair Value, Leases), improving identification strategies for research.
  • Topic Classification: We categorize notes into 32 standardized accounting topics based on FASB Codification, making cross-firm comparison easier.
  • Comprehensive Coverage: The dataset includes notes from both 10-K and 10-Q filings, with pre-calculated metrics available for each note.
  • Rich Data: Includes 85 distinct XBRL tags, allowing researchers to aggregate or disaggregate topics as needed.

Most Prevalent Footnote Topics

Compensation Commitments & Contingencies Income Tax Debt Equity

Download Data & Resources

Citing the Data

If you decide to use the data, please consider citing one or more of the following papers. These papers develop, validate, provide code, and/or use the data:

Accounting Complexity Data

For researchers interested in firm complexity, the XBRL Research website provides a valuable, complementary dataset. It offers extensively validated measures of firm complexity based on Accounting Reporting Complexity (ARC).

Available Resources

  • Firm Complexity Measures: Access to the Accounting Reporting Complexity (ARC) data, which serves as a robust proxy for overall firm complexity.
  • XBRL Data Repository: The site hosts a repository of processed eXtensible Business Reporting Language (XBRL) data, which enables efficient analysis of structured financial information.
  • SAS Code: Download the SAS code used to generate the complexity measures for your own custom analysis.
  • Financial Notes Data: The XBRL data also allows for the accurate extraction and classification of financial statement notes.

About the Authors

This project is made possible by the dedicated work of a team of researchers passionate about advancing the field of financial analysis through accessible, high-quality data.

Headshot of Mauricio Codesso, a researcher at Northeastern University.

Mauricio Codesso

Northeastern University

Mauricio Codesso's research focuses on the intersection of data science and finance. He specializes in applying machine learning models and Python-based tools to analyze complex financial texts and build novel datasets for academic use. Access his publications at: Google Scholar | SSRN

Headshot of Rani Hoitash, John E. Rhodes Professor of Accounting at Bentley University.

Rani Hoitash

John E. Rhodes Professor of Accounting, Bentley University

Rani Hoitash is the John E. Rhodes Professor of Accountancy at Bentley University. His expertise spans financial reporting, corporate governance, auditing, textual analysis, Large Language Models (LLMs), and AI. Professor Hoitash's research focuses on human capital in corporate governance and auditing and includes developing measures for firm complexity and benchmarking. His work is published in top journals such as The Accounting Review, Journal of Accounting Research, and Journal of Accounting & Economics. He is a former Editor of Auditing: A Journal of Practice and Theory and a Certified Information System Auditor (CISA). He holds a PhD from Rutgers University. Access his publications at: Bentley Profile | Google Scholar | SSRN

Headshot of Udi Hoitash, Lilian L. and Harry A. Cowan Endowed Professor of Accounting at Northeastern University.

Udi Hoitash

Lilian L. and Harry A. Cowan Endowed Professor of Accounting, D'Amore-McKim School of Business, Northeastern University

Udi Hoitash is the Lilian L. and Harry A. Cowan Endowed Professor of Accounting at the D'Amore-McKim School of Business, Northeastern University. His main research areas include auditing, disclosure quality, XBRL, natural language processing, Large Language Models (AI), corporate governance, and Human Capital. Professor Hoitash served two terms as an Editor for Auditing: A Journal of Practice & Theory. His work is highly cited, and he was listed in Stanford University’s top 2% scientist database. According to SSRN, he consistently ranks among the top 50 accounting professors worldwide based on downloads. Access his publications at: Northeastern Profile | Google Scholar | SSRN

Get in Touch

Have questions or feedback? We'd love to hear from you. Reach out to the authors directly or use the contact form for general inquiries.

Send us a Message

Thank you! Your message has been sent.

Direct Contact

Portrait of Mauricio Codesso

Mauricio Codesso

Northeastern University

m.codesso@northeastern.edu
Portrait of Rani Hoitash

Rani Hoitash

Bentley University

rhoitash@bentley.edu
Portrait of Udi Hoitash

Udi Hoitash

Northeastern University

u.hoitash@northeastern.edu