Go to file
Leopere c04e56ab8f
feat: start.sh bootstrap; HTML upload fallback; executive summary in report
2025-10-22 19:49:08 -04:00
.cursor/rules Initial commit: Technical screen project with document analysis capabilities 2025-10-22 18:55:39 -04:00
modules feat: start.sh bootstrap; HTML upload fallback; executive summary in report 2025-10-22 19:49:08 -04:00
processed feat: start.sh bootstrap; HTML upload fallback; executive summary in report 2025-10-22 19:49:08 -04:00
.gitignore Initial commit: Technical screen project with document analysis capabilities 2025-10-22 18:55:39 -04:00
README.md Add comprehensive pitch deck analysis with AI agents and debugging 2025-10-22 19:17:37 -04:00
app.py feat: start.sh bootstrap; HTML upload fallback; executive summary in report 2025-10-22 19:49:08 -04:00
example.env Initial commit: Technical screen project with document analysis capabilities 2025-10-22 18:55:39 -04:00
requirements.txt Initial commit: Technical screen project with document analysis capabilities 2025-10-22 18:55:39 -04:00
start.sh feat: start.sh bootstrap; HTML upload fallback; executive summary in report 2025-10-22 19:49:08 -04:00

README.md

Pitch Deck Market Cap Validator

A Python-based application that automatically extracts and validates market cap claims from pitch deck PDFs using specialized financial APIs and RAG (Retrieval-Augmented Generation) systems to quickly debunk inaccurate financial claims.

Technical Overview

This tool processes PDF pitch decks through a multi-stage pipeline focused on financial claim validation. The system extracts market cap claims, validates them against real-time financial data sources, and generates comprehensive debunking reports.

Architecture

PDF Input → Slide Extraction → Claim Detection → Financial API Validation → Debunking Report
    ↓              ↓              ↓                    ↓                      ↓
PyMuPDF      Image Files    Pattern Matching    Financial APIs         Markdown Report

Core Mission

Fast market cap validation and claim debunking using proper financial APIs that track market intelligence accurately, not generic web search.

Core Components

1. Main Application (app.py)

  • Entry point for the pitch deck analysis pipeline
  • Orchestrates slide extraction and market cap validation workflow
  • Generates comprehensive debunking reports with Table of Contents
  • Handles file validation and error management

2. PDF Processing (modules/pdf_processor.py)

  • PyMuPDF integration for high-quality PDF to image conversion
  • Extracts individual slides as PNG images (2x zoom for clarity)
  • Creates organized directory structure: processed/{document_name}/slides/
  • Handles page numbering and file naming conventions

3. Market Cap Validation Engine (modules/market_cap_validator.py)

  • Main interface for market cap claim validation
  • Coordinates between claim extraction and validation processes
  • Generates comprehensive validation reports
  • Handles multiple input formats (files, processed folders, direct data)

4. RAG Agent (modules/rag_agent.py)

  • Pattern-based claim extraction using regex patterns for market cap detection
  • Financial API integration for real-time market data validation
  • Confidence scoring based on context and claim specificity
  • Discrepancy analysis between claimed and actual market caps

5. Document Validator (modules/document_validator.py)

  • Batch processing for multiple documents
  • Organized reporting with document-specific validation results
  • Error handling for invalid or corrupted slide data

6. Validation Report Generator (modules/validation_report.py)

  • Comprehensive reporting with executive summaries
  • Slide source tracking for claim attribution
  • RAG search details for transparency and verification
  • Recommendations for improving claim accuracy

Technical Stack

Dependencies

  • PyMuPDF: PDF processing and image extraction
  • OpenAI: AI model integration via OpenRouter
  • requests: HTTP API communications for financial data
  • python-dotenv: Environment variable management
  • docling: Advanced document processing capabilities

Financial Data Sources (Planned)

  • Yahoo Finance API: Real-time market cap data
  • Alpha Vantage: Historical and current market data
  • Financial Modeling Prep: Comprehensive financial metrics
  • IEX Cloud: Real-time stock data and market intelligence
  • Quandl: Financial and economic data

Environment Configuration

  • OpenRouter API Key: Required for AI model access
  • Financial API Keys: Multiple providers for redundancy and accuracy
  • Rate Limiting: Configurable API call limits and retry logic

Current Limitations & Improvements Needed

RAG System Issues

  • Generic Web Search: Currently uses basic web search instead of specialized financial APIs
  • Accuracy Problems: Web search results are inconsistent and often outdated
  • No Real-time Data: Cannot access current market cap information
  • Limited Financial Context: Lacks understanding of market dynamics and valuation metrics

Required API Integrations

  1. Real-time Market Data APIs:

    • Yahoo Finance API for current market caps
    • Alpha Vantage for historical data and trends
    • Financial Modeling Prep for comprehensive metrics
  2. Enhanced Validation Logic:

    • Time-based validation (check if claim was accurate at time of presentation)
    • Market cap calculation verification (shares outstanding × price)
    • Industry benchmarking and comparison
  3. Improved Pattern Recognition:

    • Better company name extraction from slides
    • Context-aware claim detection
    • Support for different valuation metrics (enterprise value, etc.)

Usage

Quick Start

# Make start script executable
chmod +x start.sh

# Run market cap validation on a PDF file
./start.sh presentation.pdf

Manual Execution

# Activate virtual environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run validation
python3 app.py presentation.pdf

Market Cap Validation Only

# Validate market caps from processed folder
python3 modules/validate_market_caps.py --all

# Validate specific document
python3 modules/validate_market_caps.py --file slides.json --document "Company-Pitch"

Output Structure

The tool generates:

  1. Processed Images: Individual slide images in processed/{document_name}/slides/
  2. Validation Report: Comprehensive debunking report with:
    • Executive summary of claim accuracy
    • Detailed validation results for each claim
    • Source attribution and confidence scores
    • Discrepancy analysis and explanations
    • Recommendations for improving accuracy
  3. Shareable Link: Automatic upload to Hastebin for easy sharing

Technical Features

Market Cap Claim Detection

  • Pattern Recognition: Multiple regex patterns for market cap identification
  • Context Analysis: Confidence scoring based on surrounding text
  • Company Name Extraction: Automatic identification of company names
  • Value Normalization: Standardized handling of different value formats (B, M, K)

Financial Validation (Planned)

  • Real-time API Integration: Direct access to current market data
  • Historical Validation: Check if claims were accurate at presentation time
  • Market Context: Industry comparisons and benchmarking
  • Multiple Data Sources: Redundancy for accuracy verification

Report Generation

  • Executive Summary: High-level accuracy metrics and key findings
  • Detailed Analysis: Slide-by-slide validation results
  • Source Transparency: Clear attribution of validation sources
  • Actionable Insights: Specific recommendations for improvement

Error Handling

  • API Rate Limiting: Intelligent handling of API call limits
  • Data Validation: Verification of extracted financial data
  • Graceful Degradation: Continues processing even if individual validations fail
  • Comprehensive Logging: Detailed error tracking and debugging

Development Setup

Prerequisites

  • Python 3.7+
  • Virtual environment support
  • OpenRouter API account
  • Financial API accounts (Yahoo Finance, Alpha Vantage, etc.)

Installation

  1. Clone the repository
  2. Create virtual environment: python3 -m venv venv
  3. Activate environment: source venv/bin/activate
  4. Install dependencies: pip install -r requirements.txt
  5. Configure .env file with API keys

Configuration

  • Copy example.env to .env
  • Add OpenRouter API key
  • Add financial API keys:
    YAHOO_FINANCE_API_KEY=your_key_here
    ALPHA_VANTAGE_API_KEY=your_key_here
    FINANCIAL_MODELING_PREP_API_KEY=your_key_here
    

Planned Improvements

Phase 1: Financial API Integration

  • Implement Yahoo Finance API for real-time market cap data
  • Add Alpha Vantage for historical data and trends
  • Create API rate limiting and error handling

Phase 2: Enhanced Validation Logic

  • Time-based validation (check accuracy at presentation date)
  • Market cap calculation verification
  • Industry benchmarking and comparison

Phase 3: Advanced Features

  • Support for different valuation metrics (enterprise value, etc.)
  • Automated fact-checking for other financial claims
  • Integration with SEC filings for public companies
  • Machine learning for improved claim detection

Technical Considerations

Performance

  • API Optimization: Efficient use of financial API calls
  • Caching Strategy: Store validation results to avoid redundant API calls
  • Batch Processing: Process multiple claims efficiently
  • Rate Limiting: Respect API limits while maintaining speed

Accuracy

  • Multiple Data Sources: Cross-reference validation results
  • Time Context: Consider when claims were made vs. current data
  • Market Dynamics: Account for market volatility and timing
  • Data Quality: Validate API responses for accuracy

Security

  • API Key Management: Secure storage and rotation of API keys
  • Data Privacy: Handle sensitive financial information appropriately
  • Rate Limiting: Prevent API abuse and excessive costs
  • Error Handling: Graceful handling of API failures

File Structure

boxone-technical/
├── app.py                      # Main application entry point
├── start.sh                    # Development startup script
├── requirements.txt            # Python dependencies
├── .env                        # Environment configuration
├── example.env                 # Environment template
├── modules/                    # Core application modules
│   ├── market_cap_validator.py # Main market cap validation interface
│   ├── rag_agent.py           # RAG agent for claim extraction and validation
│   ├── document_validator.py  # Document-level validation processing
│   ├── validation_report.py   # Report generation utilities
│   ├── pdf_processor.py        # PDF extraction and processing
│   ├── client.py               # OpenRouter API client
│   └── ...                     # Additional utility modules
├── processed/                  # Output directory for validation results
└── venv/                       # Python virtual environment

Current Status

⚠️ Important: The current RAG system uses generic web search which is insufficient for accurate financial validation. The system needs integration with proper financial APIs to provide reliable market cap validation and claim debunking capabilities.

This tool is designed to be a comprehensive solution for fast, accurate financial claim validation using real-time market data and specialized financial intelligence APIs.