|
|
||
|---|---|---|
| .cursor/rules | ||
| modules | ||
| processed | ||
| .gitignore | ||
| README.md | ||
| app.py | ||
| example.env | ||
| requirements.txt | ||
| start.sh | ||
README.md
Pitch Deck Market Cap Validator
A Python-based application that automatically extracts and validates market cap claims from pitch deck PDFs using specialized financial APIs and RAG (Retrieval-Augmented Generation) systems to quickly debunk inaccurate financial claims.
Technical Overview
This tool processes PDF pitch decks through a multi-stage pipeline focused on financial claim validation. The system extracts market cap claims, validates them against real-time financial data sources, and generates comprehensive debunking reports.
Architecture
PDF Input → Slide Extraction → Claim Detection → Financial API Validation → Debunking Report
↓ ↓ ↓ ↓ ↓
PyMuPDF Image Files Pattern Matching Financial APIs Markdown Report
Core Mission
Fast market cap validation and claim debunking using proper financial APIs that track market intelligence accurately, not generic web search.
Core Components
1. Main Application (app.py)
- Entry point for the pitch deck analysis pipeline
- Orchestrates slide extraction and market cap validation workflow
- Generates comprehensive debunking reports with Table of Contents
- Handles file validation and error management
2. PDF Processing (modules/pdf_processor.py)
- PyMuPDF integration for high-quality PDF to image conversion
- Extracts individual slides as PNG images (2x zoom for clarity)
- Creates organized directory structure:
processed/{document_name}/slides/ - Handles page numbering and file naming conventions
3. Market Cap Validation Engine (modules/market_cap_validator.py)
- Main interface for market cap claim validation
- Coordinates between claim extraction and validation processes
- Generates comprehensive validation reports
- Handles multiple input formats (files, processed folders, direct data)
4. RAG Agent (modules/rag_agent.py)
- Pattern-based claim extraction using regex patterns for market cap detection
- Financial API integration for real-time market data validation
- Confidence scoring based on context and claim specificity
- Discrepancy analysis between claimed and actual market caps
5. Document Validator (modules/document_validator.py)
- Batch processing for multiple documents
- Organized reporting with document-specific validation results
- Error handling for invalid or corrupted slide data
6. Validation Report Generator (modules/validation_report.py)
- Comprehensive reporting with executive summaries
- Slide source tracking for claim attribution
- RAG search details for transparency and verification
- Recommendations for improving claim accuracy
Technical Stack
Dependencies
- PyMuPDF: PDF processing and image extraction
- OpenAI: AI model integration via OpenRouter
- requests: HTTP API communications for financial data
- python-dotenv: Environment variable management
- docling: Advanced document processing capabilities
Financial Data Sources (Planned)
- Yahoo Finance API: Real-time market cap data
- Alpha Vantage: Historical and current market data
- Financial Modeling Prep: Comprehensive financial metrics
- IEX Cloud: Real-time stock data and market intelligence
- Quandl: Financial and economic data
Environment Configuration
- OpenRouter API Key: Required for AI model access
- Financial API Keys: Multiple providers for redundancy and accuracy
- Rate Limiting: Configurable API call limits and retry logic
Current Limitations & Improvements Needed
RAG System Issues
- Generic Web Search: Currently uses basic web search instead of specialized financial APIs
- Accuracy Problems: Web search results are inconsistent and often outdated
- No Real-time Data: Cannot access current market cap information
- Limited Financial Context: Lacks understanding of market dynamics and valuation metrics
Required API Integrations
-
Real-time Market Data APIs:
- Yahoo Finance API for current market caps
- Alpha Vantage for historical data and trends
- Financial Modeling Prep for comprehensive metrics
-
Enhanced Validation Logic:
- Time-based validation (check if claim was accurate at time of presentation)
- Market cap calculation verification (shares outstanding × price)
- Industry benchmarking and comparison
-
Improved Pattern Recognition:
- Better company name extraction from slides
- Context-aware claim detection
- Support for different valuation metrics (enterprise value, etc.)
Usage
Quick Start
# Make start script executable
chmod +x start.sh
# Run market cap validation on a PDF file
./start.sh presentation.pdf
Manual Execution
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run validation
python3 app.py presentation.pdf
Market Cap Validation Only
# Validate market caps from processed folder
python3 modules/validate_market_caps.py --all
# Validate specific document
python3 modules/validate_market_caps.py --file slides.json --document "Company-Pitch"
Output Structure
The tool generates:
- Processed Images: Individual slide images in
processed/{document_name}/slides/ - Validation Report: Comprehensive debunking report with:
- Executive summary of claim accuracy
- Detailed validation results for each claim
- Source attribution and confidence scores
- Discrepancy analysis and explanations
- Recommendations for improving accuracy
- Shareable Link: Automatic upload to Hastebin for easy sharing
Technical Features
Market Cap Claim Detection
- Pattern Recognition: Multiple regex patterns for market cap identification
- Context Analysis: Confidence scoring based on surrounding text
- Company Name Extraction: Automatic identification of company names
- Value Normalization: Standardized handling of different value formats (B, M, K)
Financial Validation (Planned)
- Real-time API Integration: Direct access to current market data
- Historical Validation: Check if claims were accurate at presentation time
- Market Context: Industry comparisons and benchmarking
- Multiple Data Sources: Redundancy for accuracy verification
Report Generation
- Executive Summary: High-level accuracy metrics and key findings
- Detailed Analysis: Slide-by-slide validation results
- Source Transparency: Clear attribution of validation sources
- Actionable Insights: Specific recommendations for improvement
Error Handling
- API Rate Limiting: Intelligent handling of API call limits
- Data Validation: Verification of extracted financial data
- Graceful Degradation: Continues processing even if individual validations fail
- Comprehensive Logging: Detailed error tracking and debugging
Development Setup
Prerequisites
- Python 3.7+
- Virtual environment support
- OpenRouter API account
- Financial API accounts (Yahoo Finance, Alpha Vantage, etc.)
Installation
- Clone the repository
- Create virtual environment:
python3 -m venv venv - Activate environment:
source venv/bin/activate - Install dependencies:
pip install -r requirements.txt - Configure
.envfile with API keys
Configuration
- Copy
example.envto.env - Add OpenRouter API key
- Add financial API keys:
YAHOO_FINANCE_API_KEY=your_key_here ALPHA_VANTAGE_API_KEY=your_key_here FINANCIAL_MODELING_PREP_API_KEY=your_key_here
Planned Improvements
Phase 1: Financial API Integration
- Implement Yahoo Finance API for real-time market cap data
- Add Alpha Vantage for historical data and trends
- Create API rate limiting and error handling
Phase 2: Enhanced Validation Logic
- Time-based validation (check accuracy at presentation date)
- Market cap calculation verification
- Industry benchmarking and comparison
Phase 3: Advanced Features
- Support for different valuation metrics (enterprise value, etc.)
- Automated fact-checking for other financial claims
- Integration with SEC filings for public companies
- Machine learning for improved claim detection
Technical Considerations
Performance
- API Optimization: Efficient use of financial API calls
- Caching Strategy: Store validation results to avoid redundant API calls
- Batch Processing: Process multiple claims efficiently
- Rate Limiting: Respect API limits while maintaining speed
Accuracy
- Multiple Data Sources: Cross-reference validation results
- Time Context: Consider when claims were made vs. current data
- Market Dynamics: Account for market volatility and timing
- Data Quality: Validate API responses for accuracy
Security
- API Key Management: Secure storage and rotation of API keys
- Data Privacy: Handle sensitive financial information appropriately
- Rate Limiting: Prevent API abuse and excessive costs
- Error Handling: Graceful handling of API failures
File Structure
boxone-technical/
├── app.py # Main application entry point
├── start.sh # Development startup script
├── requirements.txt # Python dependencies
├── .env # Environment configuration
├── example.env # Environment template
├── modules/ # Core application modules
│ ├── market_cap_validator.py # Main market cap validation interface
│ ├── rag_agent.py # RAG agent for claim extraction and validation
│ ├── document_validator.py # Document-level validation processing
│ ├── validation_report.py # Report generation utilities
│ ├── pdf_processor.py # PDF extraction and processing
│ ├── client.py # OpenRouter API client
│ └── ... # Additional utility modules
├── processed/ # Output directory for validation results
└── venv/ # Python virtual environment
Current Status
⚠️ Important: The current RAG system uses generic web search which is insufficient for accurate financial validation. The system needs integration with proper financial APIs to provide reliable market cap validation and claim debunking capabilities.
This tool is designed to be a comprehensive solution for fast, accurate financial claim validation using real-time market data and specialized financial intelligence APIs.