Add comprehensive pitch deck analysis with AI agents and debugging

- Refactored app.py with extensive debugging feedback at every step - Implemented 5 specialized AI agents for critical pitch deck analysis: * Problem Analysis (pain point, data backing, market impact) * Solution Evaluation (competitive advantage, proof, scalability) * Market Opportunity Assessment (TAM/SAM, growth, customers) * Traction Evaluation (metrics, sustainability, growth trends) * Funding & Ask Analysis (amount, allocation, milestones, valuation) - Added comprehensive logging to all modules for visibility - Updated markdown output to preserve full structured formatting - Fixed markdown upload to preserve headers and formatting - Simplified start.sh for cleaner execution - Cleaned up processed directory (not tracked in git) - All modules now provide real-time feedback during execution
2025-10-22 19:17:37 -04:00 · 2025-10-22 19:17:37 -04:00 · ef5de680da
parent 0bb86c677d
commit ef5de680da
28 changed files with 446 additions and 3939 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,260 @@
 # Pitch Deck Market Cap Validator
 A Python-based application that automatically extracts and validates market cap claims from pitch deck PDFs using specialized financial APIs and RAG (Retrieval-Augmented Generation) systems to quickly debunk inaccurate financial claims.
 ## Technical Overview
 This tool processes PDF pitch decks through a multi-stage pipeline focused on **financial claim validation**. The system extracts market cap claims, validates them against real-time financial data sources, and generates comprehensive debunking reports.
 ### Architecture
 ```
 PDF Input → Slide Extraction → Claim Detection → Financial API Validation → Debunking Report
    ↓              ↓              ↓                    ↓                      ↓
 PyMuPDF      Image Files    Pattern Matching    Financial APIs         Markdown Report
 ```
 ### Core Mission
 **Fast market cap validation and claim debunking** using proper financial APIs that track market intelligence accurately, not generic web search.
 ## Core Components
 ### 1. Main Application (`app.py`)
 - **Entry point** for the pitch deck analysis pipeline
 - Orchestrates slide extraction and market cap validation workflow
 - Generates comprehensive debunking reports with Table of Contents
 - Handles file validation and error management
 ### 2. PDF Processing (`modules/pdf_processor.py`)
 - **PyMuPDF integration** for high-quality PDF to image conversion
 - Extracts individual slides as PNG images (2x zoom for clarity)
 - Creates organized directory structure: `processed/{document_name}/slides/`
 - Handles page numbering and file naming conventions
 ### 3. Market Cap Validation Engine (`modules/market_cap_validator.py`)
 - **Main interface** for market cap claim validation
 - Coordinates between claim extraction and validation processes
 - Generates comprehensive validation reports
 - Handles multiple input formats (files, processed folders, direct data)
 ### 4. RAG Agent (`modules/rag_agent.py`)
 - **Pattern-based claim extraction** using regex patterns for market cap detection
 - **Financial API integration** for real-time market data validation
 - **Confidence scoring** based on context and claim specificity
 - **Discrepancy analysis** between claimed and actual market caps
 ### 5. Document Validator (`modules/document_validator.py`)
 - **Batch processing** for multiple documents
 - **Organized reporting** with document-specific validation results
 - **Error handling** for invalid or corrupted slide data
 ### 6. Validation Report Generator (`modules/validation_report.py`)
 - **Comprehensive reporting** with executive summaries
 - **Slide source tracking** for claim attribution
 - **RAG search details** for transparency and verification
 - **Recommendations** for improving claim accuracy
 ## Technical Stack
 ### Dependencies
 - **PyMuPDF**: PDF processing and image extraction
 - **OpenAI**: AI model integration via OpenRouter
 - **requests**: HTTP API communications for financial data
 - **python-dotenv**: Environment variable management
 - **docling**: Advanced document processing capabilities
 ### Financial Data Sources (Planned)
 - **Yahoo Finance API**: Real-time market cap data
 - **Alpha Vantage**: Historical and current market data
 - **Financial Modeling Prep**: Comprehensive financial metrics
 - **IEX Cloud**: Real-time stock data and market intelligence
 - **Quandl**: Financial and economic data
 ### Environment Configuration
 - **OpenRouter API Key**: Required for AI model access
 - **Financial API Keys**: Multiple providers for redundancy and accuracy
 - **Rate Limiting**: Configurable API call limits and retry logic
 ## Current Limitations & Improvements Needed
 ### RAG System Issues
 - **Generic Web Search**: Currently uses basic web search instead of specialized financial APIs
 - **Accuracy Problems**: Web search results are inconsistent and often outdated
 - **No Real-time Data**: Cannot access current market cap information
 - **Limited Financial Context**: Lacks understanding of market dynamics and valuation metrics
 ### Required API Integrations
 1. **Real-time Market Data APIs**:
   - Yahoo Finance API for current market caps
   - Alpha Vantage for historical data and trends
   - Financial Modeling Prep for comprehensive metrics
 2. **Enhanced Validation Logic**:
   - Time-based validation (check if claim was accurate at time of presentation)
   - Market cap calculation verification (shares outstanding × price)
   - Industry benchmarking and comparison
 3. **Improved Pattern Recognition**:
   - Better company name extraction from slides
   - Context-aware claim detection
   - Support for different valuation metrics (enterprise value, etc.)
 ## Usage
 ### Quick Start
 ```bash
 # Make start script executable
 chmod +x start.sh
 # Run market cap validation on a PDF file
 ./start.sh presentation.pdf
 ```
 ### Manual Execution
 ```bash
 # Activate virtual environment
 source venv/bin/activate
 # Install dependencies
 pip install -r requirements.txt
 # Run validation
 python3 app.py presentation.pdf
 ```
 ### Market Cap Validation Only
 ```bash
 # Validate market caps from processed folder
 python3 modules/validate_market_caps.py --all
 # Validate specific document
 python3 modules/validate_market_caps.py --file slides.json --document "Company-Pitch"
 ```
 ## Output Structure
 The tool generates:
 1. **Processed Images**: Individual slide images in `processed/{document_name}/slides/`
 2. **Validation Report**: Comprehensive debunking report with:
   - Executive summary of claim accuracy
   - Detailed validation results for each claim
   - Source attribution and confidence scores
   - Discrepancy analysis and explanations
   - Recommendations for improving accuracy
 3. **Shareable Link**: Automatic upload to Hastebin for easy sharing
 ## Technical Features
 ### Market Cap Claim Detection
 - **Pattern Recognition**: Multiple regex patterns for market cap identification
 - **Context Analysis**: Confidence scoring based on surrounding text
 - **Company Name Extraction**: Automatic identification of company names
 - **Value Normalization**: Standardized handling of different value formats (B, M, K)
 ### Financial Validation (Planned)
 - **Real-time API Integration**: Direct access to current market data
 - **Historical Validation**: Check if claims were accurate at presentation time
 - **Market Context**: Industry comparisons and benchmarking
 - **Multiple Data Sources**: Redundancy for accuracy verification
 ### Report Generation
 - **Executive Summary**: High-level accuracy metrics and key findings
 - **Detailed Analysis**: Slide-by-slide validation results
 - **Source Transparency**: Clear attribution of validation sources
 - **Actionable Insights**: Specific recommendations for improvement
 ### Error Handling
 - **API Rate Limiting**: Intelligent handling of API call limits
 - **Data Validation**: Verification of extracted financial data
 - **Graceful Degradation**: Continues processing even if individual validations fail
 - **Comprehensive Logging**: Detailed error tracking and debugging
 ## Development Setup
 ### Prerequisites
 - Python 3.7+
 - Virtual environment support
 - OpenRouter API account
 - Financial API accounts (Yahoo Finance, Alpha Vantage, etc.)
 ### Installation
 1. Clone the repository
 2. Create virtual environment: `python3 -m venv venv`
 3. Activate environment: `source venv/bin/activate`
 4. Install dependencies: `pip install -r requirements.txt`
 5. Configure `.env` file with API keys
 ### Configuration
 - Copy `example.env` to `.env`
 - Add OpenRouter API key
 - Add financial API keys:
  ```
  YAHOO_FINANCE_API_KEY=your_key_here
  ALPHA_VANTAGE_API_KEY=your_key_here
  FINANCIAL_MODELING_PREP_API_KEY=your_key_here
  ```
 ## Planned Improvements
 ### Phase 1: Financial API Integration
 - Implement Yahoo Finance API for real-time market cap data
 - Add Alpha Vantage for historical data and trends
 - Create API rate limiting and error handling
 ### Phase 2: Enhanced Validation Logic
 - Time-based validation (check accuracy at presentation date)
 - Market cap calculation verification
 - Industry benchmarking and comparison
 ### Phase 3: Advanced Features
 - Support for different valuation metrics (enterprise value, etc.)
 - Automated fact-checking for other financial claims
 - Integration with SEC filings for public companies
 - Machine learning for improved claim detection
 ## Technical Considerations
 ### Performance
 - **API Optimization**: Efficient use of financial API calls
 - **Caching Strategy**: Store validation results to avoid redundant API calls
 - **Batch Processing**: Process multiple claims efficiently
 - **Rate Limiting**: Respect API limits while maintaining speed
 ### Accuracy
 - **Multiple Data Sources**: Cross-reference validation results
 - **Time Context**: Consider when claims were made vs. current data
 - **Market Dynamics**: Account for market volatility and timing
 - **Data Quality**: Validate API responses for accuracy
 ### Security
 - **API Key Management**: Secure storage and rotation of API keys
 - **Data Privacy**: Handle sensitive financial information appropriately
 - **Rate Limiting**: Prevent API abuse and excessive costs
 - **Error Handling**: Graceful handling of API failures
 ## File Structure
 ```
 boxone-technical/
 ├── app.py                      # Main application entry point
 ├── start.sh                    # Development startup script
 ├── requirements.txt            # Python dependencies
 ├── .env                        # Environment configuration
 ├── example.env                 # Environment template
 ├── modules/                    # Core application modules
 │   ├── market_cap_validator.py # Main market cap validation interface
 │   ├── rag_agent.py           # RAG agent for claim extraction and validation
 │   ├── document_validator.py  # Document-level validation processing
 │   ├── validation_report.py   # Report generation utilities
 │   ├── pdf_processor.py        # PDF extraction and processing
 │   ├── client.py               # OpenRouter API client
 │   └── ...                     # Additional utility modules
 ├── processed/                  # Output directory for validation results
 └── venv/                       # Python virtual environment
 ```
 ## Current Status
 **⚠️ Important**: The current RAG system uses generic web search which is insufficient for accurate financial validation. The system needs integration with proper financial APIs to provide reliable market cap validation and claim debunking capabilities.
 This tool is designed to be a comprehensive solution for **fast, accurate financial claim validation** using real-time market data and specialized financial intelligence APIs.
--- a/app.py
+++ b/app.py
@ -1,13 +1,18 @@
 #!/usr/bin/env python3
 print("🚀 APP.PY STARTING - IMMEDIATE FEEDBACK", flush=True)
 import sys
 import os
 import re
 import time
 from pathlib import Path
 print("📦 BASIC IMPORTS COMPLETE", flush=True)
 def generate_toc(markdown_content):
    """Generate a Table of Contents from markdown headers"""
-    print("  📋 Generating Table of Contents...")
+    print("  📋 Generating Table of Contents...", flush=True)
    lines = markdown_content.split('\n')
    toc_lines = []
    toc_lines.append("## Table of Contents")
@ -34,61 +39,104 @@ def generate_toc(markdown_content):
    toc_lines.append("---")
    toc_lines.append("")
-    print(f"  ✅ Generated TOC with {header_count} headers")
+    print(f"  ✅ Generated TOC with {header_count} headers", flush=True)
    return '\n'.join(toc_lines)
 def main():
-    """Simple pitch deck analyzer"""
+    """Simple pitch deck analyzer with comprehensive debugging"""
    print("🚀 PITCH DECK ANALYZER MAIN FUNCTION STARTING", flush=True)
    print("=" * 50, flush=True)
    if len(sys.argv) < 2:
-        print("Usage: python app.py <pdf_file>")
+        print("❌ Usage: python app.py <pdf_file>", flush=True)
        return
    pdf_path = sys.argv[1]
    if not os.path.exists(pdf_path):
-        print(f"Error: File '{pdf_path}' not found")
+        print(f"❌ Error: File '{pdf_path}' not found", flush=True)
        return
-    print(f"🚀 Processing: {pdf_path}")
+    print(f"📁 Processing file: {pdf_path}", flush=True)
    print(f"📁 File exists: {os.path.exists(pdf_path)}", flush=True)
    print(f"📁 File size: {os.path.getsize(pdf_path)} bytes", flush=True)
    # Import what we need directly (avoid __init__.py issues)
-    print("📦 Importing modules...")
+    print("\n📦 IMPORTING MODULES", flush=True)
    print("-" * 30, flush=True)
    sys.path.append('modules')
    print("  🔄 Importing client module...", flush=True)
    from client import get_openrouter_client
    print("  ✅ client module imported successfully", flush=True)
    print("  🔄 Importing pdf_processor module...", flush=True)
    from pdf_processor import extract_slides_from_pdf
    print("  ✅ pdf_processor module imported successfully", flush=True)
    print("  🔄 Importing analysis module...", flush=True)
    from analysis import analyze_slides_batch
    print("  ✅ analysis module imported successfully", flush=True)
    print("  🔄 Importing markdown_utils module...", flush=True)
    from markdown_utils import send_to_api_and_get_haste_link
-    print("✅ Modules imported successfully")
+    print("  ✅ markdown_utils module imported successfully", flush=True)
    print("✅ ALL MODULES IMPORTED SUCCESSFULLY", flush=True)
    # Extract slides
-    print("📄 Extracting slides...")
+    print("\n📄 EXTRACTING SLIDES", flush=True)
    print("-" * 30, flush=True)
    print("  🔄 Calling extract_slides_from_pdf...", flush=True)
    start_time = time.time()
    slides = extract_slides_from_pdf(pdf_path, "processed", Path(pdf_path).stem)
-    print(f"✅ Extracted {len(slides)} slides")
+    extraction_time = time.time() - start_time
    print(f"  ✅ extract_slides_from_pdf completed in {extraction_time:.2f}s", flush=True)
    print(f"  📊 Extracted {len(slides)} slides", flush=True)
    # LIMIT TO FIRST 3 SLIDES FOR TESTING
    print(f"  🔄 Limiting to first 3 slides for testing...", flush=True)
    slides = slides[:3]
    print(f"  📊 Processing {len(slides)} slides", flush=True)
    # Analyze slides
-    print("🧠 Analyzing slides...")
+    print("\n🧠 ANALYZING SLIDES", flush=True)
    print("-" * 30, flush=True)
    print("  🔄 Initializing API client...", flush=True)
    client = get_openrouter_client()
-    print("🔗 API client initialized")
+    print("  ✅ API client initialized successfully", flush=True)
    print("  🔄 Calling analyze_slides_batch...", flush=True)
    analysis_start_time = time.time()
    analysis_results = analyze_slides_batch(client, slides)
-    print("✅ Analysis complete")
+    analysis_time = time.time() - analysis_start_time
    print(f"  ✅ analyze_slides_batch completed in {analysis_time:.2f}s", flush=True)
    print(f"  📊 Analysis results: {len(analysis_results)} slides analyzed", flush=True)
    # Create report
-    print("📝 Creating report...")
+    print("\n📝 CREATING REPORT", flush=True)
    print("-" * 30, flush=True)
    print("  🔄 Building markdown content...", flush=True)
    markdown_content = f"# Pitch Deck Analysis: {Path(pdf_path).stem}\n\n"
    # Add analysis metadata
    markdown_content += "This analysis was generated using multiple AI agents, each specialized in different aspects of slide evaluation.\n\n"
    markdown_content += f"**Source File:** `{Path(pdf_path).name}` (PDF)\n"
-    markdown_content += f"**Analysis Generated:** {len(slides)} slides processed\n"
+    markdown_content += f"**Analysis Generated:** {len(slides)} slides processed (limited for testing)\n"
    markdown_content += "**Processing Method:** Individual processing with specialized AI agents\n"
    markdown_content += "**Text Extraction:** Docling-powered text transcription\n\n"
-    print(f"📊 Building markdown for {len(slides)} slides...")
+    print(f"  📊 Building markdown for {len(slides)} slides...", flush=True)
    for i, slide_data in enumerate(slides):
        slide_num = i + 1
-        analysis = analysis_results.get(slide_num, {})
+        print(f"    🔄 Processing slide {slide_num}/{len(slides)}...", flush=True)
-        print(f"  📄 Processing slide {slide_num}...")
+        analysis = analysis_results.get(slide_num, {})
        markdown_content += f"# Slide {slide_num}\n\n"
        markdown_content += f"![Slide {slide_num}](slides/{slide_data['filename']})\n\n"
@ -107,20 +155,22 @@ def main():
                    markdown_content += f"### {agent_name}\n\n"
                    markdown_content += f"{agent_analysis}\n\n"
-            print(f"    ✅ Added {agent_count} agent analyses")
+            print(f"    ✅ Added {agent_count} agent analyses for slide {slide_num}", flush=True)
        else:
            markdown_content += "## Agentic Analysis\n\n"
            markdown_content += "No analysis available\n\n"
-            print(f"    ⚠️  No analysis available for slide {slide_num}")
+            print(f"    ⚠️  No analysis available for slide {slide_num}", flush=True)
        markdown_content += "---\n\n"
    print("  ✅ Markdown content built successfully", flush=True)
    # Generate Table of Contents
-    print("📋 Generating Table of Contents...")
+    print("  🔄 Generating Table of Contents...", flush=True)
    toc = generate_toc(markdown_content)
    # Insert TOC after the main title
-    print("🔗 Inserting TOC into document...")
+    print("  🔄 Inserting TOC into document...", flush=True)
    lines = markdown_content.split('\n')
    final_content = []
    final_content.append(lines[0])  # Main title
@ -129,24 +179,33 @@ def main():
    final_content.extend(lines[2:])  # Rest of content
    final_markdown = '\n'.join(final_content)
    print(f"  ✅ Final markdown created: {len(final_markdown)} characters", flush=True)
    # Save report
    print("\n💾 SAVING REPORT", flush=True)
    print("-" * 30, flush=True)
    output_file = f"processed/{Path(pdf_path).stem}_analysis.md"
-    print(f"💾 Saving report to: {output_file}")
+    print(f"  🔄 Saving to: {output_file}", flush=True)
    os.makedirs("processed", exist_ok=True)
    os.makedirs("processed", exist_ok=True)
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(final_markdown)
-    
+    print(f"  ✅ Report saved successfully ({len(final_markdown)} characters)", flush=True)
    print(f"✅ Report saved successfully ({len(final_markdown)} characters)")
    # Always upload the report
-    print("🌐 Uploading report...")
+    print("\n🌐 UPLOADING REPORT", flush=True)
    print("-" * 30, flush=True)
    print("  🔄 Calling send_to_api_and_get_haste_link...", flush=True)
    haste_url = send_to_api_and_get_haste_link(final_markdown, Path(pdf_path).stem)
    if haste_url:
-        print(f"✅ Report uploaded to: {haste_url}")
+        print(f"  ✅ Report uploaded successfully: {haste_url}", flush=True)
    else:
-        print("❌ Upload failed")
+        print("  ❌ Upload failed - no URL returned", flush=True)
    print("\n🎉 PROCESSING COMPLETE!", flush=True)
    print("=" * 50, flush=True)
 if __name__ == "__main__":
    print("🎯 __main__ BLOCK ENTERED", flush=True)
    main()
--- a/modules/analysis.py
+++ b/modules/analysis.py
@ -1,37 +1,74 @@
 print('🟡 ANALYSIS.PY: Starting import...', flush=True)
 import re
 from client import get_openrouter_client
 print('🟡 ANALYSIS.PY: Import complete!', flush=True)
 def analyze_slides_batch(client, slides_data, batch_size=1):
    """Process slides individually with specialized AI agents"""
-    print(f"  Processing {len(slides_data)} slides individually...")
+    print(f"  📊 Processing {len(slides_data)} slides individually...", flush=True)
    all_results = {}
    for i, slide_data in enumerate(slides_data):
        slide_num = slide_data["page_num"]
-        print(f"    🔍 Analyzing slide {slide_num} ({i+1}/{len(slides_data)})...")
+        print(f"    🔍 Starting analysis of slide {slide_num} ({i+1}/{len(slides_data)})...", flush=True)
-        # Define specialized agents
+        # Define specialized agents with critical pitch deck questions
        agents = {
-            'content_extractor': {
+            'problem_analyzer': {
-                'name': 'Content Extractor',
+                'name': 'Problem Analysis',
-                'prompt': 'Extract and summarize the key textual content from this slide. Focus on headlines, bullet points, and main messages.'
+                'prompt': '''Analyze this slide focusing on these critical questions:
 1. What's the core pain point being addressed?
 2. Is it backed by data or evidence?
 3. How big is the market impact of this problem?
 4. Why do existing solutions fail to solve this?
 Provide clear, specific answers to each question based on what you see in the slide.'''
            },
-            'visual_analyzer': {
+            'solution_evaluator': {
-                'name': 'Visual Analyzer', 
+                'name': 'Solution Evaluation',
-                'prompt': 'Analyze the visual design elements of this slide. Comment on layout, colors, typography, and visual hierarchy.'
+                'prompt': '''Evaluate this slide focusing on these critical questions:
 1. How does this solution outperform competitors?
 2. Is there proof of value (metrics, testimonials, case studies)?
 3. Can it scale effectively?
 4. Is the solution clearly explained and understandable?
 Provide clear, specific answers to each question based on what you see in the slide.'''
            },
-            'data_interpreter': {
+            'market_opportunity_assessor': {
-                'name': 'Data Interpreter',
+                'name': 'Market Opportunity Assessment',
-                'prompt': 'Identify and interpret any numerical data, charts, graphs, or metrics present on this slide.'
+                'prompt': '''Assess this slide focusing on these critical questions:
 1. What's the market size (TAM/SAM/SOM)?
 2. Is the market growing or declining?
 3. Are target customers clearly defined?
 4. Will customers actually pay for this?
 Provide clear, specific answers to each question based on what you see in the slide.'''
            },
-            'message_evaluator': {
+            'traction_evaluator': {
-                'name': 'Message Evaluator',
+                'name': 'Traction Evaluation',
-                'prompt': 'Evaluate the effectiveness of the message delivery and communication strategy on this slide.'
+                'prompt': '''Evaluate this slide focusing on these critical questions:
 1. What metrics demonstrate market demand?
 2. Is the traction sustainable or just a one-time spike?
 3. How will funding accelerate this growth?
 4. Is growth trending upward consistently?
 Provide clear, specific answers to each question based on what you see in the slide.'''
            },
-            'improvement_suggestor': {
+            'funding_analyzer': {
-                'name': 'Improvement Suggestor',
+                'name': 'Funding & Ask Analysis',
-                'prompt': 'Suggest specific improvements for this slide in terms of clarity, impact, and effectiveness.'
+                'prompt': '''Analyze this slide focusing on these critical questions:
 1. How much funding is being raised?
 2. How will the funds be allocated and used?
 3. What specific milestones are targeted with this funding?
 4. Is the valuation justified based on traction and market?
 Provide clear, specific answers to each question based on what you see in the slide.'''
            }
        }
@ -39,17 +76,17 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
        # Analyze with each specialized agent
        for j, (agent_key, agent_config) in enumerate(agents.items()):
-            print(f"      🤖 Running {agent_config['name']} ({j+1}/5)...")
+            print(f"      🤖 Running {agent_config['name']} ({j+1}/5) for slide {slide_num}...", flush=True)
            messages = [
                {
                    "role": "system",
-                    "content": f"You are a {agent_config['name']} specialized in analyzing pitch deck slides. {agent_config['prompt']}"
+                    "content": f"You are a pitch deck analyst specialized in {agent_config['name']}. Answer the critical questions based on what you observe in the slide. If a question doesn't apply to this slide, say 'Not applicable to this slide' and briefly explain why."
                },
                {
                    "role": "user",
                    "content": [
-                        {"type": "text", "text": f"Analyze slide {slide_num}:"},
+                        {"type": "text", "text": f"Analyze slide {slide_num} and answer these critical questions:\n\n{agent_config['prompt']}"},
                        {
                            "type": "image_url",
                            "image_url": {
@ -61,15 +98,15 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
            ]
            try:
-                print(f"        📡 Sending API request...")
+                print(f"        📡 Sending API request to {agent_config['name']}...", flush=True)
                response = client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=messages,
-                    max_tokens=500
+                    max_tokens=800
                )
                analysis = response.choices[0].message.content.strip()
-                print(f"        ✅ {agent_config['name']} completed ({len(analysis)} chars)")
+                print(f"        ✅ {agent_config['name']} completed for slide {slide_num} ({len(analysis)} chars)", flush=True)
                slide_analysis[agent_key] = {
                    'agent': agent_config['name'],
@ -77,14 +114,14 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
                }
            except Exception as e:
-                print(f"        ❌ {agent_config['name']} failed: {str(e)}")
+                print(f"        ❌ {agent_config['name']} failed for slide {slide_num}: {str(e)}", flush=True)
                slide_analysis[agent_key] = {
                    'agent': agent_config['name'],
                    'analysis': f"Error analyzing slide {slide_num}: {str(e)}"
                }
        all_results[slide_num] = slide_analysis
-        print(f"    ✅ Slide {slide_num} analysis complete")
+        print(f"    ✅ Slide {slide_num} analysis complete - {len(slide_analysis)} agents finished", flush=True)
-    print(f"  🎉 All {len(slides_data)} slides analyzed successfully!")
+    print(f"  🎉 All {len(slides_data)} slides analyzed successfully!", flush=True)
    return all_results
--- a/modules/client.py
+++ b/modules/client.py
@ -1,3 +1,4 @@
 print('🔵 CLIENT.PY: Starting import...')
 #!/usr/bin/env python3
 import os
@ -21,3 +22,4 @@ def get_openrouter_client():
        base_url="https://openrouter.ai/api/v1",
        api_key=api_key
    )
 print('🔵 CLIENT.PY: Import complete!')
--- a/modules/docling_processor.py
+++ b/modules/docling_processor.py
@ -1,3 +1,4 @@
 print('🔴 DOCLING_PROCESSOR.PY: Starting import...')
 #!/usr/bin/env python3
 from docling.document_converter import DocumentConverter
@ -170,3 +171,4 @@ def get_slide_text_content(text_content, slide_num):
    except Exception as e:
        print(f"⚠️  Error extracting text for slide {slide_num}: {e}")
        return f"[Text content for slide {slide_num} could not be extracted]"
 print('🔴 DOCLING_PROCESSOR.PY: Import complete!')
--- a/modules/file_utils.py
+++ b/modules/file_utils.py
@ -1,3 +1,4 @@
 print('🟠 FILE_UTILS.PY: Starting import...')
 #!/usr/bin/env python3
 import subprocess
@ -109,3 +110,4 @@ def convert_with_libreoffice(input_file, output_pdf, file_type):
    except Exception as e:
        print(f"❌ LibreOffice conversion error: {e}")
        return None
 print('🟠 FILE_UTILS.PY: Import complete!')
--- a/modules/markdown_utils.py
+++ b/modules/markdown_utils.py
@ -1,3 +1,4 @@
 print('🟣 MARKDOWN_UTILS.PY: Starting import...', flush=True)
 #!/usr/bin/env python3
 import re
@ -5,169 +6,66 @@ import requests
 import json
 def clean_markdown_text(text):
    """Clean markdown text to ensure it's plaintext with no special characters"""
    if not text:
        return ""
    # Remove LaTeX commands and math expressions
    text = re.sub(r'\\[a-zA-Z]+\{[^}]*\}', '', text)  # Remove \command{content}
    text = re.sub(r'\$[^$]*\$', '', text)  # Remove $math$ expressions
    text = re.sub(r'\\[a-zA-Z]+', '', text)  # Remove remaining \commands
    # Remove markdown formatting but keep the text
    text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)  # Remove bold **text**
    text = re.sub(r'\*([^*]+)\*', r'\1', text)  # Remove italic *text*
    text = re.sub(r'`([^`]+)`', r'\1', text)  # Remove code `text`
    text = re.sub(r'#{1,6}\s*', '', text)  # Remove headers # ## ###
    # Remove special characters but keep basic punctuation
    text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\"\'\/\&\%\@\#\$\+\=\<\>]', ' ', text)
    # Clean up multiple spaces and newlines
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\n\s*\n', '\n\n', text)
    return text.strip()
 def create_slide_markdown(slide_data, analysis_results, slide_num, slide_text=""):
    """Create markdown content for a single slide with all agentic analyses and text content"""
    markdown = f"""# Slide {slide_num}
 ![Slide {slide_num}](slides/{slide_data['filename']})
 """
    # Add text content if available
    if slide_text and slide_text.strip():
        # Clean the slide text to ensure it's plaintext
        clean_slide_text = clean_markdown_text(slide_text)
        markdown += f"""## Text Content
 {clean_slide_text}
 """
    markdown += """## Agentic Analysis
 """
    for prompt_key, result in analysis_results.items():
        # Clean the analysis text to ensure it's plaintext
        clean_analysis = clean_markdown_text(result['analysis'])
        markdown += f"""### {result['agent']}
 {clean_analysis}
 """
    markdown += "---\n\n"
    return markdown
 def create_text_only_markdown(markdown_content):
    """Create a text-only version of markdown without image references for API submission"""
    # Remove image markdown blocks but keep the text descriptions and analysis
    text_only = markdown_content
    # Remove image embedding lines
    text_only = re.sub(r'!\[.*?\]\(slides/.*?\)\n', '', text_only)
    # Remove image link lines
    text_only = re.sub(r'\*\[View full size: slides/.*?\]\(slides/.*?\)\*\n', '', text_only)
    # Remove horizontal rules that were added for slide separation
    text_only = re.sub(r'^---\n', '', text_only, flags=re.MULTILINE)
    # Clean up extra newlines
    text_only = re.sub(r'\n{3,}', '\n\n', text_only)
    # Apply final text cleaning to ensure plaintext
    text_only = clean_markdown_text(text_only)
    return text_only.strip()
 def send_to_api_and_get_haste_link(markdown_content, document_title):
-    """Send markdown to API and get both raw markdown and HTML URLs"""
+    """Send FULL structured markdown to API and get both raw markdown and HTML URLs"""
    try:
-        print("Sending to API for URLs...")
+        print("Sending to API for URLs...", flush=True)
-        # Create text-only version for API
+        # Send the FULL structured markdown - NO STRIPPING, NO CLEANING
-        text_only_markdown = create_text_only_markdown(markdown_content)
+        # Only remove local image references since they won't work online
        online_markdown = re.sub(r'!\[Slide (\d+)\]\(slides/[^\)]+\)', r'**[Slide \1 Image]**', markdown_content)
-        # First, send raw markdown to haste.nixc.us
+        # First, send to haste.nixc.us for raw markdown
        raw_haste_url = None
        try:
-            print("  📝 Creating raw markdown URL...")
+            print("  📝 Creating raw markdown URL...", flush=True)
            raw_response = requests.post(
                "https://haste.nixc.us/documents",
-                data=text_only_markdown.encode('utf-8'),
+                data=online_markdown.encode('utf-8'),
                headers={"Content-Type": "text/plain"},
                timeout=30
            )
            if raw_response.status_code == 200:
-                raw_token = raw_response.text.strip().strip('"')
+                response_data = raw_response.json()
-                # Extract just the token from JSON response if needed
+                raw_token = response_data.get('key', '')
                if raw_token.startswith('{"key":"') and raw_token.endswith('"}'):
                    import json
                    try:
                        token_data = json.loads(raw_token)
                        raw_token = token_data['key']
                    except:
                        pass
                raw_haste_url = f"https://haste.nixc.us/{raw_token}"
-                print(f"  ✅ Raw markdown URL created")
+                print(f"  ✅ Raw markdown URL created", flush=True)
            else:
-                print(f"  ⚠️  Raw markdown upload failed with status {raw_response.status_code}")
+                print(f"  ⚠️  Raw markdown upload failed with status {raw_response.status_code}", flush=True)
        except Exception as e:
-            print(f"  ⚠️  Failed to create raw markdown URL: {e}")
+            print(f"  ⚠️  Failed to create raw markdown URL: {e}", flush=True)
        # Then, send to md.colinknapp.com for HTML version
        html_url = None
        try:
-            print("  🎨 Creating HTML version URL...")
+            print("  🎨 Creating HTML version URL...", flush=True)
            api_data = {
                "markdown": text_only_markdown,
                "format": "html",
                "template": "playful",
                "title": f"Pitch Deck Analysis: {document_title}",
-                "subtitle": "AI-Generated Analysis with Agentic Insights",
+                "content": online_markdown
                "contact": "Generated by Pitch Deck Parser",
                "send_to_haste": True
            }
            response = requests.post(
-                "https://md.colinknapp.com/api/convert",
+                "https://md.colinknapp.com/haste",
                headers={"Content-Type": "application/json"},
-                data=json.dumps(api_data),
+                json=api_data,
                timeout=30
            )
            if response.status_code == 200:
                result = response.json()
-                if 'haste_url' in result:
+                html_url = result.get('url', '')
-                    # Extract token from haste_url and format as requested
+                print(f"  ✅ HTML version URL created", flush=True)
                    haste_url = result['haste_url']
                    if 'haste.nixc.us/' in haste_url:
                        token = haste_url.split('haste.nixc.us/')[-1]
                        html_url = f"https://md.colinknapp.com/haste/{token}"
                    else:
                        html_url = haste_url
                    print(f"  ✅ HTML version URL created")
                else:
                    print("  ⚠️  API response missing haste_url")
            else:
-                print(f"  ⚠️  HTML API request failed with status {response.status_code}")
+                print(f"  ⚠️  HTML API request failed with status {response.status_code}", flush=True)
                print(f"  Response: {response.text[:200]}", flush=True)
        except Exception as e:
-            print(f"  ⚠️  Failed to create HTML URL: {e}")
+            print(f"  ⚠️  Failed to create HTML URL: {e}", flush=True)
        return raw_haste_url, html_url
    except Exception as e:
-        print(f"⚠️  Failed to send to API: {e}")
+        print(f"⚠️  Failed to send to API: {e}", flush=True)
        return None, None
 print('🟣 MARKDOWN_UTILS.PY: Import complete!', flush=True)
--- a/modules/pdf_processor.py
+++ b/modules/pdf_processor.py
@ -1,3 +1,4 @@
 print('🟢 PDF_PROCESSOR.PY: Starting import...')
 #!/usr/bin/env python3
 import base64
@ -58,3 +59,4 @@ def extract_slides_from_pdf(pdf_path, output_dir, document_name):
    except Exception as e:
        print(f"❌ Error extracting slides: {e}")
        return []
 print('🟢 PDF_PROCESSOR.PY: Import complete!')
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_001.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_001.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_002.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_002.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_003.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_003.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_004.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_004.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_005.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_005.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_006.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_006.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_007.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_007.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_008.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_008.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_009.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_009.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_010.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_010.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_011.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_011.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_012.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_012.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_013.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_013.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_014.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_014.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_015.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_015.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_016.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_016.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_017.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_017.png
--- a/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_018.png
+++ b/processed/AirBnB_Pitch_Deck/slides/AirBnB_Pitch_Deck_slide_018.png
--- a/processed/AirBnB_Pitch_Deck_analysis.md
+++ b/processed/AirBnB_Pitch_Deck_analysis.md
--- a/start.sh
+++ b/start.sh
@ -1,58 +1,10 @@
 #!/bin/bash
 # Kill any process running on port 3123
 echo "Killing any existing processes on port 3123..."
 fuser -k 3123/tcp 2>/dev/null || true
 # Create virtual environment if it doesn't exist
 if [ ! -d "venv" ]; then
    echo "Creating virtual environment..."
    python3 -m venv venv
 fi
 # Activate virtual environment
 echo "Activating virtual environment..."
 source venv/bin/activate
 # Verify virtual environment is active
 echo "Verifying virtual environment..."
 which python3
 python3 --version
 # Install dependencies
 echo "Installing dependencies..."
 pip install -r requirements.txt
 # Check for help flag
 if [ "$1" = "--help" ] || [ "$1" = "-h" ]; then
    echo ""
    echo "Pitch Deck Analysis Application"
    echo "=============================="
    echo "Usage: ./start.sh <file_path>"
    echo "Example: ./start.sh presentation.pdf"
    echo ""
    echo "The application will automatically upload the generated report."
    echo ""
    exit 0
 fi
 # Verify file exists
 if [ -z "$1" ]; then
    echo "Error: No file specified"
    echo "Usage: ./start.sh <file_path>"
    exit 1
 fi
 if [ ! -f "$1" ]; then
    echo "Error: File '$1' not found"
    exit 1
 fi
 # Start the application with immediate feedback
 echo "Starting pitch deck parser..."
 echo "Processing file: $1"
 echo "Python path: $(which python3)"
 echo "Working directory: $(pwd)"
 echo "----------------------------------------"
 python3 app.py "$1"