Add comprehensive pitch deck analysis with AI agents and debugging

- Refactored app.py with extensive debugging feedback at every step
- Implemented 5 specialized AI agents for critical pitch deck analysis:
  * Problem Analysis (pain point, data backing, market impact)
  * Solution Evaluation (competitive advantage, proof, scalability)
  * Market Opportunity Assessment (TAM/SAM, growth, customers)
  * Traction Evaluation (metrics, sustainability, growth trends)
  * Funding & Ask Analysis (amount, allocation, milestones, valuation)
- Added comprehensive logging to all modules for visibility
- Updated markdown output to preserve full structured formatting
- Fixed markdown upload to preserve headers and formatting
- Simplified start.sh for cleaner execution
- Cleaned up processed directory (not tracked in git)
- All modules now provide real-time feedback during execution
This commit is contained in:
Colin 2025-10-22 19:17:37 -04:00
parent 0bb86c677d
commit ef5de680da
28 changed files with 446 additions and 3939 deletions

260
README.md Normal file
View File

@ -0,0 +1,260 @@
# Pitch Deck Market Cap Validator
A Python-based application that automatically extracts and validates market cap claims from pitch deck PDFs using specialized financial APIs and RAG (Retrieval-Augmented Generation) systems to quickly debunk inaccurate financial claims.
## Technical Overview
This tool processes PDF pitch decks through a multi-stage pipeline focused on **financial claim validation**. The system extracts market cap claims, validates them against real-time financial data sources, and generates comprehensive debunking reports.
### Architecture
```
PDF Input → Slide Extraction → Claim Detection → Financial API Validation → Debunking Report
↓ ↓ ↓ ↓ ↓
PyMuPDF Image Files Pattern Matching Financial APIs Markdown Report
```
### Core Mission
**Fast market cap validation and claim debunking** using proper financial APIs that track market intelligence accurately, not generic web search.
## Core Components
### 1. Main Application (`app.py`)
- **Entry point** for the pitch deck analysis pipeline
- Orchestrates slide extraction and market cap validation workflow
- Generates comprehensive debunking reports with Table of Contents
- Handles file validation and error management
### 2. PDF Processing (`modules/pdf_processor.py`)
- **PyMuPDF integration** for high-quality PDF to image conversion
- Extracts individual slides as PNG images (2x zoom for clarity)
- Creates organized directory structure: `processed/{document_name}/slides/`
- Handles page numbering and file naming conventions
### 3. Market Cap Validation Engine (`modules/market_cap_validator.py`)
- **Main interface** for market cap claim validation
- Coordinates between claim extraction and validation processes
- Generates comprehensive validation reports
- Handles multiple input formats (files, processed folders, direct data)
### 4. RAG Agent (`modules/rag_agent.py`)
- **Pattern-based claim extraction** using regex patterns for market cap detection
- **Financial API integration** for real-time market data validation
- **Confidence scoring** based on context and claim specificity
- **Discrepancy analysis** between claimed and actual market caps
### 5. Document Validator (`modules/document_validator.py`)
- **Batch processing** for multiple documents
- **Organized reporting** with document-specific validation results
- **Error handling** for invalid or corrupted slide data
### 6. Validation Report Generator (`modules/validation_report.py`)
- **Comprehensive reporting** with executive summaries
- **Slide source tracking** for claim attribution
- **RAG search details** for transparency and verification
- **Recommendations** for improving claim accuracy
## Technical Stack
### Dependencies
- **PyMuPDF**: PDF processing and image extraction
- **OpenAI**: AI model integration via OpenRouter
- **requests**: HTTP API communications for financial data
- **python-dotenv**: Environment variable management
- **docling**: Advanced document processing capabilities
### Financial Data Sources (Planned)
- **Yahoo Finance API**: Real-time market cap data
- **Alpha Vantage**: Historical and current market data
- **Financial Modeling Prep**: Comprehensive financial metrics
- **IEX Cloud**: Real-time stock data and market intelligence
- **Quandl**: Financial and economic data
### Environment Configuration
- **OpenRouter API Key**: Required for AI model access
- **Financial API Keys**: Multiple providers for redundancy and accuracy
- **Rate Limiting**: Configurable API call limits and retry logic
## Current Limitations & Improvements Needed
### RAG System Issues
- **Generic Web Search**: Currently uses basic web search instead of specialized financial APIs
- **Accuracy Problems**: Web search results are inconsistent and often outdated
- **No Real-time Data**: Cannot access current market cap information
- **Limited Financial Context**: Lacks understanding of market dynamics and valuation metrics
### Required API Integrations
1. **Real-time Market Data APIs**:
- Yahoo Finance API for current market caps
- Alpha Vantage for historical data and trends
- Financial Modeling Prep for comprehensive metrics
2. **Enhanced Validation Logic**:
- Time-based validation (check if claim was accurate at time of presentation)
- Market cap calculation verification (shares outstanding × price)
- Industry benchmarking and comparison
3. **Improved Pattern Recognition**:
- Better company name extraction from slides
- Context-aware claim detection
- Support for different valuation metrics (enterprise value, etc.)
## Usage
### Quick Start
```bash
# Make start script executable
chmod +x start.sh
# Run market cap validation on a PDF file
./start.sh presentation.pdf
```
### Manual Execution
```bash
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run validation
python3 app.py presentation.pdf
```
### Market Cap Validation Only
```bash
# Validate market caps from processed folder
python3 modules/validate_market_caps.py --all
# Validate specific document
python3 modules/validate_market_caps.py --file slides.json --document "Company-Pitch"
```
## Output Structure
The tool generates:
1. **Processed Images**: Individual slide images in `processed/{document_name}/slides/`
2. **Validation Report**: Comprehensive debunking report with:
- Executive summary of claim accuracy
- Detailed validation results for each claim
- Source attribution and confidence scores
- Discrepancy analysis and explanations
- Recommendations for improving accuracy
3. **Shareable Link**: Automatic upload to Hastebin for easy sharing
## Technical Features
### Market Cap Claim Detection
- **Pattern Recognition**: Multiple regex patterns for market cap identification
- **Context Analysis**: Confidence scoring based on surrounding text
- **Company Name Extraction**: Automatic identification of company names
- **Value Normalization**: Standardized handling of different value formats (B, M, K)
### Financial Validation (Planned)
- **Real-time API Integration**: Direct access to current market data
- **Historical Validation**: Check if claims were accurate at presentation time
- **Market Context**: Industry comparisons and benchmarking
- **Multiple Data Sources**: Redundancy for accuracy verification
### Report Generation
- **Executive Summary**: High-level accuracy metrics and key findings
- **Detailed Analysis**: Slide-by-slide validation results
- **Source Transparency**: Clear attribution of validation sources
- **Actionable Insights**: Specific recommendations for improvement
### Error Handling
- **API Rate Limiting**: Intelligent handling of API call limits
- **Data Validation**: Verification of extracted financial data
- **Graceful Degradation**: Continues processing even if individual validations fail
- **Comprehensive Logging**: Detailed error tracking and debugging
## Development Setup
### Prerequisites
- Python 3.7+
- Virtual environment support
- OpenRouter API account
- Financial API accounts (Yahoo Finance, Alpha Vantage, etc.)
### Installation
1. Clone the repository
2. Create virtual environment: `python3 -m venv venv`
3. Activate environment: `source venv/bin/activate`
4. Install dependencies: `pip install -r requirements.txt`
5. Configure `.env` file with API keys
### Configuration
- Copy `example.env` to `.env`
- Add OpenRouter API key
- Add financial API keys:
```
YAHOO_FINANCE_API_KEY=your_key_here
ALPHA_VANTAGE_API_KEY=your_key_here
FINANCIAL_MODELING_PREP_API_KEY=your_key_here
```
## Planned Improvements
### Phase 1: Financial API Integration
- Implement Yahoo Finance API for real-time market cap data
- Add Alpha Vantage for historical data and trends
- Create API rate limiting and error handling
### Phase 2: Enhanced Validation Logic
- Time-based validation (check accuracy at presentation date)
- Market cap calculation verification
- Industry benchmarking and comparison
### Phase 3: Advanced Features
- Support for different valuation metrics (enterprise value, etc.)
- Automated fact-checking for other financial claims
- Integration with SEC filings for public companies
- Machine learning for improved claim detection
## Technical Considerations
### Performance
- **API Optimization**: Efficient use of financial API calls
- **Caching Strategy**: Store validation results to avoid redundant API calls
- **Batch Processing**: Process multiple claims efficiently
- **Rate Limiting**: Respect API limits while maintaining speed
### Accuracy
- **Multiple Data Sources**: Cross-reference validation results
- **Time Context**: Consider when claims were made vs. current data
- **Market Dynamics**: Account for market volatility and timing
- **Data Quality**: Validate API responses for accuracy
### Security
- **API Key Management**: Secure storage and rotation of API keys
- **Data Privacy**: Handle sensitive financial information appropriately
- **Rate Limiting**: Prevent API abuse and excessive costs
- **Error Handling**: Graceful handling of API failures
## File Structure
```
boxone-technical/
├── app.py # Main application entry point
├── start.sh # Development startup script
├── requirements.txt # Python dependencies
├── .env # Environment configuration
├── example.env # Environment template
├── modules/ # Core application modules
│ ├── market_cap_validator.py # Main market cap validation interface
│ ├── rag_agent.py # RAG agent for claim extraction and validation
│ ├── document_validator.py # Document-level validation processing
│ ├── validation_report.py # Report generation utilities
│ ├── pdf_processor.py # PDF extraction and processing
│ ├── client.py # OpenRouter API client
│ └── ... # Additional utility modules
├── processed/ # Output directory for validation results
└── venv/ # Python virtual environment
```
## Current Status
**⚠️ Important**: The current RAG system uses generic web search which is insufficient for accurate financial validation. The system needs integration with proper financial APIs to provide reliable market cap validation and claim debunking capabilities.
This tool is designed to be a comprehensive solution for **fast, accurate financial claim validation** using real-time market data and specialized financial intelligence APIs.

117
app.py
View File

@ -1,13 +1,18 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
print("🚀 APP.PY STARTING - IMMEDIATE FEEDBACK", flush=True)
import sys import sys
import os import os
import re import re
import time
from pathlib import Path from pathlib import Path
print("📦 BASIC IMPORTS COMPLETE", flush=True)
def generate_toc(markdown_content): def generate_toc(markdown_content):
"""Generate a Table of Contents from markdown headers""" """Generate a Table of Contents from markdown headers"""
print(" 📋 Generating Table of Contents...") print(" 📋 Generating Table of Contents...", flush=True)
lines = markdown_content.split('\n') lines = markdown_content.split('\n')
toc_lines = [] toc_lines = []
toc_lines.append("## Table of Contents") toc_lines.append("## Table of Contents")
@ -34,61 +39,104 @@ def generate_toc(markdown_content):
toc_lines.append("---") toc_lines.append("---")
toc_lines.append("") toc_lines.append("")
print(f" ✅ Generated TOC with {header_count} headers") print(f" ✅ Generated TOC with {header_count} headers", flush=True)
return '\n'.join(toc_lines) return '\n'.join(toc_lines)
def main(): def main():
"""Simple pitch deck analyzer""" """Simple pitch deck analyzer with comprehensive debugging"""
print("🚀 PITCH DECK ANALYZER MAIN FUNCTION STARTING", flush=True)
print("=" * 50, flush=True)
if len(sys.argv) < 2: if len(sys.argv) < 2:
print("Usage: python app.py <pdf_file>") print("Usage: python app.py <pdf_file>", flush=True)
return return
pdf_path = sys.argv[1] pdf_path = sys.argv[1]
if not os.path.exists(pdf_path): if not os.path.exists(pdf_path):
print(f"Error: File '{pdf_path}' not found") print(f"Error: File '{pdf_path}' not found", flush=True)
return return
print(f"🚀 Processing: {pdf_path}") print(f"📁 Processing file: {pdf_path}", flush=True)
print(f"📁 File exists: {os.path.exists(pdf_path)}", flush=True)
print(f"📁 File size: {os.path.getsize(pdf_path)} bytes", flush=True)
# Import what we need directly (avoid __init__.py issues) # Import what we need directly (avoid __init__.py issues)
print("📦 Importing modules...") print("\n📦 IMPORTING MODULES", flush=True)
print("-" * 30, flush=True)
sys.path.append('modules') sys.path.append('modules')
print(" 🔄 Importing client module...", flush=True)
from client import get_openrouter_client from client import get_openrouter_client
print(" ✅ client module imported successfully", flush=True)
print(" 🔄 Importing pdf_processor module...", flush=True)
from pdf_processor import extract_slides_from_pdf from pdf_processor import extract_slides_from_pdf
print(" ✅ pdf_processor module imported successfully", flush=True)
print(" 🔄 Importing analysis module...", flush=True)
from analysis import analyze_slides_batch from analysis import analyze_slides_batch
print(" ✅ analysis module imported successfully", flush=True)
print(" 🔄 Importing markdown_utils module...", flush=True)
from markdown_utils import send_to_api_and_get_haste_link from markdown_utils import send_to_api_and_get_haste_link
print("✅ Modules imported successfully") print(" ✅ markdown_utils module imported successfully", flush=True)
print("✅ ALL MODULES IMPORTED SUCCESSFULLY", flush=True)
# Extract slides # Extract slides
print("📄 Extracting slides...") print("\n📄 EXTRACTING SLIDES", flush=True)
print("-" * 30, flush=True)
print(" 🔄 Calling extract_slides_from_pdf...", flush=True)
start_time = time.time()
slides = extract_slides_from_pdf(pdf_path, "processed", Path(pdf_path).stem) slides = extract_slides_from_pdf(pdf_path, "processed", Path(pdf_path).stem)
print(f"✅ Extracted {len(slides)} slides") extraction_time = time.time() - start_time
print(f" ✅ extract_slides_from_pdf completed in {extraction_time:.2f}s", flush=True)
print(f" 📊 Extracted {len(slides)} slides", flush=True)
# LIMIT TO FIRST 3 SLIDES FOR TESTING
print(f" 🔄 Limiting to first 3 slides for testing...", flush=True)
slides = slides[:3]
print(f" 📊 Processing {len(slides)} slides", flush=True)
# Analyze slides # Analyze slides
print("🧠 Analyzing slides...") print("\n🧠 ANALYZING SLIDES", flush=True)
print("-" * 30, flush=True)
print(" 🔄 Initializing API client...", flush=True)
client = get_openrouter_client() client = get_openrouter_client()
print("🔗 API client initialized") print(" ✅ API client initialized successfully", flush=True)
print(" 🔄 Calling analyze_slides_batch...", flush=True)
analysis_start_time = time.time()
analysis_results = analyze_slides_batch(client, slides) analysis_results = analyze_slides_batch(client, slides)
print("✅ Analysis complete") analysis_time = time.time() - analysis_start_time
print(f" ✅ analyze_slides_batch completed in {analysis_time:.2f}s", flush=True)
print(f" 📊 Analysis results: {len(analysis_results)} slides analyzed", flush=True)
# Create report # Create report
print("📝 Creating report...") print("\n📝 CREATING REPORT", flush=True)
print("-" * 30, flush=True)
print(" 🔄 Building markdown content...", flush=True)
markdown_content = f"# Pitch Deck Analysis: {Path(pdf_path).stem}\n\n" markdown_content = f"# Pitch Deck Analysis: {Path(pdf_path).stem}\n\n"
# Add analysis metadata # Add analysis metadata
markdown_content += "This analysis was generated using multiple AI agents, each specialized in different aspects of slide evaluation.\n\n" markdown_content += "This analysis was generated using multiple AI agents, each specialized in different aspects of slide evaluation.\n\n"
markdown_content += f"**Source File:** `{Path(pdf_path).name}` (PDF)\n" markdown_content += f"**Source File:** `{Path(pdf_path).name}` (PDF)\n"
markdown_content += f"**Analysis Generated:** {len(slides)} slides processed\n" markdown_content += f"**Analysis Generated:** {len(slides)} slides processed (limited for testing)\n"
markdown_content += "**Processing Method:** Individual processing with specialized AI agents\n" markdown_content += "**Processing Method:** Individual processing with specialized AI agents\n"
markdown_content += "**Text Extraction:** Docling-powered text transcription\n\n" markdown_content += "**Text Extraction:** Docling-powered text transcription\n\n"
print(f"📊 Building markdown for {len(slides)} slides...") print(f" 📊 Building markdown for {len(slides)} slides...", flush=True)
for i, slide_data in enumerate(slides): for i, slide_data in enumerate(slides):
slide_num = i + 1 slide_num = i + 1
analysis = analysis_results.get(slide_num, {}) print(f" 🔄 Processing slide {slide_num}/{len(slides)}...", flush=True)
print(f" 📄 Processing slide {slide_num}...") analysis = analysis_results.get(slide_num, {})
markdown_content += f"# Slide {slide_num}\n\n" markdown_content += f"# Slide {slide_num}\n\n"
markdown_content += f"![Slide {slide_num}](slides/{slide_data['filename']})\n\n" markdown_content += f"![Slide {slide_num}](slides/{slide_data['filename']})\n\n"
@ -107,20 +155,22 @@ def main():
markdown_content += f"### {agent_name}\n\n" markdown_content += f"### {agent_name}\n\n"
markdown_content += f"{agent_analysis}\n\n" markdown_content += f"{agent_analysis}\n\n"
print(f" ✅ Added {agent_count} agent analyses") print(f" ✅ Added {agent_count} agent analyses for slide {slide_num}", flush=True)
else: else:
markdown_content += "## Agentic Analysis\n\n" markdown_content += "## Agentic Analysis\n\n"
markdown_content += "No analysis available\n\n" markdown_content += "No analysis available\n\n"
print(f" ⚠️ No analysis available for slide {slide_num}") print(f" ⚠️ No analysis available for slide {slide_num}", flush=True)
markdown_content += "---\n\n" markdown_content += "---\n\n"
print(" ✅ Markdown content built successfully", flush=True)
# Generate Table of Contents # Generate Table of Contents
print("📋 Generating Table of Contents...") print(" 🔄 Generating Table of Contents...", flush=True)
toc = generate_toc(markdown_content) toc = generate_toc(markdown_content)
# Insert TOC after the main title # Insert TOC after the main title
print("🔗 Inserting TOC into document...") print(" 🔄 Inserting TOC into document...", flush=True)
lines = markdown_content.split('\n') lines = markdown_content.split('\n')
final_content = [] final_content = []
final_content.append(lines[0]) # Main title final_content.append(lines[0]) # Main title
@ -129,24 +179,33 @@ def main():
final_content.extend(lines[2:]) # Rest of content final_content.extend(lines[2:]) # Rest of content
final_markdown = '\n'.join(final_content) final_markdown = '\n'.join(final_content)
print(f" ✅ Final markdown created: {len(final_markdown)} characters", flush=True)
# Save report # Save report
print("\n💾 SAVING REPORT", flush=True)
print("-" * 30, flush=True)
output_file = f"processed/{Path(pdf_path).stem}_analysis.md" output_file = f"processed/{Path(pdf_path).stem}_analysis.md"
print(f"💾 Saving report to: {output_file}") print(f" 🔄 Saving to: {output_file}", flush=True)
os.makedirs("processed", exist_ok=True)
os.makedirs("processed", exist_ok=True)
with open(output_file, 'w', encoding='utf-8') as f: with open(output_file, 'w', encoding='utf-8') as f:
f.write(final_markdown) f.write(final_markdown)
print(f" ✅ Report saved successfully ({len(final_markdown)} characters)", flush=True)
print(f"✅ Report saved successfully ({len(final_markdown)} characters)")
# Always upload the report # Always upload the report
print("🌐 Uploading report...") print("\n🌐 UPLOADING REPORT", flush=True)
print("-" * 30, flush=True)
print(" 🔄 Calling send_to_api_and_get_haste_link...", flush=True)
haste_url = send_to_api_and_get_haste_link(final_markdown, Path(pdf_path).stem) haste_url = send_to_api_and_get_haste_link(final_markdown, Path(pdf_path).stem)
if haste_url: if haste_url:
print(f"✅ Report uploaded to: {haste_url}") print(f" ✅ Report uploaded successfully: {haste_url}", flush=True)
else: else:
print("❌ Upload failed") print(" ❌ Upload failed - no URL returned", flush=True)
print("\n🎉 PROCESSING COMPLETE!", flush=True)
print("=" * 50, flush=True)
if __name__ == "__main__": if __name__ == "__main__":
print("🎯 __main__ BLOCK ENTERED", flush=True)
main() main()

View File

@ -1,37 +1,74 @@
print('🟡 ANALYSIS.PY: Starting import...', flush=True)
import re import re
from client import get_openrouter_client from client import get_openrouter_client
print('🟡 ANALYSIS.PY: Import complete!', flush=True)
def analyze_slides_batch(client, slides_data, batch_size=1): def analyze_slides_batch(client, slides_data, batch_size=1):
"""Process slides individually with specialized AI agents""" """Process slides individually with specialized AI agents"""
print(f" Processing {len(slides_data)} slides individually...") print(f" 📊 Processing {len(slides_data)} slides individually...", flush=True)
all_results = {} all_results = {}
for i, slide_data in enumerate(slides_data): for i, slide_data in enumerate(slides_data):
slide_num = slide_data["page_num"] slide_num = slide_data["page_num"]
print(f" 🔍 Analyzing slide {slide_num} ({i+1}/{len(slides_data)})...") print(f" 🔍 Starting analysis of slide {slide_num} ({i+1}/{len(slides_data)})...", flush=True)
# Define specialized agents # Define specialized agents with critical pitch deck questions
agents = { agents = {
'content_extractor': { 'problem_analyzer': {
'name': 'Content Extractor', 'name': 'Problem Analysis',
'prompt': 'Extract and summarize the key textual content from this slide. Focus on headlines, bullet points, and main messages.' 'prompt': '''Analyze this slide focusing on these critical questions:
1. What's the core pain point being addressed?
2. Is it backed by data or evidence?
3. How big is the market impact of this problem?
4. Why do existing solutions fail to solve this?
Provide clear, specific answers to each question based on what you see in the slide.'''
}, },
'visual_analyzer': { 'solution_evaluator': {
'name': 'Visual Analyzer', 'name': 'Solution Evaluation',
'prompt': 'Analyze the visual design elements of this slide. Comment on layout, colors, typography, and visual hierarchy.' 'prompt': '''Evaluate this slide focusing on these critical questions:
1. How does this solution outperform competitors?
2. Is there proof of value (metrics, testimonials, case studies)?
3. Can it scale effectively?
4. Is the solution clearly explained and understandable?
Provide clear, specific answers to each question based on what you see in the slide.'''
}, },
'data_interpreter': { 'market_opportunity_assessor': {
'name': 'Data Interpreter', 'name': 'Market Opportunity Assessment',
'prompt': 'Identify and interpret any numerical data, charts, graphs, or metrics present on this slide.' 'prompt': '''Assess this slide focusing on these critical questions:
1. What's the market size (TAM/SAM/SOM)?
2. Is the market growing or declining?
3. Are target customers clearly defined?
4. Will customers actually pay for this?
Provide clear, specific answers to each question based on what you see in the slide.'''
}, },
'message_evaluator': { 'traction_evaluator': {
'name': 'Message Evaluator', 'name': 'Traction Evaluation',
'prompt': 'Evaluate the effectiveness of the message delivery and communication strategy on this slide.' 'prompt': '''Evaluate this slide focusing on these critical questions:
1. What metrics demonstrate market demand?
2. Is the traction sustainable or just a one-time spike?
3. How will funding accelerate this growth?
4. Is growth trending upward consistently?
Provide clear, specific answers to each question based on what you see in the slide.'''
}, },
'improvement_suggestor': { 'funding_analyzer': {
'name': 'Improvement Suggestor', 'name': 'Funding & Ask Analysis',
'prompt': 'Suggest specific improvements for this slide in terms of clarity, impact, and effectiveness.' 'prompt': '''Analyze this slide focusing on these critical questions:
1. How much funding is being raised?
2. How will the funds be allocated and used?
3. What specific milestones are targeted with this funding?
4. Is the valuation justified based on traction and market?
Provide clear, specific answers to each question based on what you see in the slide.'''
} }
} }
@ -39,17 +76,17 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
# Analyze with each specialized agent # Analyze with each specialized agent
for j, (agent_key, agent_config) in enumerate(agents.items()): for j, (agent_key, agent_config) in enumerate(agents.items()):
print(f" 🤖 Running {agent_config['name']} ({j+1}/5)...") print(f" 🤖 Running {agent_config['name']} ({j+1}/5) for slide {slide_num}...", flush=True)
messages = [ messages = [
{ {
"role": "system", "role": "system",
"content": f"You are a {agent_config['name']} specialized in analyzing pitch deck slides. {agent_config['prompt']}" "content": f"You are a pitch deck analyst specialized in {agent_config['name']}. Answer the critical questions based on what you observe in the slide. If a question doesn't apply to this slide, say 'Not applicable to this slide' and briefly explain why."
}, },
{ {
"role": "user", "role": "user",
"content": [ "content": [
{"type": "text", "text": f"Analyze slide {slide_num}:"}, {"type": "text", "text": f"Analyze slide {slide_num} and answer these critical questions:\n\n{agent_config['prompt']}"},
{ {
"type": "image_url", "type": "image_url",
"image_url": { "image_url": {
@ -61,15 +98,15 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
] ]
try: try:
print(f" 📡 Sending API request...") print(f" 📡 Sending API request to {agent_config['name']}...", flush=True)
response = client.chat.completions.create( response = client.chat.completions.create(
model="gpt-4o-mini", model="gpt-4o-mini",
messages=messages, messages=messages,
max_tokens=500 max_tokens=800
) )
analysis = response.choices[0].message.content.strip() analysis = response.choices[0].message.content.strip()
print(f"{agent_config['name']} completed ({len(analysis)} chars)") print(f"{agent_config['name']} completed for slide {slide_num} ({len(analysis)} chars)", flush=True)
slide_analysis[agent_key] = { slide_analysis[agent_key] = {
'agent': agent_config['name'], 'agent': agent_config['name'],
@ -77,14 +114,14 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
} }
except Exception as e: except Exception as e:
print(f"{agent_config['name']} failed: {str(e)}") print(f"{agent_config['name']} failed for slide {slide_num}: {str(e)}", flush=True)
slide_analysis[agent_key] = { slide_analysis[agent_key] = {
'agent': agent_config['name'], 'agent': agent_config['name'],
'analysis': f"Error analyzing slide {slide_num}: {str(e)}" 'analysis': f"Error analyzing slide {slide_num}: {str(e)}"
} }
all_results[slide_num] = slide_analysis all_results[slide_num] = slide_analysis
print(f" ✅ Slide {slide_num} analysis complete") print(f" ✅ Slide {slide_num} analysis complete - {len(slide_analysis)} agents finished", flush=True)
print(f" 🎉 All {len(slides_data)} slides analyzed successfully!") print(f" 🎉 All {len(slides_data)} slides analyzed successfully!", flush=True)
return all_results return all_results

View File

@ -1,3 +1,4 @@
print('🔵 CLIENT.PY: Starting import...')
#!/usr/bin/env python3 #!/usr/bin/env python3
import os import os
@ -21,3 +22,4 @@ def get_openrouter_client():
base_url="https://openrouter.ai/api/v1", base_url="https://openrouter.ai/api/v1",
api_key=api_key api_key=api_key
) )
print('🔵 CLIENT.PY: Import complete!')

View File

@ -1,3 +1,4 @@
print('🔴 DOCLING_PROCESSOR.PY: Starting import...')
#!/usr/bin/env python3 #!/usr/bin/env python3
from docling.document_converter import DocumentConverter from docling.document_converter import DocumentConverter
@ -170,3 +171,4 @@ def get_slide_text_content(text_content, slide_num):
except Exception as e: except Exception as e:
print(f"⚠️ Error extracting text for slide {slide_num}: {e}") print(f"⚠️ Error extracting text for slide {slide_num}: {e}")
return f"[Text content for slide {slide_num} could not be extracted]" return f"[Text content for slide {slide_num} could not be extracted]"
print('🔴 DOCLING_PROCESSOR.PY: Import complete!')

View File

@ -1,3 +1,4 @@
print('🟠 FILE_UTILS.PY: Starting import...')
#!/usr/bin/env python3 #!/usr/bin/env python3
import subprocess import subprocess
@ -109,3 +110,4 @@ def convert_with_libreoffice(input_file, output_pdf, file_type):
except Exception as e: except Exception as e:
print(f"❌ LibreOffice conversion error: {e}") print(f"❌ LibreOffice conversion error: {e}")
return None return None
print('🟠 FILE_UTILS.PY: Import complete!')

View File

@ -1,3 +1,4 @@
print('🟣 MARKDOWN_UTILS.PY: Starting import...', flush=True)
#!/usr/bin/env python3 #!/usr/bin/env python3
import re import re
@ -5,169 +6,66 @@ import requests
import json import json
def clean_markdown_text(text):
"""Clean markdown text to ensure it's plaintext with no special characters"""
if not text:
return ""
# Remove LaTeX commands and math expressions
text = re.sub(r'\\[a-zA-Z]+\{[^}]*\}', '', text) # Remove \command{content}
text = re.sub(r'\$[^$]*\$', '', text) # Remove $math$ expressions
text = re.sub(r'\\[a-zA-Z]+', '', text) # Remove remaining \commands
# Remove markdown formatting but keep the text
text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text) # Remove bold **text**
text = re.sub(r'\*([^*]+)\*', r'\1', text) # Remove italic *text*
text = re.sub(r'`([^`]+)`', r'\1', text) # Remove code `text`
text = re.sub(r'#{1,6}\s*', '', text) # Remove headers # ## ###
# Remove special characters but keep basic punctuation
text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\"\'\/\&\%\@\#\$\+\=\<\>]', ' ', text)
# Clean up multiple spaces and newlines
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'\n\s*\n', '\n\n', text)
return text.strip()
def create_slide_markdown(slide_data, analysis_results, slide_num, slide_text=""):
"""Create markdown content for a single slide with all agentic analyses and text content"""
markdown = f"""# Slide {slide_num}
![Slide {slide_num}](slides/{slide_data['filename']})
"""
# Add text content if available
if slide_text and slide_text.strip():
# Clean the slide text to ensure it's plaintext
clean_slide_text = clean_markdown_text(slide_text)
markdown += f"""## Text Content
{clean_slide_text}
"""
markdown += """## Agentic Analysis
"""
for prompt_key, result in analysis_results.items():
# Clean the analysis text to ensure it's plaintext
clean_analysis = clean_markdown_text(result['analysis'])
markdown += f"""### {result['agent']}
{clean_analysis}
"""
markdown += "---\n\n"
return markdown
def create_text_only_markdown(markdown_content):
"""Create a text-only version of markdown without image references for API submission"""
# Remove image markdown blocks but keep the text descriptions and analysis
text_only = markdown_content
# Remove image embedding lines
text_only = re.sub(r'!\[.*?\]\(slides/.*?\)\n', '', text_only)
# Remove image link lines
text_only = re.sub(r'\*\[View full size: slides/.*?\]\(slides/.*?\)\*\n', '', text_only)
# Remove horizontal rules that were added for slide separation
text_only = re.sub(r'^---\n', '', text_only, flags=re.MULTILINE)
# Clean up extra newlines
text_only = re.sub(r'\n{3,}', '\n\n', text_only)
# Apply final text cleaning to ensure plaintext
text_only = clean_markdown_text(text_only)
return text_only.strip()
def send_to_api_and_get_haste_link(markdown_content, document_title): def send_to_api_and_get_haste_link(markdown_content, document_title):
"""Send markdown to API and get both raw markdown and HTML URLs""" """Send FULL structured markdown to API and get both raw markdown and HTML URLs"""
try: try:
print("Sending to API for URLs...") print("Sending to API for URLs...", flush=True)
# Create text-only version for API # Send the FULL structured markdown - NO STRIPPING, NO CLEANING
text_only_markdown = create_text_only_markdown(markdown_content) # Only remove local image references since they won't work online
online_markdown = re.sub(r'!\[Slide (\d+)\]\(slides/[^\)]+\)', r'**[Slide \1 Image]**', markdown_content)
# First, send raw markdown to haste.nixc.us # First, send to haste.nixc.us for raw markdown
raw_haste_url = None raw_haste_url = None
try: try:
print(" 📝 Creating raw markdown URL...") print(" 📝 Creating raw markdown URL...", flush=True)
raw_response = requests.post( raw_response = requests.post(
"https://haste.nixc.us/documents", "https://haste.nixc.us/documents",
data=text_only_markdown.encode('utf-8'), data=online_markdown.encode('utf-8'),
headers={"Content-Type": "text/plain"}, headers={"Content-Type": "text/plain"},
timeout=30 timeout=30
) )
if raw_response.status_code == 200: if raw_response.status_code == 200:
raw_token = raw_response.text.strip().strip('"') response_data = raw_response.json()
# Extract just the token from JSON response if needed raw_token = response_data.get('key', '')
if raw_token.startswith('{"key":"') and raw_token.endswith('"}'):
import json
try:
token_data = json.loads(raw_token)
raw_token = token_data['key']
except:
pass
raw_haste_url = f"https://haste.nixc.us/{raw_token}" raw_haste_url = f"https://haste.nixc.us/{raw_token}"
print(f" ✅ Raw markdown URL created") print(f" ✅ Raw markdown URL created", flush=True)
else: else:
print(f" ⚠️ Raw markdown upload failed with status {raw_response.status_code}") print(f" ⚠️ Raw markdown upload failed with status {raw_response.status_code}", flush=True)
except Exception as e: except Exception as e:
print(f" ⚠️ Failed to create raw markdown URL: {e}") print(f" ⚠️ Failed to create raw markdown URL: {e}", flush=True)
# Then, send to md.colinknapp.com for HTML version # Then, send to md.colinknapp.com for HTML version
html_url = None html_url = None
try: try:
print(" 🎨 Creating HTML version URL...") print(" 🎨 Creating HTML version URL...", flush=True)
api_data = { api_data = {
"markdown": text_only_markdown,
"format": "html",
"template": "playful",
"title": f"Pitch Deck Analysis: {document_title}", "title": f"Pitch Deck Analysis: {document_title}",
"subtitle": "AI-Generated Analysis with Agentic Insights", "content": online_markdown
"contact": "Generated by Pitch Deck Parser",
"send_to_haste": True
} }
response = requests.post( response = requests.post(
"https://md.colinknapp.com/api/convert", "https://md.colinknapp.com/haste",
headers={"Content-Type": "application/json"}, headers={"Content-Type": "application/json"},
data=json.dumps(api_data), json=api_data,
timeout=30 timeout=30
) )
if response.status_code == 200: if response.status_code == 200:
result = response.json() result = response.json()
if 'haste_url' in result: html_url = result.get('url', '')
# Extract token from haste_url and format as requested print(f" ✅ HTML version URL created", flush=True)
haste_url = result['haste_url']
if 'haste.nixc.us/' in haste_url:
token = haste_url.split('haste.nixc.us/')[-1]
html_url = f"https://md.colinknapp.com/haste/{token}"
else:
html_url = haste_url
print(f" ✅ HTML version URL created")
else:
print(" ⚠️ API response missing haste_url")
else: else:
print(f" ⚠️ HTML API request failed with status {response.status_code}") print(f" ⚠️ HTML API request failed with status {response.status_code}", flush=True)
print(f" Response: {response.text[:200]}", flush=True)
except Exception as e: except Exception as e:
print(f" ⚠️ Failed to create HTML URL: {e}") print(f" ⚠️ Failed to create HTML URL: {e}", flush=True)
return raw_haste_url, html_url return raw_haste_url, html_url
except Exception as e: except Exception as e:
print(f"⚠️ Failed to send to API: {e}") print(f"⚠️ Failed to send to API: {e}", flush=True)
return None, None return None, None
print('🟣 MARKDOWN_UTILS.PY: Import complete!', flush=True)

View File

@ -1,3 +1,4 @@
print('🟢 PDF_PROCESSOR.PY: Starting import...')
#!/usr/bin/env python3 #!/usr/bin/env python3
import base64 import base64
@ -58,3 +59,4 @@ def extract_slides_from_pdf(pdf_path, output_dir, document_name):
except Exception as e: except Exception as e:
print(f"❌ Error extracting slides: {e}") print(f"❌ Error extracting slides: {e}")
return [] return []
print('🟢 PDF_PROCESSOR.PY: Import complete!')

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 110 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 2.3 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 91 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 93 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 126 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 327 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 93 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 105 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 100 KiB

File diff suppressed because it is too large Load Diff

View File

@ -1,58 +1,10 @@
#!/bin/bash #!/bin/bash
# Kill any process running on port 3123
echo "Killing any existing processes on port 3123..."
fuser -k 3123/tcp 2>/dev/null || true
# Create virtual environment if it doesn't exist
if [ ! -d "venv" ]; then
echo "Creating virtual environment..."
python3 -m venv venv
fi
# Activate virtual environment
echo "Activating virtual environment..." echo "Activating virtual environment..."
source venv/bin/activate source venv/bin/activate
# Verify virtual environment is active
echo "Verifying virtual environment..."
which python3
python3 --version
# Install dependencies
echo "Installing dependencies..."
pip install -r requirements.txt
# Check for help flag
if [ "$1" = "--help" ] || [ "$1" = "-h" ]; then
echo ""
echo "Pitch Deck Analysis Application"
echo "=============================="
echo "Usage: ./start.sh <file_path>"
echo "Example: ./start.sh presentation.pdf"
echo ""
echo "The application will automatically upload the generated report."
echo ""
exit 0
fi
# Verify file exists
if [ -z "$1" ]; then
echo "Error: No file specified"
echo "Usage: ./start.sh <file_path>"
exit 1
fi
if [ ! -f "$1" ]; then
echo "Error: File '$1' not found"
exit 1
fi
# Start the application with immediate feedback
echo "Starting pitch deck parser..." echo "Starting pitch deck parser..."
echo "Processing file: $1" echo "Processing file: $1"
echo "Python path: $(which python3)"
echo "Working directory: $(pwd)"
echo "----------------------------------------" echo "----------------------------------------"
python3 app.py "$1" python3 app.py "$1"