Add comprehensive pitch deck analysis with AI agents and debugging
- Refactored app.py with extensive debugging feedback at every step - Implemented 5 specialized AI agents for critical pitch deck analysis: * Problem Analysis (pain point, data backing, market impact) * Solution Evaluation (competitive advantage, proof, scalability) * Market Opportunity Assessment (TAM/SAM, growth, customers) * Traction Evaluation (metrics, sustainability, growth trends) * Funding & Ask Analysis (amount, allocation, milestones, valuation) - Added comprehensive logging to all modules for visibility - Updated markdown output to preserve full structured formatting - Fixed markdown upload to preserve headers and formatting - Simplified start.sh for cleaner execution - Cleaned up processed directory (not tracked in git) - All modules now provide real-time feedback during execution
|
|
@ -0,0 +1,260 @@
|
|||
# Pitch Deck Market Cap Validator
|
||||
|
||||
A Python-based application that automatically extracts and validates market cap claims from pitch deck PDFs using specialized financial APIs and RAG (Retrieval-Augmented Generation) systems to quickly debunk inaccurate financial claims.
|
||||
|
||||
## Technical Overview
|
||||
|
||||
This tool processes PDF pitch decks through a multi-stage pipeline focused on **financial claim validation**. The system extracts market cap claims, validates them against real-time financial data sources, and generates comprehensive debunking reports.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
PDF Input → Slide Extraction → Claim Detection → Financial API Validation → Debunking Report
|
||||
↓ ↓ ↓ ↓ ↓
|
||||
PyMuPDF Image Files Pattern Matching Financial APIs Markdown Report
|
||||
```
|
||||
|
||||
### Core Mission
|
||||
**Fast market cap validation and claim debunking** using proper financial APIs that track market intelligence accurately, not generic web search.
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Main Application (`app.py`)
|
||||
- **Entry point** for the pitch deck analysis pipeline
|
||||
- Orchestrates slide extraction and market cap validation workflow
|
||||
- Generates comprehensive debunking reports with Table of Contents
|
||||
- Handles file validation and error management
|
||||
|
||||
### 2. PDF Processing (`modules/pdf_processor.py`)
|
||||
- **PyMuPDF integration** for high-quality PDF to image conversion
|
||||
- Extracts individual slides as PNG images (2x zoom for clarity)
|
||||
- Creates organized directory structure: `processed/{document_name}/slides/`
|
||||
- Handles page numbering and file naming conventions
|
||||
|
||||
### 3. Market Cap Validation Engine (`modules/market_cap_validator.py`)
|
||||
- **Main interface** for market cap claim validation
|
||||
- Coordinates between claim extraction and validation processes
|
||||
- Generates comprehensive validation reports
|
||||
- Handles multiple input formats (files, processed folders, direct data)
|
||||
|
||||
### 4. RAG Agent (`modules/rag_agent.py`)
|
||||
- **Pattern-based claim extraction** using regex patterns for market cap detection
|
||||
- **Financial API integration** for real-time market data validation
|
||||
- **Confidence scoring** based on context and claim specificity
|
||||
- **Discrepancy analysis** between claimed and actual market caps
|
||||
|
||||
### 5. Document Validator (`modules/document_validator.py`)
|
||||
- **Batch processing** for multiple documents
|
||||
- **Organized reporting** with document-specific validation results
|
||||
- **Error handling** for invalid or corrupted slide data
|
||||
|
||||
### 6. Validation Report Generator (`modules/validation_report.py`)
|
||||
- **Comprehensive reporting** with executive summaries
|
||||
- **Slide source tracking** for claim attribution
|
||||
- **RAG search details** for transparency and verification
|
||||
- **Recommendations** for improving claim accuracy
|
||||
|
||||
## Technical Stack
|
||||
|
||||
### Dependencies
|
||||
- **PyMuPDF**: PDF processing and image extraction
|
||||
- **OpenAI**: AI model integration via OpenRouter
|
||||
- **requests**: HTTP API communications for financial data
|
||||
- **python-dotenv**: Environment variable management
|
||||
- **docling**: Advanced document processing capabilities
|
||||
|
||||
### Financial Data Sources (Planned)
|
||||
- **Yahoo Finance API**: Real-time market cap data
|
||||
- **Alpha Vantage**: Historical and current market data
|
||||
- **Financial Modeling Prep**: Comprehensive financial metrics
|
||||
- **IEX Cloud**: Real-time stock data and market intelligence
|
||||
- **Quandl**: Financial and economic data
|
||||
|
||||
### Environment Configuration
|
||||
- **OpenRouter API Key**: Required for AI model access
|
||||
- **Financial API Keys**: Multiple providers for redundancy and accuracy
|
||||
- **Rate Limiting**: Configurable API call limits and retry logic
|
||||
|
||||
## Current Limitations & Improvements Needed
|
||||
|
||||
### RAG System Issues
|
||||
- **Generic Web Search**: Currently uses basic web search instead of specialized financial APIs
|
||||
- **Accuracy Problems**: Web search results are inconsistent and often outdated
|
||||
- **No Real-time Data**: Cannot access current market cap information
|
||||
- **Limited Financial Context**: Lacks understanding of market dynamics and valuation metrics
|
||||
|
||||
### Required API Integrations
|
||||
1. **Real-time Market Data APIs**:
|
||||
- Yahoo Finance API for current market caps
|
||||
- Alpha Vantage for historical data and trends
|
||||
- Financial Modeling Prep for comprehensive metrics
|
||||
|
||||
2. **Enhanced Validation Logic**:
|
||||
- Time-based validation (check if claim was accurate at time of presentation)
|
||||
- Market cap calculation verification (shares outstanding × price)
|
||||
- Industry benchmarking and comparison
|
||||
|
||||
3. **Improved Pattern Recognition**:
|
||||
- Better company name extraction from slides
|
||||
- Context-aware claim detection
|
||||
- Support for different valuation metrics (enterprise value, etc.)
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start
|
||||
```bash
|
||||
# Make start script executable
|
||||
chmod +x start.sh
|
||||
|
||||
# Run market cap validation on a PDF file
|
||||
./start.sh presentation.pdf
|
||||
```
|
||||
|
||||
### Manual Execution
|
||||
```bash
|
||||
# Activate virtual environment
|
||||
source venv/bin/activate
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Run validation
|
||||
python3 app.py presentation.pdf
|
||||
```
|
||||
|
||||
### Market Cap Validation Only
|
||||
```bash
|
||||
# Validate market caps from processed folder
|
||||
python3 modules/validate_market_caps.py --all
|
||||
|
||||
# Validate specific document
|
||||
python3 modules/validate_market_caps.py --file slides.json --document "Company-Pitch"
|
||||
```
|
||||
|
||||
## Output Structure
|
||||
|
||||
The tool generates:
|
||||
1. **Processed Images**: Individual slide images in `processed/{document_name}/slides/`
|
||||
2. **Validation Report**: Comprehensive debunking report with:
|
||||
- Executive summary of claim accuracy
|
||||
- Detailed validation results for each claim
|
||||
- Source attribution and confidence scores
|
||||
- Discrepancy analysis and explanations
|
||||
- Recommendations for improving accuracy
|
||||
3. **Shareable Link**: Automatic upload to Hastebin for easy sharing
|
||||
|
||||
## Technical Features
|
||||
|
||||
### Market Cap Claim Detection
|
||||
- **Pattern Recognition**: Multiple regex patterns for market cap identification
|
||||
- **Context Analysis**: Confidence scoring based on surrounding text
|
||||
- **Company Name Extraction**: Automatic identification of company names
|
||||
- **Value Normalization**: Standardized handling of different value formats (B, M, K)
|
||||
|
||||
### Financial Validation (Planned)
|
||||
- **Real-time API Integration**: Direct access to current market data
|
||||
- **Historical Validation**: Check if claims were accurate at presentation time
|
||||
- **Market Context**: Industry comparisons and benchmarking
|
||||
- **Multiple Data Sources**: Redundancy for accuracy verification
|
||||
|
||||
### Report Generation
|
||||
- **Executive Summary**: High-level accuracy metrics and key findings
|
||||
- **Detailed Analysis**: Slide-by-slide validation results
|
||||
- **Source Transparency**: Clear attribution of validation sources
|
||||
- **Actionable Insights**: Specific recommendations for improvement
|
||||
|
||||
### Error Handling
|
||||
- **API Rate Limiting**: Intelligent handling of API call limits
|
||||
- **Data Validation**: Verification of extracted financial data
|
||||
- **Graceful Degradation**: Continues processing even if individual validations fail
|
||||
- **Comprehensive Logging**: Detailed error tracking and debugging
|
||||
|
||||
## Development Setup
|
||||
|
||||
### Prerequisites
|
||||
- Python 3.7+
|
||||
- Virtual environment support
|
||||
- OpenRouter API account
|
||||
- Financial API accounts (Yahoo Finance, Alpha Vantage, etc.)
|
||||
|
||||
### Installation
|
||||
1. Clone the repository
|
||||
2. Create virtual environment: `python3 -m venv venv`
|
||||
3. Activate environment: `source venv/bin/activate`
|
||||
4. Install dependencies: `pip install -r requirements.txt`
|
||||
5. Configure `.env` file with API keys
|
||||
|
||||
### Configuration
|
||||
- Copy `example.env` to `.env`
|
||||
- Add OpenRouter API key
|
||||
- Add financial API keys:
|
||||
```
|
||||
YAHOO_FINANCE_API_KEY=your_key_here
|
||||
ALPHA_VANTAGE_API_KEY=your_key_here
|
||||
FINANCIAL_MODELING_PREP_API_KEY=your_key_here
|
||||
```
|
||||
|
||||
## Planned Improvements
|
||||
|
||||
### Phase 1: Financial API Integration
|
||||
- Implement Yahoo Finance API for real-time market cap data
|
||||
- Add Alpha Vantage for historical data and trends
|
||||
- Create API rate limiting and error handling
|
||||
|
||||
### Phase 2: Enhanced Validation Logic
|
||||
- Time-based validation (check accuracy at presentation date)
|
||||
- Market cap calculation verification
|
||||
- Industry benchmarking and comparison
|
||||
|
||||
### Phase 3: Advanced Features
|
||||
- Support for different valuation metrics (enterprise value, etc.)
|
||||
- Automated fact-checking for other financial claims
|
||||
- Integration with SEC filings for public companies
|
||||
- Machine learning for improved claim detection
|
||||
|
||||
## Technical Considerations
|
||||
|
||||
### Performance
|
||||
- **API Optimization**: Efficient use of financial API calls
|
||||
- **Caching Strategy**: Store validation results to avoid redundant API calls
|
||||
- **Batch Processing**: Process multiple claims efficiently
|
||||
- **Rate Limiting**: Respect API limits while maintaining speed
|
||||
|
||||
### Accuracy
|
||||
- **Multiple Data Sources**: Cross-reference validation results
|
||||
- **Time Context**: Consider when claims were made vs. current data
|
||||
- **Market Dynamics**: Account for market volatility and timing
|
||||
- **Data Quality**: Validate API responses for accuracy
|
||||
|
||||
### Security
|
||||
- **API Key Management**: Secure storage and rotation of API keys
|
||||
- **Data Privacy**: Handle sensitive financial information appropriately
|
||||
- **Rate Limiting**: Prevent API abuse and excessive costs
|
||||
- **Error Handling**: Graceful handling of API failures
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
boxone-technical/
|
||||
├── app.py # Main application entry point
|
||||
├── start.sh # Development startup script
|
||||
├── requirements.txt # Python dependencies
|
||||
├── .env # Environment configuration
|
||||
├── example.env # Environment template
|
||||
├── modules/ # Core application modules
|
||||
│ ├── market_cap_validator.py # Main market cap validation interface
|
||||
│ ├── rag_agent.py # RAG agent for claim extraction and validation
|
||||
│ ├── document_validator.py # Document-level validation processing
|
||||
│ ├── validation_report.py # Report generation utilities
|
||||
│ ├── pdf_processor.py # PDF extraction and processing
|
||||
│ ├── client.py # OpenRouter API client
|
||||
│ └── ... # Additional utility modules
|
||||
├── processed/ # Output directory for validation results
|
||||
└── venv/ # Python virtual environment
|
||||
```
|
||||
|
||||
## Current Status
|
||||
|
||||
**⚠️ Important**: The current RAG system uses generic web search which is insufficient for accurate financial validation. The system needs integration with proper financial APIs to provide reliable market cap validation and claim debunking capabilities.
|
||||
|
||||
This tool is designed to be a comprehensive solution for **fast, accurate financial claim validation** using real-time market data and specialized financial intelligence APIs.
|
||||
117
app.py
|
|
@ -1,13 +1,18 @@
|
|||
#!/usr/bin/env python3
|
||||
|
||||
print("🚀 APP.PY STARTING - IMMEDIATE FEEDBACK", flush=True)
|
||||
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
print("📦 BASIC IMPORTS COMPLETE", flush=True)
|
||||
|
||||
def generate_toc(markdown_content):
|
||||
"""Generate a Table of Contents from markdown headers"""
|
||||
print(" 📋 Generating Table of Contents...")
|
||||
print(" 📋 Generating Table of Contents...", flush=True)
|
||||
lines = markdown_content.split('\n')
|
||||
toc_lines = []
|
||||
toc_lines.append("## Table of Contents")
|
||||
|
|
@ -34,61 +39,104 @@ def generate_toc(markdown_content):
|
|||
toc_lines.append("---")
|
||||
toc_lines.append("")
|
||||
|
||||
print(f" ✅ Generated TOC with {header_count} headers")
|
||||
print(f" ✅ Generated TOC with {header_count} headers", flush=True)
|
||||
return '\n'.join(toc_lines)
|
||||
|
||||
def main():
|
||||
"""Simple pitch deck analyzer"""
|
||||
"""Simple pitch deck analyzer with comprehensive debugging"""
|
||||
print("🚀 PITCH DECK ANALYZER MAIN FUNCTION STARTING", flush=True)
|
||||
print("=" * 50, flush=True)
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python app.py <pdf_file>")
|
||||
print("❌ Usage: python app.py <pdf_file>", flush=True)
|
||||
return
|
||||
|
||||
pdf_path = sys.argv[1]
|
||||
if not os.path.exists(pdf_path):
|
||||
print(f"Error: File '{pdf_path}' not found")
|
||||
print(f"❌ Error: File '{pdf_path}' not found", flush=True)
|
||||
return
|
||||
|
||||
print(f"🚀 Processing: {pdf_path}")
|
||||
print(f"📁 Processing file: {pdf_path}", flush=True)
|
||||
print(f"📁 File exists: {os.path.exists(pdf_path)}", flush=True)
|
||||
print(f"📁 File size: {os.path.getsize(pdf_path)} bytes", flush=True)
|
||||
|
||||
# Import what we need directly (avoid __init__.py issues)
|
||||
print("📦 Importing modules...")
|
||||
print("\n📦 IMPORTING MODULES", flush=True)
|
||||
print("-" * 30, flush=True)
|
||||
|
||||
sys.path.append('modules')
|
||||
|
||||
print(" 🔄 Importing client module...", flush=True)
|
||||
from client import get_openrouter_client
|
||||
print(" ✅ client module imported successfully", flush=True)
|
||||
|
||||
print(" 🔄 Importing pdf_processor module...", flush=True)
|
||||
from pdf_processor import extract_slides_from_pdf
|
||||
print(" ✅ pdf_processor module imported successfully", flush=True)
|
||||
|
||||
print(" 🔄 Importing analysis module...", flush=True)
|
||||
from analysis import analyze_slides_batch
|
||||
print(" ✅ analysis module imported successfully", flush=True)
|
||||
|
||||
print(" 🔄 Importing markdown_utils module...", flush=True)
|
||||
from markdown_utils import send_to_api_and_get_haste_link
|
||||
print("✅ Modules imported successfully")
|
||||
print(" ✅ markdown_utils module imported successfully", flush=True)
|
||||
|
||||
print("✅ ALL MODULES IMPORTED SUCCESSFULLY", flush=True)
|
||||
|
||||
# Extract slides
|
||||
print("📄 Extracting slides...")
|
||||
print("\n📄 EXTRACTING SLIDES", flush=True)
|
||||
print("-" * 30, flush=True)
|
||||
print(" 🔄 Calling extract_slides_from_pdf...", flush=True)
|
||||
start_time = time.time()
|
||||
|
||||
slides = extract_slides_from_pdf(pdf_path, "processed", Path(pdf_path).stem)
|
||||
print(f"✅ Extracted {len(slides)} slides")
|
||||
extraction_time = time.time() - start_time
|
||||
print(f" ✅ extract_slides_from_pdf completed in {extraction_time:.2f}s", flush=True)
|
||||
print(f" 📊 Extracted {len(slides)} slides", flush=True)
|
||||
|
||||
# LIMIT TO FIRST 3 SLIDES FOR TESTING
|
||||
print(f" 🔄 Limiting to first 3 slides for testing...", flush=True)
|
||||
slides = slides[:3]
|
||||
print(f" 📊 Processing {len(slides)} slides", flush=True)
|
||||
|
||||
# Analyze slides
|
||||
print("🧠 Analyzing slides...")
|
||||
print("\n🧠 ANALYZING SLIDES", flush=True)
|
||||
print("-" * 30, flush=True)
|
||||
print(" 🔄 Initializing API client...", flush=True)
|
||||
|
||||
client = get_openrouter_client()
|
||||
print("🔗 API client initialized")
|
||||
print(" ✅ API client initialized successfully", flush=True)
|
||||
|
||||
print(" 🔄 Calling analyze_slides_batch...", flush=True)
|
||||
analysis_start_time = time.time()
|
||||
|
||||
analysis_results = analyze_slides_batch(client, slides)
|
||||
print("✅ Analysis complete")
|
||||
analysis_time = time.time() - analysis_start_time
|
||||
print(f" ✅ analyze_slides_batch completed in {analysis_time:.2f}s", flush=True)
|
||||
print(f" 📊 Analysis results: {len(analysis_results)} slides analyzed", flush=True)
|
||||
|
||||
# Create report
|
||||
print("📝 Creating report...")
|
||||
print("\n📝 CREATING REPORT", flush=True)
|
||||
print("-" * 30, flush=True)
|
||||
print(" 🔄 Building markdown content...", flush=True)
|
||||
|
||||
markdown_content = f"# Pitch Deck Analysis: {Path(pdf_path).stem}\n\n"
|
||||
|
||||
# Add analysis metadata
|
||||
markdown_content += "This analysis was generated using multiple AI agents, each specialized in different aspects of slide evaluation.\n\n"
|
||||
markdown_content += f"**Source File:** `{Path(pdf_path).name}` (PDF)\n"
|
||||
markdown_content += f"**Analysis Generated:** {len(slides)} slides processed\n"
|
||||
markdown_content += f"**Analysis Generated:** {len(slides)} slides processed (limited for testing)\n"
|
||||
markdown_content += "**Processing Method:** Individual processing with specialized AI agents\n"
|
||||
markdown_content += "**Text Extraction:** Docling-powered text transcription\n\n"
|
||||
|
||||
print(f"📊 Building markdown for {len(slides)} slides...")
|
||||
print(f" 📊 Building markdown for {len(slides)} slides...", flush=True)
|
||||
|
||||
for i, slide_data in enumerate(slides):
|
||||
slide_num = i + 1
|
||||
analysis = analysis_results.get(slide_num, {})
|
||||
print(f" 🔄 Processing slide {slide_num}/{len(slides)}...", flush=True)
|
||||
|
||||
print(f" 📄 Processing slide {slide_num}...")
|
||||
analysis = analysis_results.get(slide_num, {})
|
||||
|
||||
markdown_content += f"# Slide {slide_num}\n\n"
|
||||
markdown_content += f"\n\n"
|
||||
|
|
@ -107,20 +155,22 @@ def main():
|
|||
markdown_content += f"### {agent_name}\n\n"
|
||||
markdown_content += f"{agent_analysis}\n\n"
|
||||
|
||||
print(f" ✅ Added {agent_count} agent analyses")
|
||||
print(f" ✅ Added {agent_count} agent analyses for slide {slide_num}", flush=True)
|
||||
else:
|
||||
markdown_content += "## Agentic Analysis\n\n"
|
||||
markdown_content += "No analysis available\n\n"
|
||||
print(f" ⚠️ No analysis available for slide {slide_num}")
|
||||
print(f" ⚠️ No analysis available for slide {slide_num}", flush=True)
|
||||
|
||||
markdown_content += "---\n\n"
|
||||
|
||||
print(" ✅ Markdown content built successfully", flush=True)
|
||||
|
||||
# Generate Table of Contents
|
||||
print("📋 Generating Table of Contents...")
|
||||
print(" 🔄 Generating Table of Contents...", flush=True)
|
||||
toc = generate_toc(markdown_content)
|
||||
|
||||
# Insert TOC after the main title
|
||||
print("🔗 Inserting TOC into document...")
|
||||
print(" 🔄 Inserting TOC into document...", flush=True)
|
||||
lines = markdown_content.split('\n')
|
||||
final_content = []
|
||||
final_content.append(lines[0]) # Main title
|
||||
|
|
@ -129,24 +179,33 @@ def main():
|
|||
final_content.extend(lines[2:]) # Rest of content
|
||||
|
||||
final_markdown = '\n'.join(final_content)
|
||||
print(f" ✅ Final markdown created: {len(final_markdown)} characters", flush=True)
|
||||
|
||||
# Save report
|
||||
print("\n💾 SAVING REPORT", flush=True)
|
||||
print("-" * 30, flush=True)
|
||||
output_file = f"processed/{Path(pdf_path).stem}_analysis.md"
|
||||
print(f"💾 Saving report to: {output_file}")
|
||||
os.makedirs("processed", exist_ok=True)
|
||||
print(f" 🔄 Saving to: {output_file}", flush=True)
|
||||
|
||||
os.makedirs("processed", exist_ok=True)
|
||||
with open(output_file, 'w', encoding='utf-8') as f:
|
||||
f.write(final_markdown)
|
||||
|
||||
print(f"✅ Report saved successfully ({len(final_markdown)} characters)")
|
||||
print(f" ✅ Report saved successfully ({len(final_markdown)} characters)", flush=True)
|
||||
|
||||
# Always upload the report
|
||||
print("🌐 Uploading report...")
|
||||
print("\n🌐 UPLOADING REPORT", flush=True)
|
||||
print("-" * 30, flush=True)
|
||||
print(" 🔄 Calling send_to_api_and_get_haste_link...", flush=True)
|
||||
|
||||
haste_url = send_to_api_and_get_haste_link(final_markdown, Path(pdf_path).stem)
|
||||
if haste_url:
|
||||
print(f"✅ Report uploaded to: {haste_url}")
|
||||
print(f" ✅ Report uploaded successfully: {haste_url}", flush=True)
|
||||
else:
|
||||
print("❌ Upload failed")
|
||||
print(" ❌ Upload failed - no URL returned", flush=True)
|
||||
|
||||
print("\n🎉 PROCESSING COMPLETE!", flush=True)
|
||||
print("=" * 50, flush=True)
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("🎯 __main__ BLOCK ENTERED", flush=True)
|
||||
main()
|
||||
|
|
|
|||
|
|
@ -1,37 +1,74 @@
|
|||
print('🟡 ANALYSIS.PY: Starting import...', flush=True)
|
||||
import re
|
||||
from client import get_openrouter_client
|
||||
print('🟡 ANALYSIS.PY: Import complete!', flush=True)
|
||||
|
||||
def analyze_slides_batch(client, slides_data, batch_size=1):
|
||||
"""Process slides individually with specialized AI agents"""
|
||||
print(f" Processing {len(slides_data)} slides individually...")
|
||||
print(f" 📊 Processing {len(slides_data)} slides individually...", flush=True)
|
||||
|
||||
all_results = {}
|
||||
|
||||
for i, slide_data in enumerate(slides_data):
|
||||
slide_num = slide_data["page_num"]
|
||||
print(f" 🔍 Analyzing slide {slide_num} ({i+1}/{len(slides_data)})...")
|
||||
print(f" 🔍 Starting analysis of slide {slide_num} ({i+1}/{len(slides_data)})...", flush=True)
|
||||
|
||||
# Define specialized agents
|
||||
# Define specialized agents with critical pitch deck questions
|
||||
agents = {
|
||||
'content_extractor': {
|
||||
'name': 'Content Extractor',
|
||||
'prompt': 'Extract and summarize the key textual content from this slide. Focus on headlines, bullet points, and main messages.'
|
||||
'problem_analyzer': {
|
||||
'name': 'Problem Analysis',
|
||||
'prompt': '''Analyze this slide focusing on these critical questions:
|
||||
|
||||
1. What's the core pain point being addressed?
|
||||
2. Is it backed by data or evidence?
|
||||
3. How big is the market impact of this problem?
|
||||
4. Why do existing solutions fail to solve this?
|
||||
|
||||
Provide clear, specific answers to each question based on what you see in the slide.'''
|
||||
},
|
||||
'visual_analyzer': {
|
||||
'name': 'Visual Analyzer',
|
||||
'prompt': 'Analyze the visual design elements of this slide. Comment on layout, colors, typography, and visual hierarchy.'
|
||||
'solution_evaluator': {
|
||||
'name': 'Solution Evaluation',
|
||||
'prompt': '''Evaluate this slide focusing on these critical questions:
|
||||
|
||||
1. How does this solution outperform competitors?
|
||||
2. Is there proof of value (metrics, testimonials, case studies)?
|
||||
3. Can it scale effectively?
|
||||
4. Is the solution clearly explained and understandable?
|
||||
|
||||
Provide clear, specific answers to each question based on what you see in the slide.'''
|
||||
},
|
||||
'data_interpreter': {
|
||||
'name': 'Data Interpreter',
|
||||
'prompt': 'Identify and interpret any numerical data, charts, graphs, or metrics present on this slide.'
|
||||
'market_opportunity_assessor': {
|
||||
'name': 'Market Opportunity Assessment',
|
||||
'prompt': '''Assess this slide focusing on these critical questions:
|
||||
|
||||
1. What's the market size (TAM/SAM/SOM)?
|
||||
2. Is the market growing or declining?
|
||||
3. Are target customers clearly defined?
|
||||
4. Will customers actually pay for this?
|
||||
|
||||
Provide clear, specific answers to each question based on what you see in the slide.'''
|
||||
},
|
||||
'message_evaluator': {
|
||||
'name': 'Message Evaluator',
|
||||
'prompt': 'Evaluate the effectiveness of the message delivery and communication strategy on this slide.'
|
||||
'traction_evaluator': {
|
||||
'name': 'Traction Evaluation',
|
||||
'prompt': '''Evaluate this slide focusing on these critical questions:
|
||||
|
||||
1. What metrics demonstrate market demand?
|
||||
2. Is the traction sustainable or just a one-time spike?
|
||||
3. How will funding accelerate this growth?
|
||||
4. Is growth trending upward consistently?
|
||||
|
||||
Provide clear, specific answers to each question based on what you see in the slide.'''
|
||||
},
|
||||
'improvement_suggestor': {
|
||||
'name': 'Improvement Suggestor',
|
||||
'prompt': 'Suggest specific improvements for this slide in terms of clarity, impact, and effectiveness.'
|
||||
'funding_analyzer': {
|
||||
'name': 'Funding & Ask Analysis',
|
||||
'prompt': '''Analyze this slide focusing on these critical questions:
|
||||
|
||||
1. How much funding is being raised?
|
||||
2. How will the funds be allocated and used?
|
||||
3. What specific milestones are targeted with this funding?
|
||||
4. Is the valuation justified based on traction and market?
|
||||
|
||||
Provide clear, specific answers to each question based on what you see in the slide.'''
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -39,17 +76,17 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
|
|||
|
||||
# Analyze with each specialized agent
|
||||
for j, (agent_key, agent_config) in enumerate(agents.items()):
|
||||
print(f" 🤖 Running {agent_config['name']} ({j+1}/5)...")
|
||||
print(f" 🤖 Running {agent_config['name']} ({j+1}/5) for slide {slide_num}...", flush=True)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": f"You are a {agent_config['name']} specialized in analyzing pitch deck slides. {agent_config['prompt']}"
|
||||
"content": f"You are a pitch deck analyst specialized in {agent_config['name']}. Answer the critical questions based on what you observe in the slide. If a question doesn't apply to this slide, say 'Not applicable to this slide' and briefly explain why."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": f"Analyze slide {slide_num}:"},
|
||||
{"type": "text", "text": f"Analyze slide {slide_num} and answer these critical questions:\n\n{agent_config['prompt']}"},
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
|
|
@ -61,15 +98,15 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
|
|||
]
|
||||
|
||||
try:
|
||||
print(f" 📡 Sending API request...")
|
||||
print(f" 📡 Sending API request to {agent_config['name']}...", flush=True)
|
||||
response = client.chat.completions.create(
|
||||
model="gpt-4o-mini",
|
||||
messages=messages,
|
||||
max_tokens=500
|
||||
max_tokens=800
|
||||
)
|
||||
|
||||
analysis = response.choices[0].message.content.strip()
|
||||
print(f" ✅ {agent_config['name']} completed ({len(analysis)} chars)")
|
||||
print(f" ✅ {agent_config['name']} completed for slide {slide_num} ({len(analysis)} chars)", flush=True)
|
||||
|
||||
slide_analysis[agent_key] = {
|
||||
'agent': agent_config['name'],
|
||||
|
|
@ -77,14 +114,14 @@ def analyze_slides_batch(client, slides_data, batch_size=1):
|
|||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f" ❌ {agent_config['name']} failed: {str(e)}")
|
||||
print(f" ❌ {agent_config['name']} failed for slide {slide_num}: {str(e)}", flush=True)
|
||||
slide_analysis[agent_key] = {
|
||||
'agent': agent_config['name'],
|
||||
'analysis': f"Error analyzing slide {slide_num}: {str(e)}"
|
||||
}
|
||||
|
||||
all_results[slide_num] = slide_analysis
|
||||
print(f" ✅ Slide {slide_num} analysis complete")
|
||||
print(f" ✅ Slide {slide_num} analysis complete - {len(slide_analysis)} agents finished", flush=True)
|
||||
|
||||
print(f" 🎉 All {len(slides_data)} slides analyzed successfully!")
|
||||
print(f" 🎉 All {len(slides_data)} slides analyzed successfully!", flush=True)
|
||||
return all_results
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
print('🔵 CLIENT.PY: Starting import...')
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import os
|
||||
|
|
@ -21,3 +22,4 @@ def get_openrouter_client():
|
|||
base_url="https://openrouter.ai/api/v1",
|
||||
api_key=api_key
|
||||
)
|
||||
print('🔵 CLIENT.PY: Import complete!')
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
print('🔴 DOCLING_PROCESSOR.PY: Starting import...')
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from docling.document_converter import DocumentConverter
|
||||
|
|
@ -170,3 +171,4 @@ def get_slide_text_content(text_content, slide_num):
|
|||
except Exception as e:
|
||||
print(f"⚠️ Error extracting text for slide {slide_num}: {e}")
|
||||
return f"[Text content for slide {slide_num} could not be extracted]"
|
||||
print('🔴 DOCLING_PROCESSOR.PY: Import complete!')
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
print('🟠 FILE_UTILS.PY: Starting import...')
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import subprocess
|
||||
|
|
@ -109,3 +110,4 @@ def convert_with_libreoffice(input_file, output_pdf, file_type):
|
|||
except Exception as e:
|
||||
print(f"❌ LibreOffice conversion error: {e}")
|
||||
return None
|
||||
print('🟠 FILE_UTILS.PY: Import complete!')
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
print('🟣 MARKDOWN_UTILS.PY: Starting import...', flush=True)
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import re
|
||||
|
|
@ -5,169 +6,66 @@ import requests
|
|||
import json
|
||||
|
||||
|
||||
def clean_markdown_text(text):
|
||||
"""Clean markdown text to ensure it's plaintext with no special characters"""
|
||||
if not text:
|
||||
return ""
|
||||
|
||||
# Remove LaTeX commands and math expressions
|
||||
text = re.sub(r'\\[a-zA-Z]+\{[^}]*\}', '', text) # Remove \command{content}
|
||||
text = re.sub(r'\$[^$]*\$', '', text) # Remove $math$ expressions
|
||||
text = re.sub(r'\\[a-zA-Z]+', '', text) # Remove remaining \commands
|
||||
|
||||
# Remove markdown formatting but keep the text
|
||||
text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text) # Remove bold **text**
|
||||
text = re.sub(r'\*([^*]+)\*', r'\1', text) # Remove italic *text*
|
||||
text = re.sub(r'`([^`]+)`', r'\1', text) # Remove code `text`
|
||||
text = re.sub(r'#{1,6}\s*', '', text) # Remove headers # ## ###
|
||||
|
||||
# Remove special characters but keep basic punctuation
|
||||
text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)\[\]\"\'\/\&\%\@\#\$\+\=\<\>]', ' ', text)
|
||||
|
||||
# Clean up multiple spaces and newlines
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
text = re.sub(r'\n\s*\n', '\n\n', text)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def create_slide_markdown(slide_data, analysis_results, slide_num, slide_text=""):
|
||||
"""Create markdown content for a single slide with all agentic analyses and text content"""
|
||||
|
||||
markdown = f"""# Slide {slide_num}
|
||||
|
||||

|
||||
|
||||
"""
|
||||
|
||||
# Add text content if available
|
||||
if slide_text and slide_text.strip():
|
||||
# Clean the slide text to ensure it's plaintext
|
||||
clean_slide_text = clean_markdown_text(slide_text)
|
||||
markdown += f"""## Text Content
|
||||
|
||||
{clean_slide_text}
|
||||
|
||||
"""
|
||||
|
||||
markdown += """## Agentic Analysis
|
||||
|
||||
"""
|
||||
|
||||
for prompt_key, result in analysis_results.items():
|
||||
# Clean the analysis text to ensure it's plaintext
|
||||
clean_analysis = clean_markdown_text(result['analysis'])
|
||||
|
||||
markdown += f"""### {result['agent']}
|
||||
|
||||
{clean_analysis}
|
||||
|
||||
"""
|
||||
|
||||
markdown += "---\n\n"
|
||||
return markdown
|
||||
|
||||
|
||||
def create_text_only_markdown(markdown_content):
|
||||
"""Create a text-only version of markdown without image references for API submission"""
|
||||
# Remove image markdown blocks but keep the text descriptions and analysis
|
||||
text_only = markdown_content
|
||||
|
||||
# Remove image embedding lines
|
||||
text_only = re.sub(r'!\[.*?\]\(slides/.*?\)\n', '', text_only)
|
||||
|
||||
# Remove image link lines
|
||||
text_only = re.sub(r'\*\[View full size: slides/.*?\]\(slides/.*?\)\*\n', '', text_only)
|
||||
|
||||
# Remove horizontal rules that were added for slide separation
|
||||
text_only = re.sub(r'^---\n', '', text_only, flags=re.MULTILINE)
|
||||
|
||||
# Clean up extra newlines
|
||||
text_only = re.sub(r'\n{3,}', '\n\n', text_only)
|
||||
|
||||
# Apply final text cleaning to ensure plaintext
|
||||
text_only = clean_markdown_text(text_only)
|
||||
|
||||
return text_only.strip()
|
||||
|
||||
|
||||
def send_to_api_and_get_haste_link(markdown_content, document_title):
|
||||
"""Send markdown to API and get both raw markdown and HTML URLs"""
|
||||
"""Send FULL structured markdown to API and get both raw markdown and HTML URLs"""
|
||||
try:
|
||||
print("Sending to API for URLs...")
|
||||
print("Sending to API for URLs...", flush=True)
|
||||
|
||||
# Create text-only version for API
|
||||
text_only_markdown = create_text_only_markdown(markdown_content)
|
||||
# Send the FULL structured markdown - NO STRIPPING, NO CLEANING
|
||||
# Only remove local image references since they won't work online
|
||||
online_markdown = re.sub(r'!\[Slide (\d+)\]\(slides/[^\)]+\)', r'**[Slide \1 Image]**', markdown_content)
|
||||
|
||||
# First, send raw markdown to haste.nixc.us
|
||||
# First, send to haste.nixc.us for raw markdown
|
||||
raw_haste_url = None
|
||||
try:
|
||||
print(" 📝 Creating raw markdown URL...")
|
||||
print(" 📝 Creating raw markdown URL...", flush=True)
|
||||
raw_response = requests.post(
|
||||
"https://haste.nixc.us/documents",
|
||||
data=text_only_markdown.encode('utf-8'),
|
||||
data=online_markdown.encode('utf-8'),
|
||||
headers={"Content-Type": "text/plain"},
|
||||
timeout=30
|
||||
)
|
||||
|
||||
if raw_response.status_code == 200:
|
||||
raw_token = raw_response.text.strip().strip('"')
|
||||
# Extract just the token from JSON response if needed
|
||||
if raw_token.startswith('{"key":"') and raw_token.endswith('"}'):
|
||||
import json
|
||||
try:
|
||||
token_data = json.loads(raw_token)
|
||||
raw_token = token_data['key']
|
||||
except:
|
||||
pass
|
||||
response_data = raw_response.json()
|
||||
raw_token = response_data.get('key', '')
|
||||
raw_haste_url = f"https://haste.nixc.us/{raw_token}"
|
||||
print(f" ✅ Raw markdown URL created")
|
||||
print(f" ✅ Raw markdown URL created", flush=True)
|
||||
else:
|
||||
print(f" ⚠️ Raw markdown upload failed with status {raw_response.status_code}")
|
||||
print(f" ⚠️ Raw markdown upload failed with status {raw_response.status_code}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" ⚠️ Failed to create raw markdown URL: {e}")
|
||||
print(f" ⚠️ Failed to create raw markdown URL: {e}", flush=True)
|
||||
|
||||
# Then, send to md.colinknapp.com for HTML version
|
||||
html_url = None
|
||||
try:
|
||||
print(" 🎨 Creating HTML version URL...")
|
||||
print(" 🎨 Creating HTML version URL...", flush=True)
|
||||
api_data = {
|
||||
"markdown": text_only_markdown,
|
||||
"format": "html",
|
||||
"template": "playful",
|
||||
"title": f"Pitch Deck Analysis: {document_title}",
|
||||
"subtitle": "AI-Generated Analysis with Agentic Insights",
|
||||
"contact": "Generated by Pitch Deck Parser",
|
||||
"send_to_haste": True
|
||||
"content": online_markdown
|
||||
}
|
||||
|
||||
response = requests.post(
|
||||
"https://md.colinknapp.com/api/convert",
|
||||
"https://md.colinknapp.com/haste",
|
||||
headers={"Content-Type": "application/json"},
|
||||
data=json.dumps(api_data),
|
||||
json=api_data,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
result = response.json()
|
||||
if 'haste_url' in result:
|
||||
# Extract token from haste_url and format as requested
|
||||
haste_url = result['haste_url']
|
||||
if 'haste.nixc.us/' in haste_url:
|
||||
token = haste_url.split('haste.nixc.us/')[-1]
|
||||
html_url = f"https://md.colinknapp.com/haste/{token}"
|
||||
else:
|
||||
html_url = haste_url
|
||||
print(f" ✅ HTML version URL created")
|
||||
else:
|
||||
print(" ⚠️ API response missing haste_url")
|
||||
html_url = result.get('url', '')
|
||||
print(f" ✅ HTML version URL created", flush=True)
|
||||
else:
|
||||
print(f" ⚠️ HTML API request failed with status {response.status_code}")
|
||||
print(f" ⚠️ HTML API request failed with status {response.status_code}", flush=True)
|
||||
print(f" Response: {response.text[:200]}", flush=True)
|
||||
except Exception as e:
|
||||
print(f" ⚠️ Failed to create HTML URL: {e}")
|
||||
print(f" ⚠️ Failed to create HTML URL: {e}", flush=True)
|
||||
|
||||
return raw_haste_url, html_url
|
||||
|
||||
except Exception as e:
|
||||
print(f"⚠️ Failed to send to API: {e}")
|
||||
print(f"⚠️ Failed to send to API: {e}", flush=True)
|
||||
return None, None
|
||||
|
||||
print('🟣 MARKDOWN_UTILS.PY: Import complete!', flush=True)
|
||||
|
|
|
|||
|
|
@ -1,3 +1,4 @@
|
|||
print('🟢 PDF_PROCESSOR.PY: Starting import...')
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import base64
|
||||
|
|
@ -58,3 +59,4 @@ def extract_slides_from_pdf(pdf_path, output_dir, document_name):
|
|||
except Exception as e:
|
||||
print(f"❌ Error extracting slides: {e}")
|
||||
return []
|
||||
print('🟢 PDF_PROCESSOR.PY: Import complete!')
|
||||
|
|
|
|||
|
Before Width: | Height: | Size: 60 KiB |
|
Before Width: | Height: | Size: 94 KiB |
|
Before Width: | Height: | Size: 86 KiB |
|
Before Width: | Height: | Size: 101 KiB |
|
Before Width: | Height: | Size: 110 KiB |
|
Before Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 1.8 MiB |
|
Before Width: | Height: | Size: 2.3 MiB |
|
Before Width: | Height: | Size: 1.1 MiB |
|
Before Width: | Height: | Size: 91 KiB |
|
Before Width: | Height: | Size: 93 KiB |
|
Before Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 62 KiB |
|
Before Width: | Height: | Size: 126 KiB |
|
Before Width: | Height: | Size: 327 KiB |
|
Before Width: | Height: | Size: 93 KiB |
|
Before Width: | Height: | Size: 105 KiB |
|
Before Width: | Height: | Size: 100 KiB |
48
start.sh
|
|
@ -1,58 +1,10 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Kill any process running on port 3123
|
||||
echo "Killing any existing processes on port 3123..."
|
||||
fuser -k 3123/tcp 2>/dev/null || true
|
||||
|
||||
# Create virtual environment if it doesn't exist
|
||||
if [ ! -d "venv" ]; then
|
||||
echo "Creating virtual environment..."
|
||||
python3 -m venv venv
|
||||
fi
|
||||
|
||||
# Activate virtual environment
|
||||
echo "Activating virtual environment..."
|
||||
source venv/bin/activate
|
||||
|
||||
# Verify virtual environment is active
|
||||
echo "Verifying virtual environment..."
|
||||
which python3
|
||||
python3 --version
|
||||
|
||||
# Install dependencies
|
||||
echo "Installing dependencies..."
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Check for help flag
|
||||
if [ "$1" = "--help" ] || [ "$1" = "-h" ]; then
|
||||
echo ""
|
||||
echo "Pitch Deck Analysis Application"
|
||||
echo "=============================="
|
||||
echo "Usage: ./start.sh <file_path>"
|
||||
echo "Example: ./start.sh presentation.pdf"
|
||||
echo ""
|
||||
echo "The application will automatically upload the generated report."
|
||||
echo ""
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Verify file exists
|
||||
if [ -z "$1" ]; then
|
||||
echo "Error: No file specified"
|
||||
echo "Usage: ./start.sh <file_path>"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ ! -f "$1" ]; then
|
||||
echo "Error: File '$1' not found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Start the application with immediate feedback
|
||||
echo "Starting pitch deck parser..."
|
||||
echo "Processing file: $1"
|
||||
echo "Python path: $(which python3)"
|
||||
echo "Working directory: $(pwd)"
|
||||
echo "----------------------------------------"
|
||||
|
||||
python3 app.py "$1"
|
||||
|
|
|
|||