diff --git a/docker/crawler/README.md b/docker/crawler/README.md new file mode 100644 index 0000000..d1a0991 --- /dev/null +++ b/docker/crawler/README.md @@ -0,0 +1,60 @@ +## Ploughshares Crawler + +This directory has the crawler that retrieves information from various sources (Marketline, Google DORKS) about military exports, uses Google Gemini to intelligently (alledegedly) extract the deals that have a connection to **Canadian** military exports, and then pushes those deals to our temporary postgres DB. + +From there, they can be approved/further looked into, being added to the main ploughshares DB. + +```mermaid +flowchart TD + A[Marketline Scraper
Manual Login Required] -->|Extracts raw markdown
Saves JSON| R[Crawl Results Folder] + B[Google Dorks Scraper] -->|Saves JSON| R + C[Other URLs
via Marketline Crawler] -->|Saves JSON| R + + R --> D[analyze.py
Parses JSON and extracts
Canada military exports] + D -->|Writes deals JSON| E[extracted_deals/.json] + + E --> F[Uploader Script
Pushes to Backend API] + F --> G[(Postgres Database)] +``` + + +### Marketline Scraper; Two Approaches + +#### Approach One - Saved cookies +Marketline Requires a University login to access, and occasionally would use captchas to prevent scraping. Our first, more hands-off/automatic approach is implemented in `marketline_crawler.py`, where a user first logs in, their cookies are saved and then a scraping session is started using those cookies for auth. +This was very easy to use, however could not effectively handle captchas, and the user would still have to login everytime the cookies expired. + +#### Approach Two - User Handoff +In this approach, implemented in `marketline_handoff.py`, everytime a scrape needs to be run the user must first login, do any captchas, and then leave their browser open at the desired scraping page. At this point they press enter in the program, and the scraper takes it from there, starting its scrape already guaranteed to be authenticated and on the correct page. +This is more work but is more robust to sign in errors, captchas, or the sign-in changing. + +#### How The Scraper Works +The scraper uses [Crawl4AI](https://github.com/unclecode/crawl4ai) to scrape marketline. The beauty of this solution is that the scraper can be easily adapted to **pretty much any source**, not just Marketline (though some specifics of our implementation were optimized for Marketline and would need to be tweaked, sunch as the URL filtering preventing the scraper from exploring irrelevant parts of the site such as the "about us" section or non-news articles). + +The scraper is seeded from a given URL, and from there it branches off to sites linked on the source in a branching tree. We define the max depth to be 2 and the max pages to be scraped in total to be 50, but these are adaptable. The scraper extracts the page content as markdown, which is included in the `JSON` under `[content]`, along with other info like the URL of where the content is from, the scraping timestamp, the depth that the page was found in the scrape, and the score. The score is determined by the prevalence of specified keywords such as "canada", "military", etc. and is what guides the crawler in what pages to continue exploring. These keywords can be easily adapted, ideally from the front end. As of writing, they are hardcoded in a Python list in both crawler files. + +### Gemini Analyzer +Implemented in `analyze.py`, this script goes through all the extracted content, has Gemini read it and extract deals that may relate to Canadian arms trade. If you want to see the extraction prompt, check it out in the `analyze.py` file. Any extracted transactions are written to `extracted_arms_deals.json`. See the output schema: + +**Required fields (must be provided; use "Not Found" if absent):** +- transaction_type (string) # e.g., "Export", "Purchase Order", "Component Supply", "Maintenance Contract", "Grant" +- company_division (string) # The primary company or division involved. +- recipient (string) # The receiving country, company, or entity. + +**Optional fields (include if present, otherwise omit the key):** +- amount (string or number) # e.g., "15,000,000 CAD" +- description (string) # A summary of the transaction. +- address_1, address_2, city, province, region, postal_code +- source_date (string YYYY-MM-DD) +- source_description (string) +- grant_type (string) +- commodity_class (string) # e.g., "Armoured Vehicles", "Avionics", "Engine Components", "Naval Systems" +- contract_number (string) +- comments (string) +- is_primary (boolean) + +### API Integration +These extracted arms deals that are foudn by Gemini are then sent to the Flask backend via the API endpoint, where they then get written to the Postgres DB. From there, they should show up on the frontend in a listview, for them to be reviewed, and they can be further investigated through their source URL by someone at Ploughshares. From there, they can be added to the Ploughshares DB and archived, or rejected, as determined by a human in-the-loop. + +## MORE TODO +- Dedupe; Right now were not really doing anything to prevent scraping and analyzing of the same pge twice. we could do URL dedupe from the database or some other way. \ No newline at end of file diff --git a/docker/crawler/analyze.py b/docker/crawler/analyze.py index 63b11b0..1cd36bd 100644 --- a/docker/crawler/analyze.py +++ b/docker/crawler/analyze.py @@ -27,71 +27,98 @@ INPUT_FILE = os.path.join("crawl_results", "successful_pages.json") # output JSON any extracted deals from the scraped data (API-ready schema) OUTPUT_FILE = os.path.join("crawl_results", "extracted_arms_deals.json") -MODEL_NAME = "gemini-2.0-flash-lite" + +# TODO; we can use 2.0 flash lite it just has a lower request per day and seems to perform slightly worst... +# we should switch if we run into significiant request per minute issues... +# see the most updated docs below, my info above may become outdated by the time anyone reads this +# see the overview: https://ai.google.dev/gemini-api/docs/rate-limits + +MODEL_NAME = "gemini-2.5-flash-lite" # Prompt: instruct model to return API schema fields and to explicitly indicate # if and how the result is related to Canada (direct, indirect, none). EXTRACTION_PROMPT = """ -You are a precise data-extraction system. +You are an expert intelligence analyst specializing in the global defense supply chain. Your task is to act as a precise data-extraction system. -Given the DOCUMENT TEXT below, extract ALL transactions or arms-export relevant -entries and output a JSON array (possibly empty) of objects that match the -Project Ploughshares API schema. Output ONLY the JSON array — no markdown, -no commentary, no code fences. +Given the DOCUMENT TEXT below, your mission is to identify and extract ALL potential transactions, contracts, supply chain mentions, or other arms-export relevant events with a potential connection to Canada. Your primary objective is high recall; you should err on the side of including an entry if it has any plausible link to Canada. -Each object must use the following fields (required fields must be provided -and set to "Not Found" if absent): +Output a JSON array of objects that match the Project Ploughshares API schema. Output ONLY the JSON array — no markdown, no commentary, no code fences. -Required fields: -- transaction_type (string) # e.g., "Export", "Purchase Order", "Component Supply" -- company_division (string) # company or division name (use "Not Found" if unknown) -- recipient (string) # receiving country or recipient (use "Not Found" if unknown) +--- +### Guiding Principles & Heuristics -Optional fields (include if present): -- amount (string or number) # monetary value if present (e.g., "15,000,000 CAD") -- description (string) +To determine Canadian relevance, use the following rules: + +1. **Canadian Company Identification:** A company or division is considered Canadian if: + a. Its name explicitly includes "Canada" (e.g., "L3Harris Canada"). + b. Its address is located within Canada. + c. It is one of the following known major players in the Canadian defense industry: + - General Dynamics Land Systems-Canada (GDLS-C) + - CAE Inc. + - Bombardier + - L3Harris Technologies Canada + - Thales Canada + - MDA + - IMP Group + - Magellan Aerospace + - Heroux-Devtek + - PAL Aerospace + - Irving Shipbuilding + - Seaspan Shipyards + - Babcock Canada + d. The text describes it as a "Canadian company" or "based in Canada". + +2. **Indirect Link Identification:** An 'indirect' link exists when Canadian-made parts, materials, or sub-systems are part of a larger product assembled or sold by a non-Canadian entity. Look for phrases like: + - "powered by engines from..." + - "utilizing components supplied by..." + - "avionics provided by..." + - "built with steel from..." + - "the supply chain includes..." + +3. **Transaction Definition:** A "transaction" is defined broadly. It can be a direct sale, a purchase order, a maintenance contract, a government grant for development, a component supply agreement, or even a confirmed report of a transfer. + +--- +### JSON Output Schema + +Each object in the output array must use the following fields. + +**Required fields (must be provided; use "Not Found" if absent):** +- transaction_type (string) # e.g., "Export", "Purchase Order", "Component Supply", "Maintenance Contract", "Grant" +- company_division (string) # The primary company or division involved. +- recipient (string) # The receiving country, company, or entity. + +**Optional fields (include if present, otherwise omit the key):** +- amount (string or number) # e.g., "15,000,000 CAD" +- description (string) # A summary of the transaction. - address_1, address_2, city, province, region, postal_code - source_date (string YYYY-MM-DD) - source_description (string) - grant_type (string) -- commodity_class (string) # e.g., missile components, avionics, engines +- commodity_class (string) # e.g., "Armoured Vehicles", "Avionics", "Engine Components", "Naval Systems" - contract_number (string) - comments (string) - is_primary (boolean) -Additionally, include these two new fields to help filter relevance: -- canadian_relevance (string) # one of: "direct", "indirect", "none" - - "direct" = Canadian company or Canada-origin export of military goods/components - - "indirect" = Canadian-made parts/components appear in a larger export (final assembly elsewhere) - - "none" = no meaningful Canadian connection -- relation_explanation (string) # short explanation why this is direct/indirect/none (1-2 sentences) +**Mandatory Relevance Analysis Fields:** +- canadian_relevance (string) # Must be one of: "direct", "indirect", "none" + - "direct": A Canadian company or the Canadian government is directly exporting/selling military goods or services. + - "indirect": Canadian-made parts, materials, or sub-systems are identified as being part of a larger system exported by another country. + - "none": No meaningful Canadian connection can be established. +- relation_explanation (string) # A brief (1-2 sentence) explanation for the 'canadian_relevance' classification, citing the evidence from the text. -Rules: -1. If a piece of info cannot be found, set it to the string "Not Found" (not null). -2. If multiple transactions are described in the text, output them as separate objects. -3. If the text contains the same transaction repeated, ensure you only output one object per distinct transaction. -4. Output must be valid JSON (an array). Example: - [ - {{ - "transaction_type": "Export", - "company_division": "Example Corp Canada", - "recipient": "Country X", - "amount": "3,000,000 CAD", - "commodity_class": "avionics modules", - "description": "Example summary ...", - "source_url": "https://example.com/article", - "canadian_relevance": "direct", - "relation_explanation": "Company is based in Canada and shipped avionics modules." - }} - ] +--- +### Final Output Rules +1. If a required field's value cannot be found in the text, you MUST set its value to the string "Not Found". Do not use null. +2. If multiple distinct transactions are described, output them as separate objects in the array. +3. Do not duplicate the same transaction. If mentioned multiple times, consolidate into one object. +4. Your final output must be ONLY the raw, valid JSON array. + +--- DOCUMENT TEXT: {text_content} """ -# ------------------------- -# Helper functions -# ------------------------- def load_scraped_data(filepath): """Loads the scraped data from the JSON file created by the crawler.""" try: diff --git a/docker/crawler/marketline_handoff.py b/docker/crawler/marketline_handoff.py index 5fa08e3..2762c9c 100644 --- a/docker/crawler/marketline_handoff.py +++ b/docker/crawler/marketline_handoff.py @@ -4,6 +4,7 @@ # more reliable, easier to debug and captcha resistant import asyncio +from crawl4ai import UndetectedAdapter from itertools import chain from playwright.async_api import async_playwright import json @@ -103,9 +104,10 @@ async def main(): await browser.close() # --- STEP 2: Configure and Run the Crawler with the Captured State --- - + adapter = UndetectedAdapter() # Pass the captured 'storage_state' dictionary to the crawler's browser configuration. browser_config = BrowserConfig( + # enable_stealth=True, headless=False, storage_state=storage_state # This injects your logged-in session. ) @@ -123,6 +125,7 @@ async def main(): # This configuration remains the same config = CrawlerRunConfig( + deep_crawl_strategy=BestFirstCrawlingStrategy( max_depth=DEPTH, @@ -141,7 +144,7 @@ async def main(): logging.info("Starting crawler with the captured session state...") - async with AsyncWebCrawler(config=browser_config) as crawler: + async with AsyncWebCrawler(config=browser_config, browser_adapter=adapter) as crawler: # The crawler will now begin at the correct URL you navigated to. async for result in await crawler.arun(start_url, config=config): if result.success: diff --git a/docker/crawler/run_all.sh b/docker/crawler/run_all.sh index 42f8888..86781d8 100644 --- a/docker/crawler/run_all.sh +++ b/docker/crawler/run_all.sh @@ -8,9 +8,9 @@ echo "📡 Crawling data..." python marketline_crawler.py echo "🧠 Analyzing with Gemini..." -python analyze.py crawled_data.json results.json +python analyze.py crawl_results/successful_pages.json crawl_results/extracted_arms_deals.json echo "📤 Sending to API..." -python write_to_api.py results.json +python write_to_api.py crawl_results/extracted_arms_deals.json echo "✅ All done!"