ploughshares/docker/crawler
coleWesterveld ec50d703d7
ci/woodpecker/push/woodpecker Pipeline was successful Details
model change; undetectable; readme
2025-09-04 12:37:37 -04:00
..
crawl_results scraper handoff + updated reqs 2025-09-03 19:31:14 -04:00
.gitignore scraper handoff + updated reqs 2025-09-03 19:31:14 -04:00
Dockerfile feat: add marketline crawler container and services; CI build; README usage 2025-08-27 18:59:45 -04:00
README.md model change; undetectable; readme 2025-09-04 12:37:37 -04:00
analyze.py model change; undetectable; readme 2025-09-04 12:37:37 -04:00
check.py added crawler 2025-08-06 19:01:58 -04:00
marketline_crawler.py scraper handoff + updated reqs 2025-09-03 19:31:14 -04:00
marketline_handoff.py model change; undetectable; readme 2025-09-04 12:37:37 -04:00
pdf_crawler.py scraper handoff + updated reqs 2025-09-03 19:31:14 -04:00
pdf_queue.txt scraper handoff + updated reqs 2025-09-03 19:31:14 -04:00
requirements.txt scraper handoff + updated reqs 2025-09-03 19:31:14 -04:00
run_all.sh model change; undetectable; readme 2025-09-04 12:37:37 -04:00
write_to_api.py better gemini extraction + results to API 2025-08-09 14:02:17 -04:00

README.md

Ploughshares Crawler

This directory has the crawler that retrieves information from various sources (Marketline, Google DORKS) about military exports, uses Google Gemini to intelligently (alledegedly) extract the deals that have a connection to Canadian military exports, and then pushes those deals to our temporary postgres DB.

From there, they can be approved/further looked into, being added to the main ploughshares DB.

flowchart TD
    A[Marketline Scraper<br>Manual Login Required] -->|Extracts raw markdown <br> Saves JSON| R[Crawl Results Folder]
    B[Google Dorks Scraper] -->|Saves JSON| R
    C[Other URLs<br>via Marketline Crawler] -->|Saves JSON| R

    R --> D[analyze.py<br>Parses JSON and extracts<br>Canada military exports]
    D -->|Writes deals JSON| E[extracted_deals/<date>.json]

    E --> F[Uploader Script<br>Pushes to Backend API]
    F --> G[(Postgres Database)]

Marketline Scraper; Two Approaches

Approach One - Saved cookies

Marketline Requires a University login to access, and occasionally would use captchas to prevent scraping. Our first, more hands-off/automatic approach is implemented in marketline_crawler.py, where a user first logs in, their cookies are saved and then a scraping session is started using those cookies for auth. This was very easy to use, however could not effectively handle captchas, and the user would still have to login everytime the cookies expired.

Approach Two - User Handoff

In this approach, implemented in marketline_handoff.py, everytime a scrape needs to be run the user must first login, do any captchas, and then leave their browser open at the desired scraping page. At this point they press enter in the program, and the scraper takes it from there, starting its scrape already guaranteed to be authenticated and on the correct page. This is more work but is more robust to sign in errors, captchas, or the sign-in changing.

How The Scraper Works

The scraper uses Crawl4AI to scrape marketline. The beauty of this solution is that the scraper can be easily adapted to pretty much any source, not just Marketline (though some specifics of our implementation were optimized for Marketline and would need to be tweaked, sunch as the URL filtering preventing the scraper from exploring irrelevant parts of the site such as the "about us" section or non-news articles).

The scraper is seeded from a given URL, and from there it branches off to sites linked on the source in a branching tree. We define the max depth to be 2 and the max pages to be scraped in total to be 50, but these are adaptable. The scraper extracts the page content as markdown, which is included in the JSON under [content], along with other info like the URL of where the content is from, the scraping timestamp, the depth that the page was found in the scrape, and the score. The score is determined by the prevalence of specified keywords such as "canada", "military", etc. and is what guides the crawler in what pages to continue exploring. These keywords can be easily adapted, ideally from the front end. As of writing, they are hardcoded in a Python list in both crawler files.

Gemini Analyzer

Implemented in analyze.py, this script goes through all the extracted content, has Gemini read it and extract deals that may relate to Canadian arms trade. If you want to see the extraction prompt, check it out in the analyze.py file. Any extracted transactions are written to extracted_arms_deals.json. See the output schema:

Required fields (must be provided; use "Not Found" if absent):

  • transaction_type (string) # e.g., "Export", "Purchase Order", "Component Supply", "Maintenance Contract", "Grant"
  • company_division (string) # The primary company or division involved.
  • recipient (string) # The receiving country, company, or entity.

Optional fields (include if present, otherwise omit the key):

  • amount (string or number) # e.g., "15,000,000 CAD"
  • description (string) # A summary of the transaction.
  • address_1, address_2, city, province, region, postal_code
  • source_date (string YYYY-MM-DD)
  • source_description (string)
  • grant_type (string)
  • commodity_class (string) # e.g., "Armoured Vehicles", "Avionics", "Engine Components", "Naval Systems"
  • contract_number (string)
  • comments (string)
  • is_primary (boolean)

API Integration

These extracted arms deals that are foudn by Gemini are then sent to the Flask backend via the API endpoint, where they then get written to the Postgres DB. From there, they should show up on the frontend in a listview, for them to be reviewed, and they can be further investigated through their source URL by someone at Ploughshares. From there, they can be added to the Ploughshares DB and archived, or rejected, as determined by a human in-the-loop.

MORE TODO

  • Dedupe; Right now were not really doing anything to prevent scraping and analyzing of the same pge twice. we could do URL dedupe from the database or some other way.