AWS S3 — Cloud Scanner (batch)

The aws-s3 client scans S3 buckets across an entire AWS Organization for files matching configured MIME types and submits them to highvolt-server for PII analysis.

How it works

  1. Authenticates to JSONAir and highvolt-server.

  2. Lists all accounts in the AWS Organization using the Organizations API.

  3. For each account (excluding any in the s3.exclude_users list), assumes the OrganizationAccountAccessRole IAM role via STS.

  4. Lists all S3 buckets in that account.

  5. For each bucket, paginates through all objects.

  6. For each object, calls Analyze():

    • Checks the local registry (GOB file keyed by s3://<bucket>/<key>) — skips if already seen.

    • Downloads the first 1 MB of the object to detect the MIME type via magic bytes.

    • If the MIME type is not in the configured list, records the path in the registry and moves on.

    • If the object exceeds core.max_file_size, it is recorded and skipped.

    • Queries highvolt-server to check if the SHA256 has already been analyzed.

    • If not analyzed, streams the full object while simultaneously computing SHA256/SHA1/MD5 and building a base64 payload in a single-pass io.MultiWriter.

    • Submits to highvolt-server and records the path in the registry.

IAM requirements

The scanner requires an IAM role in the management (root) account with:

Each member account must have an OrganizationAccountAccessRole that trusts the management account and has at minimum:

Configuration (via JSONAir)

Local registry

The aws-s3 client maintains a local GOB registry keyed by S3 path (s3://<bucket>/<key>). An object is recorded in the registry when:

  • Its MIME type does not match the configured list.

  • It exceeds the file size limit.

  • It has already been analyzed by highvolt-server.

  • It is successfully submitted for analysis.

Objects in the registry are never re-downloaded, making subsequent runs much faster.

Submitted JSON structure

Notes

  • The scanner runs once and exits. Schedule it with cron or a workflow orchestrator for periodic scanning.

  • The single-pass streaming approach means each S3 object is downloaded exactly once, even though four operations are performed (MIME detection uses a buffered header read, then the remainder is streamed).

  • Member accounts that are inaccessible (permission denied, suspended) log an error and are skipped — the scan continues with the next account.

Last updated