Python script to extract all unique domains form incoming email addresses from a .mbox
file (gmail export format) to find your accounts, outputs to .csv
To use this you are going to need python
and some libraries (that all should come installed with python)
- mailbox
- pandas
- os
- threading
- time
- random
Get python here.
- Go to Google Take Out and log in (if not logged in)
- Locate and select only Gmail
- Press on
All mail data included
and from what folders in your email you want to extract - Scroll to bottom press
Next step
- Select to what file sizes export should be spit (I don't recommend more then 2GB)
- Press
Create export
- Wait for email that your export is ready (possibly hours or days)
- Download to computer
- Clone or download repo
clone https://github.com/Kaktur/Email-domain-extractor.git
- Put downloaded
.mbox
files in theinput
folder - Additionally adjust settings
- Run program
- Wait for program to finish
- Results are in file named
output.csv
- Profit
- Works on multiple files
- Visual representation of file processing
- Spends a lot of time to load 1GB size filed to memory, represented by
Loading file
- Shows live how many messages where processed and fow many are left
- Shows live how many domains where found
- Sows time elapsed
- Spends a lot of time to load 1GB size filed to memory, represented by
- Otput format is
.csv
, columns represent as follows:- no.
- Sender's email
- Domain
- First found message
- Ignore
gmail.com
domain, to filter out all none services that contact you- Configure in code, section noted with
#general CONFIG
gmail
- disable/enable ignoring of
gmail.com
domains
- disable/enable ignoring of
- Configure in code, section noted with
- Additional result to be enabled: occurrence. Splits domains by
.
, saves how often all parts aper to allow you to better find your selves in the data. Results will be inoccurrence.csv
- Configure in code, section noted with
#occurrence CONFIG
, Config options:occurrence
- enable/disable this extra output
com
- enable/disable exclude all
.com
parts
- enable/disable exclude all
sing
- enable/disable remove all parts that appear once
- For this output format is also
.csv
, columns represent as follows:- no.
- part string
- instance count
- Configure in code, section noted with
All contributions, issues, and messages are welcome! If you aren't sure about something or have any questions please reach out to me.