-
Notifications
You must be signed in to change notification settings - Fork 93
Setting up Whois Data
The Cisco OpenSOC development team obtains its Whois data from a private third-party in CSV format. Your source data may be different however our bolt code uses specific field names:
Key name | Use |
---|---|
domainName | The search key |
fieldName | Added to source string during enrichment |
fieldName | Added to source string during enrichment |
fieldName | Added to source string during enrichment |
fieldName | Added to source string during enrichment |
This activity can be modified in the Bolt source code.
OpenSOC comes with an conversion utility that will take a source CSV file format with header fields and convert each line to a JSON string. It expects CSV files to be found within a TLD and will combine the TLD and filename to consolidate output into one directory for HBASE consumption.
As an example the directory /whois/csv has the following structure:
-- whois/
-- csv/
-- com/1.csv
-- us/1.csv
These files will be processed and saved as whois/json/com_1.json and whois/json/us_1.json.
$ OpenSOC-PlatformScripts/WhoisEnrichment/Whois_CSV_to_JSON.py -s /whois/csv -o /whois/json
INFO:root:Processing Whois files from /whois/csv
INFO:root:Starting 8 pool workers
INFO:root:Starting activities on 2 CSV files
DEBUG:root:PoolWorker-1: Converting /whois/csv/com/1.csv to /whois/json/com_1.json
DEBUG:root:PoolWorker-2: Converting /whois/csv/us/1.csv to /whois/json/us_1.json
INFO:root:Completed
An example line of the JSON output:
{"standardRegCreatedDate": "2012-02-20 14:09:17 UTC", "technicalContact_telephoneExt": "",
"expiresDate": "Tue Feb 19 23:59:59 GMT 2013", "technicalContact_city": "San Nicola La Strada",
"billingContact_fax": "", "whoisServer": "", "administrativeContact_faxExt": "", "registrant_fax": "",
"registrant_postalCode": "00132", "registrant_city": "Roma", "billingContact_postalCode": "81020",
"billingContact_city": "San Nicola La Strada", "technicalContact_street4": "", "technicalContact_street1": "Via Caserta, 5",
"technicalContact_street3": "", "technicalContact_state": "CE", "technicalContact_email": "domainmanager@interferenza.com",
"technicalContact_street2": "", "technicalContact_name": "Giancarlo Russo", "zoneContact_street4": "",
"zoneContact_street3": "", "zoneContact_street2": "", "zoneContact_street1": "",
"administrativeContact_street1": "Via Caserta, 5", "administrativeContact_street3": "",
"administrativeContact_street2": "", "administrativeContact_street4": "", "zoneContact_postalCode": "",
"administrativeContact_postalCode": "81020", "technicalContact_organization": "Interferenza s.r.l.",
"zoneContact_city": "", "registrant_name": "Roberto Delle Fratte", "standardRegExpiresDate": "2013-02-19 23:59:59 UTC",
"billingContact_email": "domainmanager@interferenza.com", "registrant_email": "domainmanager@interferenza.com",
"billingContact_name": "Giancarlo Russo", "billingContact_organization": "Interferenza s.r.l.",
"administrativeContact_organization": "Interferenza s.r.l.", "administrativeContact_telephone": "39390823454016",
"technicalContact_fax": "", "zoneContact_telephoneExt": "", "updatedDate": "Sat Feb 23 03:42:03 GMT 2013",
"standardRegUpdatedDate": "2013-02-23 03:42:03 UTC", "zoneContact_name": "", "administrativeContact_telephoneExt": "",
"technicalContact_postalCode": "81020", "billingContact_street3": "", "billingContact_street2": "",
"billingContact_street1": "Via Caserta, 5", "registrant_street4": "", "registrant_street3": "",
"registrant_street2": "", "registrant_street1": "Via Fermignano, 90", "billingContact_street4": "",
"zoneContact_email": "", "zoneContact_telephone": "", "registrant_organization": "Delle Fratte Roberto",
"zoneContact_organization": "", "registrant_telephoneExt": "", "administrativeContact_fax": "",
"billingContact_telephoneExt": "", "createdDate": "Mon Feb 20 14:09:17 GMT 2012", "zoneContact_fax": "",
"administrativeContact_city": "San Nicola La Strada", "administrativeContact_state": "CE",
"zoneContact_country": "", "technicalContact_telephone": "39390823454016", "contactEmail": "domainmanager@interferenza.com",
"registrant_state": "RM", "billingContact_state": "CE", "technicalContact_country": "ITALY",
"technicalContact_faxExt": "", "registrarName": "DOMAIN.COM, LLC|DOTSTER",
"administrativeContact_country": "ITALY", "status": "ok", "registrant_telephone": "39393338043201",
"nameServers": "EIG1.RENEWYOURNAME.NET|EIG2.RENEWYOURNAME.NET|", "billingContact_telephone": "39390823454016",
"billingContact_country": "ITALY", "zoneContact_state": "", "registrant_country": "ITALY",
"administrativeContact_email": "domainmanager@interferenza.com", "administrativeContact_name": "Giancarlo Russo",
"registrant_faxExt": "", "billingContact_faxExt": "", "domainName": "antinfortunistica.us", "zoneContact_faxExt": ""}
JSON files can then be uploaded to an HDFS directory and loaded using the HBASE ImportTsv command:
$ ./bin/hbase create 'whois', {NAME => 'data', COMPRESSION => 'LZO', VERSIONS=>'2'}
$ ./bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,data:json whois hdfs://whois/load/
CIF Data can be loaded into hbase using the utility class com.opensoc.dataloads.cif.HBaseTableLoad. This class is in the OpenSoc-Dataloads folder.
The class takes a directory name and table name as inputs.
java -cp OpenSOC-Topologies-0.3BETA-SNAPSHOT.jar com.opensoc.dataloads.cif.HBaseTableLoad directoryname hbaseTableName
The hbase configuration is loaded from hbase-site.xml file. There is a hbase-site.xml file within OpenSoc-Dataloads folder. You can override the hbase-site.xml by passing in a different hbase-site.xml file in the classpath. i.e. java -cp /etc/hbase/conf/hbase-site.xml:OpenSOC-Topologies-0.3BETA-SNAPSHOT.jar com.opensoc.dataloads.cif.HBaseTableLoad directoryname hbaseTableName
The class assumes the source files are in gz compressed and data is in json. As of now, domain, email and infrastructure data is being loaded. URL and malware data is not being loaded.
The following datasets are being loaded into the hbase table.
domain_botnet/ domain_fastflux/ domain_malware/ domain_phishing/ domain_spam/ domain_spamvertising/ domain_suspicious/ domain_whitelist/ email_phishing/ email_registrant/ email_spam/ email_spamvertising/ email_suspicious/ email_whitelist/ infrastructure_botnet/ infrastructure_fastflux/ infrastructure_malware/ infrastructure_phishing/ infrastructure_scan/ infrastructure_spam/ infrastructure_spamvertising/ infrastructure_suspicious/ infrastructure_warez/ infrastructure_whitelist/
The first part of the directory is the column family name, the second part is the column qualifier. For e.g. within domain_botnet, domain is the family name and botnet is the qualifier name.
The loader uses the value for the json field "address" as the hbase row key. The value stored is a simple boolean flag "Y".