Skip to content

I wrote a bash script so you can download a Dataset of 3 Months of AOL User Search Queries from 2006. Data Analysis, Machine Learning, Natural Language Processing, Artificial Intelligence, OSINT, Data Broker

Notifications You must be signed in to change notification settings

ajtran303/aol_data_analysis_v1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

AOL Data Analysis v1

I'm trying out some machine learning and natural language processing (ML / NLP) techniques with this dataset of AOL User Search Queries.

Background

In 2006, America Online (aka "AOL") accidentally leaked 3 months worth of "anonymized" search query logs. It was a big "uh oh." Read the Wikipedia article and do some searching on your search engine of choice to learn more.

It turned into a cultural phenomenon - tons of articles were written, websites to search and navigate the data, theater, and a tv show... and I guess it's mostly forgotten now.

It's 2023 and there's all sorts of data leaks all the time. Just this week, the entire state of Lousiana had information stolen from every person with a driver's license. This was a cyber crime, for sure, made by people with mal-intent. I'm sure there's still accidents like this AOL leak happening though.

Back to the dataset though. I'll just lift text from the README:

# U500k_README.txt

...

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged. 

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research. 

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
        AnonID - an anonymous user ID number.
        Query  - the query issued by the user, case shifted with
                 most punctuation removed.
        QueryTime - the time at which the query was submitted for search.
        ItemRank  - if the user clicked on a search result, the rank of the
                    item on which they clicked is listed. 
        ClickURL  - if the user clicked on a search result, the domain portion of 
                    the URL in the clicked result is listed.

...

Source

AOL took the archive offline a few days later, but not before other people made copies and uploaded it elsewhere.

I got a copy of the archive from some server at McGill CIM (Centre for Intelligent Machines). I make no guarantees that this data will continue to be available from them.

Interestingly, there's a README included and at the end of it, it says Copyright (2006) AOL.

Caveat emptor: This dataset contains explicit text references (like search queries for XXX) and personally identifiable information (people have been searching their own name since the beginning of search engines). I make no guarantees or endorsements about its contents.

Get the Data

I used command line tools to download, decompress, and collect all the data into one text file. I wrote a script so you can benefit from my experience:

$ bash scripts/download-aol-data.sh

This script will create a directory, data/ and download the raw data in tab-delimted text files, around 2.2G in size. Also included is U500k_README.txt which has really useful information about the data.

Note: The data/ directory is in .gitignore, so those GB won't be uploaded to source control.

About

I wrote a bash script so you can download a Dataset of 3 Months of AOL User Search Queries from 2006. Data Analysis, Machine Learning, Natural Language Processing, Artificial Intelligence, OSINT, Data Broker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published