Searching anomalies in call detail records (CDR). It's my BSc thesis. You can check the announcement at docs/thesis-announcement.pdf
.
Telecom Italia's data. Recorded in Milano in 2013 november and september. You can download the dataset from here. It licensed under ODbL.
From dandelion.eu:
This dataset provides information about the telecommunication activity over the city.
The dataset is the result of a computation over the Call Detail Records (CDRs) generated by the Telecom Italia cellular network over the city. CDRs log the user activity for billing purposes and network management. There are many types of CDRs, for the generation of this dataset we considered those related to the following activities:
- Received SMS: a CDR is generated each time a user receives an SMS
- Sent SMS: a CDR is generated each time a user sends an SMS
- Incoming Calls: a CDR is generated each time a user receives a call
- Outgoing Calls: CDR is generated each time a user issues a call
- Internet: a CDR is generate each time
- a user starts an internet connection
- a user ends an internet connection
- during the same connection one of the following limits is reached:
- 15 minutes from the last generated CDR
- 5 MB from the last generated CDR
By aggregating the aforementioned records it was created this dataset that provides SMSs, calls and Internet traffic activity. It measures the level of interaction of the users with the mobile phone network; for example the higher is the number of SMS sent by the users, the higher is the activity of the sent SMS. Measurements of call and SMS activity have the same scale (therefore are comparable); those referring to Internet traffic do not.
From dandelion.eu:
- Square id: The id of the square that is part of the city GRID.
- Time interval: The beginning of the time interval expressed as the number of millisecond elapsed from the Unix Epoch on January 1st, 1970 at UTC. The end of the time interval can be obtained by adding 600000 milliseconds (10 minutes) to this value.
- Country code: The phone country code of a nation. Depending on the measured activity this value assumes different meanings that are explained later.
- SMS-in activity: The activity in terms of received SMS inside the Square id, during the Time interval and sent from the nation identified by the Country code.
- SMS-out activity: The activity in terms of sent SMS inside the Square id, during the Time interval and received by the nation identified by the Country code.
- Call-in activity: The activity in terms of received calls inside the Square id, during the Time interval and issued from the nation identified by the Country code.
- Call-out activity: The activity in terms of issued calls inside the Square id, during the Time interval and received by the nation identified by the Country code.
- Internet traffic activity: The activity in terms of performed internet traffic inside the Square id, during the Time interval and by the nation of the users performing the connection identified by the Country code.
From dandelion.eu:
Files are in tsv format. If no activity was recorded for a field specified in the schema above then the corresponding value is missing from the file. For example, if for a given combination of the Square id
s
, the Time intervali
and the Country codec
no SMS was sent the corresponding record looks as follows:s \t i \t c \t \t SMSout \t Callin \t Callout \t Internettraffic
where\t
corresponds to the tab character,SMSout
is the value corresponding to the SMS-out activity,Callin
is the value corresponding to the Call-in activity,Callout
is the value corresponding to the Call-out activity andinternettraffic
is the value corresponding to the Internet traffic activity.Moreover, if for a given combination of the Square id
s
, the Time intervali
and the Country codec
no activity is recorded the corresponding record is missing from the dataset. This means that records of the following types \t i \t c \t \t \t \t \t
are not stored in the dataset.