Global flight route information, which naturally reveals the connections and communication among countries and regions, has long been of great importance to political and economic area. Instead of possessing regularity, the derived flight route graph has been discovered to have various parts within which flights are denser than the others. This inconsistency motivates the attempt to figure out communities in the aforementioned graph. In this project, we exploit the network analysis and machine learning techniques to discover communities in the flight routes graph in weakly supervised and unsupervised fashion.
The following python modules are required to re-produce the results obtained.
- NumPy 1.15.0
pip3 install --user numpy==1.15.0
- Matplotlib 3.0.3
pip3 install --user matplotlib==3.0.3
- Networkx 2.3
pip install --user networkx==2.3
- Pandas 0.23.0
pip install --user pandas==0.23.0
- Geopy 1.20.0
pip install --user geopy==1.20.0
- scipy 1.1.0
pip install --user scipy==1.1.0
- folium 0.8.0
pip install --user folium==0.8.0
- scikit-learn 0.19.1
pip3 install -user scikit-learn==0.19.1
- seaborn 0.8.1
pip install -user seaborn==0.8.1
The dataset used in the project consists of an airline database and a flight routes database from https://openflights.org/data.html#route.
The airline database with 3188 entries provides the following essential information about airline:
Attribute | Airport_id | Name | City | Country | IATA | ICAO | Latitude | Longitude | Timezone | Tzdatabase |
---|---|---|---|---|---|---|---|---|---|---|
Example | 1665 | Geneva Cointrin International Airport | Geneva | Switzerland | GVA | LSGG | 46.238098 | 6.10895 | 1 | Europe/Paris |
The routes database with 66771 entries provides the following essential information about routes:
Attribute | Airline | Airline_id | Source_airport | Source_airport_id | Destination_airport | Destination_airport_id | Stops | Equipment |
---|---|---|---|---|---|---|---|---|
Example | AB | 214 | GVA | 1665 | MAD | 1229 | 0 | 320 319 |
preprocessing.ipynb
contains the code for preprocessing. We remove the flights which contain unrecorded airports or do not specify source/destination airports. Based on the preprocessed data, we build the weighted and non-weighted adjacency matrices and the graph.
explore_data.ipynb
contains the code for data exploration. We explore the graph by computing degree distribution and centrality metrics. We also visualize all airports on the global map with continent borders.
exploit_data.ipynb
contains the code for data exploitation. We implement spectral clustering algorithm and K-means algorithm on both weighted and non-weighted adjacency matrices. In order to discover communities without initialization, we also implement greedy modularity maximization and label propagation methods to detect communities.
data
folder contains all raw data, preprocessed data and some other additional data used for visualization.
map
folder contains all html files of the visualization results.
doc
folder contains our report for this project.
The visualizations of our results are availbale here.
This project is licensed under the MIT License - see the LICENSE file for details.