Jérôme BATON v0.6, 2018-07-25 :toc: right :imagesdir: images
This guide is written in asciidoc and converted to html.
Original URL is https://wadael.github.io/presse_fr/french_medias.adoc
It is a web page to be read within the Neo4j browser (the builtin webapp accessible on port 7474). You certainly already know the movies guide that you ran with
:play movies
This Neo4j formatted guide’s goal is to show who owns what in the french media community by playing with some features of Neo4j.
We will see:
-
data loading
-
apply some algorithms
-
answer a few obvious questions
Disclaimer: this guide has no journalistic pretention. I wrote it to share my reuse of existing data.
The data is published by the team of the best French weekly, Le Monde Diplomatique, at this URL https://github.com/mdiplo/Medias_francais.
Thanks a lot guys
Get the two .tsv files and drop them into the import folder of your Neo4j server.
Tsv files are text files using tabulation as a field separator
(Do the same with other files used later)
medias_francais.tsv contains lines (Entities) which can be either contain contient (real) persons, companies or medias (tv, press,..).
relations_medias_francais.tsv contains the relations between those entites. The field 'valeur' gives the owning percentage of the 'source' entity on the 'cible' (target) entity. The relation to be created between the entities will be 'OWNS'.
The labels to be used will be Les étiquettes que j’envisage sont: Entite, Person,and Media.
As the names will come from the content of the file which I dont want to change, and as this English version comes after the French version, I’ll keep the Fr names from the import and add equivalent EN labels thereafter
Because of the French content, in correct French, with spaces and accentuated e in the names, some data changes were to be made on import. There are kept.
Please note that only one transformation can be made at import time on a field with plain Cypher. Unless you decide to get something unreadable by chaining the transformations. The best option for several modifications is to use a custom function.
In my book, I repeated not to forget to create constraints, for the checking and the automatic index creation.
So, even with known data of a small volume thus no need of an index, lets run this:
CREATE CONSTRAINT ON (e:Entity) ASSERT e.name IS UNIQUE
The codes proposed are
LOAD CSV WITH HEADERS FROM 'file:///medias_francais.tsv' AS line FIELDTERMINATOR '\t'
MERGE (node0:Entity{name: line.nom})
FOREACH(uselessVar IN CASE WHEN trim(line.`rangChallenges`) <> "" THEN [1] ELSE [] END | SET node0.challengesRanking= toInt(line.`rangChallenges`))
FOREACH(uselessVa IN CASE WHEN trim(line.`mediaType`) <> "" THEN [1] ELSE [] END | SET node0.mediaType = line.`mediaType`)
FOREACH(uselessV IN CASE WHEN trim(line.`mediaPeriodicite`) <> "" THEN [1] ELSE [] END | SET node0.mediaPeriodicity = line.`mediaPeriodicite`)
FOREACH(useless IN CASE WHEN trim(line.`mediaEchelle`) <> "" THEN [1] ELSE [] END | SET node0.mediaScale = line.`mediaEchelle`)
FOREACH(u IN CASE WHEN trim(line.commentaire) <> "" THEN [1] ELSE [] END | SET node0.comment = line.commentaire)
WITH node0,line
CALL apoc.create.addLabels(node0,[replace(line.`typeLibelle`," ","_")]) YIELD node
RETURN node
The field 'challengesRanking' is the ranking of the richest French nationals. It comes from the 'Challenges' periodical.
The code for the second file is:
LOAD CSV WITH HEADERS FROM 'file:///relations_medias_francais.tsv' AS ligne FIELDTERMINATOR '\t'
MERGE (nodeO:Entity{name: ligne.origine})
MERGE (nodeC:Entity{name: ligne.cible})
MERGE (nodeO)-[p:OWNS]->(nodeC)
FOREACH(u IN CASE WHEN trim(ligne.valeur) <> "" THEN [1] ELSE [] END | SET p.percentage = toInt(ligne.valeur))
FOREACH(u IN CASE WHEN trim(ligne.source) <> "" THEN [1] ELSE [] END | SET p.source = ligne.source)
FOREACH(u IN CASE WHEN trim(ligne.datePublication) <> "" THEN [1] ELSE [] END | SET p.publicationDate = ligne.datePublication)
To add the English labels and remove the French ones
MATCH (pp:Personne_physique)
SET pp:Person
REMOVE pp:Personne_physique
WITH pp
MATCH (pm:Personne_morale)
SET pm:Company
REMOVE pm:Personne_morale
WITH pp, pm
MATCH (m:Média)
SET m:Media
REMOVE m:Média
RETURN pp,pm,m
Let’s see who owns what with a simple query
MATCH pom = (pp:Person)-[:OWNS*]->(m:Media)
RETURN pom
The star * character means that there can be more than one OWN relation in the path.
Let’s launch the PageRank algorithm, just to see the most connected nodes. This uses the APOC plugin.
MATCH (node:Entity)
WITH collect(node) AS nodes
CALL apoc.algo.pageRank(nodes) YIELD node, score
RETURN node.name, score
ORDER BY score DESC
We get the most connected nodes first. Surprisingly, the top 3 is
"La Tribune" |
0.69473 |
"Hima Groupe" |
0.64087 |
"Arte" |
0.62095 |
Now, for the persons :
MATCH (no:Person) WITH collect(no) AS nodes
CALL apoc.algo.pageRankWithConfig(nodes,{types:'OWNS'}) YIELD node, score
RETURN node.name, score
ORDER BY score DESC
Surprisingly, all the persons listed have a score of 0.15. It means they all have the same number of connections. Weird coincidence.
Lets look for who owns the most medias, with a graph answer
MATCH (pp:Person)-[p:OWNS]->(m:Media)
RETURN pp.name, collect(m.name) as titles, size(collect(m.name)) as nbTitles ORDER BY nbTitles DESC
Only direct possesions are given and they seem to only have one each. This explains that.
Let’s dig deeper.
MATCH (pp:Person)-[p:OWNS*]->(m:Media)
RETURN pp.name, collect(m.name) as titles, size(collect(m.name)) as nbTitles ORDER BY nbTitles DESC
Look at the star. It gives a more credible result.
Let’s see if the persons owning the most medias have the highest ranking. What do you think ?
Try this Cypher query
MATCH (pp:Person)-[p:OWNS*..]->(m:Media)
RETURN coalesce(toInt(pp.challengesRanking),0) as rank, size(collect(m.name)) as nbTitles ORDER BY rank ASC
For a quick visual result, I use Spoon (which I believe was created by Michael Hunger) (:play spoon.html and follow instructions). I got this chart as a result.
I attributed rank zero to the persons without a ChallengesRanking. That explains why the first column is high.
I expected a declining shape. Certainly it was true in the middle of the 20th century. Maybe.
Remember, extra data is about press which obviously puts aside tv.
A media list with owners is a good start but. The more data the better. Its not uncommon to hear that there is many taxes in France. It is also true there is a lot of governmental grants, and several for the press, on different criteria.
Amongst the questions a tax payer may have, I have:
-
what are the medias getting money ?
-
what are their diffusion ?
-
who owns them ?
-
are the owners filthy rich already ?
-
which owner has the most impact on the population ?
I found some answers via the Internet
This is human-readable XLS format, with fancy presentation inside that forbids to convert it to csv and import dramalessly.
MATCHing on the name sounds obvious. But this file does not use the same naming than the first two.
For example "L’Humanité" is referenced as "Humanité (L')". Same for "Le Figaro" et "Figaro (Le)"
And I’m writing this guide for fame, not for fortune so … I can not use this file as only part of it would be used. Unless I rework the data.
The two biggest beneficiaries are:
SOCIETE COMMUNE POUR LES INFRASTRUCTURES DE LA DISTRIBUTION DE LA PRESSE |
4 419 926,00 € |
IMPRIMERIE DE L’AVESNOIS |
2 503 123,00 € |
Unknown to me.
Other beneficiaries are listed as groups of websites like j7.agefi.fr; agefi.fr; L’Agefi Hebdo Magazine
for 159 708,00 €
Luckily, I have found this URL corresponding to a CSV file with both grants and diffusion informations.
Before to import and create nodes, I write a RETURN query to check the columns.
That helps to see that some newspapers are given with their website. This will need to do some string splitting.
WITH "file:///aides-a-la-presse-classement-des-titres-de-presse-aides.csv" AS url
LOAD CSV WITH HEADERS FROM url AS ligne FIELDTERMINATOR ";"
RETURN ligne.`Bénéficiaires` as Beneficiary, split( ligne.`Bénéficiaires`,"/"), ligne.`Total des aides individuelles (en €)` as totalGrants,ligne.`Diffusion annuelle (en exemplaires` as yearlyDiffusion, ligne.`Année` as year LIMIT 5
This new content will be inserted in a new node labelled 'Statistic' linked to the Media node with a 'DESCRIBE' relation.
First,
MATCH (m:Media)
SET m.upName = toUpper(m.name)
RETURN count(m)
WITH "file:///aides-a-la-presse-classement-des-titres-de-presse-aides.csv" AS url
LOAD CSV WITH HEADERS FROM url AS ligne FIELDTERMINATOR ";"
MERGE(m:Media {upName:toUpper( split( ligne.`Bénéficiaires`,"/")[0] ) })
SET m.name=split( ligne.`Bénéficiaires`,"/")[0]
CREATE (s:Statistic { year: toInt(ligne.`Année`), totalGrants: toInt(ligne.`Total des aides individuelles (en €)`), yearlyDiffusion: toInt(ligne.`Diffusion annuelle (en exemplaires`) })
CREATE (m)<-[d:DESCRIBE]-(s)
RETURN m,d,s
We notice that some press media do not have statistics for 2015 and 2016.
Who gets the most grant money
MATCH (pp:Person)-[p:OWNS*..]->(m:Media)<--(s:Statistic)
RETURN pp.name, coalesce(toInt(pp.challengesRanking),0) as challengesRanking, size(collect(m.name)) as nbTitles, sum(s.totalGrants) AS jackpot ORDER BY jackpot DESC
Interesting sums. However, before to cry wolf, checking the percentage owned of each node between the person and the media is necessary. It is not because a title of a billionaire gets millions that those millions ends in the billionaire’s (overseas tax-free) pockets.
This can be queried too :)
MATCH path=(p:Person)-[o:Owns*]->(m:Media)
RETURN p.name, m.name, reduce(prod=1.0, x in relationships(path) | prod*coalesce( x.value, 0.0 )/100 ) as possession
ORDER BY possession DESC
I thank Stefan Armbruster (Neo4j) who reminded me the existence of the reduce function.
From time to time, I like to see other languages than those I am used to. Even more when they are specialized to do quickly a job I want to do.
The R language is the current best choice to get charts quickly, like those two next representations of who gets the most governmental money.
I got the values as text in the Neo4j browser, then used a spreadsheet to convert from columns to lines. A text editor with a column mode can be used too.
group=c("Bernard Arnault","Famille Hutin","Famille Hurbain","Famille Lemoine","Famille Baylet","Xavier Niel","Matthieu Pigasse","Arnaud Lagardère","Bernard Tapie","Famille Mohn","Yves de Chaisemartin","Famille Saint-Cricq","François Pinault","Philippe Hersant","Édouard Coudurier")
value=c(18454766,7240740,2970948,2806813,2317190,1565190,1565190,1533157,1077799,550688 ,250628 ,243025 ,219387,43018,11565)
pie(value, group)
That creates an unreadable chart.
A treemap shares its total area between the results, according to their values
library(treemap)
group=c("Bernard Arnault","Famille Hutin","Famille Hurbain","Famille Lemoine","Famille Baylet","Xavier Niel","Matthieu Pigasse","Arnaud Lagardère","Bernard Tapie","Famille Mohn","Yves de Chaisemartin","Famille Saint-Cricq","François Pinault","Philippe Hersant","Édouard Coudurier")
value=c(18454766,7240740,2970948,2806813,2317190,1565190,1565190,1533157,1077799,550688 ,250628 ,243025 ,219387,43018,11565)
data=data.frame(group,value)
# treemap
treemap(data,index="group",vSize="value",type="index")
I say impact because in my opinion too many persons take what’s written for truth.
And some moguls are known to make print what serves them best. Or not print what would disserve them.
So, imho, it is an interesting metric. However, we know its TVs that makes opinions now, and fac*b*k.
MATCH (pp:Person)-[p:OWNS*..]->(m:Media)<--(s:Statistic)
RETURN pp.name, coalesce(toInt(pp.challengesRanking),0) as challengesRanking, size(collect(m.name)) as nbTitles, apoc.number.format( sum(s.yearlyDiffusion)) AS targetSize ORDER BY targetSize DESC
If you want to learn more about Neo4j, my book Learning Neo4j, Second Edition is available on many websites.
My editor: Packt Publishing often does discount campaigns (like any book for 10$, 3 for 25$)
Do you need a consultant ? Contact me via LinkedIn or via email: firstnameLastname followed by .pro@gmail.com
Thanks for reading the ad.
Your comments are welcome (pr, issues). This guide may get more content based on your feedback
jerome baton