Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the Topics and Messages sent to Kafka #28

Open
VincenzoFerme opened this issue Dec 10, 2015 · 14 comments
Open

Define the Topics and Messages sent to Kafka #28

VincenzoFerme opened this issue Dec 10, 2015 · 14 comments

Comments

@VincenzoFerme
Copy link
Member

Define and Implement the Topic and the structure of the Messages sent from the Collectors to Kafka

@Cerfoglg
Copy link
Contributor

Cerfoglg commented Dec 11, 2015

@VincenzoFerme

For collectors, we need to use one topic per collector, named after the collector that is going to use it for sending messages. The topics we have right now are:

  • mysql
  • stats
  • properties
  • faban
  • logs

This way the spark task sender can receive messages from a certain topic and know from its config file what scripts to launch

The messages will contain the data that will be passed to the scripts, sent as a JSON formatted string, which for collectors would be:

  • The location of the stored data on minio
  • The trial ID
  • The experiment ID
  • The container ID
  • The host ID
  • The collector name

So the messages would be the marshalled JSON from the Golang struct:

type KafkaMessage struct {
    Minio_key string `json:"minio_key"`
    Trial_id string `json:"trial_id"`
    Experiment_id string `json:"experiment_id"`
    Container_id string `json:"container_id"`
    Host_id string `json:"host_id"`
    Collector_name string `json:"collector_name"`
    }

This way we have the collector performing their task, then signalling on their topic that they have the data ready, and the spark task sender can react accordingly by sending the tasks to the Spark cluster and provide the necessary arguments to the scripts.

@VincenzoFerme
Copy link
Member Author

@Cerfoglg ok, go for it and update the collectors accordingly.

I would always use underscore separated and lowercase for Kafka topic, to be consistent. If you agree, please update benchflow/benchflow#2.

At same point we would also probably need to specify the SUT information somewhere, so that the spark-tasks-sender has all the information to instantiate the correct data-transformer and consequently the correct analysers.

@Cerfoglg
Copy link
Contributor

@VincenzoFerme I altered the description a bit. Basically, instead of passing experiment id and replication number I just pass trial ID, since it's a composite of the two anyways. Also, I'm passing the container ID, as we need to store that in the database, so we should have the container report that.

Also, the message format is a single string containing the 3 values, separated by commas.

@VincenzoFerme
Copy link
Member Author

VincenzoFerme commented Dec 15, 2015

@Cerfoglg ok for just using trial_ID.

  1. Why do you need the container_ID for all the collectors? You should need it only for the collectors sending data referencing to the container table. Moreover in the future we should move somewhere else in the flow the mapping between the trail_ID and the container_ID so that we will only need to pass around the trial_ID. NOTE: currently we always pass all of them to dynamically manage the containers from where the collectors collect data
  2. Why do you switched to a 3 values comma separated message instead of a structured message? Apart performance of communications, are there other advantages?

@Cerfoglg
Copy link
Contributor

@VincenzoFerme

  1. You're right, maybe it is best to only send the trial id and use that to obtain other informations like the container id, instead of sending too much information via kafka. I'll change it back to only the file location and trail id in the message
  2. It's mostly for performance. We are not sending too much data through kafka, so we can keep it compact and use a simple coma separated message, which is also easier to process when received by just splitting the message. Of course, there's always the option to send a structured message in JSON format and unmarshal that once received, but the message will be bigger.

@VincenzoFerme
Copy link
Member Author

@Cerfoglg

  1. ok
  2. ok. I would prefer the option in which we send a bigger, but self-descriptive, message. This because:
    • We are not sending big messages, so we can spend some more bytes to add metadata;
    • We can define the data structure of these messages in commons and share it in our projects so that we are sure to always be compliant to the defined structure.

@Cerfoglg
Copy link
Contributor

@VincenzoFerme

That second point about the commons is very true. Alright, I'll change it to send a JSON object instead of a coma separated message. Unmarshalling JSON into a data structure in Golang is really quick, so it should be easy to deal with them.

@Cerfoglg
Copy link
Contributor

Cerfoglg commented Jan 5, 2016

@VincenzoFerme

Now when a collector signals on kafka, it will send a JSON with this structure:

{
    minio_key: "MINIO_LOCATION",
    trial_id: "TRIAL_ID"
    experiment_id:"EXPERIMENT_ID"
    container_id:"CONTAINER_ID"
    host_id:"HOST_ID"
    collector_name:"COLLECTOR_NAME"
}

With location and trial id being the key of the stored data, and trial id being the trial id associated with them

@VincenzoFerme
Copy link
Member Author

@Cerfoglg Update the structure of the message, so that the names match with the one we defined in the following issue: #38

@VincenzoFerme
Copy link
Member Author

@Cerfoglg discuss about why it is the right choice to have a unique key for each "container folder" and using multiple comma separated key to represent information coming from different containers.

@Cerfoglg
Copy link
Contributor

Cerfoglg commented May 27, 2016

@VincenzoFerme It's acceptable to send a single key containing the container folder because with the Minio API we can obtain a list of all files in that "folder", essentially all keys with that prefix. We can separate by comma keys belonging to different containers, which need to be taken separately by the scripts. This way we don't end up with large kafka messages in the case we have too many files that were collected. We send the container ids the same way as the minio keys: a comma separated list in the same order as the minio keys

@Cerfoglg
Copy link
Contributor

The current kafka messages are sent as a json marshalling of this Go structure:

type KafkaMessage struct {
    Minio_key string `json:"minio_key"`
    Trial_id string `json:"trial_id"`
    Experiment_id string `json:"experiment_id"`
    Container_id string `json:"container_id"`
    Host_id string `json:"host_id"`
    Collector_name string `json:"collector_name"`
    }

Where minio keys and container ids can be sent as coma separated lists when dealing with multiple containers to collect data from, such as stats.

@Cerfoglg
Copy link
Contributor

@VincenzoFerme This definition should be final

@VincenzoFerme
Copy link
Member Author

Evaluate the following:

  • Decide if using one topic with same type of data from multiple sources (e.g., different dbms) or different topics as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants