by Maisha Maliha, Vishal Pramanik https://arxiv.org/html/2410.09319
- Abstract
- 1 Introduction
- 2 Related Works
- 3 Background and Technologies
- 4 Dataset
- 5 Techniques and Implementation
- 6 Experimentation
- 7 Results and Analysis
- 8 Conclusion and Future Works
Automatic Essay Grading (AEG):
- Attracts attention of NLP community due to applications in scoring essays, short answers, etc.
- Can save time and money compared to manual grading.
- Existing works use a single network responsible for the whole process, which may be ineffective.
- This work introduces a new model that outperforms state-of-the-art AEG models.
- Uses collaborative learning and transfer learning:
- One network checks grammatical and structural features of sentences.
- Another network scores the overall idea in the essay.
- Learnings are transferred to another network for essay scoring.
- Compared performances of different models, proposing a new model with an accuracy of 85.50%.
Keywords:
- Automatic Essay Grading
- Collaborative learning
- Deep Learning
- Recursive Learning
- Recursive Neural Network.
Automatic Essay Grading (AEG)
- A subfield of Natural Language Processing (NLP) that has been around for over 50 years
- Suggested the first AEG system: [1]
- Little progress due to lack of resources and processing power until development of deep neural networks
- Deep Neural Networks: [2}, [3]
- Automatic Essay Grading:
- Using a machine to grade a text in response to an essay prompt
- Holistic AEG: Awarding overall quality grade
- Trait-Specific AEG: Rating essays based on single trait/attribute (content, organization, style)
- Problem: Need for human-graded essays to evaluate the AEG system
- Solution: Domain adaption techniques like cross-domain AEG
Contributions of the Paper:
- Introduce a collaborative deep learning model for automatic essay grading
- Model considers nature, idea, grammar, and structure of sentences in the essay before grading it
- Comparative analysis of results from different machine and deep learning models to the proposed deep learning network.
Automatic Essay Grading (AEG)
Holistic AEG:
- Entails providing an overall score/grade to the essay based on its quality
- Majority of AEG research focuses on this approach
Machine Learning with Classifiers for Holistic AEG:
- Early research used machine learning with classifiers to grade essays holistically
- Examples: e-rater [4] and Intelligent Essay Assessor [5]
- Use a variety of features:
- Surface-level features: Word count, average word length, average sentence length, etc.
- More complex features: Usage score (detecting usage errors)
- Kernels
Task-Independent Features for AEG:
- Study on the problem of cross-domain AEG [6]
- Findings:
- Best results for cross-domain AEG when source and target prompts are similar
Systems Used in AEG:
- Overview of various AEG systems discussed
AEG Systems
- Early systems employ machine learning techniques
- Approach: feature engineering and ordinal classification/regression
- Project Essay Grade (PEG) [1]: first AEG system, uses intrinsic properties called 'trins' for an essay score approximation
(Note: Trins are analogous to features)
Neural Networks in NLP (since 2010s)
- CNN and other hierarchical models used for various tasks [7]
- SSWE system developed by [8]: learns score-specific word embeddings and uses LSTMs to get essay representation for scoring
- Pre-trained word embeddings used in [9]'s architecture, similar to [8], for scoring essays with LSTMs and other RNNs
Recurrent Neural Network (RNN)
Definition:
- Deep neural network that uses previous step's output as feedback input
- Used for sequences, next-word prediction in sentences
Hidden State and Output:
- Hidden state: a, represented by the hidden state vector at time step t
- Output: y, represented by the output vector at time step t
Equations: (1) and (2)
- Equation (1): hidden state update
- g1: activation function
- Wa, Wx, by: coefficients shared over time steps
- a = g1(Waa<t−1> + Wax + ba)
- Equation (2): output prediction
- g2: activation function
- Wy, b_y: coefficients shared over time steps
- y = g2(Wya + by)
Implementation:
- RvNN works for sentence processing
- Word embeddings are fed into a neural network to obtain phrase embeddings
- Phrase embeddings then pass through the network to obtain sentence embedding output vectors
- Sentence embedding vectors store grammatical and structural properties of the sentence.
Convolutional Neural Network (CNN)
- Three main parts: convolutional layer, pooling layer, dense layer
- Convolutional Layer:
- Uses kernel function to extract information from images
- Calculated using equation:
(f∗h)[m,n] = ∑j∑kh[j,k]f[m−j,n−k]
- Where:
G[m,n]
is the resultant image matrixf
andh
are input imagesm
andn
are dimensions of the resultant imagej
andk
are dimensions of the kernel function
- Pooling Layer:
- Decreases dimension of input matrix
- Extracts deeper meaning from feature matrix
BERT as Encoder in Transformer Model
- BERT (11) is the encoder part of the Transformer model (13).
- It extracts features from input sentences and passes output vectors to the Decoder part of the Transformer.
- BERT encodes input sentences with positional encodings, enabling parallel processing.
- The encoder consists of 12 blocks, each containing a multi-headed self-attention mechanism and a dense layer.
- The attention mechanism helps establish relationships between words within the sentence, enhancing the model's understanding capabilities.
ASAP Automatic Essay Grading Dataset
- Commonly used dataset for automatic essay grading: Automated Students Assessment Prize (ASAP) AEG dataset
- Comprises nearly 13,000 essays in response to 8 different essay prompts
- Data is freely available on Kaggle (https://www.kaggle.com/c/asap-aes/data)
Dataset Statistics:
- Consists of 8 essays, each corresponding to one question
- Originally authored by students in grades 7 through 10
- Essays scored based on four points: ideas, style, organization, and conventions
Table 1: Data Analysis:
- Shows the different prompts, number of essays, average length, and score range
- Prompt | No. of Essays | Avg Length | Score Range
- 1 | 1783 | 350 | 2-12
- 2 | 1800 | 350 | 1-6
- 3 | 1726 | 150 | 0-3
- 4 | 1772 | 150 | 0-3
- 5 | 1805 | 150 | 0-4
- 6 | 1800 | 150 | 0-4
- 7 | 1569 | 250 | 0-30
- 8 | 723 | 650 | 0-60
Essay Types:
- The four types of essays present are persuasive, narrative, expository, and source-dependent responses
- Scores given by raters belonging to domain one were added as the final score
- Dataset divided into 4:1 ratios for training and testing, respectively.
Comparative Analysis Models Tested
- Evaluated various models: [model names] for comparison purposes.
Automatic Grading System using TF-IDF Vectorization and SVM Classification
Approach:
- Treated essay grading as a classification task
- Maximum mark: 60, minimum mark: 0
- Divided marks into 61 classes
- Used TF-IDF vectorisation with l2 normalisation for essay representation
TF-IDF Calculation:
- Calculate tfidf(w,e) using Equation (4) in text
- tf(w,e): term frequency of word w in document e
- idf(w): inverse document frequency of word w across all essays
- Smoothened tfidf by adding a constant n to the denominator (Equation 5)
Model Training:
- Used multi-class classification SVM with gaussian kernel for model training
- Trained on essay set and evaluated using SVM classifier. [13]
Automatic Essay Grading Model
- Standard bert-base-cased model used:
- 6 encoder blocks
- Consideration of word case
- Purposefully chosen for grading essays:
- Word case provides important information (beginning of sentence, proper nouns)
- Limitations:
- Cannot accept inputs greater than 512 tokens
- Eliminated essays with more words than 500 words
Bert Model Architecture:
- Each encoder block contains:
- Self-attention layer
- Followed by a feedforward layer
- Attention layer:
- 8 multi-headed attention mechanisms
- Stores information about sentence's words
- Feedforward network maintains output vector dimension and sends it to next encoder block
- Positional encoding of words with BERT embeddings enables fast processing.
Collaborative Learning Methods
Definition: Collaborative learning is a method of learning where pupils work in groups on separate tasks contributing to a common outcome or on shared tasks.
Transfer Learning:
- Instead of a single deep neural network, learning can be distributed among several networks
- Their collective knowledge can be shared
Collaborative Deep Learning Network (CDLN)
- Architecture:
- Recursive Neural Network (RvNN)
- Convolutional Neural Network (CNN)
- Long Short Term Memory (LSTM)
- Dense neural network
Convolutional Neural Network (CNN)
- Understands the idea conveyed in sentences using:
- Convolution and average pooling layers
- Word2Vec embeddings of 100 dimensions each
- Kernel sizes: 1x105x8 for convolution, 1x90x8 for average pooling
- Repeated 5 times
- Helps analyze the essay and brings out the idea conveyed in sentences
Recursive Neural Network (RvNN)
- Understands the structure of sentences
- Divides words into bigrams
- Representations vectors fed into a neural network with:
- 200 neurons in the first layer
- 4 layers of 150 neurons each
- Output layer of 100 neurons to match word embedding dimension
- Helps check essays for grammatical and sentence construction errors
Long Short Term Memory (LSTM)
- Takes input vector from RvNN and CNN concatenation
- Output vector is 1x10000 dimensions
- Gathers learnings of previous deep learning networks
- Stores knowledge about sentence structure and ideas conveyed in essays
- Information is forwarded to next layers for essay grading
Dense Layer and Output Layer:
- Dense layer takes input from LSTM output
- 5 hidden layers with 120 neurons each
- Last output layer gives the essay grade.
Experiments Conducted on Six Models
- Models: CDLN, BERT, RNN, ANN, SVM
- Detailed architecture outlined in previous section
- Learning rate: 0.0001, batch size for training: 32
- Epochs: 15 (CDLN, BERT, RNN, ANN), 8 (SVM), 6 (SVM)
- Dropouts utilized in deep learning models to prevent overfitting
- Eight-fold cross-validation during training for all models
Evaluation Metrics
- Mean Square Error (MSE)
- Pearson’s Correlation Coefficient (PCC): Measures linear correlation between two variables (-1 to 1; 1: highly correlated in positive direction, -1: not correlated or opposite directions). [16]
- Quadratic Weighted Kappa (QWK): Most common metric for AEG system performance evaluation, used in our evaluation. [17]
Comparison of Machine and Deep Learning Models for Automatic Essay Grading
Models Compared:
- CDLN model (proposed)
- LSTM
- RNN
- ANN
- SVM
- Baseline models: TDNN(ALL), CNN-LSTM, CNN-LSTM-ATT, 2L-LSTM, and CNN-LSTM-ATT from [20]
Comparison Metrics:
- Accuracy (Accu.)
- Pearson Correlation Coefficient (PCC)
- Quality Metric (QWK)
Results:
- CDLN model outperforms other models and baseline models on each prompt and overall.
- Sharing of knowledge between CNN, RvNN, and LSTM boosts results.
- Self-attention mechanism in BERT leads to better results than TDNN(Sem+Synt).
Performance Comparison Tables:
- Table 2: Prompt-wise performance comparison of CDLN model and other baseline models
- Table 3: Overall average performance comparison of CDLN model and other baseline models
Additional Observations:
- The original essays' grades are compared with the paraphrased ones in Figure 3.
Robustness Check of CDLN Model for Automatic Essay Grading
Testing the Robustness of CDLN Model:
- Conducted a robustness check by rephrasing 1000 random essays using Quillbot, a paraphrasing tool
- Graduated the original and modified essays using the CDLN model
- Compared the grades before and after paraphrasing
Results:
- Grades given by the model were very close before and after paraphrasing
- Average marks for modified essays were more than original essays
- This may be due to proper sentence structure added by Quillbot and removal of grammatical errors in original essays
Difference Between Grades:
- Measured the difference using mean square error formula:
- g_original: Marks graded by the model for original essays
- g_modified: Marks graded for paraphrased essays
- Δ = (g_original - g_modified)² / N
- Resulted in a value of 0.34, indicating grades are very close to each other and the model is quite robust.
Main Focus:
- Demonstrates effectiveness of collaborative learning in automatic essay grading (AEG)
- Multiple networks working together to analyze essay features
- Performance improvement shown through results
Key Achievements:
- Model outperformed pretrained models
- Surpassed state-of-the-art AEG systems
- Successfully implemented collaborative learning approach
Current Limitations:
- Provides only holistic essay scores
- Does not offer paragraph-level scoring
- Room for performance improvement
Future Opportunities:
- Integration with newer deep neural networks
- Potential use of new pretrained networks
- Possibility of paragraph-wise scoring implementation
- Scope for further research and investigation