language:
- en
widget: - text: theft 3
- text: forgery
- text: unlawful possession short-barreled shotgun
- text: criminal trespass 2nd degree
- text: eluding a police vehicle
- text: upcs synthetic narcotic
ROTA
Rapid Offense Text Autocoder
Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes.
Currently ROTA predicts the Charge Category of a given offense text. A charge category is one of the headings for offense codes in the 2009 NCRP Codebook: Appendix F.
The model was trained on publicly available data from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets.
Data Preprocessing
The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased.
Cross-Validation Performance
This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds.
The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below.
Overall Metrics
Metric | Value |
---|---|
Accuracy | 0.934 |
MCC | 0.931 |
Metric | precision | recall | f1-score |
---|---|---|---|
macro avg | 0.811 | 0.786 | 0.794 |
Note: These are the average of the values per fold, so macro avg is the average of the macro average of all categories per fold.
Per-Category Metrics
Category | precision | recall | f1-score | support |
---|---|---|---|---|
AGGRAVATED ASSAULT | 0.954 | 0.954 | 0.954 | 4085 |
ARMED ROBBERY | 0.961 | 0.955 | 0.958 | 1021 |
ARSON | 0.946 | 0.954 | 0.95 | 344 |
ASSAULTING PUBLIC OFFICER | 0.914 | 0.905 | 0.909 | 588 |
AUTO THEFT | 0.962 | 0.962 | 0.962 | 1660 |
BLACKMAIL/EXTORTION/INTIMIDATION | 0.872 | 0.871 | 0.872 | 627 |
BRIBERY AND CONFLICT OF INTEREST | 0.784 | 0.796 | 0.79 | 216 |
BURGLARY | 0.979 | 0.981 | 0.98 | 2214 |
CHILD ABUSE | 0.805 | 0.78 | 0.792 | 139 |
COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED | 0.827 | 0.815 | 0.821 | 47 |
COMMERCIALIZED VICE | 0.818 | 0.788 | 0.802 | 666 |
CONTEMPT OF COURT | 0.982 | 0.987 | 0.984 | 2952 |
CONTRIBUTING TO DELINQUENCY OF A MINOR | 0.544 | 0.333 | 0.392 | 50 |
CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED | 0.864 | 0.791 | 0.826 | 280 |
COUNTERFEITING (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
DESTRUCTION OF PROPERTY | 0.97 | 0.968 | 0.969 | 2560 |
DRIVING UNDER INFLUENCE - DRUGS | 0.567 | 0.603 | 0.581 | 34 |
DRIVING UNDER THE INFLUENCE | 0.951 | 0.946 | 0.949 | 2195 |
DRIVING WHILE INTOXICATED | 0.986 | 0.981 | 0.984 | 2391 |
DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED | 0.903 | 0.911 | 0.907 | 3100 |
DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT | 0.856 | 0.861 | 0.858 | 380 |
EMBEZZLEMENT | 0.865 | 0.759 | 0.809 | 100 |
EMBEZZLEMENT (FEDERAL ONLY) | 0 | 0 | 0 | 1 |
ESCAPE FROM CUSTODY | 0.988 | 0.991 | 0.989 | 4035 |
FAMILY RELATED OFFENSES | 0.739 | 0.773 | 0.755 | 442 |
FELONY - UNSPECIFIED | 0.692 | 0.735 | 0.712 | 122 |
FLIGHT TO AVOID PROSECUTION | 0.46 | 0.407 | 0.425 | 38 |
FORCIBLE SODOMY | 0.82 | 0.8 | 0.809 | 76 |
FORGERY (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
FORGERY/FRAUD | 0.911 | 0.928 | 0.919 | 4687 |
FRAUD (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
GRAND LARCENY - THEFT OVER $200 | 0.957 | 0.973 | 0.965 | 2412 |
HABITUAL OFFENDER | 0.742 | 0.627 | 0.679 | 53 |
HEROIN VIOLATION - OFFENSE UNSPECIFIED | 0.879 | 0.811 | 0.843 | 24 |
HIT AND RUN DRIVING | 0.922 | 0.94 | 0.931 | 303 |
HIT/RUN DRIVING - PROPERTY DAMAGE | 0.929 | 0.918 | 0.923 | 362 |
IMMIGRATION VIOLATIONS | 0.84 | 0.609 | 0.697 | 19 |
INVASION OF PRIVACY | 0.927 | 0.923 | 0.925 | 1235 |
JUVENILE OFFENSES | 0.928 | 0.866 | 0.895 | 144 |
KIDNAPPING | 0.937 | 0.93 | 0.933 | 553 |
LARCENY/THEFT - VALUE UNKNOWN | 0.955 | 0.945 | 0.95 | 3175 |
LEWD ACT WITH CHILDREN | 0.775 | 0.85 | 0.811 | 596 |
LIQUOR LAW VIOLATIONS | 0.741 | 0.768 | 0.755 | 214 |
MANSLAUGHTER - NON-VEHICULAR | 0.626 | 0.802 | 0.701 | 139 |
MANSLAUGHTER - VEHICULAR | 0.79 | 0.853 | 0.819 | 117 |
MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED | 0.741 | 0.662 | 0.699 | 62 |
MISDEMEANOR UNSPECIFIED | 0.63 | 0.243 | 0.347 | 57 |
MORALS/DECENCY - OFFENSE | 0.774 | 0.764 | 0.769 | 412 |
MURDER | 0.965 | 0.915 | 0.939 | 621 |
OBSTRUCTION - LAW ENFORCEMENT | 0.939 | 0.947 | 0.943 | 4220 |
OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS | 0.881 | 0.895 | 0.888 | 1965 |
PAROLE VIOLATION | 0.97 | 0.953 | 0.962 | 946 |
PETTY LARCENY - THEFT UNDER $200 | 0.965 | 0.761 | 0.85 | 139 |
POSSESSION/USE - COCAINE OR CRACK | 0.893 | 0.928 | 0.908 | 68 |
POSSESSION/USE - DRUG UNSPECIFIED | 0.624 | 0.535 | 0.572 | 189 |
POSSESSION/USE - HEROIN | 0.884 | 0.852 | 0.866 | 25 |
POSSESSION/USE - MARIJUANA/HASHISH | 0.977 | 0.97 | 0.973 | 556 |
POSSESSION/USE - OTHER CONTROLLED SUBSTANCES | 0.975 | 0.965 | 0.97 | 3271 |
PROBATION VIOLATION | 0.963 | 0.953 | 0.958 | 1158 |
PROPERTY OFFENSES - OTHER | 0.901 | 0.87 | 0.885 | 446 |
PUBLIC ORDER OFFENSES - OTHER | 0.7 | 0.721 | 0.71 | 1871 |
RACKETEERING/EXTORTION (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
RAPE - FORCE | 0.842 | 0.873 | 0.857 | 641 |
RAPE - STATUTORY - NO FORCE | 0.707 | 0.55 | 0.611 | 140 |
REGULATORY OFFENSES (FEDERAL ONLY) | 0.847 | 0.567 | 0.674 | 70 |
RIOTING | 0.784 | 0.605 | 0.68 | 119 |
SEXUAL ASSAULT - OTHER | 0.836 | 0.836 | 0.836 | 971 |
SIMPLE ASSAULT | 0.976 | 0.967 | 0.972 | 4577 |
STOLEN PROPERTY - RECEIVING | 0.959 | 0.957 | 0.958 | 1193 |
STOLEN PROPERTY - TRAFFICKING | 0.902 | 0.888 | 0.895 | 491 |
TAX LAW (FEDERAL ONLY) | 0.373 | 0.233 | 0.286 | 30 |
TRAFFIC OFFENSES - MINOR | 0.974 | 0.977 | 0.976 | 8699 |
TRAFFICKING - COCAINE OR CRACK | 0.896 | 0.951 | 0.922 | 185 |
TRAFFICKING - DRUG UNSPECIFIED | 0.709 | 0.795 | 0.749 | 516 |
TRAFFICKING - HEROIN | 0.871 | 0.92 | 0.894 | 54 |
TRAFFICKING - OTHER CONTROLLED SUBSTANCES | 0.963 | 0.954 | 0.959 | 2832 |
TRAFFICKING MARIJUANA/HASHISH | 0.921 | 0.943 | 0.932 | 255 |
TRESPASSING | 0.974 | 0.98 | 0.977 | 1916 |
UNARMED ROBBERY | 0.941 | 0.939 | 0.94 | 377 |
UNAUTHORIZED USE OF VEHICLE | 0.94 | 0.908 | 0.924 | 304 |
UNSPECIFIED HOMICIDE | 0.61 | 0.554 | 0.577 | 60 |
VIOLENT OFFENSES - OTHER | 0.827 | 0.817 | 0.822 | 606 |
VOLUNTARY/NONNEGLIGENT MANSLAUGHTER | 0.619 | 0.513 | 0.542 | 54 |
WEAPON OFFENSE | 0.943 | 0.949 | 0.946 | 2466 |
Note: support
is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x support
.
Using Confidence Scores
If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of p
. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions.
cutoff | percent retained | mcc | acc | |
---|---|---|---|---|
0 | 0.85 | 0.952 | 0.96 | 0.961 |
1 | 0.9 | 0.943 | 0.964 | 0.965 |
2 | 0.95 | 0.928 | 0.97 | 0.971 |
3 | 0.975 | 0.912 | 0.975 | 0.976 |
4 | 0.99 | 0.886 | 0.982 | 0.983 |
5 | 0.999 | 0.733 | 0.995 | 0.996 |