language | widget | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes.
Currently ROTA predicts the Charge Category of a given offense text. A charge category is one of the headings for offense codes in the 2009 NCRP Codebook: Appendix F.
The model was trained on publicly available data from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets.
The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased.
This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds.
The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below.
Metric | Value |
---|---|
Accuracy | 0.934 |
MCC | 0.931 |
Metric | precision | recall | f1-score |
---|---|---|---|
macro avg | 0.811 | 0.786 | 0.794 |
Note: These are the average of the values per fold, so macro avg is the average of the macro average of all categories per fold.
Category | precision | recall | f1-score | support |
---|---|---|---|---|
AGGRAVATED ASSAULT | 0.954 | 0.954 | 0.954 | 4085 |
ARMED ROBBERY | 0.961 | 0.955 | 0.958 | 1021 |
ARSON | 0.946 | 0.954 | 0.95 | 344 |
ASSAULTING PUBLIC OFFICER | 0.914 | 0.905 | 0.909 | 588 |
AUTO THEFT | 0.962 | 0.962 | 0.962 | 1660 |
BLACKMAIL/EXTORTION/INTIMIDATION | 0.872 | 0.871 | 0.872 | 627 |
BRIBERY AND CONFLICT OF INTEREST | 0.784 | 0.796 | 0.79 | 216 |
BURGLARY | 0.979 | 0.981 | 0.98 | 2214 |
CHILD ABUSE | 0.805 | 0.78 | 0.792 | 139 |
COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED | 0.827 | 0.815 | 0.821 | 47 |
COMMERCIALIZED VICE | 0.818 | 0.788 | 0.802 | 666 |
CONTEMPT OF COURT | 0.982 | 0.987 | 0.984 | 2952 |
CONTRIBUTING TO DELINQUENCY OF A MINOR | 0.544 | 0.333 | 0.392 | 50 |
CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED | 0.864 | 0.791 | 0.826 | 280 |
COUNTERFEITING (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
DESTRUCTION OF PROPERTY | 0.97 | 0.968 | 0.969 | 2560 |
DRIVING UNDER INFLUENCE - DRUGS | 0.567 | 0.603 | 0.581 | 34 |
DRIVING UNDER THE INFLUENCE | 0.951 | 0.946 | 0.949 | 2195 |
DRIVING WHILE INTOXICATED | 0.986 | 0.981 | 0.984 | 2391 |
DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED | 0.903 | 0.911 | 0.907 | 3100 |
DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT | 0.856 | 0.861 | 0.858 | 380 |
EMBEZZLEMENT | 0.865 | 0.759 | 0.809 | 100 |
EMBEZZLEMENT (FEDERAL ONLY) | 0 | 0 | 0 | 1 |
ESCAPE FROM CUSTODY | 0.988 | 0.991 | 0.989 | 4035 |
FAMILY RELATED OFFENSES | 0.739 | 0.773 | 0.755 | 442 |
FELONY - UNSPECIFIED | 0.692 | 0.735 | 0.712 | 122 |
FLIGHT TO AVOID PROSECUTION | 0.46 | 0.407 | 0.425 | 38 |
FORCIBLE SODOMY | 0.82 | 0.8 | 0.809 | 76 |
FORGERY (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
FORGERY/FRAUD | 0.911 | 0.928 | 0.919 | 4687 |
FRAUD (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
GRAND LARCENY - THEFT OVER $200 | 0.957 | 0.973 | 0.965 | 2412 |
HABITUAL OFFENDER | 0.742 | 0.627 | 0.679 | 53 |
HEROIN VIOLATION - OFFENSE UNSPECIFIED | 0.879 | 0.811 | 0.843 | 24 |
HIT AND RUN DRIVING | 0.922 | 0.94 | 0.931 | 303 |
HIT/RUN DRIVING - PROPERTY DAMAGE | 0.929 | 0.918 | 0.923 | 362 |
IMMIGRATION VIOLATIONS | 0.84 | 0.609 | 0.697 | 19 |
INVASION OF PRIVACY | 0.927 | 0.923 | 0.925 | 1235 |
JUVENILE OFFENSES | 0.928 | 0.866 | 0.895 | 144 |
KIDNAPPING | 0.937 | 0.93 | 0.933 | 553 |
LARCENY/THEFT - VALUE UNKNOWN | 0.955 | 0.945 | 0.95 | 3175 |
LEWD ACT WITH CHILDREN | 0.775 | 0.85 | 0.811 | 596 |
LIQUOR LAW VIOLATIONS | 0.741 | 0.768 | 0.755 | 214 |
MANSLAUGHTER - NON-VEHICULAR | 0.626 | 0.802 | 0.701 | 139 |
MANSLAUGHTER - VEHICULAR | 0.79 | 0.853 | 0.819 | 117 |
MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED | 0.741 | 0.662 | 0.699 | 62 |
MISDEMEANOR UNSPECIFIED | 0.63 | 0.243 | 0.347 | 57 |
MORALS/DECENCY - OFFENSE | 0.774 | 0.764 | 0.769 | 412 |
MURDER | 0.965 | 0.915 | 0.939 | 621 |
OBSTRUCTION - LAW ENFORCEMENT | 0.939 | 0.947 | 0.943 | 4220 |
OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS | 0.881 | 0.895 | 0.888 | 1965 |
PAROLE VIOLATION | 0.97 | 0.953 | 0.962 | 946 |
PETTY LARCENY - THEFT UNDER $200 | 0.965 | 0.761 | 0.85 | 139 |
POSSESSION/USE - COCAINE OR CRACK | 0.893 | 0.928 | 0.908 | 68 |
POSSESSION/USE - DRUG UNSPECIFIED | 0.624 | 0.535 | 0.572 | 189 |
POSSESSION/USE - HEROIN | 0.884 | 0.852 | 0.866 | 25 |
POSSESSION/USE - MARIJUANA/HASHISH | 0.977 | 0.97 | 0.973 | 556 |
POSSESSION/USE - OTHER CONTROLLED SUBSTANCES | 0.975 | 0.965 | 0.97 | 3271 |
PROBATION VIOLATION | 0.963 | 0.953 | 0.958 | 1158 |
PROPERTY OFFENSES - OTHER | 0.901 | 0.87 | 0.885 | 446 |
PUBLIC ORDER OFFENSES - OTHER | 0.7 | 0.721 | 0.71 | 1871 |
RACKETEERING/EXTORTION (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
RAPE - FORCE | 0.842 | 0.873 | 0.857 | 641 |
RAPE - STATUTORY - NO FORCE | 0.707 | 0.55 | 0.611 | 140 |
REGULATORY OFFENSES (FEDERAL ONLY) | 0.847 | 0.567 | 0.674 | 70 |
RIOTING | 0.784 | 0.605 | 0.68 | 119 |
SEXUAL ASSAULT - OTHER | 0.836 | 0.836 | 0.836 | 971 |
SIMPLE ASSAULT | 0.976 | 0.967 | 0.972 | 4577 |
STOLEN PROPERTY - RECEIVING | 0.959 | 0.957 | 0.958 | 1193 |
STOLEN PROPERTY - TRAFFICKING | 0.902 | 0.888 | 0.895 | 491 |
TAX LAW (FEDERAL ONLY) | 0.373 | 0.233 | 0.286 | 30 |
TRAFFIC OFFENSES - MINOR | 0.974 | 0.977 | 0.976 | 8699 |
TRAFFICKING - COCAINE OR CRACK | 0.896 | 0.951 | 0.922 | 185 |
TRAFFICKING - DRUG UNSPECIFIED | 0.709 | 0.795 | 0.749 | 516 |
TRAFFICKING - HEROIN | 0.871 | 0.92 | 0.894 | 54 |
TRAFFICKING - OTHER CONTROLLED SUBSTANCES | 0.963 | 0.954 | 0.959 | 2832 |
TRAFFICKING MARIJUANA/HASHISH | 0.921 | 0.943 | 0.932 | 255 |
TRESPASSING | 0.974 | 0.98 | 0.977 | 1916 |
UNARMED ROBBERY | 0.941 | 0.939 | 0.94 | 377 |
UNAUTHORIZED USE OF VEHICLE | 0.94 | 0.908 | 0.924 | 304 |
UNSPECIFIED HOMICIDE | 0.61 | 0.554 | 0.577 | 60 |
VIOLENT OFFENSES - OTHER | 0.827 | 0.817 | 0.822 | 606 |
VOLUNTARY/NONNEGLIGENT MANSLAUGHTER | 0.619 | 0.513 | 0.542 | 54 |
WEAPON OFFENSE | 0.943 | 0.949 | 0.946 | 2466 |
Note: support
is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x support
.
If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of p
. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions.
cutoff | percent retained | mcc | acc | |
---|---|---|---|---|
0 | 0.85 | 0.952 | 0.96 | 0.961 |
1 | 0.9 | 0.943 | 0.964 | 0.965 |
2 | 0.95 | 0.928 | 0.97 | 0.971 |
3 | 0.975 | 0.912 | 0.975 | 0.976 |
4 | 0.99 | 0.886 | 0.982 | 0.983 |
5 | 0.999 | 0.733 | 0.995 | 0.996 |