Removal of all whitespaces during PDF conversion #120

tsmdt · 2024-12-18T09:13:53Z

For a certain PDF of my test files markitdown will remove all whitespaces during conversion. The PDF can be found here: https://aclanthology.org/2024.eacl-long.5.pdf

I run the example code in a jupyter notebook (Python 3.12.8) like this:

md = MarkItDown()
result = md.convert('leak_cheat.pdf')
print(result.text_content)

The result looks like this (preserving the head of the paper but removing all whitespaces from the body):

67
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics
Volume 1: Long Papers, pages 67–93
March 17-22, 2024 c(cid:13)2024 Association for Computational Linguistics

Leak,Cheat,Repeat:DataContaminationandEvaluationMalpracticesinClosed-SourceLLMsSimoneBalloccuPatríciaSchmidtováMateuszLangoOndˇrejDušekCharlesUniversity,FacultyofMathematicsandPhysicsInstituteofFormalandAppliedLinguisticsPrague,CzechRepublic{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.czAbstractNaturalLanguageProcessing(NLP)researchisincreasinglyfocusingontheuseofLargeLanguageModels(LLMs),withsomeofthemostpopularonesbeingeitherfullyorpartiallyclosed-source.Thelackofaccesstomodeldetails,especiallyregardingtrainingdata,hasrepeatedlyraisedconcernsaboutdatacontam-inationamongresearchers.Severalattemptshavebeenmadetoaddressthisissue,buttheyarelimitedtoanecdotalevidenceandtrialanderror.Additionally,theyoverlooktheprob-lemofindirectdataleaking,wheremodelsareiterativelyimprovedbyusingdatacom-ingfromusers.Inthiswork,weconductthefirstsystematicanalysisofworkusingOpe-nAI’sGPT-3.5andGPT-4,themostpromi-nentlyusedLLMstoday,inthecontextofdatacontamination.Byanalysing255papersandconsideringOpenAI’sdatausagepolicy,weex-tensivelydocumenttheamountofdataleakedtothesemodelsduringthefirstyearafterthemodel’srelease.Wereportthatthesemodelshavebeengloballyexposedto∼4.7Msamplesfrom263benchmarks.Atthesametime,wedocumentanumberofevaluationmalpracticesemerginginthereviewedpapers,suchasun-fairormissingbaselinecomparisonsandrepro-ducibilityissues.Wereleaseourresultsasacol-laborativeprojectonhttps://leak-llm.github.io/,whereotherresearcherscancontributetoourefforts.1IntroductionTherecentemergenceoflargelanguagemodels(LLMs),thatshowremarkableperformanceonawiderangeoftasks,haslednotonlytoadramaticincreaseintheiruseinresearchbutalsotoagrow-ingnumberofcompaniesjoiningtheraceforthebiggestandmostpowerfulmodels.Inpursuingacompetitiveadvantage,manypopularLLMsto-dayarelockedbehindAPIaccessandtheirde-tailsareunknown(OpenAI,2023;Thoppilanetal.,2022;Touvronetal.,2023).Thisincludesmodelweights(OpenAI,2023),trainingdata(Piktusetal.,2023),orinfrastructuraldetailstoassessmodelcar-bonfootprint(Lacosteetal.,2019).Inparticular,thelackofinformationontrainingdataraisesimportantquestionsaboutthecredibilityofLLMsperformanceevaluation.Thedatafromwhichthesemodelslearn,typicallycollectedau-tomaticallybyscrapingdocumentsfromtheweb,maycontaintraining,validation,and–mostcrit-ically–testsetscomingfromNLPbenchmarks.Becauseofthis,researchersandstakeholdersmaylaterinadvertentlyevaluateLLMsonthesamedatatheyweretrainedon.Thisphenomenon,knownasdatacontamination,maynotbeanissueinthegeneraluseofcommercialLLMs,whereadherencetoresearchprinciplesisnotmandatory,butitbe-comesaseriousproblemwhenthesemodelsarewidelyusedandevaluatedinresearch.Unfortunately,manyproprietarymodelsarelockedbehindinference-onlyAPIs,makingithardtoinspectdatacontamination.Becauseofthis,ex-istingworkonthemattermostlyfocusesondetect-ingextremeformsofoverfittingandmemorization,suchasthemodel’sabilitytogeneratebenchmarksverbatim.TheseapproachesarenotonlylimitedbutalsoneglectthatrecentproprietaryLLMsgetiterativelyimprovedfromuserinteractions.Ifsuchinteractionsinvolvebenchmarkdata(forexamplewhenresearchersevaluateLLMsagainstbaselines),themodelmay,infact,becomecontaminatedevenifitwascontamination-freeduringitsinitialtrain-ing.Werefertothisphenomenonasindirectdataleaking.Inthispaper,weaddresstheissueofindirectdatacontaminationinclosed-source1LLMsbycon-ductingasystematicliteraturereview.Wereview255papersandcarefullydetaildataleakageemerg-ingfromthem.Wefocusprimarilyonthemodels1Inthispaperweusetheterms“proprietary”and“closed-source”interchangeablytorefertothesemodels.68
domaintextenrichedbytextualinstructionsleadstoanincreaseinthemodelperformanceevenifgoldlabelsarenotshowntothemodel.ThissetupperfectlymatchesthekindofdatashowntochatLLMswhenevaluatedbyresearchers.Thismeansthatclosed-sourceLLMssuchasGPT-3.5andGPT-4canmakeuseofthesegoldstandardexamplesfromwidelyusedNLPbenchmarkstogainanunfairadvantageoverothermodels.Wealsopointoutthatrecentwork(Aiyappaetal.,2023)showedthataftermodelupdates,Chat-GPTperformanceimprovedonbenchmarkstowhichitwaspreviouslyexposed(Zhangetal.,2022).Withthesemotivations,weconductasystematicreviewtoquantifyhowmuchofsuchdatathemodelspoweringChatGPTcouldhaveobtained.4MethodologyFollowingthestandardsystematicreviewproto-colfromthemedicaldomain(Khanetal.,2003),weanalysetheexistingworkonLLMsevaluationtoinspecttheissueofindirectdatacontaminationandotherevaluationmalpractices.WefocusonOpenAI’sGPT-3.5andGPT-4models,astheyarethemostprominentlyusedinrecentNLPresearch.Weorganizeourworkintofivemacro-steps,corre-spondingtothefollowingsubsections.4.1FramingquestionsInreviewingtheexistingworkevaluatingtheper-formaceofGPT-3.5andGPT-4,weposethefol-lowingresearchquestions:(1)WhichdatasetshavebeendemonstrablyleakedtoGPT-3.5andGPT-4duringthelastyear?70

...

Other PDFs that I've tested work fine.

The text was updated successfully, but these errors were encountered:

SigireddyBalasai · 2024-12-18T10:03:21Z

i also tested it i also got same problem on analysis i found that pdf miner cannot be able to distinguish between the words i think someone with experience with pdfminer can fix it

Viddesh1 · 2024-12-18T15:58:00Z

Hello ,

The problem was probably because of the emoji png at the start of the file.

When i converted pdf file to .docx it worked fine with the emoji and with out emoji. find below:-

![](data:image/png;base64...)Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

# Simone Balloccu Patrícia Schmidtová Mateusz Lango Ondrˇej Dušek

Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

Prague, Czech Republic

{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.cz

Also, After removing (deleting) emoji png file at the start and converting back to pdf file. It worked fine. Find below:-

Leak, Cheat, Repeat:  Data Contamination

and Evaluation Malpractices in Closed-Source LLMs

Simone Balloccu  Patrícia Schmidtová  Mateusz Lango  Ondrˇej Dušek
Charles University, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Prague, Czech Republic
{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.cz

pdfminer python package probably unable to render png or image files.

Regards!
Viddesh

tsmdt · 2024-12-20T08:13:17Z

Thanks for testing this out!

gagb added bug Something isn't working open for contribution Invites open-source developers to contribute to the project. labels Dec 19, 2024

Viddesh1 mentioned this issue Dec 23, 2024

PDF conversion fails #199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removal of all whitespaces during PDF conversion #120

Removal of all whitespaces during PDF conversion #120

tsmdt commented Dec 18, 2024

SigireddyBalasai commented Dec 18, 2024

Viddesh1 commented Dec 18, 2024

tsmdt commented Dec 20, 2024

Removal of all whitespaces during PDF conversion #120

Removal of all whitespaces during PDF conversion #120

Comments

tsmdt commented Dec 18, 2024

SigireddyBalasai commented Dec 18, 2024

Viddesh1 commented Dec 18, 2024

tsmdt commented Dec 20, 2024