Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of all whitespaces during PDF conversion #120

Open
tsmdt opened this issue Dec 18, 2024 · 3 comments
Open

Removal of all whitespaces during PDF conversion #120

tsmdt opened this issue Dec 18, 2024 · 3 comments
Labels
bug Something isn't working open for contribution Invites open-source developers to contribute to the project.

Comments

@tsmdt
Copy link

tsmdt commented Dec 18, 2024

For a certain PDF of my test files markitdown will remove all whitespaces during conversion. The PDF can be found here: https://aclanthology.org/2024.eacl-long.5.pdf

I run the example code in a jupyter notebook (Python 3.12.8) like this:

md = MarkItDown()
result = md.convert('leak_cheat.pdf')
print(result.text_content)

The result looks like this (preserving the head of the paper but removing all whitespaces from the body):

67
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics
Volume 1: Long Papers, pages 67–93
March 17-22, 2024 c(cid:13)2024 Association for Computational Linguistics

Leak,Cheat,Repeat:DataContaminationandEvaluationMalpracticesinClosed-SourceLLMsSimoneBalloccuPatríciaSchmidtováMateuszLangoOndˇrejDušekCharlesUniversity,FacultyofMathematicsandPhysicsInstituteofFormalandAppliedLinguisticsPrague,CzechRepublic{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.czAbstractNaturalLanguageProcessing(NLP)researchisincreasinglyfocusingontheuseofLargeLanguageModels(LLMs),withsomeofthemostpopularonesbeingeitherfullyorpartiallyclosed-source.Thelackofaccesstomodeldetails,especiallyregardingtrainingdata,hasrepeatedlyraisedconcernsaboutdatacontam-inationamongresearchers.Severalattemptshavebeenmadetoaddressthisissue,buttheyarelimitedtoanecdotalevidenceandtrialanderror.Additionally,theyoverlooktheprob-lemofindirectdataleaking,wheremodelsareiterativelyimprovedbyusingdatacom-ingfromusers.Inthiswork,weconductthefirstsystematicanalysisofworkusingOpe-nAI’sGPT-3.5andGPT-4,themostpromi-nentlyusedLLMstoday,inthecontextofdatacontamination.Byanalysing255papersandconsideringOpenAI’sdatausagepolicy,weex-tensivelydocumenttheamountofdataleakedtothesemodelsduringthefirstyearafterthemodel’srelease.Wereportthatthesemodelshavebeengloballyexposedto∼4.7Msamplesfrom263benchmarks.Atthesametime,wedocumentanumberofevaluationmalpracticesemerginginthereviewedpapers,suchasun-fairormissingbaselinecomparisonsandrepro-ducibilityissues.Wereleaseourresultsasacol-laborativeprojectonhttps://leak-llm.github.io/,whereotherresearcherscancontributetoourefforts.1IntroductionTherecentemergenceoflargelanguagemodels(LLMs),thatshowremarkableperformanceonawiderangeoftasks,haslednotonlytoadramaticincreaseintheiruseinresearchbutalsotoagrow-ingnumberofcompaniesjoiningtheraceforthebiggestandmostpowerfulmodels.Inpursuingacompetitiveadvantage,manypopularLLMsto-dayarelockedbehindAPIaccessandtheirde-tailsareunknown(OpenAI,2023;Thoppilanetal.,2022;Touvronetal.,2023).Thisincludesmodelweights(OpenAI,2023),trainingdata(Piktusetal.,2023),orinfrastructuraldetailstoassessmodelcar-bonfootprint(Lacosteetal.,2019).Inparticular,thelackofinformationontrainingdataraisesimportantquestionsaboutthecredibilityofLLMsperformanceevaluation.Thedatafromwhichthesemodelslearn,typicallycollectedau-tomaticallybyscrapingdocumentsfromtheweb,maycontaintraining,validation,and–mostcrit-ically–testsetscomingfromNLPbenchmarks.Becauseofthis,researchersandstakeholdersmaylaterinadvertentlyevaluateLLMsonthesamedatatheyweretrainedon.Thisphenomenon,knownasdatacontamination,maynotbeanissueinthegeneraluseofcommercialLLMs,whereadherencetoresearchprinciplesisnotmandatory,butitbe-comesaseriousproblemwhenthesemodelsarewidelyusedandevaluatedinresearch.Unfortunately,manyproprietarymodelsarelockedbehindinference-onlyAPIs,makingithardtoinspectdatacontamination.Becauseofthis,ex-istingworkonthemattermostlyfocusesondetect-ingextremeformsofoverfittingandmemorization,suchasthemodel’sabilitytogeneratebenchmarksverbatim.TheseapproachesarenotonlylimitedbutalsoneglectthatrecentproprietaryLLMsgetiterativelyimprovedfromuserinteractions.Ifsuchinteractionsinvolvebenchmarkdata(forexamplewhenresearchersevaluateLLMsagainstbaselines),themodelmay,infact,becomecontaminatedevenifitwascontamination-freeduringitsinitialtrain-ing.Werefertothisphenomenonasindirectdataleaking.Inthispaper,weaddresstheissueofindirectdatacontaminationinclosed-source1LLMsbycon-ductingasystematicliteraturereview.Wereview255papersandcarefullydetaildataleakageemerg-ingfromthem.Wefocusprimarilyonthemodels1Inthispaperweusetheterms“proprietary”and“closed-source”interchangeablytorefertothesemodels.68
domaintextenrichedbytextualinstructionsleadstoanincreaseinthemodelperformanceevenifgoldlabelsarenotshowntothemodel.ThissetupperfectlymatchesthekindofdatashowntochatLLMswhenevaluatedbyresearchers.Thismeansthatclosed-sourceLLMssuchasGPT-3.5andGPT-4canmakeuseofthesegoldstandardexamplesfromwidelyusedNLPbenchmarkstogainanunfairadvantageoverothermodels.Wealsopointoutthatrecentwork(Aiyappaetal.,2023)showedthataftermodelupdates,Chat-GPTperformanceimprovedonbenchmarkstowhichitwaspreviouslyexposed(Zhangetal.,2022).Withthesemotivations,weconductasystematicreviewtoquantifyhowmuchofsuchdatathemodelspoweringChatGPTcouldhaveobtained.4MethodologyFollowingthestandardsystematicreviewproto-colfromthemedicaldomain(Khanetal.,2003),weanalysetheexistingworkonLLMsevaluationtoinspecttheissueofindirectdatacontaminationandotherevaluationmalpractices.WefocusonOpenAI’sGPT-3.5andGPT-4models,astheyarethemostprominentlyusedinrecentNLPresearch.Weorganizeourworkintofivemacro-steps,corre-spondingtothefollowingsubsections.4.1FramingquestionsInreviewingtheexistingworkevaluatingtheper-formaceofGPT-3.5andGPT-4,weposethefol-lowingresearchquestions:(1)WhichdatasetshavebeendemonstrablyleakedtoGPT-3.5andGPT-4duringthelastyear?70

...

Other PDFs that I've tested work fine.

@SigireddyBalasai
Copy link
Contributor

i also tested it i also got same problem on analysis i found that pdf miner cannot be able to distinguish between the words i think someone with experience with pdfminer can fix it

@Viddesh1
Copy link

Hello ,

The problem was probably because of the emoji png at the start of the file.

When i converted pdf file to .docx it worked fine with the emoji and with out emoji. find below:-

![](data:image/png;base64...)Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

# Simone Balloccu Patrícia Schmidtová Mateusz Lango Ondrˇej Dušek

Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

Prague, Czech Republic

{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.cz

Also, After removing (deleting) emoji png file at the start and converting back to pdf file. It worked fine. Find below:-

Leak, Cheat, Repeat:  Data Contamination

and Evaluation Malpractices in Closed-Source LLMs

Simone Balloccu  Patrícia Schmidtová  Mateusz Lango  Ondrˇej Dušek
Charles University, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
Prague, Czech Republic
{balloccu,schmidtova,lango,odusek}@ufal.mff.cuni.cz

pdfminer python package probably unable to render png or image files.

Regards!
Viddesh

@gagb gagb added bug Something isn't working open for contribution Invites open-source developers to contribute to the project. labels Dec 19, 2024
@tsmdt
Copy link
Author

tsmdt commented Dec 20, 2024

Thanks for testing this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working open for contribution Invites open-source developers to contribute to the project.
Projects
None yet
Development

No branches or pull requests

4 participants