Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49181] Remove site/docs/{version}/api/python/_sources folders and util/build-error-docs.py #544

Merged
merged 2 commits into from
Aug 10, 2024

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Aug 9, 2024

This PR removes interim data under the _sources folder for each version listed below:

./site/docs/4.0.0-preview1/api/python/_sources
./site/docs/3.1.2/api/python/_sources
./site/docs/3.3.1/api/python/_sources
./site/docs/3.3.0/api/python/_sources
./site/docs/3.1.3/api/python/_sources
./site/docs/3.4.0/api/python/_sources
./site/docs/3.2.2/api/python/_sources
./site/docs/3.4.1/api/python/_sources
./site/docs/3.2.4/api/python/_sources
./site/docs/3.2.3/api/python/_sources
./site/docs/3.5.0/api/python/_sources
./site/docs/3.1.1/api/python/_sources
./site/docs/3.3.2/api/python/_sources
./site/docs/3.5.1/api/python/_sources
./site/docs/3.3.3/api/python/_sources
./site/docs/3.3.4/api/python/_sources
./site/docs/3.2.1/api/python/_sources
./site/docs/3.4.3/api/python/_sources
./site/docs/3.2.0/api/python/_sources
./site/docs/3.4.2/api/python/_sources

After removing them, dangling links like:
https://spark.apache.org/docs/3.5.1/api/python/_sources/user_guide/sql/index.rst.txt are invisible from end users.

We also remove util/build-error-docs.py, which is a tool for doc-gen not for users from in 4.0.0-preview1.

The main goal of this PR is to remove unnecessary publications from doc to reduce the repo size of spark-website

@yaooqinn yaooqinn changed the title [SPARK-49181] RRemove site/docs/{version}/api/python/_sources f [SPARK-49181] RRemove site/docs/{version}/api/python/_sources folders Aug 9, 2024
@yaooqinn yaooqinn changed the title [SPARK-49181] RRemove site/docs/{version}/api/python/_sources folders [SPARK-49181] Remove site/docs/{version}/api/python/_sources folders Aug 9, 2024
@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 9, 2024

cc @HyukjinKwon @cloud-fan @dongjoon-hyun @srowen thanks

@srowen
Copy link
Member

srowen commented Aug 9, 2024

The idea here is simply that these are unused and so deleting them helps a little bit with the space crunch?
That's OK if so.

How did you find them, out of curiosity?

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 9, 2024

shall we update Spark scripts to clean these files after doc building?

@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 9, 2024

The idea here is simply that these are unused and so deleting them helps a little bit with the space crunch?
That's OK if so.
How did you find them, out of curiosity?

Hi @srowen, I have gained some experience using reStructuredText and Sphinx in other Apache projects, e.g. Apache Kyuubi.

shall we update Spark scripts to clean these files after doc building?

Hi @cloud-fan apache/spark#47686 is also ready

@cloud-fan
Copy link
Contributor

how much space we are saving here?

@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 9, 2024

about 10-20M for each version

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Yes, I also double-checked.

$ du -h 3.5.1/api/python/_sources
 68K	3.5.1/api/python/_sources/development
148K	3.5.1/api/python/_sources/user_guide/pandas_on_spark
 40K	3.5.1/api/python/_sources/user_guide/sql
208K	3.5.1/api/python/_sources/user_guide
 20K	3.5.1/api/python/_sources/migration_guide
2.9M	3.5.1/api/python/_sources/reference/pyspark.sql/api
3.0M	3.5.1/api/python/_sources/reference/pyspark.sql
180K	3.5.1/api/python/_sources/reference/pyspark.ss/api
196K	3.5.1/api/python/_sources/reference/pyspark.ss
2.5M	3.5.1/api/python/_sources/reference/api
3.0M	3.5.1/api/python/_sources/reference/pyspark.pandas/api
3.1M	3.5.1/api/python/_sources/reference/pyspark.pandas
8.8M	3.5.1/api/python/_sources/reference
4.1M	3.5.1/api/python/_sources/getting_started
 13M	3.5.1/api/python/_sources

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, @yaooqinn . It seems that this PR deleted the following.

$ COLUMNS=1000 git diff HEAD~1 --stat | grep -v '/api/python/_sources'
 site/docs/4.0.0-preview1/util/build-error-docs.py                                                                                              |   152 ---------
 33288 files changed, 1194166 deletions(-)

This PR:
https://github.com/apache/spark-website/blob/14b7c9e0fb1dd3945672b1fde7fedd4f9b1e585d/site/docs/4.0.0-preview1/util/build-error-docs.py

asf-site:
https://github.com/apache/spark-website/blob/4fd6a38/site/docs/4.0.0-preview1/util/build-error-docs.py

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 9, 2024

I manually do the same thing as a test and share it with you as a new PR. The result is not the same with this PR, @yaooqinn . In other words, this PR deletes a wrong file like util/build-error-docs.py which is outside of _sources and also deletes less number of files.

Screenshot 2024-08-09 at 09 05 35

$ find . -name _sources
./site/docs/4.0.0-preview1/api/python/_sources
./site/docs/3.1.2/api/python/_sources
./site/docs/3.3.1/api/python/_sources
./site/docs/3.3.0/api/python/_sources
./site/docs/3.1.3/api/python/_sources
./site/docs/3.4.0/api/python/_sources
./site/docs/3.2.2/api/python/_sources
./site/docs/3.4.1/api/python/_sources
./site/docs/3.2.4/api/python/_sources
./site/docs/3.2.3/api/python/_sources
./site/docs/3.5.0/api/python/_sources
./site/docs/3.1.1/api/python/_sources
./site/docs/3.3.2/api/python/_sources
./site/docs/3.5.1/api/python/_sources
./site/docs/3.3.3/api/python/_sources
./site/docs/3.3.4/api/python/_sources
./site/docs/3.2.1/api/python/_sources
./site/docs/3.4.3/api/python/_sources
./site/docs/3.2.0/api/python/_sources
./site/docs/3.4.2/api/python/_sources

$ find . -name _sources | xargs rm -rf

@yaooqinn yaooqinn changed the title [SPARK-49181] Remove site/docs/{version}/api/python/_sources folders [SPARK-49181] Remove site/docs/{version}/api/python/_sources folders and util/build-error-docs.py Aug 9, 2024
@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 9, 2024

Thank you @dongjoon-hyun

I update the PR title to mention util/build-error-docs.py too. Spark main has moved it into _plugins already

@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 9, 2024

FYI, apache/spark@81948bb

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 9, 2024

I update the PR title to mention util/build-error-docs.py too. Spark main has moved it into _plugins already

BTW, for the above part, you are unable to delete that file by using apache/spark@81948bb because it's already published Apache Spark 4.0.0-preview1 artifacts.

https://github.com/apache/spark-website/blob/4fd6a38/site/docs/4.0.0-preview1/util/build-error-docs.py

I'm wondering how did you generate this PR initially. Could you revise the PR description by describing the reproducible steps, @yaooqinn ?

@yaooqinn
Copy link
Member Author

BTW, for the above part, you are unable to delete that file by using apache/spark@81948bb because it's already published Apache Spark 4.0.0-preview1 artifacts.

I think it's similar to the "_sources" folders. Both of them are accessible via doc releases, but useless to users.

@dongjoon-hyun
Copy link
Member

Sure, it's similar, of course, @yaooqinn . So, I gave +1 already.

BTW, please revise the PR description about your process to decide them.

I'm wondering how did you generate this PR initially. Could you revise the PR description by describing the reproducible steps, @yaooqinn ?

@yaooqinn
Copy link
Member Author

Thank you @dongjoon-hyun, updated

@yaooqinn yaooqinn merged commit 0020708 into apache:asf-site Aug 10, 2024
1 check passed
@yaooqinn yaooqinn deleted the SPARK-49181 branch August 10, 2024 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants