Skip to content

Commit

Permalink
feat: add element finder
Browse files Browse the repository at this point in the history
  • Loading branch information
Undertone0809 committed Aug 1, 2023
1 parent 6e7f7c3 commit 811e7ea
Show file tree
Hide file tree
Showing 5 changed files with 108 additions and 34 deletions.
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- Batch Conversion: It supports batch conversion of single or multiple files, as well as formatting and renaming of generated files.
- Highly Customizable: By inheriting the `MdAdapter` class, you can easily implement custom URL conversion for different image servers.
- Image Server Adapters: Currently, only Aliyun OSS is supported as an image server. Contributions are welcome to add support for more types of image servers.
- Custom recognition format: With ElementFinder, users can customize the search method for elements (such as image addresses) to meet the needs of special element recognition.

## Target Audience

Expand Down Expand Up @@ -270,27 +271,40 @@ if __name__ == "__main__":
```

### Custom Regular Expression
`imarkdown` use regular expression to find your images. It supports `![](image_url)` and `<img src="image_url"/>` format, but there are still some other format `imarkdown` can not find it.

At this point, `imarkdown` supports custom regular expression to address this issue. You can customize a regular expression which can find your markdown image url and pass it to MdImageConverter. The following example show how to use it.
`imarkdown` uses the regular expression element finder `ReElementFinder` to recognize the URL of an image, the finder currently supports `![](image_url)` and `<img src="image_url"/>` are two types of image URL format recognition. Of course, if your image URL is strange, sometimes the default regular expression for `imarkdown` cannot be recognized.

At this point, you can customize an element finder called `CustomElementFinder`, which can recognize the content you need to recognize through custom regular expressions or other recognition methods, and use it to pass it to MdImageConverter for element replacement. The following example shows how to use a custom `ElementFinder` to identify image links.

```python
from imarkdown import MdImageConverter, LocalFileAdapter, MdFolder
import re
from typing import List

from imarkdown import BaseElementFinder, MdFile, MdImageConverter, LocalFileAdapter


class CustomElementFinder(BaseElementFinder):
def find_all_elements(self, md_str) -> List[str]:
re_rule: str = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
images = re.findall(re_rule, md_str)
return list(map(lambda item: item[1], images))


def main():
custom_re = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
adapter = LocalFileAdapter()
md_converter = MdImageConverter(adapter=adapter)

md_folder = MdFolder(name="mds")
md_converter.convert(md_folder, output_directory="output_mds", re_rule=custom_re)
converter = MdImageConverter(adapter=adapter)
element_finder = CustomElementFinder()

md_file = MdFile(name="test.md")
converter.convert(md_file, element_finder=element_finder)


if __name__ == "__main__":
main()
```

In this example, `CustomElementFinder` needs to inherit from `BaseElementFinder` and implement `find_all_elements()` function and implements specific search logic to construct an array of all elements found in the markdown (such as the urls of all images) and return it to `MdImageConverter`.

## Roadmap

- [ ] Add client-side support
Expand Down
32 changes: 22 additions & 10 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- 批量转换:支持单、多文件的批量转换,以及生成文件的格式化重命名等操作
- 高度自定义: 只需要继承一个MdAdapter,就可以轻松实现自定义图床的url转换
- 图床适配: 当前暂时只支持阿里云图床,欢迎pr提供更多类型图床
- 自定义识别格式: 通过ElementFinder,用户可以自定义元素(如图片地址)的查找方式,以满足特殊元素识别的需求。

## 适用人群

Expand Down Expand Up @@ -270,29 +271,40 @@ if __name__ == "__main__":
main()
```

### 自定义正则表达式
### 自定义元素查找器

`imarkdown`是使用正则表达式对image的url进行识别,当前支持`![](image_url)``<img src="image_url"/>`两种图片url的格式,当然,如果你的图片url很奇怪,有的时候`imarkdown`默认的正则表达式也无法识别出来。
`imarkdown`使用正则表达式元素查找器`ReElementFinder`对image的url进行识别,其表达式当前支持`![](image_url)``<img src="image_url"/>`两种图片url的格式的识别,当然,如果你的图片url很奇怪,有的时候`imarkdown`默认的正则表达式也无法识别出来。

这个时候,你可以自定义一个可以识别你的图片的正则表达式,传入`imarkdown`进行识别,下面的示例展示了怎么使用自定义的正则表达式来识别图片
这个时候,你可以自定义一个元素查找器`CustomElementFinder`,通过自定义正则表达式或者其他的识别方式,从而定制化的识别到你需要识别的内容,用于传递给MdImageConverter进行元素替换。下面的示例展示了怎么使用自定义的`ElementFinder`来识别图片链接

```python
from imarkdown import MdImageConverter, LocalFileAdapter, MdFolder
import re
from typing import List

from imarkdown import BaseElementFinder, MdFile, MdImageConverter, LocalFileAdapter


class CustomElementFinder(BaseElementFinder):
def find_all_elements(self, md_str) -> List[str]:
re_rule: str = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
images = re.findall(re_rule, md_str)
return list(map(lambda item: item[1], images))


def main():
custom_re = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
adapter = LocalFileAdapter()
md_converter = MdImageConverter(adapter=adapter)

md_folder = MdFolder(name="mds")
md_converter.convert(md_folder, output_directory="output_mds", re_rule=custom_re)
converter = MdImageConverter(adapter=adapter)
element_finder = CustomElementFinder()

md_file = MdFile(name="test.md")
converter.convert(md_file, element_finder=element_finder)


if __name__ == "__main__":
main()
```

在这个示例中,`CustomElementFinder`需要继承`BaseElementFinder`,并且实现`find_all_elements`函数,并实现特定的查找逻辑,将从markdown中找到的所有元素(如所有图片的url)构建成一个数组返回给`MdImageConverter`,用于元素替换。

## 开发计划

Expand All @@ -306,7 +318,7 @@ if __name__ == "__main__":
- [ ] 提供文件自定义命名
- [ ] 提供图片自定义格式化命名方式
- [ ] 构建PDF转换器
- [ ] 提供markdown其他元素的替换
- [x] 提供markdown其他元素的替换


## FAQ
Expand Down
24 changes: 24 additions & 0 deletions example/custom_element_finder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import re
from typing import List

from imarkdown import BaseElementFinder, MdFile, MdImageConverter, LocalFileAdapter


class CustomElementFinder(BaseElementFinder):
def find_all_elements(self, md_str, *args, **kwargs) -> List[str]:
re_rule: str = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
images = re.findall(re_rule, md_str)
return list(map(lambda item: item[1], images))


def main():
adapter = LocalFileAdapter()
converter = MdImageConverter(adapter=adapter)
element_finder = CustomElementFinder()

md_file = MdFile(name="test.md")
converter.convert(md_file, element_finder=element_finder)


if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion imarkdown/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from imarkdown.adapter.aliyun_adapter import AliyunAdapter
from imarkdown.adapter.base import BaseMdAdapter
from imarkdown.adapter.local_adapter import LocalFileAdapter
from imarkdown.converter import MdImageConverter, BaseMdImageConverter
from imarkdown.converter import MdImageConverter, BaseMdImageConverter, BaseElementFinder
from imarkdown.schema import MdFile, MdFolder

__all__ = [
Expand All @@ -12,4 +12,5 @@
"BaseMdAdapter",
"LocalFileAdapter",
"AliyunAdapter",
"BaseElementFinder"
]
53 changes: 38 additions & 15 deletions imarkdown/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import re
import time
import traceback
from abc import abstractmethod
from typing import List, Optional, Any, Union, Dict

import requests
Expand Down Expand Up @@ -73,6 +74,27 @@ def _load_default_adapter() -> BaseMdAdapter:
return MdAdapterMapper[cfg.last_adapter_name]()


class BaseElementFinder:
"""Element Finder can find all specified elements(like images) in markdown file. ReElementFinder use
regular expression to find element."""

@abstractmethod
def find_all_elements(self, md_str: str) -> List[str]:
"""Find all elements(images) and return them."""


class ReElementFinder(BaseElementFinder):
def __init__(
self, re_rule: str = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
):
self.re_rule = re_rule
"""Default regular expression to find images, you can custom re_rule."""

def find_all_elements(self, md_str: str) -> List[str]:
elements = re.findall(self.re_rule, md_str)
return list(map(lambda item: item[1], elements))


class BaseMdImageConverter(BaseModel):
adapter: BaseMdAdapter = Field(default_factory=_load_default_adapter)
"""Adapter determines the convert method you choose."""
Expand All @@ -90,8 +112,11 @@ class BaseMdImageConverter(BaseModel):
"""The storage directory of converted markdown file."""
converted_md_file_name: Optional[str] = None
"""The converted markdown file name."""
re_rule: str = r"(?:!\[(.*?)\]\((.*?)\))|<img.*?src=[\'\"](.*?)[\'\"].*?>"
"""Default regular expression to find images, you can custom re_rule."""
element_finder: BaseElementFinder = Field(default=ReElementFinder())
"""Element Finder can find all specified elements(like images) in markdown file."""

class Config:
arbitrary_types_allowed = True

@root_validator(pre=True)
def variables_check(
Expand Down Expand Up @@ -170,7 +195,7 @@ def convert(
image_local_storage_directory: Optional[str] = None,
output_md_directory: Optional[str] = None,
is_local_images: Optional[bool] = None,
re_rule: Optional[str] = None,
element_finder: Optional[BaseElementFinder] = None,
**kwargs,
):
"""Convert Markdown image url and generate a new Markdown file.
Expand All @@ -180,8 +205,8 @@ def convert(
image_local_storage_directory(Optional[str]): Specified image storage path. You can pass an absolute or a
relative path. Default image directory path is the Markdown directory named `markdown_dir/images`.
output_md_directory(Optional[str]): The storage directory of converted markdown file.
re_rule(Optional[str]): Regular expression to find images, you can custom re_rule.
is_local_images: It is a local images.
element_finder: Element Finder can find all specified elements(like images) in markdown file.
**kwargs:
enable_rename(bool): Default is true, it means the generated markdown file will receive a new name.
name_prefix(Optional[str]): Prefix name of generated markdown file.
Expand All @@ -194,9 +219,8 @@ def convert(
return
if is_local_images:
self.is_local_images = is_local_images
if re_rule:
logger.debug(f"[imarkdown] reset regular expression <{re_rule}>")
self.re_rule = re_rule
if element_finder:
self.element_finder = element_finder

self.set_converted_md_file_name(md_file_path, **kwargs)
self.set_md_file_original_directory(md_file_path)
Expand All @@ -220,29 +244,27 @@ def convert(
_write_data(converted_md_path, modified_data)
logger.info(f"[imarkdown] <{md_file_path}> converted task end")

def _find_img_and_replace(self, md_str: str, re_rule: Optional[str] = None) -> str:
def _find_img_and_replace(self, md_str: str) -> str:
"""Input original markdown str and replace images address
It can find `[]()` type image url and `<img/>` type image url
It can find `![]()` type image url and `<img/>` type image url
Args:
md_str: markdown original data
Returns:
Markdown data for the image url has been changed.
"""
_images = re.findall(
self.re_rule, md_str
)
_images = self.element_finder.find_all_elements(md_str)

images = []
for image in _images:
if image[1] == "":
if image == "":
continue
# If current image link is local path URL and you need to web URL to a local path,
# the local path url will not be converted.
if not self.is_local_images and not image[1].startswith("http"):
if not self.is_local_images and not image.startswith("http"):
continue
images.append(image[1])
images.append(image)

for image in images:
converted_image_url = self._get_converted_image_url(image)
Expand Down Expand Up @@ -328,6 +350,7 @@ def convert(
**kwargs:
re_rule(Optional[str]): custom regular expression to find specified element like image.
"""

def check_warning(medium: Union[MdFile, MdFolder]):
if not output_directory and isinstance(medium, MdFolder):
raise ValueError(
Expand Down

0 comments on commit 811e7ea

Please sign in to comment.